# Data preprocessing
    - Download data to the server
    - Convert text to sequences.
    - Configure sequences for a RNN model.

## Download data to the server

### Command line in the server
    Path to data:
        cd /home/ubuntu/data/training/text/sentiment
    Download dataset: 
        wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
    Uncompress it:
        tar -zxvf aclImdb_v1.tar.gz

## Convert text to sequences
    - List of all text files
    - Read files into python
    - Tokenize
    - Create dictionaries to recode
    - Recode tokens into ids and create sentences

In [1]:
#Imports and paths
from __future__ import print_function

import numpy as np

data_path='../data/aclImdb/'


In [2]:
# Generator of list of files in a folder and subfolders
import os
import shutil
import fnmatch

def gen_find(filepattern, toppath):
    '''
    Generator with a recursive list of files in the toppath that match filepattern 
    Inputs:
        filepattern(str): Command stype pattern 
        toppath(str): Root path
    '''
    for path, dirlist, filelist in os.walk(toppath):
        for name in fnmatch.filter(filelist, filepattern):
            yield os.path.join(path, name)

#Test
#print(gen_find("*.txt", data_path+'train/pos/').next())

In [3]:
def read_sentences(path):
    sentences = []
    sentences_list = gen_find("*.txt", path)
    for ff in sentences_list:
        with open(ff, 'r', encoding='utf-8') as f:
            sentences.append(f.readline().strip())
    return sentences        

#Test
print(read_sentences(data_path+'train/pos/')[0:2])

['For a movie that gets no respect there sure are a lot of memorable quotes listed for this gem. Imagine a movie where Joe Piscopo is actually funny! Maureen Stapleton is a scene stealer. The Moroni character is an absolute scream. Watch for Alan "The Skipper" Hale jr. as a police Sgt.', 'Bizarre horror movie filled with famous faces but stolen by Cristina Raines (later of TV\'s "Flamingo Road") as a pretty but somewhat unstable model with a gummy smile who is slated to pay for her attempted suicides by guarding the Gateway to Hell! The scenes with Raines modeling are very well captured, the mood music is perfect, Deborah Raffin is charming as Cristina\'s pal, but when Raines moves into a creepy Brooklyn Heights brownstone (inhabited by a blind priest on the top floor), things really start cooking. The neighbors, including a fantastically wicked Burgess Meredith and kinky couple Sylvia Miles & Beverly D\'Angelo, are a diabolical lot, and Eli Wallach is great fun as a wily police detect

In [4]:
print(read_sentences(data_path+'train/neg/')[0:2])

["Working with one of the best Shakespeare sources, this film manages to be creditable to it's source, whilst still appealing to a wider audience.<br /><br />Branagh steals the film from under Fishburne's nose, and there's a talented cast on good form.", 'Well...tremors I, the original started off in 1990 and i found the movie quite enjoyable to watch. however, they proceeded to make tremors II and III. Trust me, those movies started going downhill right after they finished the first one, i mean, ass blasters??? Now, only God himself is capable of answering the question "why in Gods name would they create another one of these dumpster dives of a movie?" Tremors IV cannot be considered a bad movie, in fact it cannot be even considered an epitome of a bad movie, for it lives up to more than that. As i attempted to sit though it, i noticed that my eyes started to bleed, and i hoped profusely that the little girl from the ring would crawl through the TV and kill me. did they really think t

In [5]:
def tokenize(sentences):
    from nltk import word_tokenize
    print( 'Tokenizing...',)
    tokens = []
    for sentence in sentences:
        tokens += [word_tokenize(sentence)]
    print('Done!')

    return tokens

print(tokenize(read_sentences(data_path+'train/pos/')[0:2]))

Tokenizing...
Done!
[['For', 'a', 'movie', 'that', 'gets', 'no', 'respect', 'there', 'sure', 'are', 'a', 'lot', 'of', 'memorable', 'quotes', 'listed', 'for', 'this', 'gem', '.', 'Imagine', 'a', 'movie', 'where', 'Joe', 'Piscopo', 'is', 'actually', 'funny', '!', 'Maureen', 'Stapleton', 'is', 'a', 'scene', 'stealer', '.', 'The', 'Moroni', 'character', 'is', 'an', 'absolute', 'scream', '.', 'Watch', 'for', 'Alan', '``', 'The', 'Skipper', "''", 'Hale', 'jr.', 'as', 'a', 'police', 'Sgt', '.'], ['Bizarre', 'horror', 'movie', 'filled', 'with', 'famous', 'faces', 'but', 'stolen', 'by', 'Cristina', 'Raines', '(', 'later', 'of', 'TV', "'s", '``', 'Flamingo', 'Road', "''", ')', 'as', 'a', 'pretty', 'but', 'somewhat', 'unstable', 'model', 'with', 'a', 'gummy', 'smile', 'who', 'is', 'slated', 'to', 'pay', 'for', 'her', 'attempted', 'suicides', 'by', 'guarding', 'the', 'Gateway', 'to', 'Hell', '!', 'The', 'scenes', 'with', 'Raines', 'modeling', 'are', 'very', 'well', 'captured', ',', 'the', 'mood', 

In [6]:
sentences_trn_pos = tokenize(read_sentences(data_path+'train/pos/'))
sentences_trn_neg = tokenize(read_sentences(data_path+'train/neg/'))
sentences_trn = sentences_trn_pos + sentences_trn_neg


Tokenizing...
Done!
Tokenizing...
Done!


In [7]:
#create the dictionary to conver words to numbers. Order it with most frequent words first
def build_dict(sentences):
#    from collections import OrderedDict

    '''
    Build dictionary of train words
    Outputs: 
     - Dictionary of word --> word index
     - Dictionary of word --> word count freq
    '''
    print( 'Building dictionary..',)
    wordcount = dict()
    #For each worn in each sentence, cummulate frequency
    for ss in sentences:
        for w in ss:
            if w not in wordcount:
                wordcount[w] = 1
            else:
                wordcount[w] += 1

    counts = list(wordcount.values()) # List of frequencies
    keys = list(wordcount) #List of words
    
    sorted_idx = reversed(np.argsort(counts))
    
    worddict = dict()
    for idx, ss in enumerate(sorted_idx):
        worddict[keys[ss]] = idx+2  # leave 0 and 1 (UNK)
    print( np.sum(counts), ' total words ', len(keys), ' unique words')

    return worddict, wordcount


worddict, wordcount = build_dict(sentences_trn)

print(worddict['the'], wordcount['the'])

Building dictionary..
7056532  total words  134957  unique words
2 289300


In [8]:
# 
def generate_sequence(sentences, dictionary):
    '''
    Convert tokenized text in sequences of integers
    '''
    seqs = [None] * len(sentences)
    for idx, ss in enumerate(sentences):
        seqs[idx] = [dictionary[w] if w in dictionary else 1 for w in ss]

    return seqs

In [9]:
# Create train and test data

#Read train sentences and generate target y
train_x_pos = generate_sequence(sentences_trn_pos, worddict)
train_x_neg = generate_sequence(sentences_trn_neg, worddict)
X_train_full = train_x_pos + train_x_neg
y_train_full = [1] * len(train_x_pos) + [0] * len(train_x_neg)

print(X_train_full[0], y_train_full[0])

[333, 6, 25, 17, 225, 85, 1203, 68, 300, 35, 6, 189, 7, 917, 4853, 3630, 24, 19, 1572, 4, 4289, 6, 25, 141, 902, 16298, 9, 183, 182, 41, 7579, 16907, 9, 6, 157, 21646, 4, 21, 40749, 123, 9, 46, 1634, 3210, 4, 1231, 24, 1545, 32, 21, 23991, 31, 9489, 44509, 22, 6, 679, 7480, 4] 1


In [10]:
#Read test sentences and generate target y
sentences_tst_pos = read_sentences(data_path+'test/pos/')
sentences_tst_neg = read_sentences(data_path+'test/neg/')

test_x_pos = generate_sequence(tokenize(sentences_tst_pos), worddict)
test_x_neg = generate_sequence(tokenize(sentences_tst_neg), worddict)
X_test_full = test_x_pos + test_x_neg
y_test_full = [1] * len(test_x_pos) + [0] * len(test_x_neg)

print(X_test_full[0])
print(y_test_full[0])

Tokenizing...
Done!
Tokenizing...
Done!
[4142, 33, 46, 772, 80, 3, 320, 13655, 299, 2, 1654, 7, 46, 338, 1249, 3, 625, 631, 5, 531, 81, 2026, 5, 75, 20, 6298, 9222, 23, 52, 1908, 4, 137, 3791, 8, 49716, 23, 52, 852, 474, 50, 6, 63, 339, 8, 98, 271, 49, 16, 45, 3, 29, 73, 52, 34981, 20, 2652, 14, 1, 3, 75, 96, 36, 604, 2, 757, 23, 52, 852, 3, 5, 20, 911, 8, 864, 171, 377, 75, 96, 98, 108221, 4, 7836, 49, 2, 338, 32442, 4, 415, 2228, 14, 6, 316, 187, 75, 96, 2603, 58, 3, 75, 569, 6, 1234, 97, 2, 4420, 23, 6, 3708, 4907, 4, 32, 15, 802, 1643, 166, 14, 172, 4718, 15238, 3, 29, 194, 16642, 14, 87, 4, 15, 20, 4718, 559, 4, 31, 12, 13, 10, 11, 12, 13, 10, 11, 7056, 45, 779, 3189, 1967, 5, 75, 20, 1057, 14, 6, 1013, 10899, 4, 565, 73, 16, 613, 50, 75, 76, 4087, 5, 7972, 23268, 6, 1550, 3, 75, 234, 52, 3708, 4907, 98, 3837, 5, 351, 4, 142, 6, 3515, 380, 75, 858, 8, 1926, 49, 2, 781, 1550, 5, 373, 8, 2384, 104, 3, 23, 85, 215, 7, 769, 4, 121482, 52, 144, 20, 14, 2605, 4, 12, 13, 10, 11, 12, 13, 

## Configure sequences for a RNN model
    - Remove words with low frequency
    - Truncate / complete sequences to the same length

In [11]:
#Median length of sentences
print('Median length: ', np.median([len(x) for x in X_test_full]))

Median length:  208.0


In [12]:
max_features = 50000 # Number of most frequent words selected. the less frequent recode to 0
maxlen = 200  # cut texts after this number of words (among top max_features most common words)

In [13]:
#Select the most frequent max_features, recode others using 0
def remove_features(x):
    return [[0 if w >= max_features else w for w in sen] for sen in x]

X_train = remove_features(X_train_full)
X_test  = remove_features(X_test_full)
y_train = y_train_full
y_test = y_test_full

print(X_test[1])

[61, 9, 6, 1572, 4, 214, 6, 1352, 4931, 391, 91, 2, 7546, 529, 20, 1020, 2106, 4, 6712, 23, 106, 460, 17, 1577, 87, 62, 6431, 4644, 127, 3, 108, 7468, 5, 353, 3216, 4, 51, 18, 239, 294, 4788, 8, 2, 246, 16, 18, 7542, 4, 335, 139, 166, 105, 602, 28, 42, 71, 521, 44, 2, 594, 7, 10695, 7, 6, 421, 14, 2, 3089, 27, 29, 100, 35, 4921, 8, 84, 16, 3, 6, 244, 50, 6, 549, 1754, 14, 633, 1563, 4, 21, 80, 14646, 107, 18113, 1379, 5, 1373, 62, 3770, 89, 36, 373, 5, 13532, 49, 800, 2, 42650, 3080, 7, 2, 562, 3, 22, 111, 1995, 23, 2, 3506, 7, 2, 101, 1750, 337, 543, 104, 3, 1503, 188, 48, 89, 30, 671, 104, 14, 74, 4, 51, 569, 87, 6, 184, 646, 8, 98, 58, 7, 2, 8659, 17, 17281, 87, 138, 342, 19, 0, 109, 704, 4]


In [14]:
from tensorflow.contrib.keras import preprocessing

# Cut or complete the sentences to length = maxlen
print("Pad sequences (samples x time)")

X_train = preprocessing.sequence.pad_sequences(X_train, maxlen=maxlen)
X_test = preprocessing.sequence.pad_sequences(X_test, maxlen=maxlen)

print('X_train shape:', X_train.shape)
print('X_test shape:', X_test.shape)

print(X_test[0])

Pad sequences (samples x time)
X_train shape: (25000, 200)
X_test shape: (25000, 200)
[    2   338 32442     4   415  2228    14     6   316   187    75    96
  2603    58     3    75   569     6  1234    97     2  4420    23     6
  3708  4907     4    32    15   802  1643   166    14   172  4718 15238
     3    29   194 16642    14    87     4    15    20  4718   559     4
    31    12    13    10    11    12    13    10    11  7056    45   779
  3189  1967     5    75    20  1057    14     6  1013 10899     4   565
    73    16   613    50    75    76  4087     5  7972 23268     6  1550
     3    75   234    52  3708  4907    98  3837     5   351     4   142
     6  3515   380    75   858     8  1926    49     2   781  1550     5
   373     8  2384   104     3    23    85   215     7   769     4     0
    52   144    20    14  2605     4    12    13    10    11    12    13
    10    11   896     9     6   272    47  7034  9756     3 19787 12709
     3 21207    52   144     8   672  

In [15]:
# Shuffle data
from sklearn.utils import shuffle

X_train, y_train = shuffle(X_train, y_train, random_state=0)

In [16]:
# Export train and test data
np.save(data_path + 'X_train', X_train)
np.save(data_path + 'y_train', y_train)
np.save(data_path + 'X_test',  X_test)
np.save(data_path + 'y_test',  y_test)


In [17]:
# Export worddict
import pickle

with open(data_path + 'worddict.pickle', 'wb') as pfile:
    pickle.dump(worddict, pfile)
