# Data preprocessing
    - Download data in the server
    - Convert test to sequences.
    - Configure sequences for a RNN model.

## Download data in the server

### Command line in the server
    Path to data:
        cd /home/ubuntu/data/training/keras
    Download dataset: 
        wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
    Uncompress it:
        tar -zxvf aclImdb_v1.tar.gz

## Convert test to sequences
    - List of all text files
    - Read files into python
    - Tokenize
    - Create dictionaries to recode
    - Recode tokens into ids and create sentences

In [1]:
#Imports and paths
from __future__ import print_function

import numpy as np

data_path='/home/ubuntu/data/training/keras/aclImdb/'


In [2]:
# Generator of list of files in a folder and subfolders
import os
import shutil
import fnmatch

def gen_find(filepattern, toppath):
    '''
    Generator with a recursive list of files in the toppath that match filepattern 
    Inputs:
        filepattern(str): Command stype pattern 
        toppath(str): Root path
    '''
    for path, dirlist, filelist in os.walk(toppath):
        for name in fnmatch.filter(filelist, filepattern):
            yield os.path.join(path, name)

#Test
#print(gen_find("*.txt", data_path+'train/pos/').next())

In [3]:
def read_sentences(path):
    sentences = []
    sentences_list = gen_find("*.txt", path)
    for ff in sentences_list:
        with open(ff, 'r') as f:
            sentences.append(f.readline().strip())
    return sentences        

#Test
print(read_sentences(data_path+'train/pos/')[0:2])

["I read the book and saw the movie. Both excellent. The movie is diamond among coals during this era. Liebman and Selby dominate the screen and communicate the intensity of their characters without flaw. This film should have made them stars. Shame on the studio for not putting everything they had behind this film. It could have easily been a franchise. Release on DVD is a must and a worthy remake would revive this film. Look for it in your TV guide and if you see it listed, no matter how late, watch it. You won't be disappointed. Do yourself another favor - read the book (same title). It'll blow you away. Times have changed dramatically since those days, or at least we like to think they have.", "Let me first state that while I have viewed every episode of StarTrek at least twice, I do not consider myself a Trekker or Trekkie. Those are people who live in their parents basement and attend conventions wearing costumes with pointed rubber ears. I gave this movie a seven casting aside t

In [4]:
print(read_sentences(data_path+'train/neg/')[0:2])

['What Hopkins does succeed at with this effort as writer and director is giving us a sense that we know absolutely no one in the film. However, perhaps therein lies the problem. His movie has a lot of ambition and his intentions were obviously complex and drawn from very deep within, but it\'s so impersonal. There are no characters. We never know who anyone is, thus there is no investment on our part.<br /><br />It could be about a screenwriter intermingle with his own characters. Is it? Maybe. By that I don\'t mean that Slipstream is ambiguous; I mean that there is no telling. Hopkins\'s film is an experiment. On the face of it, one could make the case that it is about a would-be screenwriter, who at the very moment of his meeting with fate, realizes that life is hit and miss, and/or success is blind chance, as he is hurled into a "slipstream" of collisions between points in time, dreams, thoughts, and reality. Nevertheless, it is so unremittingly cerebral that it leaves no room for 

In [5]:
def tokenize(sentences):
    from nltk import word_tokenize
    print( 'Tokenizing...',)
    tokens = []
    for sentence in sentences:
        tokens += [word_tokenize(sentence.decode('utf-8'))]
    print('Done!')

    return tokens

print(tokenize(read_sentences(data_path+'train/pos/')[0:2]))

Tokenizing...
Done!
[[u'I', u'read', u'the', u'book', u'and', u'saw', u'the', u'movie', u'.', u'Both', u'excellent', u'.', u'The', u'movie', u'is', u'diamond', u'among', u'coals', u'during', u'this', u'era', u'.', u'Liebman', u'and', u'Selby', u'dominate', u'the', u'screen', u'and', u'communicate', u'the', u'intensity', u'of', u'their', u'characters', u'without', u'flaw', u'.', u'This', u'film', u'should', u'have', u'made', u'them', u'stars', u'.', u'Shame', u'on', u'the', u'studio', u'for', u'not', u'putting', u'everything', u'they', u'had', u'behind', u'this', u'film', u'.', u'It', u'could', u'have', u'easily', u'been', u'a', u'franchise', u'.', u'Release', u'on', u'DVD', u'is', u'a', u'must', u'and', u'a', u'worthy', u'remake', u'would', u'revive', u'this', u'film', u'.', u'Look', u'for', u'it', u'in', u'your', u'TV', u'guide', u'and', u'if', u'you', u'see', u'it', u'listed', u',', u'no', u'matter', u'how', u'late', u',', u'watch', u'it', u'.', u'You', u'wo', u"n't", u'be', u'disapp

In [6]:
sentences_trn_pos = tokenize(read_sentences(data_path+'train/pos/'))
sentences_trn_neg = tokenize(read_sentences(data_path+'train/neg/'))
sentences_trn = sentences_trn_pos + sentences_trn_neg


Tokenizing...
Done!
Tokenizing...
Done!


In [7]:
#create the dictionary to conver words to numbers. Order it with most frequent words first
def build_dict(sentences):
#    from collections import OrderedDict

    '''
    Build dictionary of train words
    Outputs: 
     - Dictionary of word --> word index
     - Dictionary of word --> word count freq
    '''
    print( 'Building dictionary..',)
    wordcount = dict()
    #For each worn in each sentence, cummulate frequency
    for ss in sentences:
        for w in ss:
            if w not in wordcount:
                wordcount[w] = 1
            else:
                wordcount[w] += 1

    counts = wordcount.values() # List of frequencies
    keys = wordcount.keys() #List of words
    
    sorted_idx = reversed(np.argsort(counts))
    
    worddict = dict()
    for idx, ss in enumerate(sorted_idx):
        worddict[keys[ss]] = idx+2  # leave 0 and 1 (UNK)
    print( np.sum(counts), ' total words ', len(keys), ' unique words')

    return worddict, wordcount


worddict, wordcount = build_dict(sentences_trn)

print(worddict['the'], wordcount['the'])

Building dictionary..
7056193  total words  135098  unique words
2 289298


In [8]:
# 
def generate_sequence(sentences, dictionary):
    '''
    Convert tokenized text in sequences of integers
    '''
    seqs = [None] * len(sentences)
    for idx, ss in enumerate(sentences):
        seqs[idx] = [dictionary[w] if w in dictionary else 1 for w in ss]

    return seqs

In [9]:
# Create train and test data

#Read train sentences and generate target y
train_x_pos = generate_sequence(sentences_trn_pos, worddict)
train_x_neg = generate_sequence(sentences_trn_neg, worddict)
X_train_full = train_x_pos + train_x_neg
y_train_full = [1] * len(train_x_pos) + [0] * len(train_x_neg)

print(X_train_full[0], y_train_full[0])

[15, 357, 2, 302, 5, 234, 2, 25, 4, 1502, 350, 4, 21, 25, 9, 6565, 847, 33335, 347, 19, 1108, 4, 33177, 5, 29674, 8845, 2, 314, 5, 5961, 2, 3062, 7, 82, 122, 237, 3417, 4, 61, 26, 156, 37, 109, 112, 441, 4, 5224, 33, 2, 1216, 24, 36, 1600, 346, 48, 76, 543, 19, 26, 4, 51, 96, 37, 721, 95, 6, 3248, 4, 24263, 33, 308, 9, 6, 242, 5, 6, 1711, 1022, 66, 10263, 19, 26, 4, 2377, 24, 16, 14, 150, 280, 4872, 5, 78, 34, 84, 16, 3631, 3, 85, 578, 114, 573, 3, 130, 16, 4, 221, 538, 30, 39, 722, 4, 422, 655, 198, 2091, 91, 357, 2, 302, 28, 185, 469, 27, 4, 51, 254, 2976, 34, 271, 4, 4477, 37, 1185, 7409, 274, 172, 566, 3, 54, 43, 233, 100, 50, 8, 121, 48, 37, 4] 1


In [10]:
#Read test sentences and generate target y
sentences_tst_pos = read_sentences(data_path+'test/pos/')
sentences_tst_neg = read_sentences(data_path+'test/neg/')

test_x_pos = generate_sequence(tokenize(sentences_tst_pos), worddict)
test_x_neg = generate_sequence(tokenize(sentences_tst_neg), worddict)
X_test_full = test_x_pos + test_x_neg
y_test_full = [1] * len(test_x_pos) + [0] * len(test_x_neg)

print(X_test_full[0])
print(y_test_full[0])

Tokenizing...
Done!
Tokenizing...
Done!
[19503, 28, 2, 4152, 5189, 49, 32, 4132, 18, 6754, 31, 5, 307, 2161, 2341, 27, 9, 540, 4, 118, 38, 12972, 17511, 28, 1987, 11500, 27, 225, 38, 1316, 5, 28, 808, 32, 25426, 24, 32153, 31, 27, 46228, 104, 4, 472, 3, 165, 2366, 119, 3884, 8, 98, 97, 3, 13099, 9, 575, 5, 56, 52, 1492, 295, 97, 198, 5189, 41, 6884, 48, 830, 14, 146, 5, 7776, 6, 356, 388, 28, 1855, 1, 3, 11212, 21652, 27, 8, 217, 112, 8, 3411, 18, 8491, 8, 98, 46, 29487, 8, 113, 19503, 5, 17511, 174, 101, 202, 69, 12, 13, 10, 11, 12, 13, 10, 11, 137, 189, 148, 93, 16, 1004, 4, 416, 2, 268, 132, 2341, 8, 32, 4132, 18, 6754, 31, 28, 238, 7, 72, 81, 572, 27, 15, 20, 1009, 2, 279, 3, 29, 19, 183, 20, 885, 7, 277, 4, 21, 25, 88, 30, 217, 431, 765, 24, 6, 381, 28, 765, 159, 114, 96, 16, 59, 27, 5, 2, 436, 5, 1193, 35, 183, 200, 182, 4, 478, 68, 35, 6, 184, 2564, 2181, 1599, 1359, 14, 8, 4915, 224, 235, 493, 5, 2, 26, 136, 3085, 780, 4, 21, 25, 111, 56, 6, 184, 209, 692, 36, 273, 14, 6, 235, 

## Configure sequences for a RNN model
    - Remove words with low frequency
    - Truncate / complete sequences to the same length

In [11]:
#Median length of sentences
print('Median length: ', np.median([len(x) for x in X_test_full]))

Median length:  208.0


In [12]:
max_features = 50000 # Number of most frequent words selected. the less frequent recode to 0
maxlen = 200  # cut texts after this number of words (among top max_features most common words)

In [13]:
#Select the most frequent max_features, recode others using 0
def remove_features(x):
    return [[0 if w >= max_features else w for w in sen] for sen in x]

X_train = remove_features(X_train_full)
X_test  = remove_features(X_test_full)
y_train = y_train_full
y_test = y_test_full

print(X_test[1])

[1050, 25, 3, 45, 785, 7, 6, 63, 25, 20, 68, 3, 80, 3, 177, 3, 251, 3, 5, 500, 4, 15, 20, 33, 2, 1551, 7, 86, 2473, 2, 236, 1732, 12, 13, 10, 11, 12, 13, 10, 11, 401, 910, 55, 16, 3, 9, 6, 524, 453, 26, 3, 29, 15, 435, 16, 64, 93, 131, 259, 453, 4198, 12, 13, 10, 11, 12, 13, 10, 11, 0, 1, 310, 1486, 6942, 3, 47, 9, 1993, 23, 38, 12772, 336, 4, 472, 6, 620, 14, 2, 12327, 5, 126, 57, 26794, 12, 13, 10, 11, 12, 13, 10, 11, 4695, 22829, 9, 334, 5, 940, 6, 992, 257, 4, 263, 9, 46, 350, 197, 8, 2, 856, 12, 13, 10, 11, 12, 13, 10, 11, 15, 435, 2, 9546, 7827, 833, 2, 155, 3, 83, 181, 100, 139, 17, 2, 305, 123, 71, 39, 588, 3, 100, 3668, 342, 346, 49, 38, 247, 7, 710, 5, 2211, 90, 846, 9, 2, 536, 59]


In [14]:
from keras.preprocessing import sequence

# Cut or complete the sentences to length = maxlen
print("Pad sequences (samples x time)")

X_train = sequence.pad_sequences(X_train, maxlen=maxlen)
X_test = sequence.pad_sequences(X_test, maxlen=maxlen)

print('X_train shape:', X_train.shape)
print('X_test shape:', X_test.shape)

print(X_test[0])

Using TensorFlow backend.


Pad sequences (samples x time)
X_train shape: (25000, 200)
X_test shape: (25000, 200)
[ 2564  2181  1599  1359    14     8  4915   224   235   493     5     2
    26   136  3085   780     4    21    25   111    56     6   184   209
   692    36   273    14     6   235    25   159     6  1061   145   476
    28  2264   492     1    27    47     9  1106     5    36   275    24
   955     5     6   716    14    72     1    56    38  4812   149    57
     8   143    38 11482   691     4   320  3468    56     6   356  2057
   119    22     6 45613    12    13    10    11    12    13    10    11
    21   134     9    63   159     1     9   220     3  2564  2269     5
  1452   125 21652    88    30    37    94     8    60    29  2643    16
   149     5  2783 12881    28     2   620     7 19503    27     5 11500
    35   682    22     2  4152  7705     4   331   537   197     9    73
     2  7705    37   438    28    60    30   948    27     5    75  1616
    24     6  5722     5    40  8941  

In [15]:
# Shuffle data
from sklearn.utils import shuffle
X_train, y_train = shuffle(X_train, y_train, random_state=0)