# Data preprocessing
    - Download data in the server
    - Convert test to sequences.
    - Configure sequences for a RNN model.

## Download data in the server

### Command line in the server
    Path to data:
        cd /home/ubuntu/data/training/keras
    Download dataset: 
        wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
    Uncompress it:
        tar -zxvf aclImdb_v1.tar.gz

## Convert test to sequences
    - List of all text files
    - Read files into python
    - Tokenize
    - Create dictionaries to recode
    - Recode tokens into ids and create sentences

In [1]:
#Imports and paths
from __future__ import print_function

import numpy as np

data_path='/home/ubuntu/data/training/keras/aclImdb/'


In [2]:
# Generator of list of files in a folder and subfolders
import os
import shutil
import fnmatch

def gen_find(filepattern, toppath):
    '''
    Generator with a recursive list of files in the toppath that match filepattern 
    Inputs:
        filepattern(str): Command stype pattern 
        toppath(str): Root path
    '''
    for path, dirlist, filelist in os.walk(toppath):
        for name in fnmatch.filter(filelist, filepattern):
            yield os.path.join(path, name)

#Test
#print(gen_find("*.txt", data_path+'train/pos/').next())

In [3]:
def read_sentences(path):
    sentences = []
    sentences_list = gen_find("*.txt", path)
    for ff in sentences_list:
        with open(ff, 'r') as f:
            sentences.append(f.readline().strip())
    return sentences        

#Test
print(read_sentences(data_path+'train/pos/')[0:2])

['Michael Is King. This film contains some of the best stuff Mike has ever done. Smooth Criminal is pure genius. The cameos are wonderful, but as always, the main event is MJ himself. He is the best, hands down.', 'This comment does contain spoilers!!<br /><br />There are few actors that have an intangible to them. That innate quality which is an amalgamation of charisma, panache and swagger. It\'s the quality that can separate good actors from the truly great. I think George Clooney has it and so does Jack Nicholson. You can look at Clooney\'s subtle touches in scenes like his one word good-bye to Andy Garcia in Ocean\'s 11 when they just utter each other\'s name disdainfully. "Terry." "Danny." You can pick any number of Jack\'s performances dating as far back as Five Easy Pieces in the diner to A Few Good Men and his court room interrogation scene. These guys just have it. You can add Denzel Washington to the small and exclusive list of actors who exudes that terrific trait in everyt

In [4]:
print(read_sentences(data_path+'train/neg/')[0:2])

["After Chaplin made one of his best films: Dough & Dynamite, he made one of his worst: Gentlemen Of Nerve. During this first year in films, Chaplin made about a third of all his films. Many of them were experimental in terms of ad-libbing, editing, gags, location shooting, etc. This one takes place at a racetrack where Chaplin and his friend try to get in without paying. Mabel Normand is there with her friend also, and Chaplin manages to rid himself of both his and Mabel's friends. He then woos Mabel in the grandstand with no apparent repercussions from his behavior. Lots of slapstick in here, but there is very little else to recommend this film for other then watching Chaplin develop. The print I saw was badly deteriorated, which may have affected its enjoyment. Charley Chase can be glimpsed. * of 4 stars.", "Please, for the love of God, don't watch it. Now saying that, I know what you're thinking, it can't be that bad can it? If everyone says it as bad as they say, I have to watch i

In [5]:
def tokenize(sentences):
    from nltk import word_tokenize
    print( 'Tokenizing...',)
    tokens = []
    for sentence in sentences:
        tokens += [word_tokenize(sentence)]
    print('Done!')

    return tokens

print(tokenize(read_sentences(data_path+'train/pos/')[0:2]))

Tokenizing...
Done!
[['Michael', 'Is', 'King', '.', 'This', 'film', 'contains', 'some', 'of', 'the', 'best', 'stuff', 'Mike', 'has', 'ever', 'done', '.', 'Smooth', 'Criminal', 'is', 'pure', 'genius', '.', 'The', 'cameos', 'are', 'wonderful', ',', 'but', 'as', 'always', ',', 'the', 'main', 'event', 'is', 'MJ', 'himself', '.', 'He', 'is', 'the', 'best', ',', 'hands', 'down', '.'], ['This', 'comment', 'does', 'contain', 'spoilers', '!', '!', '<', 'br', '/', '>', '<', 'br', '/', '>', 'There', 'are', 'few', 'actors', 'that', 'have', 'an', 'intangible', 'to', 'them', '.', 'That', 'innate', 'quality', 'which', 'is', 'an', 'amalgamation', 'of', 'charisma', ',', 'panache', 'and', 'swagger', '.', 'It', "'s", 'the', 'quality', 'that', 'can', 'separate', 'good', 'actors', 'from', 'the', 'truly', 'great', '.', 'I', 'think', 'George', 'Clooney', 'has', 'it', 'and', 'so', 'does', 'Jack', 'Nicholson', '.', 'You', 'can', 'look', 'at', 'Clooney', "'s", 'subtle', 'touches', 'in', 'scenes', 'like', 'his',

In [6]:
sentences_trn_pos = tokenize(read_sentences(data_path+'train/pos/'))
sentences_trn_neg = tokenize(read_sentences(data_path+'train/neg/'))
sentences_trn = sentences_trn_pos + sentences_trn_neg


Tokenizing...
Done!
Tokenizing...
Done!


In [7]:
#create the dictionary to conver words to numbers. Order it with most frequent words first
def build_dict(sentences):
#    from collections import OrderedDict

    '''
    Build dictionary of train words
    Outputs: 
     - Dictionary of word --> word index
     - Dictionary of word --> word count freq
    '''
    print( 'Building dictionary..',)
    wordcount = dict()
    #For each worn in each sentence, cummulate frequency
    for ss in sentences:
        for w in ss:
            if w not in wordcount:
                wordcount[w] = 1
            else:
                wordcount[w] += 1

    counts = list(wordcount.values()) # List of frequencies
    keys = list(wordcount) #List of words
    
    sorted_idx = reversed(np.argsort(counts))
    
    worddict = dict()
    for idx, ss in enumerate(sorted_idx):
        worddict[keys[ss]] = idx+2  # leave 0 and 1 (UNK)
    print( np.sum(counts), ' total words ', len(keys), ' unique words')

    return worddict, wordcount


worddict, wordcount = build_dict(sentences_trn)

print(worddict['the'], wordcount['the'])

Building dictionary..
7056193  total words  135098  unique words
2 289298


In [8]:
# 
def generate_sequence(sentences, dictionary):
    '''
    Convert tokenized text in sequences of integers
    '''
    seqs = [None] * len(sentences)
    for idx, ss in enumerate(sentences):
        seqs[idx] = [dictionary[w] if w in dictionary else 1 for w in ss]

    return seqs

In [9]:
# Create train and test data

#Read train sentences and generate target y
train_x_pos = generate_sequence(sentences_trn_pos, worddict)
train_x_neg = generate_sequence(sentences_trn_neg, worddict)
X_train_full = train_x_pos + train_x_neg
y_train_full = [1] * len(train_x_pos) + [0] * len(train_x_neg)

print(X_train_full[0], y_train_full[0])

[492, 797, 778, 4, 61, 26, 1406, 62, 7, 2, 145, 567, 1944, 56, 147, 250, 4, 13398, 10409, 9, 1102, 1293, 4, 21, 3295, 35, 424, 3, 29, 22, 228, 3, 2, 305, 1557, 9, 9804, 326, 4, 154, 9, 2, 145, 3, 1002, 211, 4] 1


In [10]:
#Read test sentences and generate target y
sentences_tst_pos = read_sentences(data_path+'test/pos/')
sentences_tst_neg = read_sentences(data_path+'test/neg/')

test_x_pos = generate_sequence(tokenize(sentences_tst_pos), worddict)
test_x_neg = generate_sequence(tokenize(sentences_tst_neg), worddict)
X_test_full = test_x_pos + test_x_neg
y_test_full = [1] * len(test_x_pos) + [0] * len(test_x_neg)

print(X_test_full[0])
print(y_test_full[0])

Tokenizing...
Done!
Tokenizing...
Done!
[712, 17355, 18, 25, 1206, 1, 92, 392, 55, 49, 2, 174, 144, 3712, 97, 2, 2526, 7, 484, 36542, 14, 6, 11759, 14, 1418, 18044, 14, 2, 1466, 18, 4, 214, 228, 3, 17355, 3, 129, 2, 31532, 17, 40, 9, 3, 392, 8, 2, 6013, 94, 50, 45, 2, 366, 7, 38, 299, 4, 15, 183, 273, 19, 25, 4190, 44, 9154, 1582, 4, 15, 3099, 14, 2, 2819, 5, 71, 395, 2, 3064, 7, 2, 2478, 43, 101514, 2676, 28, 5264, 2050, 44, 62, 27, 53, 19, 25, 71, 136, 249, 599, 22, 88, 31672, 9893, 18288, 92, 4, 15, 121, 19, 9, 6, 63, 25, 72, 9, 64, 31532, 93, 16, 9, 771, 4, 118, 202, 3, 121, 7, 2, 74, 873, 16, 20, 109, 28, 9154, 27, 5, 67, 96, 39, 1657, 5, 319, 33, 26, 43, 17, 74, 4, 15, 37, 228, 273, 712, 17355, 8, 39, 2375, 736, 5, 65, 4609, 4, 15, 139, 131, 37, 8157, 2, 26, 22, 10973, 5, 1489, 2577, 4, 153, 34, 256, 19, 115, 3, 130, 31672, 9893, 18288, 92, 23, 2831, 765, 64672, 4, 15, 265, 6, 17355, 359, 5, 376, 2, 25, 24, 16, 18, 4118, 3, 36, 5479, 4, 214, 15, 37, 319, 3, 17355, 2505, 14, 19, 7

## Configure sequences for a RNN model
    - Remove words with low frequency
    - Truncate / complete sequences to the same length

In [11]:
#Median length of sentences
print('Median length: ', np.median([len(x) for x in X_test_full]))

Median length:  208.0


In [12]:
max_features = 50000 # Number of most frequent words selected. the less frequent recode to 0
maxlen = 200  # cut texts after this number of words (among top max_features most common words)

In [13]:
#Select the most frequent max_features, recode others using 0
def remove_features(x):
    return [[0 if w >= max_features else w for w in sen] for sen in x]

X_train = remove_features(X_train_full)
X_test  = remove_features(X_test_full)
y_train = y_train_full
y_test = y_test_full

print(X_test[1])

[41650, 0, 178, 85, 11828, 3, 1613, 24, 85, 9074, 3, 16, 18, 6, 85, 1783, 23014, 5721, 33, 2, 5768, 4, 5746, 88, 30, 491, 78, 40, 17422, 3, 78, 40, 3303, 45, 151, 2132, 5, 1848, 3, 19, 9, 26, 276, 43, 16, 18366, 4, 51, 18, 106, 8, 84, 6, 26, 4035, 624, 179, 737, 9, 30, 1, 23, 44, 2, 2196, 17, 39, 4, 5746, 9, 6, 1749, 7, 8809, 28956, 70, 7057, 1116, 5, 1176, 7055, 16, 23, 499, 58, 1407, 399, 125, 34, 195, 36, 300, 78, 34, 156, 39, 317, 271, 14, 7106, 54, 3184, 206, 1111, 3, 379, 115, 16, 18, 6, 9440, 10182, 5, 42, 36, 8, 39, 1065, 41, 51, 18, 36, 24, 352, 5, 15, 167, 6805, 114, 4742, 99, 218, 2, 26, 3, 475, 78, 34, 195, 36, 6, 359, 7, 5746, 18, 460, 3, 19, 227, 39, 6, 140, 296, 8, 7136, 3, 29, 0, 3, 16, 9, 6, 26, 72, 353, 56, 8, 39, 128, 4, 401, 1369, 31079, 26, 359, 156, 1056, 19, 4, 124, 2, 257, 7, 0, 30066, 69, 834, 60, 30, 34, 3572, 2990, 104, 41, 41]


In [14]:
from keras.preprocessing import sequence

# Cut or complete the sentences to length = maxlen
print("Pad sequences (samples x time)")

X_train = sequence.pad_sequences(X_train, maxlen=maxlen)
X_test = sequence.pad_sequences(X_test, maxlen=maxlen)

print('X_train shape:', X_train.shape)
print('X_test shape:', X_test.shape)

print(X_test[0])

Using TensorFlow backend.


Pad sequences (samples x time)
X_train shape: (25000, 200)
X_test shape: (25000, 200)
[ 9154  1582     4    15  3099    14     2  2819     5    71   395     2
  3064     7     2  2478    43     0  2676    28  5264  2050    44    62
    27    53    19    25    71   136   249   599    22    88 31672  9893
 18288    92     4    15   121    19     9     6    63    25    72     9
    64 31532    93    16     9   771     4   118   202     3   121     7
     2    74   873    16    20   109    28  9154    27     5    67    96
    39  1657     5   319    33    26    43    17    74     4    15    37
   228   273   712 17355     8    39  2375   736     5    65  4609     4
    15   139   131    37  8157     2    26    22 10973     5  1489  2577
     4   153    34   256    19   115     3   130 31672  9893 18288    92
    23  2831   765     0     4    15   265     6 17355   359     5   376
     2    25    24    16    18  4118     3    36  5479     4   214    15
    37   319     3 17355  2505    14  

In [15]:
# Shuffle data
from sklearn.utils import shuffle
X_train, y_train = shuffle(X_train, y_train, random_state=0)

In [16]:
# Export train and test data
np.save(data_path+'X_train', X_train)
np.save(data_path+'y_train', y_train)
np.save(data_path+'X_test', X_test)
np.save(data_path+'y_test', y_test)


In [17]:
# Export worddict
import pickle

with open(data_path + 'worddict.pickle', 'wb') as pfile:
    pickle.dump(worddict, pfile)
