 # Table of Contents
<div class="toc" style="margin-top: 1em;"><ul class="toc-item" id="toc-level0"><li><span><a href="http://localhost:8888/notebooks/my_lesson5.ipynb#Get-and-prepare-dataset" data-toc-modified-id="Get-and-prepare-dataset-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Get and prepare dataset</a></span></li><li><span><a href="http://localhost:8888/notebooks/my_lesson5.ipynb#Truncate-vocab-and-convert-to-numpy-array" data-toc-modified-id="Truncate-vocab-and-convert-to-numpy-array-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Truncate vocab and convert to numpy array</a></span></li><li><span><a href="http://localhost:8888/notebooks/my_lesson5.ipynb#Make-every-review-500-words-long" data-toc-modified-id="Make-every-review-500-words-long-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Make every review 500 words long</a></span></li><li><span><a href="http://localhost:8888/notebooks/my_lesson5.ipynb#Simple-Embedding---Dense-level" data-toc-modified-id="Simple-Embedding---Dense-level-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Simple Embedding - Dense level</a></span></li><li><span><a href="http://localhost:8888/notebooks/my_lesson5.ipynb#With-a-conv-layer" data-toc-modified-id="With-a-conv-layer-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>With a conv layer</a></span></li></ul></div>

# Sentiment analysis with imdb dataset

**Steps**
1. Get dataset --> train and test sets (cheating allowed) DONE
2. Truncate vocab to 5000 words DONE
3. Truncate and fill reviews to 500 words DONE
4. Create a simple model with Embedding and one Dense layer DONE
5. Create a convolutional model with max pool
6. Load pretrained embeddings (cheating allowed)
7. Create a multisized (convo width) CNN

In [15]:
from __future__ import division, print_function

## Get and prepare dataset

In [1]:
from keras.datasets import imdb

Using TensorFlow backend.


In [9]:
idx = imdb.get_word_index()

dict

In [12]:
idx_arr = sorted(idx, key=idx.get)
idx_arr[:10]

['the', 'and', 'a', 'of', 'to', 'is', 'br', 'in', 'it', 'i']

In [13]:
idx2word = {v: k  for k, v in idx.iteritems()}

In [18]:
# Download the reviews
from utils import get_file
import cPickle as pickle
path = get_file('imdb_full.pkl',
                origin='https://s3.amazonaws.com/text-datasets/imdb_full.pkl',
                md5_hash='d091312047c43cf9e4e38fef92437263')
f = open(path, 'rb')
(x_train, labels_train), (x_test, labels_test) = pickle.load(f)

In [19]:
len(x_train), len(x_test)

(25000, 25000)

In [20]:
x_train[999]

[54917,
 39,
 14,
 9,
 6,
 79,
 570,
 58086,
 6,
 28,
 4,
 1,
 88,
 2006,
 1377,
 105,
 10,
 25,
 123,
 107,
 1,
 3887,
 4,
 84242,
 2,
 3332,
 1378,
 6,
 748,
 13733,
 3,
 280,
 6529,
 15,
 1,
 520,
 1,
 428,
 13,
 3,
 13565,
 15,
 3,
 51770,
 2047,
 7648,
 8,
 55236,
 1,
 1341,
 432,
 5,
 8775,
 37479,
 31,
 10188,
 1091,
 3,
 19,
 11595,
 52,
 726,
 5,
 13763,
 1974,
 32,
 3432,
 8,
 704,
 848,
 196,
 105,
 2,
 27,
 3,
 1916,
 15,
 621]

In [21]:
# Convert to words
' '.join([idx2word[o] for o in x_train[999]])

"europa' or as it is also known zentropa' is one of the most visually stunning films i have ever seen the blend of grayscale and colour photography is near seamless a true feast for the eyes the picture was a contender for a 1991's golden palm in canners the award went to barton fink by coen brothers a film stylistically very similar to zentropa here's an exercise in class rent both films and be a judge for yourself"

In [22]:
labels_train[999]

1

## Truncate vocab and convert to numpy array

In [23]:
max_vocab = 5000

In [24]:
import numpy as np
train = [np.array([i if i < max_vocab else max_vocab for i in s]) 
         for s in x_train]

In [27]:
train[0]

array([5000,  309,    6,    3, 1069,  209,    9, 2175,   30,    1,  169,   55,   14,   46,   82,
       5000,   41,  393,  110,  138,   14, 5000,   58, 4477,  150,    8,    1, 5000, 5000,  482,
         69,    5,  261,   12, 5000, 5000, 2003,    6,   73, 2436,    5,  632,   71,    6, 5000,
          1, 5000,    5, 2004, 5000,    1, 5000, 1534,   34,   67,   64,  205,  140,   65, 1232,
       5000, 5000,    1, 5000,    4,    1,  223,  901,   29, 3024,   69,    4,    1, 5000,   10,
        694,    2,   65, 1534,   51,   10,  216,    1,  387,    8,   60,    3, 1472, 3724,  802,
          5, 3521,  177,    1,  393,   10, 1238, 5000,   30,  309,    3,  353,  344, 2989,  143,
        130,    5, 5000,   28,    4,  126, 5000, 1472, 2375,    5, 5000,  309,   10,  532,   12,
        108, 1470,    4,   58,  556,  101,   12, 5000,  309,    6,  227, 4187,   48,    3, 2237,
         12,    9,  215])

In [28]:
test = [np.array([i if i < max_vocab else max_vocab for i in s]) 
         for s in x_test]

## Make every review 500 words long

In [29]:
from keras.preprocessing.sequence import pad_sequences

In [30]:
seq_len = 500

In [31]:
train = pad_sequences(train, maxlen=seq_len)

In [32]:
train[0]

array([   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,

In [33]:
test = pad_sequences(test, maxlen=seq_len)

## Simple Embedding - Dense level

In [54]:
# train, labels_train; test and labels_test
from keras.layers.core import Dense, Dropout, Flatten
from keras.layers.embeddings import Embedding
from keras.layers import Input, Embedding
from keras.regularizers import l2
from keras.models import Model

In [85]:
emb_size = 32

In [86]:
inp = Input(shape=(seq_len, ))
x = Embedding(max_vocab+1, emb_size)(inp)

In [87]:
x = Flatten()(x)

In [88]:
x = Dense(100, activation='relu')(x)

In [89]:
x = Dropout(0.7)(x)

In [90]:
out = Dense(1, activation='sigmoid')(x)

In [91]:
model = Model(inp, out)

In [92]:
model.summary()

____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
input_6 (InputLayer)             (None, 500)           0                                            
____________________________________________________________________________________________________
embedding_6 (Embedding)          (None, 500, 32)       160032      input_6[0][0]                    
____________________________________________________________________________________________________
flatten_5 (Flatten)              (None, 16000)         0           embedding_6[0][0]                
____________________________________________________________________________________________________
dense_8 (Dense)                  (None, 100)           1600100     flatten_5[0][0]                  
___________________________________________________________________________________________

In [75]:
from keras.optimizers import Adam

In [93]:
opt = Adam()
model.compile(opt, 'binary_crossentropy', metrics=['accuracy'])

In [94]:
model.fit(train, labels_train, nb_epoch=2, validation_data=[test, labels_test],
         batch_size=64)

Train on 25000 samples, validate on 25000 samples
Epoch 1/2
Epoch 2/2


<keras.callbacks.History at 0x12c31ac50>

## With a conv layer

In [84]:
from keras.layers.convolutional import Conv1D
from keras.layers.convolutional import MaxPooling1D

In [100]:
inp = Input(shape=(seq_len,))
x = Embedding(max_vocab + 1, emb_size)(inp)
x = Dropout(0.2)(x)
x = Conv1D(64, 5, border_mode='same', activation='relu')(x)
x = Dropout(0.2)(x)
x = MaxPooling1D()(x)
x = Flatten()(x)
x = Dense(100, activation='relu')(x)
x = Dropout(0.7)(x)
out = Dense(1, activation='sigmoid')(x)

In [101]:
model = Model(inp, out)
model.summary()

____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
input_10 (InputLayer)            (None, 500)           0                                            
____________________________________________________________________________________________________
embedding_10 (Embedding)         (None, 500, 32)       160032      input_10[0][0]                   
____________________________________________________________________________________________________
dropout_14 (Dropout)             (None, 500, 32)       0           embedding_10[0][0]               
____________________________________________________________________________________________________
convolution1d_4 (Convolution1D)  (None, 500, 64)       10304       dropout_14[0][0]                 
___________________________________________________________________________________________

In [102]:
model.compile(opt, 'binary_crossentropy', metrics=['accuracy'])

In [103]:
model.fit(train, labels_train, nb_epoch=2, validation_data=[test, labels_test],
         batch_size=64)

Train on 25000 samples, validate on 25000 samples
Epoch 1/2
Epoch 2/2


<keras.callbacks.History at 0x12ca9d090>