# INTRODUCTION

In this Jupyter notebook we build a model to classify movie reviews using the IMDB dataset in Keras.  The first part of the notebook follows chapter 3.4 of the Deep Learning With Python (DLWP) book and replicates the baseline model.  The later parts of the notebook contain experiments intended to improve performance.

In [2]:
# modules
import itertools
import numpy as np
import tensorflow as tf

# shortcuts to submodules
imdb = tf.keras.datasets.imdb
models = tf.keras.models
layers = tf.keras.layers
optimizers = tf.keras.optimizers
losses = tf.keras.losses
metrics = tf.keras.metrics

In [3]:
# import dataset of reviews and their labels
# limit review text to top 10000 commonly occuring words
NUM_WORDS = 10000
(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words = NUM_WORDS)
min_index = min([min(s) for s in train_data])
max_index = max([max(s) for s in train_data])

In [4]:
print('summary of train_data')
print('---------------------')
print('type: ' + str(type(train_data)))
print('shape: ' + str(train_data.shape))
print('type of train_data[0]: ' + str(type(train_data[0])))
print('length of train_data[:5]: ' + str([len(train_data[i]) for i in range(5)]))
print('type of train_data[0][0]: ' + str(type(train_data[0][0])))
print('train_data[0][:10]: ' + str(train_data[0][:10]))
print('minimum entry in train_data: ' + str(min_index))
print('maximum entry in train_data: ' + str(max_index))
print()
print('summary of train_labels')
print('-----------------------')
print('type: ' + str(type(train_labels)))
print('shape: ' + str(train_labels.shape))
print('type of train_labels[0]: ' + str(type(train_labels[0])))
print('train_labels[:10]: ' + str(train_labels[:10]))
print('number of negative reviews: ' + str(np.sum(train_labels == 0)))
print('number of positive reviews: ' + str(np.sum(train_labels == 1)))
print()
print('summary of test_data')
print('--------------------')
print('shape: ' + str(test_data.shape))
print()
print('summary of test_labels')
print('----------------------')
print('shape: ' + str(test_labels.shape))
print('number of negative reviews: ' + str(np.sum(test_labels == 0)))
print('number of positive reviews: ' + str(np.sum(test_labels == 1)))

summary of train_data
---------------------
type: <class 'numpy.ndarray'>
shape: (25000,)
type of train_data[0]: <class 'list'>
length of train_data[:5]: [218, 189, 141, 550, 147]
type of train_data[0][0]: <class 'int'>
train_data[0][:10]: [1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65]
minimum entry in train_data: 1
maximum entry in train_data: 9999

summary of train_labels
-----------------------
type: <class 'numpy.ndarray'>
shape: (25000,)
type of train_labels[0]: <class 'numpy.int64'>
train_labels[:10]: [1 0 0 1 0 0 1 0 1 0]
number of negative reviews: 12500
number of positive reviews: 12500

summary of test_data
--------------------
shape: (25000,)

summary of test_labels
----------------------
shape: (25000,)
number of negative reviews: 12500
number of positive reviews: 12500


In [5]:
# get word to index mapping and create reverse mapping from index to word
word_index = imdb.get_word_index()
reverse_word_index = { value:key for key,value in word_index.items() }

# helper function to decode index encoded reviews
def decode_review( encoded_review ) : 
    return ' '.join([reverse_word_index.get(i-3, '?') for i in encoded_review])

In [6]:
print()
print('number of entries word to index mapping dictionary: ' + str(len(word_index)))
print('number of entries index to word mapping dictionary: ' + str(len(reverse_word_index)))
print('low index values are common words: ' + str([reverse_word_index[i] for i in range(1,10)]))
print()
print('example of a positive review:')
print('-----------------------------')
positive_index = 6
print(decode_review(train_data[positive_index]))
print()
print('example of a negative review:')
print('-----------------------------')
negative_index = 2
print(decode_review(train_data[negative_index]))


number of entries word to index mapping dictionary: 88584
number of entries index to word mapping dictionary: 88584
low index values are common words: ['the', 'and', 'a', 'of', 'to', 'is', 'br', 'in', 'it']

example of a positive review:
-----------------------------
? lavish production values and solid performances in this straightforward adaption of jane ? satirical classic about the marriage game within and between the classes in ? 18th century england northam and paltrow are a ? mixture as friends who must pass through ? and lies to discover that they love each other good humor is a ? virtue which goes a long way towards explaining the ? of the aged source material which has been toned down a bit in its harsh ? i liked the look of the film and how shots were set up and i thought it didn't rely too much on ? of head shots like most other films of the 80s and 90s do very good results

example of a negative review:
-----------------------------
? this has to be one of the worst films

## Encoding
Encode each review as a 10000-dimensional vector.  Each component of this vector is 0 or 1 indicating the absence or presence of the corresponding word in the review.  An alternate encoding counts the number of times the corresponding word is present.

In [9]:
# binary encoding of reviews - 1 if word is present
def vectorize_sequences1( sequences, dimension = NUM_WORDS ):
    results = np.zeros((len(sequences), dimension))
    for i, sequence in enumerate(sequences):
        results[i,sequence] = 1
    return results

# "histogram" encoding of reviews - counts number of times word is present
def vectorize_sequences2( sequences, dimension = NUM_WORDS ):
    results = np.zeros((len(sequences), dimension))
    for i, sequence in enumerate(sequences):
        results[i,sequence] += 1
    return results

x_train = vectorize_sequences1(train_data)
x_test = vectorize_sequences1(test_data)

In [10]:
# vectorize labels -- don't know why this is necessary (int64-->float32?)
y_train = np.asarray(train_labels).astype('float32')
y_test = np.asarray(test_labels).astype('float32')

In [11]:
# the model from the book
def build_model1():
    model = models.Sequential()
    model.add(layers.Dense(16, activation='relu', input_shape = (NUM_WORDS,)))
    model.add(layers.Dense(16, activation='relu'))
    model.add(layers.Dense(1, activation='sigmoid'))
    return model

In [12]:
model = build_model1()
model.compile(optimizer = 'rmsprop', loss = 'binary_crossentropy', metrics = ['accuracy'])

W0827 09:47:08.396245 4572341696 deprecation.py:506] From //anaconda3/envs/tf/lib/python3.7/site-packages/tensorflow/python/ops/init_ops.py:1251: calling VarianceScaling.__init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version.
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
W0827 09:47:08.578969 4572341696 deprecation.py:323] From //anaconda3/envs/tf/lib/python3.7/site-packages/tensorflow/python/ops/nn_impl.py:180: add_dispatch_support.<locals>.wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where


In [13]:
# set aside a validation set
VALIDATION_SIZE = 10000
x_val = x_train[:VALIDATION_SIZE]
partial_x_train = x_train[VALIDATION_SIZE:]
y_val = y_train[:VALIDATION_SIZE]
partial_y_train = y_train[VALIDATION_SIZE:]

In [14]:
history = model.fit(partial_x_train, partial_y_train, epochs = 20, 
                    batch_size = 512, validation_data = (x_val, y_val))

Train on 15000 samples, validate on 10000 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


In [7]:
# create a histogram of frequency counts of each index
flatlist = list(itertools.chain.from_iterable(train_data))  # concatinate reviews
freq = np.bincount(flatlist)/len(flatlist)  # count occurance of each word

In [8]:
# create list of bigrams from flat list
bigramlist = list(zip(flatlist[:-1],flatlist[1:]))