# Analyzing IMDB Data in Keras

In [1]:
# Imports
import numpy as np
import keras
from keras.datasets import imdb
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation
from keras.preprocessing.text import Tokenizer
import matplotlib.pyplot as plt
%matplotlib inline

np.random.seed(42)

Using TensorFlow backend.


## 1. Loading the data
This dataset comes preloaded with Keras, so one simple command will get us training and testing data. There is a parameter for how many words we want to look at. We've set it at 1000, but feel free to experiment.

In [148]:
# Loading the data (it's preloaded in Keras)
num_features = 5000
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=num_features)

print(x_train.shape)
print(x_test.shape)
print(y_test.shape)

(25000,)
(25000,)
(25000,)


## 2. Examining the data
Notice that the data has been already pre-processed, where all the words have numbers, and the reviews come in as a vector with the words that the review contains. For example, if the word 'the' is the first one in our dictionary, and a review contains the word 'the', then there is a 1 in the corresponding vector.

The output comes as a vector of 1's and 0's, where 1 is a positive sentiment for the review, and 0 is negative.

In [149]:
print(x_train[0])
print(y_train[0])

[1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 2, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 2, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 2, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 2, 16, 480, 66, 3785, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 2, 8, 4, 107, 117, 2, 15, 256, 4, 2, 7, 3766, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 2, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 194, 2, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 2, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16, 283, 5, 16, 4472, 113, 103, 32, 15, 16, 2, 19, 178, 32]
1


## 3. One-hot encoding the output
Here, we'll turn the input vectors into (0,1)-vectors. For example, if the pre-processed vector contains the number 14, then in the processed vector, the 14th entry will be 1.

In [150]:
# One-hot encoding the output into vector mode, each of length 1000
tokenizer = Tokenizer(num_words=num_features)
x_train = tokenizer.sequences_to_matrix(x_train, mode='binary')
x_test = tokenizer.sequences_to_matrix(x_test, mode='binary')
# print(x_train[0]) # convert each array to one-hot 1000 colums

In [151]:
len(x_train[0])

5000

In [152]:
print(x_train.shape)
print(x_test.shape)

(25000, 5000)
(25000, 5000)


And we'll also one-hot encode the output.

In [153]:
# One-hot encoding the output
num_classes = 2
y_train = keras.utils.to_categorical(y_train, num_classes)
y_test = keras.utils.to_categorical(y_test, num_classes)
print(y_train.shape)
print(y_test.shape)

(25000, 2)
(25000, 2)


## 4. Building the  model architecture
Build a model here using sequential. Feel free to experiment with different layers and sizes! Also, experiment adding dropout to reduce overfitting.

In [154]:
# TODO: Build the model architecture
model = Sequential()
model.add(Dense(128, activation='sigmoid', input_shape=(num_features,))) # 25000 data points, each has num_features
model.add(Dropout(.2))
model.add(Dense(64, activation='sigmoid'))
model.add(Dropout(.1))
model.add(Dense(2, activation='softmax')) # 2 output

# TODO: Compile the model using a loss function and an optimizer.
model.compile(loss = 'mean_squared_error', optimizer='adam', metrics=['accuracy'])
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_49 (Dense)             (None, 128)               640128    
_________________________________________________________________
dropout_33 (Dropout)         (None, 128)               0         
_________________________________________________________________
dense_50 (Dense)             (None, 64)                8256      
_________________________________________________________________
dropout_34 (Dropout)         (None, 64)                0         
_________________________________________________________________
dense_51 (Dense)             (None, 2)                 130       
Total params: 648,514.0
Trainable params: 648,514.0
Non-trainable params: 0.0
_________________________________________________________________


## 5. Training the model
Run the model here. Experiment with different batch_size, and number of epochs!

In [155]:
import time

In [156]:
# TODO: Run the model. Feel free to experiment with different batch sizes and number of epochs.
t0 = time.time()
model.fit(x_train, y_train, epochs=20, batch_size=500, verbose=0) # total 25000 data
print ("Run time: ", time.time()-t0)

Run time:  65.99659895896912


## 6. Evaluating the model
This will give you the accuracy of the model, as evaluated on the testing set. Can you get something over 85%?

In [157]:
score = model.evaluate(x_test, y_test, verbose=0)
print("Accuracy: ", score[1])

Accuracy:  0.8662


## 7. Result from different parameters

### Change epochs
- epochs = 5 ,batch_size=500, num_features = 1000, activation='relu', loss = 'categorical_crossentropy', optimizer='adam'
    - Accuracy:  0.846
    - Run time:  3.30
- epochs = 10, batch_size=500, num_features = 1000, activation='relu', loss = 'categorical_crossentropy', optimizer='adam'
    - Accuracy:  0.84696
    - Run time:  7.26
- epochs = 20 ,batch_size=500, num_features = 1000, activation='relu', loss = 'categorical_crossentropy', optimizer='adam'
    - Accuracy:  0.84772
    - Run time:  14.27
- epochs = 50, batch_size=500, num_features = 1000, activation='relu', loss = 'categorical_crossentropy', optimizer='adam'
    - Accuracy: 0.84632
    - Run time:  35.62
- epochs = 100,batch_size=500, num_features = 1000, activation='relu', loss = 'categorical_crossentropy', optimizer='adam'
    - Accuracy:  0.84296
    - Run time:  68.66
- Increasing epochs to 20 tend to increase accuracy with better fitting. Further increasing epochs decreases accuracy on test data due to overfitting. It also takes longer time to iterate for more epochs (linear to epochs). 
### Change batch size
- epochs = 20, batch_size=100, num_features = 1000, activation='relu', loss = 'categorical_crossentropy', optimizer='adam'
    - Accuracy:  0.84648
    - Run time:  21.89
- epochs = 20, batch_size=250, num_features = 1000, activation='relu', loss = 'categorical_crossentropy', optimizer='adam'
    - Accuracy: 0.84752
    - Run time:  15.39
- epochs = 20, batch_size=500, num_features = 1000, activation='relu', loss = 'categorical_crossentropy', optimizer='adam'
    - Accuracy:  0.84772
    - Run time:  14.27
- epochs = 20, batch_size=1000, num_features = 1000, activation='relu', loss = 'categorical_crossentropy', optimizer='adam'
    - Accuracy:  0.84724
    - Run time:  12.29
- epochs = 20, batch_size=5000,num_features = 1000, activation='relu', loss = 'categorical_crossentropy', optimizer='adam'
    - Accuracy:  0.84764
    - Run time:  12.37
- Too small batch_size has low accuracy. Increasing batch_size seems to increase accuracy with shorter run time. 
### Change num_features
- epochs = 20, batch_size=500, num_features = 500, activation='relu', loss = 'categorical_crossentropy', optimizer='adam'
    - Accuracy:  0.8116
    - Run time:  10.42
- epochs = 20, batch_size=500, num_features = 1000, activation='relu', loss = 'categorical_crossentropy', optimizer='adam'
    - Accuracy:  0.84772
    - Run time:  14.27
- epochs = 20, batch_size=500, num_features = 2000, activation='relu', loss = 'categorical_crossentropy', optimizer='adam'
    - Accuracy:  0.85708
    - Run time:  26.22
- epochs = 20, batch_size=500, num_features = 5000, activation='relu', loss = 'categorical_crossentropy', optimizer='adam'
    - Accuracy:  0.86348
    - Run time:  66.58
- More features and number of words tend to increase accuracy but takes longer time to train the model. 
### Change activation
- epochs = 20, batch_size=500, num_features = 1000, activation='relu', loss = 'categorical_crossentropy', optimizer='adam'
    - Accuracy:  0.84772
    - Run time:  14.27
- epochs = 20, batch_size=500, num_features = 1000, activation='sigmoid', loss = 'categorical_crossentropy', optimizer='adam'
    - Accuracy:  0.86056
    - Run time:  16.39
- It seems sigmoid activation works better
### Change loss
- epochs = 20, batch_size=500, num_features = 1000, activation='relu', loss = 'categorical_crossentropy', optimizer='adam'
    - Accuracy:  0.84772
    - Run time:  14.27
- epochs = 20, batch_size=500, num_features = 1000, activation='relu', loss = 'mean_squared_error', optimizer='adam'
    - Accuracy:  0.84824
    - Run time:  15.88
- It seems mean_squared_error works better
### Change optimizer
- epochs = 20, batch_size=500, num_features = 1000, activation='relu', loss = 'categorical_crossentropy', optimizer='adam'
    - Accuracy:  0.84772
    - Run time:  14.27
- epochs = 20, batch_size=500, num_features = 1000, activation='relu', loss = 'categorical_crossentropy', optimizer='rmsprop'
    - Accuracy:  0.8394
    - Run time:  14.62
- epochs = 20, batch_size=500, num_features = 1000, activation='relu', loss = 'categorical_crossentropy', optimizer='sgd'
    - Accuracy:  0.80852
    - Run time:  14.45
- adam optimizer works better

## 8. Combine parameters
- epochs = 20, batch_size=500, num_features = 1000, activation='relu', loss = 'categorical_crossentropy', optimizer='adam'
    - Accuracy:  0.84772
    - Run time:  14.27
- epochs = 20, batch_size=500, num_features = 5000, activation='sigmoid', loss = 'mean_squared_error', optimizer='adam'
    - Accuracy:  0.8662
    - Run time:  65.99
- Achieve a high accuracy of 86.62%. 