# Objective

1. Build an RNN model for IMDB dataset 
2. Adjust the parameters for better accuracy, such as number of layers, number of nodes in each layer, optimizer, learning rate, etc

# Prepare Environment

In [1]:
%env KERAS_BACKEND=tensorflow
%matplotlib inline

import numpy as np
import matplotlib.pyplot as plt

env: KERAS_BACKEND=tensorflow


# Prepare Data
1. Load data

In [2]:
from keras.datasets import imdb
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=10000)

Using TensorFlow backend.


2. Pad sequences
    - Make all the reviews in the same length (100 words).
    - If the review is too long, strip it; otherwise, pad zeros.

In [3]:
from keras.preprocessing import sequence
x_train = sequence.pad_sequences(x_train, maxlen=100)
x_test = sequence.pad_sequences(x_test, maxlen=100)

# Build RNN

In [4]:
from keras.models import Sequential
from keras.layers import Dense, Embedding, LSTM, GRU

def rnn(embed_dim=128, lstm=True, units=150, loss='binary_crossentropy', optimizer='adam', batch_size=32, epochs=15):
    model = Sequential()
    model.add(Embedding(10000, embed_dim))
    if lstm:
        model.add(LSTM(units))
    else:
        model.add(GRU(units))
    model.add(Dense(1, activation='sigmoid'))

    model.summary()
    model.compile(loss=loss, optimizer=optimizer, metrics=['accuracy'])
    model.fit(x_train, y_train, batch_size=batch_size, epochs=epochs)
    train_accuracy = model.evaluate(x_train, y_train)[1]
    test_accuracy = model.evaluate(x_test, y_test)[1]
    return (train_accuracy, test_accuracy)

### Default setting from class
    - Embedding the 10000 dimensional input to 128 dimensions
    - Add an LSTM layer with 150 cells
    - Loss function: binary crossentropy
    - Optimizer: adam
    - Batch size: 32
    - Epochs: 15
    - Results: (0.9962, 0.82784)

In [5]:
print('default setting:', rnn())

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, None, 128)         1280000   
_________________________________________________________________
lstm_1 (LSTM)                (None, 150)               167400    
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 151       
Total params: 1,447,551
Trainable params: 1,447,551
Non-trainable params: 0
_________________________________________________________________
Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15
default setting: (0.9962, 0.82784)


# Tune Parameters
1. Number of epochs
    - LSTM is very slow. Examine how many epochs are enough for acceptable results.
    - Results
        * epochs = 1 : (0.90556, 0.84448)
        * epochs = 2 : (0.93988, 0.84452)
        * epochs = 3 : (0.96588, 0.84432)
        * epochs = 5 : (0.961, 0.81728)
        * epochs = 10 : (0.99292, 0.8192)
        * epochs = 15 : (0.99856, 0.833)
        * epochs = 20 : (0.9966, 0.82776)
    - Surprisingly, the accuracy was quite good for 1 epoch and reaches the best at 2 epochs. More epochs increase the accuracy of training data but not necessarily help the accuracy on testing data.

In [6]:
for epochs in [1, 2, 3, 5, 10, 15, 20]:
    print('epochs =', epochs, ':', rnn(epochs=epochs))

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, None, 128)         1280000   
_________________________________________________________________
lstm_2 (LSTM)                (None, 150)               167400    
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 151       
Total params: 1,447,551
Trainable params: 1,447,551
Non-trainable params: 0
_________________________________________________________________
Epoch 1/1
epochs = 1 : (0.90556, 0.84448)
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_3 (Embedding)      (None, None, 128)         1280000   
_________________________________________________________________
lstm_3 (LSTM)                (None, 150)               167400    
________________________

Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15
epochs = 15 : (0.99856, 0.833)
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_8 (Embedding)      (None, None, 128)         1280000   
_________________________________________________________________
lstm_8 (LSTM)                (None, 150)               167400    
_________________________________________________________________
dense_8 (Dense)              (None, 1)                 151       
Total params: 1,447,551
Trainable params: 1,447,551
Non-trainable params: 0
_________________________________________________________________
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
epochs = 20 : (0.9966, 0.82776)


2. Batch size
    - Results
        * batch_size = 10 : (0.97492, 0.85244), 12ms/step
        * batch_size = 32 : (0.96528, 0.83848), 4ms/step
        * batch_size = 100 : (0.96432, 0.83996), 1ms/step
        * batch_size = 200 : (0.94932, 0.83884), 660us/step
    - Smaller batch size gives more iterations to approach the optimized model. However, smaller batch also takes a lot more time.
    -  If the amount of training data is enough, bigger batch size can reduce the training time and can still converge to acceptable accuracy.

In [7]:
for batch_size in [10, 32, 100, 200]:
    print('batch_size =', batch_size, ':', rnn(batch_size=batch_size, epochs=3))

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_9 (Embedding)      (None, None, 128)         1280000   
_________________________________________________________________
lstm_9 (LSTM)                (None, 150)               167400    
_________________________________________________________________
dense_9 (Dense)              (None, 1)                 151       
Total params: 1,447,551
Trainable params: 1,447,551
Non-trainable params: 0
_________________________________________________________________
Epoch 1/3
Epoch 2/3
Epoch 3/3
batch_size = 10 : (0.97492, 0.85244)
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_10 (Embedding)     (None, None, 128)         1280000   
_________________________________________________________________
lstm_10 (LSTM)               (None, 150)               167400    

3. LSTM vs. GRU
    - Results
        * [LSTM] units = 10 : (0.96412, 0.85016), 4ms/step
        * [GRU] units = 10 : (0.96884, 0.84808), 3ms/step
        * [LSTM] units = 100 : (0.96372, 0.83808), 4ms/step
        * [GRU] units = 100 : (0.97464, 0.84784), 3ms/step
        * [LSTM] units = 150 : (0.94716, 0.8316), 4ms/step
        * [GRU] units = 150 : (0.96908, 0.84532), 3ms/step
        * [LSTM] units = 200 : (0.96444, 0.84276), 4ms/step
        * [GRU] units = 200 : (0.97884, 0.84848), 3ms/step
        * [LSTM] units = 300 : (0.95504, 0.83988), 4ms/step
        * [GRU] units = 300 : (0.97624, 0.8484), 3ms/step
    - As shown, GRU could save 25% of time compared to LSTM and still got about the same the accuracy.
    - Increaing the number of units did not lead to better performance.
    - Surprisingly, even though more parameters need to be adjusted when there are more units in LSTM or GRU, the training time does not increase apparently with the number of units. It indicated that most of the training time was not spent on adjusting parameters but on other operations in LSTM and GRU.

In [8]:
for units in [10, 100, 150, 200, 300]:
    print('[LSTM] units =', units, ':', rnn(lstm=True, units=units, epochs=3))
    print('[GRU] units =', units, ':', rnn(lstm=False, units=units, epochs=3))

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_13 (Embedding)     (None, None, 128)         1280000   
_________________________________________________________________
lstm_13 (LSTM)               (None, 10)                5560      
_________________________________________________________________
dense_13 (Dense)             (None, 1)                 11        
Total params: 1,285,571
Trainable params: 1,285,571
Non-trainable params: 0
_________________________________________________________________
Epoch 1/3
Epoch 2/3
Epoch 3/3
[LSTM] units = 10 : (0.96412, 0.85016)
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_14 (Embedding)     (None, None, 128)         1280000   
_________________________________________________________________
gru_1 (GRU)                  (None, 10)                4170    

4. Loss function
    - Results
        * loss = mean_squared_error : (0.94068, 0.8322)
        * loss = mean_absolute_error : (0.5716, 0.55848)
        * loss = mean_absolute_percentage_error : (0.5, 0.5)
        * loss = squared_hinge : (0.5, 0.5)
        * loss = hinge : (0.5, 0.5)
        * loss = categorical_hinge : (0.53876, 0.53044)
        * loss = logcosh : (0.94376, 0.8356)
        * loss = sparse_categorical_crossentropy : (0.0, 0.0)
        * loss = binary_crossentropy : (0.96752, 0.84524)
        * loss = kullback_leibler_divergence : (0.5, 0.5)
        * loss = poisson : (0.94372, 0.83748)
        * loss = cosine_proximity : (0.5, 0.5)
    - Some loss functions are not appropriate for this problem, which lead to very low accuracy.
    - Mean absolute percentage error is not appropriate for binary problems. The loss would be divided by the expected value and became infinite if the expected value is zero.
    - Hinge and cosine proximity are also not made for binary problems. The expected value and predicted value would be multiplied, which did not provide any information if the expected value is zero.
    - KL divergence requires multiplying the excepted value, which is always zero if the expected value is zero.
    - Sparse categorical crossentropy is made for categorical problems and exception would be raised if it was applied on binary problems.
    - Mean square error, logcosh, binary crossentropy and poissen fit this problem better and all gave pretty good results.

In [9]:
for loss in ['mean_squared_error', 'mean_absolute_error', 'mean_absolute_percentage_error', 'squared_hinge', 'hinge', 'categorical_hinge', 'logcosh', 'sparse_categorical_crossentropy', 'binary_crossentropy', 'kullback_leibler_divergence', 'poisson', 'cosine_proximity']:
    print('loss =', loss, ':', rnn(loss=loss, epochs=3))

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_23 (Embedding)     (None, None, 128)         1280000   
_________________________________________________________________
lstm_18 (LSTM)               (None, 150)               167400    
_________________________________________________________________
dense_23 (Dense)             (None, 1)                 151       
Total params: 1,447,551
Trainable params: 1,447,551
Non-trainable params: 0
_________________________________________________________________
Epoch 1/3
Epoch 2/3
Epoch 3/3
loss = mean_squared_error : (0.94068, 0.8322)
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_24 (Embedding)     (None, None, 128)         1280000   
_________________________________________________________________
lstm_19 (LSTM)               (None, 150)               1

5. Optimizer
    - Results
        * optimizer = sgd : (0.53596, 0.5262)
        * optimizer = rmsprop : (0.91224, 0.84604)
        * optimizer = adagrad : (0.94904, 0.85024)
        * optimizer = adadelta : (0.871, 0.83224)
        * optimizer = adam : (0.96532, 0.84416)
        * optimizer = adamax : (0.92492, 0.84612)
        * optimizer = nadam : (0.96584, 0.84236)

In [10]:
for optimizer in ['sgd', 'rmsprop', 'adagrad', 'adadelta', 'adam', 'adamax', 'nadam']:
    print('optimizer =', optimizer, ':', rnn(optimizer=optimizer, epochs=3))

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_35 (Embedding)     (None, None, 128)         1280000   
_________________________________________________________________
lstm_30 (LSTM)               (None, 150)               167400    
_________________________________________________________________
dense_35 (Dense)             (None, 1)                 151       
Total params: 1,447,551
Trainable params: 1,447,551
Non-trainable params: 0
_________________________________________________________________
Epoch 1/3
Epoch 2/3
Epoch 3/3
optimizer = sgd : (0.53596, 0.5262)
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_36 (Embedding)     (None, None, 128)         1280000   
_________________________________________________________________
lstm_31 (LSTM)               (None, 150)               167400    


6. Embedding dimensions
    - Results
        * embedding dimension = 10 , units = 100 : (0.92968, 0.84184)
        * embedding dimension = 50 , units = 100 : (0.96032, 0.84184)
        * embedding dimension = 100 , units = 100 : (0.9582, 0.84036)
        * embedding dimension = 100 , units = 50 : (0.941, 0.82712)
        * embedding dimension = 100 , units = 10 : (0.9686, 0.84416)
    - According to the previous experiments, the number of units does not obviously affect the results.
    - As shown, the number of embedding dimension does not obviously affect the results as well. The preserved information seems to be enough even when the embedding dimension was only 10.

In [11]:
for embed_dim, units in [(10, 100), (50, 100), (100, 100), (100, 50), (100, 10)]:
    print('embedding dimension =', embed_dim, ', units =', units, ':', rnn(embed_dim=embed_dim, units=units, epochs=3))

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_42 (Embedding)     (None, None, 10)          100000    
_________________________________________________________________
lstm_37 (LSTM)               (None, 100)               44400     
_________________________________________________________________
dense_42 (Dense)             (None, 1)                 101       
Total params: 144,501
Trainable params: 144,501
Non-trainable params: 0
_________________________________________________________________
Epoch 1/3
Epoch 2/3
Epoch 3/3
embedding dimension = 10 , units = 100 : (0.92968, 0.84184)
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_43 (Embedding)     (None, None, 50)          500000    
_________________________________________________________________
lstm_38 (LSTM)               (None, 100)      