Let's first load some libraries... 

In [None]:
import tensorflow as tf
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from tensorflow import keras 
from tensorflow.keras import layers

#**IMDB Dataset**

We load the IMDB dataset, keeping only the 10,000 most frequent terms in the corpus. Each of those 10,000 terms is represented in the data by a unique integer in the range between 0 and 9,999. Each observation in the train or test data is therefore just a list of integer values. The associated labels are just binary indicators of positive or negative rating valence. IMDB ratings are out of 10, so I presume positive means something like 8+, and negative means something like < 8.  

In [None]:
from tensorflow.keras.datasets import imdb
(train_data, train_labels), (test_data, test_labels) = imdb.load_data(
    num_words=10000)

The observation in the training data with the largest number of 'frequent' terms has ~2,500 of them. Note that this is not the same thing as the number of words in the review. *Q: Why not?*

Further, if we extract the largest integer value from each review in the training data, put it into a list, and then take the max of that list, we see the largest integer index is 9,999 (as expected). 

In [None]:
print(max(len(i) for i in train_data))
print(max([max(sequence) for sequence in train_data]))

We can reverse the integer coding like this... 

In [None]:
# Here is the key-value dictionary that stores the integer indexes for each term.
# Note that the word index dictionary has 'garbage' values in the first three entries.
# These are explained in the dataset description if you look for it; they are special terms in the data.
word_index = imdb.get_word_index()
word_index["<PAD>"] = 0
word_index["<START>"] = 1
word_index["<UNK>"] = 2
word_index["<UNUSED>"] = 3

# So, the word 'big' is represented by integer 191.
print(word_index.get("big"))

# We can reverse the keys and values for each entry, to get a dictionary that would let us 'decode' the terms from a review.
reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])
print(reverse_word_index)

# Now we can convert integer values back to terms for a given review in the sample. 
decoded_review = " ".join([reverse_word_index.get(i) for i in train_data[0]])
print(train_data[0])
print(decoded_review)

#*Pre-processing the Data*

Now, we need to pre-process the data. We can't just pass integer lists of variable length into the neural network. We need a fixed number of features to serve as our x's (inputs) to the network, i.e., same number of features for each observation in the training data (and also the test data for validation, later). The most obvious thing we can do is multi-hot encode the observations (i.e., dummy code it). So, we basically make a matrix with 10,000 columns, and set values to 1 for columns representing the terms we have and 0 for columns representing the terms we don't have. 

In [None]:
import numpy as np

def vectorize_sequences(sequences, dimension=10000): 
    
    # Make our blank matrix of 0's to store hot encodings.
    results = np.zeros((len(sequences), dimension))

    # For each observation and element in that observation,
    # Update the blank matrix to a 1 at row obs, column element value.
    for i, sequence in enumerate(sequences):
        for j in sequence:
            results[i, j] = 1.
    return results

# I am converting the resulting giant arrays into float datatype. They are 'float64' by default, which takes more RAM.
x_train = vectorize_sequences(train_data).astype('float')
x_test = vectorize_sequences(test_data).astype('float')

# Labels are already fine, but we can convert them to floats to match the x's data type (they are 'float32' by default.)
y_train = train_labels.astype('float')
y_test = test_labels.astype('float')

We can see that the hot-encoding looks correct:

In [None]:
print(train_data[0])
print(x_train[0,14])
print(x_train[0,22])
print(x_train[0,23])

#*Building the Model*

So, mapping binary inputs to binary outputs is a very simple problem setup. We don't even need to whiten the data; features are already in the 0-1 range. We will follow the book's advice and make a bunch of dense layers with relu activations, followed by a sigmoid activated output layer. 

In [None]:
# Can install the tensorflow-addons package in your colab runtime. 
try:
    import tensorflow_addons as tfa                     
except ImportError:
    !pip install tensorflow-addons
    import tensorflow_addons as tfa 

In [None]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

# Some options for mitigating over-fitting:
#   - Weight regularization.
#   - Activation regularization.
#   - Weight constraints. 
#   - Dropout
#   - Topology simplification
#   - Add noise to the input data. 

model = keras.Sequential([
    #layers.Dropout(0.2), # This layer sets a random fraction of weights to 0 in a given training pass. 
    #layers.GaussianNoise(0.1), # This layer adds random normal noise to the input features.
    #tfa.layers.NoisyDense(16,activation="relu"), # This injects noise into the weights at each step. 
    layers.Dense(16, activation="relu"), #,kernel_regularizer='l2'
    layers.Dense(16, activation="relu"), #,activity_regularizer='l2'
    layers.Dense(1, activation="sigmoid")
])

model.compile(optimizer="rmsprop",
              loss="binary_crossentropy",
              metrics=["accuracy"])

# We will keep records 0-9,999 for validation, and we will use the remaining records for training.
# We are still holding out the test dataset by the way, so we are going to do train -> validation to figure out when overfitting happens.
# Then we are going to re-train on the whole training dataset with early stopping. Finally, we will evaluate performance on the test dataset.
x_val = x_train[:10000]
partial_x_train = x_train[10000:]
y_val = y_train[:10000]
partial_y_train = y_train[10000:]

history = model.fit(partial_x_train,
                    partial_y_train,
                    epochs=20,
                    batch_size=512,
                    validation_data=(x_val, y_val))

As you can see, the model starts to overfit after 3-4 epochs; validation accuracy starts to decline. Here we are plotting loss...

In [None]:
import matplotlib.pyplot as plt

history_dict = history.history
loss_values = history_dict["loss"]
val_loss_values = history_dict["val_loss"]
epochs = range(1, len(loss_values) + 1)
plt.plot(epochs, loss_values, "r", label="Training loss")
plt.plot(epochs, val_loss_values, "b", label="Validation loss")
plt.title("Training and validation loss")
plt.xlabel("Epochs")
plt.ylabel("Loss")
plt.legend()
plt.show()

Here we are plotting accuracy.

In [None]:
acc = history_dict["accuracy"]
val_acc = history_dict["val_accuracy"]
plt.plot(epochs, acc, "r", label="Training acc")
plt.plot(epochs, val_acc, "b", label="Validation acc")
plt.title("Training and validation accuracy")
plt.xlabel("Epochs")
plt.ylabel("Accuracy")
plt.legend()
plt.show()

Note: if you see your RAM creeping up, you might want to clear some things out of memory to free up space... the garbage collector can help here. I'm going to delete the existing model and the original test / train datasets (we have them in hot-encoded format now anyway. Or, you can set the objects to 0.

In [None]:
import gc
del model, train_data, test_data
gc.collect()

# Setting the objects to empty can also help.
train_data, test_data = [],[]

Let's re-run this and force a stop after 4 epochs. Note, you need to redefine the model here, else it will just pickup where your model left off (i.e., the last set of weights you were using, which are already overfit).



In [None]:
model = keras.Sequential([
    layers.Dense(16, activation="relu"),
    layers.Dense(16, activation="relu"),
    layers.Dense(1, activation="sigmoid")
])

model.compile(optimizer="adam",
              loss="binary_crossentropy",
              metrics=["accuracy"])

history = model.fit(x_train,
                    y_train,
                    epochs=4,
                    batch_size=512,
                    validation_data=(x_test, y_test))

# This will return our loss and accuracy metrics.
model.evaluate(x_test, y_test)

And, of course, the model is useless if we can't use it to produce a new prediction. So, let's do that as well. 

In [None]:
import pandas as pd

# y_test is a 1D numpy array, whereas our predictions are a 2D array.
print(y_test.shape)

predictions = model.predict(x_test)
print(predictions.shape)

# np.ravel() is another way to flatten an array into 1D, so then we can take the cross-tabulation.
# Note that I'm applying the >0.5 threshold rule to the predictions. 
pd.crosstab(np.ravel(predictions)>0.5,y_test)