
---
# Classifying movie reviews: A Binary Classification Tesorflow example
---

In [6]:
from tensorflow.keras.datasets import imdb

import matplotlib.pyplot as plt
import numpy as np

(train_data, train_lables), (test_data, test_labels) = imdb.load_data(num_words=10_000)

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz


In [7]:
print(f"{train_data[0] = }")

train_data[0] = [1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 2, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 2, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 2, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 5244, 16, 480, 66, 3785, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 2, 8, 4, 107, 117, 5952, 15, 256, 4, 2, 7, 3766, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 2, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 194, 7486, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 5535, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16, 283, 5, 16, 4472, 113, 103, 32, 15, 16, 5345, 19, 178, 32]


These represent the words. E.g., the 14 is the 14th work etc.

In [12]:
print(f"{train_lables[0] = }")

train_lables[0] = 1


In [10]:
print(f"{len(train_data[0]) = }")

len(train_data[0]) = 218


In [11]:
print(f"{len(train_data[1]) = }")

len(train_data[1]) = 189


Notice that the inputs have different lengths. This is a problem and we'll need to learn how to navigate it.

In [18]:
word_index = imdb.get_word_index()
reverse_word_index = dict([(value,key) for (key,value) in word_index.items()])

decoded_review = " ".join([
    reverse_word_index.get(i-3,"?") for i in train_data[0]
    # 0, 1, 2 are reserved for "padding", "start", and "missing" respectively
])

decoded_review

"? this film was just brilliant casting location scenery story direction everyone's really suited the part they played and you could just imagine being there robert ? is an amazing actor and now the same being director ? father came from the same scottish island as myself so i loved the fact there was a real connection with this film the witty remarks throughout the film were great it was just brilliant so much that i bought the film as soon as it was released for ? and would recommend it to everyone to watch and the fly fishing was amazing really cried at the end it was so sad and you know what they say if you cry at a film it must have been good and this definitely was also ? to the two little boy's that played the ? of norman and paul they were just brilliant children are often left out of the ? list i think because the stars that play them all grown up are such a big profile for the whole film but these children are amazing and should be praised for what they have done don't you th

### Prepping the Data
You can't directly feed list of integers into a neural network. They all have different lenghts, and a neural network expects to process contigious batches of data. Thus, we will need to first turn our lists into tensors. We can accomplish this in two ways:
1. Pad your lists so that they all have the same length, turn them into integer tensors of shape (samples, max_length), and start your model with a lalyer capable of handling such integer tensors.
2. *Multi-hot encodes* your lists to turn them into vectors of 0's and 1's. This would mean, for instance, turning the sequence [8,5] into a 10,000 dimentional vextor that would be all 0's except for indicies 8 and 5, which would be 1. Then you could use a `Dense` layer, capable of handling floating-point vector data, as the first layer in your model.

We choose 2, and vectorize our data (manually) below.

In [19]:
def vectorize_sequences(sequences, dimension=10_000):
  num_rows = len(sequences)
  results = np.zeros((num_rows, dimension))
  for i, sequence in enumerate(sequences):
    for j in sequence:
      results[i,j] = 1.0
  return results

x_train = vectorize_sequences(train_data)
y_test = vectorize_sequences(test_data)

In [21]:
print(x_train[0])

[0. 1. 1. ... 0. 0. 0.]


In [24]:
from tensorflow import keras
from tensorflow.keras import layers

model = keras.Sequential([
    layers.Dense(16, activation = "relu"),
    layers.Dense(16, activation = "relu"),
    layers.Dense(1, activation = "sigmoid"),
])

Recall:

* Preactivation on $z^2 = W^2x+b^2$
* Postaction: $a = \Phi(z)$

* `train_data[i]` is the i-th encoded movie review

In [26]:
model.compile(
    optimizer = "rmsprop",
    loss="binary_crossentropy",
    metrics=["accuracy"],
)

In [27]:
y_train = np.asarray(train_lables).astype("float32")
y_test = np.asarray(test_labels).astype("float32")

y_train

array([1., 0., 0., ..., 0., 1., 0.], dtype=float32)

In [28]:
x_val, y_val = x_train[:10_000], y_train[:10_000]
partial_x_train, partial_y_train = x_train[10_000:], y_train[10_000:]

In [29]:
history = model.fit(
    partial_x_train,
    partial_y_train,
    epochs = 20,
    batch_size = 512,
    validation_data = (x_val, y_val)
)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
