**RNN**
In this code, we first set the parameters such as the maximum number of words to consider as features (max_features), the maximum sequence length (max_length), batch size, and the number of epochs for training.

We then load the IMDb dataset using imdb.load_data(). The dataset is automatically split into training and testing sets. We apply padding to the sequences using sequence.pad_sequences() to ensure all input sequences have the same length (It pads the vectors with zeros in order to match the length).

The RNN model is built using tf.keras.Sequential with an embedding layer, a simple RNN layer, and a dense layer with sigmoid activation for binary classification. The Embedding layer is a text-embedding layer which groups similar meaning words together in the vector space (which is kept as 32 Dimentional). While training, we also train the embedding so that our model learns the correct embeddings on our own.

The model is compiled with the RMSprop optimizer, binary cross-entropy loss, and accuracy as the evaluation metric.


In [1]:
import numpy as np
import tensorflow as tf
from tensorflow.keras.datasets import imdb
from tensorflow.keras.preprocessing import sequence

In [31]:
max_features = 5000  # Number of words to consider as features
max_length = 250  # Maximum sequence length (words)
batch_size = 32
epochs = 5

# Load the IMDb dataset
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)

In [32]:
x_train.shape

(25000,)

In [33]:
x_train[0] #ouputs some numbers which are some kind of tokens

[1,
 14,
 22,
 16,
 43,
 530,
 973,
 1622,
 1385,
 65,
 458,
 4468,
 66,
 3941,
 4,
 173,
 36,
 256,
 5,
 25,
 100,
 43,
 838,
 112,
 50,
 670,
 2,
 9,
 35,
 480,
 284,
 5,
 150,
 4,
 172,
 112,
 167,
 2,
 336,
 385,
 39,
 4,
 172,
 4536,
 1111,
 17,
 546,
 38,
 13,
 447,
 4,
 192,
 50,
 16,
 6,
 147,
 2025,
 19,
 14,
 22,
 4,
 1920,
 4613,
 469,
 4,
 22,
 71,
 87,
 12,
 16,
 43,
 530,
 38,
 76,
 15,
 13,
 1247,
 4,
 22,
 17,
 515,
 17,
 12,
 16,
 626,
 18,
 2,
 5,
 62,
 386,
 12,
 8,
 316,
 8,
 106,
 5,
 4,
 2223,
 2,
 16,
 480,
 66,
 3785,
 33,
 4,
 130,
 12,
 16,
 38,
 619,
 5,
 25,
 124,
 51,
 36,
 135,
 48,
 25,
 1415,
 33,
 6,
 22,
 12,
 215,
 28,
 77,
 52,
 5,
 14,
 407,
 16,
 82,
 2,
 8,
 4,
 107,
 117,
 2,
 15,
 256,
 4,
 2,
 7,
 3766,
 5,
 723,
 36,
 71,
 43,
 530,
 476,
 26,
 400,
 317,
 46,
 7,
 4,
 2,
 1029,
 13,
 104,
 88,
 4,
 381,
 15,
 297,
 98,
 32,
 2071,
 56,
 26,
 141,
 6,
 194,
 2,
 18,
 4,
 226,
 22,
 21,
 134,
 476,
 26,
 480,
 5,
 144,
 30,
 2,
 18,
 51,
 36,
 

In [6]:
y_train[0] #positive review

1

In [34]:
word_index = imdb.get_word_index()
word_index['bad']

75

In [35]:
get_word = dict((value,key) for key,value in word_index.items())
def decode(review):
  txt = ''
  for r in review:
    txt += get_word[r]
    txt += " "

  return txt

In [36]:
decode(x_train[19])

"the exercise he it is tony falls ask has is found and and david order besides and it age cole watch extraordinary would it racist bad them can of order this community dancer of introduced for at does for all with released movies sometimes to for be war rock made all it coming all well fighting in cheap not his movie high and and actors know bad can and chance 6 and and and in and or weren't to and is got say room and legend like that hand some it of and br about help of you it is over are of straight able of their book an of details have stuff that in not that it her and anyone who so and down course and to and for have big 3 not also global she this of played and there theatre will life are goes in still lion to would to and he and who is success save br being before this television god is nobody to tony ask for into and to and have "

In [30]:
x_train[0].shape

(250,)

In [19]:
y_train[19] #seems like a positive review

1

In [17]:

# Pad sequences to a fixed length because each size is differnt
x_train = sequence.pad_sequences(x_train, maxlen=max_length)
x_test = sequence.pad_sequences(x_test, maxlen=max_length)

In [None]:

# Build the RNN model
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(max_features, 32),
    tf.keras.layers.SimpleRNN(32),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

# Compile the model
model.compile(optimizer='rmsprop',
              loss='binary_crossentropy',
              metrics=['accuracy'])

# Train the model
model.fit(x_train, y_train, batch_size=batch_size, epochs=epochs, validation_data=(x_test, y_test))

# Evaluate the model
loss, accuracy = model.evaluate(x_test, y_test, batch_size=batch_size)
print(f'Test loss: {loss:.4f}')
print(f'Test accuracy: {accuracy:.4f}')

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Test loss: 0.4651
Test accuracy: 0.7915


##### LSTM

In this code, we use an LSTM layer instead of a RNN layer since RNNs are expected to suffer from vanishing gradients.

The number of layers, features, padding, epochs, optimisers are set the same as earlier to make the comparison as fair as possible.

In [None]:
import numpy as np
import tensorflow as tf
from tensorflow.keras.datasets import imdb
from tensorflow.keras.preprocessing import sequence

# Set the parameters
max_features = 5000  # Number of words to consider as features
max_length = 250  # Maximum sequence length (words)
batch_size = 32
epochs = 5

# Load the IMDb dataset
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)

# Pad sequences to a fixed length
x_train = sequence.pad_sequences(x_train, maxlen=max_length)
x_test = sequence.pad_sequences(x_test, maxlen=max_length)

# Build the LSTM model
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(max_features, 32),
    tf.keras.layers.LSTM(32),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

# Compile the model
model.compile(optimizer='rmsprop',
              loss='binary_crossentropy',
              metrics=['accuracy'])

# Train the model
model.fit(x_train, y_train, batch_size=batch_size, epochs=epochs, validation_data=(x_test, y_test))

# Evaluate the model
loss, accuracy = model.evaluate(x_test, y_test, batch_size=batch_size)
print(f'Test loss: {loss:.4f}')
print(f'Test accuracy: {accuracy:.4f}')


Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Test loss: 0.4134
Test accuracy: 0.8660


As expected, we get a better accuracy.