we will use the IMDB dataset which comes with Tensorflow. It is a large movie review dataset. The data is text data and labels are binary. It has 25000 training data and 25000 test data already separated for us. Learn more about this dataset here. This is a very good dataset for practicing some Natural Language Processing tasks. Each row of this dataset contains text data as expected and the label is either 0 or 1. So, it represents either good sentiment or bad sentiment.

In [2]:
import numpy as np
import tensorflow as tf
import tensorflow_datasets as tfds

importing the IMDB dataset and the info with it:



In [None]:
imdb, info = tfds.load("imdb_reviews",
                      with_info=True, as_supervised=True)

2024-10-30 20:56:58.059433: W external/local_tsl/tsl/platform/cloud/google_auth_provider.cc:184] All attempts to get a Google authentication bearer token failed, returning an empty token. Retrieving token from files failed with "NOT_FOUND: Could not locate the credentials file.". Retrieving token from GCE failed with "FAILED_PRECONDITION: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Could not resolve host: metadata.google.internal".


[1mDownloading and preparing dataset 80.23 MiB (download: 80.23 MiB, generated: Unknown size, total: 80.23 MiB) to /home/youssef/tensorflow_datasets/imdb_reviews/plain_text/1.0.0...[0m


  from .autonotebook import tqdm as notebook_tqdm
Dl Completed...:   0%|          | 0/1 [04:07<?, ? url/s]

Set training and test data in separate variables:



In [None]:
train_data, test_data = imdb['train'], imdb['test']


###Data Preprocessing
Having all the texts as a list and labels as a separate list will be helpful. So, training sentences and labels and testing sentences and labels are retrieved as lists here:

In [None]:
training_sentences = []
training_labels = []
testing_sentences =[]
testing_labels = []
for s,l in train_data:
    training_sentences.append(str(s.numpy()))
    training_labels.append(l.numpy())
for s,l in test_data:
    testing_sentences.append(str(s.numpy()))
    testing_labels.append(l.numpy())

Converting the labels as NumPy arrays:

In [None]:
training_labels_final = np.array(training_labels)
testing_labels_final = np.array(testing_labels)

setting some important parameters necessary for the model.

In [None]:
vocab_size = 10000
embedding_dim=16
max_length = 120
trunc_type= 'post'
oov_tok=""

Here, vocal_size 10000. That means 10000 unique words will be used for this model. If the IMDB dataset has more than 10000 words, extra words will not be used to train the model. So, usually, we are careful about taking this number. Please feel free to try with different vocab_size.

The next parameter is ‘embedding_dim’. It represents the size of the vector that will be used to represent each word. Here embedding_dim is 16 means, a vector of size 16 will be representing each word. You can also try a different number here.

A maximum length of 120 words will be used for each piece of text or to predict a label. This is what is represented by the max_length parameter. If the text is originally bigger than that it will be truncated.

The next parameter trunc_type is set to be ‘post’. That means the text will be truncated (shorter ) at the end.

If there is an unknown word that will be represented by oov_tok.

Data preprocessing is started by tokenizing the texts in NLP projects.

In [None]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

tokenizer = Tokenizer(num_words=vocab_size, oov_token=oov_tok)
tokenizer.fit_on_texts(training_sentences)
word_index = tokenizer.word_index


In [None]:
{'': 1,
 'the': 2,
 'and': 3,
 'a': 4,
 'of': 5,
 'to': 6,
 'is': 7,
 'br': 8,
 'in': 9,
 'it': 10,
 'i': 11,
 'this': 12,
 'that': 13,
 'was': 14,
 'as': 15,
 'for': 16,
 'with': 17,
 'movie': 18,}

{'': 1,
 'the': 2,
 'and': 3,
 'a': 4,
 'of': 5,
 'to': 6,
 'is': 7,
 'br': 8,
 'in': 9,
 'it': 10,
 'i': 11,
 'this': 12,
 'that': 13,
 'was': 14,
 'as': 15,
 'for': 16,
 'with': 17,
 'movie': 18}

So, we have a unique integer value for each word. Here we are arranging our sentences using these integer values instead of words. Also, use padding if the sentences are less than our set max_length 120 words. That way we will have the same size vector for each text.

In [None]:
sequences = tokenizer.texts_to_sequences(training_sentences)
padded = pad_sequences(sequences, maxlen=max_length,
                       truncating = trunc_type)
testing_sequences = tokenizer.texts_to_sequences(testing_sentences)
testing_padded = pad_sequences(testing_sequences, maxlen=max_length)

###Model Development

###Simple RNN

The first model will be a simple Recurrent Neural Network model.

In this model, the first layer will be the embedding layer where sentences will be represented as max_length by embedding_dim vectors. The next layer is a simple RNN layer. Then the dense layers. Here is the model


Embedding layer takes the integer-encoded representations of words in your sequence and converts them into dense vectors of fixed size.

In [None]:
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim,
                             input_length=max_length),
    tf.keras.layers.SimpleRNN(32),
    tf.keras.layers.Dense(10, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 120, 16)           160000    
                                                                 
 simple_rnn (SimpleRNN)      (None, 32)                1568      
                                                                 
 dense (Dense)               (None, 10)                330       
                                                                 
 dense_1 (Dense)             (None, 1)                 11        
                                                                 
Total params: 161909 (632.46 KB)
Trainable params: 161909 (632.46 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


Look at the output shape of each layer. The first layer output shape is (120, 16). Remember our max_length for each sentence was 120 and the embedding dimension was 16.
In the second layer, we put 32 as the parameter in the SimpleRNN layer and the output shape is also 32.

Here we will compile the model using the loss function of binary_crossentropy, ‘adam’ optimizer, and the evaluation metric as the accuracy.

In [None]:
model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

train the model for 30 epochs.

In [None]:
num_epochs=30
history=model.fit(padded, training_labels_final, epochs=num_epochs, validation_data = (testing_padded, testing_labels_final))

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
 79/782 [==>...........................] - ETA: 1:18 - loss: 0.0174 - accuracy: 0.9937

After 30 epochs training accuracy becomes 0.99 or 99%. Perfect, right? But the validation accuracy is 71.97%. Not that bad. But huge *overfitting* issue.


It will be good to see how accuracies and losses changed with each epoch.

In [None]:
import matplotlib.pyplot as plt
def plot_graphs(history, string):
    plt.plot(history.history[string])
    plt.plot(history.history['val_'+string])
    plt.xlabel("Epochs")
    plt.ylabel(string)
    plt.legend([string, 'val_'+string])
    plt.show()

# plot accuracy
plot_graphs(history, 'accuracy')

# plot loss
plot_graphs(history, 'loss')


Validation accuracy oscillated a lot in the beginning and then got settled at 71.9%. On the other hand, the training accuracy went up steadily to 99%

But the loss curve for validation looks pretty bad. It kept going up. We want the loss to go down.