<a href="https://colab.research.google.com/github/thedatadj/natural-language-processing/blob/main/movies-reviews-classifier.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Description
In this project I create a machine learning model capable of classifying movies reviews as positive or negative.

## Dataset

The dataset used for this project is the IMDB dataset consisting of movies reviews with labels indicating if the review is positive or negative.

Maas, A. L., Daly, R. E., Pham, P. T., Huang, D., Ng, A. Y., & Potts, C. (2011). Learning Word Vectors for Sentiment Analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (pp. 142–150). Portland, Oregon, USA. Association for Computational Linguistics. [Link to the Paper](http://www.aclweb.org/anthology/P11-1015)


In [34]:
# Numerical analysis
import numpy as np

# Deep Learning
import tensorflow as tf

# TensorFlow datasets
import tensorflow_datasets as tfds

# Data

In [8]:
# Load the data

imdb_data = tfds.load('imdb_reviews',
                      as_supervised=True)


In [12]:
# Get the training and testing datasets

train_set = imdb_data['train']
test_set = imdb_data['test']


In [21]:
# Take a look into a training example

next(iter(train_set))

(<tf.Tensor: shape=(), dtype=string, numpy=b"This was an absolutely terrible movie. Don't be lured in by Christopher Walken or Michael Ironside. Both are great actors, but this must simply be their worst role in history. Even their great acting could not redeem this movie's ridiculous storyline. This movie is an early nineties US propaganda piece. The most pathetic scenes were those when the Columbian rebels were making their cases for revolutions. Maria Conchita Alonso appeared phony, and her pseudo-love affair with Walken was nothing but a pathetic emotional plug in a movie that was devoid of any real meaning. I am disappointed that there are movies like this, ruining actor's like Christopher Walken's good name. I could barely sit through it.">,
 <tf.Tensor: shape=(), dtype=int64, numpy=0>)

I want each training example to be a string, and store these in a list. I also want each label to be just an integer store in a list.

**Get training sentences and labels into a list**

In [31]:
# Lists
training_sentences = []
training_labels = []

for sentence, label in train_set:

    # Sentence

    # Extract string
    string = sentence.numpy()
    # Decode
    decoded = string.decode('utf8')
    # Add to list
    training_sentences.append(decoded)

    # Label

    # Extract value
    value = label.numpy()
    # Add to list
    training_labels.append(value)

**Get testing sentences and labels into lists**

In [50]:
# Lists
testing_sentences = []
testing_labels = []


for sentence, label in test_set:
    # Extract string
    string = sentence.numpy()
    # Decode
    decoded = string.decode('utf8')
    # Add to list
    testing_sentences.append(decoded)

    # Extract number
    value = label.numpy()
    # Add to list
    testing_labels.append(value)

In [30]:
# Training example
training_sentences[0]

"This was an absolutely terrible movie. Don't be lured in by Christopher Walken or Michael Ironside. Both are great actors, but this must simply be their worst role in history. Even their great acting could not redeem this movie's ridiculous storyline. This movie is an early nineties US propaganda piece. The most pathetic scenes were those when the Columbian rebels were making their cases for revolutions. Maria Conchita Alonso appeared phony, and her pseudo-love affair with Walken was nothing but a pathetic emotional plug in a movie that was devoid of any real meaning. I am disappointed that there are movies like this, ruining actor's like Christopher Walken's good name. I could barely sit through it."

In [76]:
# List to array
training_labels = np.array(training_labels)
testing_labels = np.array(testing_labels)

> Now the datasets are ready.

# Tokenization
This step is necessary to fit the classifier later.

In [37]:
# Hyperparamters
vocabulary_size = 10000
embedding_dimension = 16
maximum_lenght_of_sentences = 120
truncating_type = 'post'
out_of_vocabulary_token = "<OOV>"

In [36]:
# Imports
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [39]:
# Instantiate
tokenizer = Tokenizer(num_words = vocabulary_size,
                      oov_token = out_of_vocabulary_token)

# Fit
tokenizer.fit_on_texts(training_sentences)

Now that the tokenizer is fit I can convert the sentences from my list of sentences `training_sentences` and `testing_sentences` into numbers using `tokenizer`

In [40]:
# Sentences to sequences
sequences = tokenizer.texts_to_sequences(training_sentences)

In [64]:
# Take a look
print("Human:", [word for word in training_sentences[0].split()][:6])
print("Computer:", sequences[0][:6])

Human: ['This', 'was', 'an', 'absolutely', 'terrible', 'movie.']
Computer: [12, 14, 33, 425, 392, 18]


I pad the sequences (sentences) into a matrix so that every training example have the same lenght.

In [47]:
# Pad sequences
padded = pad_sequences(sequences,
                       maxlen=maximum_lenght_of_sentences,
                       truncating=truncating_type)

I apply the same process to the testing set.

In [51]:
testing_sequences = tokenizer.texts_to_sequences(testing_sentences)
testing_padded = pad_sequences(testing_sequences,
                               maxlen=maximum_lenght_of_sentences)

# Modeling
I define a neural network model.

In [79]:
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocabulary_size,
                              embedding_dimension,
                              input_length=maximum_lenght_of_sentences),
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(12, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

In [80]:
model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

In [65]:
# Training hyperparameter
n_epochs = 10

In [81]:
# Training
model.fit(padded,
          training_labels,
          epochs=n_epochs,
          validation_data=(testing_padded, testing_labels))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.src.callbacks.History at 0x7bae4e95a3b0>

> Training accuracy is perfect. <br>
Validation accuracy is not very close to training accuracy. <br>
Therefore, this model migh be overfitting.

# Results analysis

In [87]:
model.predict(testing_padded)



array([[2.6054122e-02],
       [9.9998307e-01],
       [1.6896736e-09],
       ...,
       [4.0747193e-12],
       [9.9911696e-01],
       [9.9999672e-01]], dtype=float32)

> The first prediction is close to 0, this means that the model think that this review is negative, let's take a look.

In [89]:
testing_sentences[0]

"There are films that make careers. For George Romero, it was NIGHT OF THE LIVING DEAD; for Kevin Smith, CLERKS; for Robert Rodriguez, EL MARIACHI. Add to that list Onur Tukel's absolutely amazing DING-A-LING-LESS. Flawless film-making, and as assured and as professional as any of the aforementioned movies. I haven't laughed this hard since I saw THE FULL MONTY. (And, even then, I don't think I laughed quite this hard... So to speak.) Tukel's talent is considerable: DING-A-LING-LESS is so chock full of double entendres that one would have to sit down with a copy of this script and do a line-by-line examination of it to fully appreciate the, uh, breadth and width of it. Every shot is beautifully composed (a clear sign of a sure-handed director), and the performances all around are solid (there's none of the over-the-top scenery chewing one might've expected from a film like this). DING-A-LING-LESS is a film whose time has come."

> We can see that this review is fairly positive.

Let's see the actual label for this review.

In [92]:
testing_labels[0]

1

The label is 1, which means that this review is positive, but the model predicted as being closer to negative.

On the other hand, the last prediction is close to 1, which means that the algorithm thinks that this review is positive review.

In [93]:
testing_sentences[-1]

"They just don't make cartoons like they used to. This one had wit, great characters, and the greatest ensemble of voice over artists ever assembled for a daytime cartoon show. This still remains as one of the highest rated daytime cartoon shows, and one of the most honored, winning several Emmy Awards."

> And indeed it looks like a positive review.

In [94]:
testing_labels[-1]

1