## TEXT CLASSIFICATION OF MOVIE REVIEWS
In this notebook I aim to successfully classify movie reviews as positive or negative using the text of the review.

> Data Acquisition Credit:
[Learning Word Vectors for Sentiment Analysis](https://aclanthology.org/P11-1015) (Maas et al., ACL 2011)

>> Code by:
@semaxspaul

In [41]:
import tensorflow as tf
from tensorflow import keras
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [42]:
from notebook.services.config import ConfigManager
cm = ConfigManager().update('notebook', {'limit_output': 100})

In [43]:
# Loading movie review data using the 'keras' API by tensorflow
reviews_data = keras.datasets.imdb

> Pre-Processing Data

In [44]:
# Splitting data into training and testing sets
(train_data, train_labels), (test_data, test_labels) = reviews_data.load_data(num_words=200000)

In [45]:
# Word mapping Retrival
word_index = reviews_data.get_word_index()

In [46]:
# Review Editting
word_index = {key:value+3 for key, value in word_index.items()}
word_index["<PAD>"] = 0
word_index["<START>"] = 1
word_index["<UNK>"] = 2
word_index["<UNUSED>"] = 3

In [47]:
# Reversing the word_index dictionary
reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])

In [48]:
# Regenerating actual worded review
def decode_review(text):
  return " ".join([reverse_word_index.get(i, "?") for i in text])

In [49]:
# Setting a fixed length for all reviews
train_data = keras.preprocessing.sequence.pad_sequences(train_data, value=word_index["<PAD>"], padding="post", maxlen=250)
test_data = keras.preprocessing.sequence.pad_sequences(test_data, value=word_index["<PAD>"], padding="post", maxlen=250)

> Developing Model

In [50]:
from keras.api._v2.keras import activations
model = keras.Sequential()
model.add(keras.layers.Embedding(200000, 16))
model.add(keras.layers.GlobalAveragePooling1D())
model.add(keras.layers.Dense(16, activation='relu'))
model.add(keras.layers.Dense(1, activation='sigmoid'))

model.summary()

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_1 (Embedding)     (None, None, 16)          3200000   
                                                                 
 global_average_pooling1d_1   (None, 16)               0         
 (GlobalAveragePooling1D)                                        
                                                                 
 dense_2 (Dense)             (None, 16)                272       
                                                                 
 dense_3 (Dense)             (None, 1)                 17        
                                                                 
Total params: 3,200,289
Trainable params: 3,200,289
Non-trainable params: 0
_________________________________________________________________


In [51]:
# Defining model metrics and initial hyperparameters
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# creating a validation set
val_data = train_data[:10000]
train_data = train_data[10000:]

val_labels = train_labels[:10000]
train_labels = train_labels[10000:]

In [52]:
# Fitting the model
model.fit(train_data, train_labels,
          epochs=40, 
          batch_size=512, 
          validation_data=(val_data, val_labels),
          verbose=1)

Epoch 40/40


<keras.callbacks.History at 0x7fd2cd5649d0>

In [53]:
# Model Evaluation
results = model.evaluate(test_data, test_labels)

print(results)

[0.3390657901763916, 0.8714799880981445]


> Model Sample Test

In [54]:
test_review = test_data[0]
predict = model.predict([test_review.reshape(-1,1)])
print(f"Review: \n{decode_review(test_review)}")
print(f"Sentiment Prediction: {str(round(predict[0][0], 2))}")
print(f"Actual Sentiment: {test_labels[0]}")
print(results)

Review: 
<START> please give this one a miss br br kristy swanson and the rest of the cast rendered terrible performances the show is flat flat flat br br i don't know how michael madison could have allowed this one on his plate he almost seemed to know this wasn't going to work out and his performance was quite lacklustre so all you madison fans give this a miss <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD

> Saving the Model

In [55]:
# Saving this model
model.save('semaxspaul_movie_review_sentiment.h5')

> Testing the model on external data
>> I obtained a (10/10) movie review on Blank Panther Wakanda forever 2022 and saved it in ***'external_data_test.txt'*** file
>>> Link: https://www.imdb.com/review/rw8669414/?ref_=tt_urv 

In [56]:
# Loading the model
review_model = keras.models.load_model('/content/semaxspaul_movie_review_sentiment.h5')

In [57]:
# Function to encode a review
def encode_review(text):
  # 1 corresponds to <START>
  encoded_text = [1]

  for word in text:
    if word.lower() in word_index:
      encoded_text.append(word_index[word.lower()])
    else:
      # 2 corresponds to unknown words <UNK>
      encoded_text.append(2)

  return encoded_text

In [58]:
# Preprocessing the external data
with open('/content/external_data_test.txt', encoding='utf-8') as f:
  for line in f.readlines():
    new_line = line.replace(',','').replace('.','').replace('(','').replace(')','').replace(':','').replace('*','').replace('\"','').strip().split()
    encoded = encode_review(new_line)
    encoded = keras.preprocessing.sequence.pad_sequences([encoded], value=word_index["<PAD>"], padding="post", maxlen=250)
    sentiment_prediction = review_model.predict(encoded)
    print(f"Review: \n{line}")
    print(f"Encoded Review: \n{encoded}\n")
    print(f"Sentiment Prediction: {str(round(sentiment_prediction[0][0], 2))}")
    print(f"Actual Sentiment: Positive (1) ")


Review: 
I'll preface this by stating that I do NOT understand the mixed reviews. There is not one element of this that I would've done differently. This 2-hour-and-41-minute epic had me hooked from start to finish. No other Marvel property in my opinion, particularly in Phase 4, can match this level of quality, and that's in no small part due to Ryan Coogler, Angela Bassett, Letitia Wright, and the incredibly talented (not to mention gorgeous) newcomer Tenoch Huerta. The talent on display in *every single element* of this film is going to be tough to match, even from next month's Avatar: The Way of Water. Speaking of Avatar 2, they now have some SERIOUS competition in EVERY Oscar category next year, thanks to this film. I could honestly see this sweeping every category and making a strong play for Best Picture. The acting: Fantastic. Angela Bassett is magnificent. Letitia Wright is wonderful.  Tenoch Huerta brings a depth to his Namor that is rare in any film, much less a Marvel one. 

The model predicted the sentiment of the review as expected.