<a href="https://colab.research.google.com/github/retazo0018/Movie-Review-Classification/blob/master/movie_review_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
import tensorflow as tf
from tensorflow import keras
import numpy as np

In [0]:
data = keras.datasets.imdb

In [0]:
(train_data, train_labels), (test_data, test_labels) = data.load_data(num_words=10000)
# take only 10000 frequent words

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz


In [0]:
print(train_data[0])

[1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 2, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 2, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 2, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 5244, 16, 480, 66, 3785, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 2, 8, 4, 107, 117, 5952, 15, 256, 4, 2, 7, 3766, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 2, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 194, 7486, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 5535, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16, 283, 5, 16, 4472, 113, 103, 32, 15, 16, 5345, 19, 178, 32]


In [0]:
word_index = data.get_word_index()

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb_word_index.json


In [0]:
word_index = {k:(v+3) for k,v in word_index.items()}
word_index["<PAD>"]  = 0 # to make each movie review of a same length
word_index["<START>"] = 1 
word_index["<UNK>"] = 2 #unknown
word_index["<UNUSED>"] = 3

reverse_word_index = dict([(value,key) for (key, value) in word_index.items()])

def decode_Review(text):
  return " ".join([reverse_word_index.get(i,"?")for i in text]) # put ? if the key (number) does not have a value (text) else the value associated with the key

print(decode_Review(test_data[4]))


<START> one like much we social while ? haven't away formulaic black ? the cinema and close ? ? and close hand science given it fox which they sense ? to child to was truly over knock ? simon as am ? <UNK> his were arrogant was <UNK> over excellent behind ? <UNK> while that bin and close ? well <UNK> ? ? sam must ? ? ? small and costumes sit the with ? small good mom bat an slowly it coming home and close occasion but sense ? ? ? up effort effort thought watching year they a just <UNK> watching call <UNK> watching move watching


In [0]:
# converting all reviews to size 250 characters
train_data = keras.preprocessing.sequence.pad_sequences(train_data, value=word_index["<PAD>"], padding="post", maxlen=250)
test_data = keras.preprocessing.sequence.pad_sequences(test_data, value=word_index["<PAD>"], padding="post", maxlen=250)

In [0]:
model = keras.Sequential()
model.add(keras.layers.Embedding(10000,16)) # tries to group words that are similar to each other, eg. great, good ; 16 dimensions 
model.add(keras.layers.GlobalAveragePooling1D()) # to scale the 16 dimension into lower dimension
model.add(keras.layers.Dense(16, activation="relu")) # 16 neurons (arbitrary) 
model.add(keras.layers.Dense(1, activation="sigmoid")) 
model.summary()

Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, None, 16)          160000    
_________________________________________________________________
global_average_pooling1d (Gl (None, 16)                0         
_________________________________________________________________
dense (Dense)                (None, 16)                272       
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 17        
Total params: 160,289
Trainable params: 160,289
Non-trainable params: 0
_________________________________________________________________


In [0]:
model.compile(optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"])
x_val = train_data[:10000] # validation data
x_train = train_data[10000:]

y_val = train_labels[:10000] # validation data
y_train = train_labels[10000:]

fitmodel = model.fit(x_train, y_train, epochs = 40, batch_size = 512, validation_data=(x_val,y_val), verbose=1) # batch size - > how many reviews are you gonna load at each time

results = model.evaluate(test_data, test_labels)

print(results)

Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
Train on 15000 samples, validate on 10000 samples
Epoch 1/40
Epoch 2/40
Epoch 3/40
Epoch 4/40
Epoch 5/40
Epoch 6/40
Epoch 7/40
Epoch 8/40
Epoch 9/40
Epoch 10/40
Epoch 11/40
Epoch 12/40
Epoch 13/40
Epoch 14/40
Epoch 15/40
Epoch 16/40
Epoch 17/40
Epoch 18/40
Epoch 19/40
Epoch 20/40
Epoch 21/40
Epoch 22/40
Epoch 23/40
Epoch 24/40
Epoch 25/40
Epoch 26/40
Epoch 27/40
Epoch 28/40
Epoch 29/40
Epoch 30/40
Epoch 31/40
Epoch 32/40
Epoch 33/40
Epoch 34/40
Epoch 35/40
Epoch 36/40
Epoch 37/40
Epoch 38/40
Epoch 39/40
Epoch 40/40
[0.3339951250600815, 0.86936]


In [0]:
test_review = test_data[0]
predict = model.predict([test_review])
print("Review: ")
print(decode_Review(test_review))
print("Prediciton: "+ str(predict[0]))
print("Actual: "+ str(test_labels[0]))
print(results)

Review: 
<START> coming take and film ? thriller ? ? <UNK> <UNK> ? ? dead ? ? look locals came either ? plot ? rich rich rich ? ? the into never will history zodiac most with ed and film this for addressed movie times care ? never and ending didn't ? those who ? for far of going <UNK> he on was zodiac she's take and ? thriller <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>