<img src="images/imdb_logo.png" width="200" height="200" />

# IMDB Review Tensorflow Classifier

IMDB dataset having 50K movie reviews for natural language processing or Text analytics.
This is a dataset for binary sentiment classification.It has a set of 25,000 highly polar movie reviews for training and 25,000 for testing. So, predict the number of positive and negative reviews using either classification or deep learning algorithms.




Import all the libraries I would need. Keras will provide the IMDb dataset

In [1]:
import matplotlib.pyplot as plt
from keras.datasets import imdb
from keras import models
from keras import layers
import numpy as np
import random

The argument **num_words=10000** means you’ll only keep the top 10 000 most frequently occurring words in the training data. Rare words will be discarded. This allows you to work with vector data of manageable size.
As stated above the dataset should contain 25k reviews of movies which are the rows of the data sets.

In [2]:
(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words=10000)

print("train_data shape is ", train_data.shape)
print("train_labels shape is ", train_labels.shape)
print("test_data shape is ", train_data.shape)
print("test_labels shape is ", train_labels.shape)

train_data shape is  (25000,)
train_labels shape is  (25000,)
test_data shape is  (25000,)
test_labels shape is  (25000,)


Each row contains indexes of the word position in the dictionary and the label is either 1 or 0 to state whether the review was positive or negative.


In [3]:
random_index = random.randrange(0, train_data.shape[0])
random_review = train_data[random_index]
random_review_setiment = train_labels[random_index]

print (random_review)
if(random_review_setiment == 0 ):
    print("negative review")
else:
    print("positive review")

[1, 260, 332, 4, 85, 795, 23, 14, 22, 13, 62, 40, 8, 1497, 61, 205, 650, 15, 14, 9, 31, 1211, 20, 8, 67, 894, 25, 26, 6, 964, 2, 13, 244, 24, 10, 10, 54, 610, 33, 34, 6, 2855, 5210, 2, 4, 22, 9, 35, 2, 1321, 15, 2423, 178, 19, 53, 2, 2, 74, 26, 2250, 7, 112, 5742, 11, 31, 1266, 12, 9, 179, 878, 8, 106, 4, 3228, 109, 670, 3636, 9, 38, 6938, 5, 4009, 15, 12, 9, 254, 8, 2198, 8, 90, 387, 584, 3663, 18, 90, 17, 29, 5905, 39, 31, 5216, 532, 5742, 904, 8, 4, 375, 5, 29, 144, 115, 81, 6, 4245, 136, 5, 535, 8, 30, 623, 615, 11, 6, 731, 2008, 57, 132, 100, 28, 15, 76, 3773, 2, 5, 30, 424, 8, 471, 23, 6, 4545, 40, 5839, 2, 894, 7, 265, 29, 9, 2060, 3228, 11, 35, 2060, 3228, 5407, 365, 10, 10, 682, 883, 47, 94, 1139, 388, 21, 36, 26, 2, 5, 2, 53, 400, 74, 24, 13, 421, 17, 48, 13, 71, 23, 6, 7616, 1311, 19, 6, 1562, 2, 415, 5, 4864, 15, 12, 62, 130, 460, 12, 2, 4, 712, 15, 70, 2061, 54, 99, 76, 1140, 7, 6, 22, 9, 2611, 11, 64, 31, 415, 294, 37, 1503, 4, 532, 7696, 8, 30, 502, 8, 1491, 145, 39, 12,

The words are indexed based on their occurence in an ascending order, that means they would be no word that has an index greater than 9999 and that our dictionary is limited to 9997 words.

Lets create a function that will decode the review from numbers to words

In [4]:
def decode_review(review):
    word_index = imdb.get_word_index()
    reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])
    decoded_review = ' '.join([reverse_word_index.get(i - 3, '?') for i in review])
    return decoded_review

In [5]:
decoded_review = decode_review(random_review)
print("review:", decoded_review)
if(random_review_setiment == 0 ):
    print("negative review")
else:
    print("positive review")

review: ? having read the other comments on this film i would like to share my own view that this is one tough movie to see unless you are a total ? i am not br br when looked at by a purely objective ? the film is an ? narrative that presents us with more ? ? than are capable of being absorbed in one sitting it is quite difficult to watch the brooks character robert cole is so unsympathetic and unpleasant that it is hard to relate to him let alone root for him as he stumbles from one dysfunctional self absorbed situation to the next and he should never do a topless scene and expect to be taken seriously in a romantic context no man could have that much exposed ? and be supposed to turn on a babe like kathryn ? unless of course he is albert brooks in an albert brooks controlled production br br modern romance has its amusing moments but they are ? and ? more often than not i felt as if i were on a confined journey with a thoroughly ? person and wishing that it would end already it ? th

We need to standardize the reviews because they each contain a different number of words, we should encode them in such a way that they all have the same length while preserving the integrity of the data.

Since our diction has 10 000 words we can encode each review to a 10 000 length array where the index of each word is marked as a 1 and the rest of the array is zeros.

For example, review [2, 4, 5] would be [0, 0, 1, 0, 1, 5]


In [6]:
def vectorize_sequences(sequences, dimension=10000):
    results = np.zeros((len(sequences), dimension))
    for i, sequence in enumerate(sequences):
        results[i, sequence] = 1.
    return results

In [7]:
encoded_review = vectorize_sequences(train_data[1:2])
print(encoded_review)

[[0. 1. 1. ... 0. 0. 0.]]


## Neural Network Design



## Design Validation
Using a subset of the data to check if our dsesign will converge. Will use 10 000 out of 25 000 datasets to verify my models works.

In [8]:
x_train = vectorize_sequences(train_data)
y_train = np.asarray(train_labels).astype('float32')

x_test = vectorize_sequences(test_data)
y_test = np.asarray(test_labels).astype('float32')

x_val = x_train[:10000]
partial_x_train = x_train[10000:]

y_val = y_train[:10000]
partial_y_train = y_train[10000:]

Now to implement the model

In [9]:
model = models.Sequential()
model.add(layers.Dense(16, activation='relu', input_shape=(10000,)))
model.add(layers.Dense(16, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))
model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['accuracy'])
history = model.fit(partial_x_train, partial_y_train, epochs=20, batch_size=512, validation_data=(x_val, y_val))

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


Lets visualize the results

In [None]:
history_dict = history.history
loss_values = history_dict['loss']
val_loss_values = history_dict['val_loss']

epochs = range(1, len(val_loss_values) + 1)

plt.plot(epochs, loss_values, 'bo', label='Training loss')
plt.plot(epochs, val_loss_values, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.show()

In [None]:
print("done")