<img src="images/imdb_logo.png" width="200" height="200" />

# IMDB Review Tensorflow Classifier

IMDB dataset having 50K movie reviews for natural language processing or Text analytics.
This is a dataset for binary sentiment classification.It has a set of 25,000 highly polar movie reviews for training and 25,000 for testing. So, predict the number of positive and negative reviews using either classification or deep learning algorithms.




Import all the libraries I would need. Keras will provide the IMDb dataset

In [2]:
import matplotlib.pyplot as plt
from keras.datasets import imdb
from keras import models
from keras import layers
import numpy as np
import random

The argument **num_words=10000** means you’ll only keep the top 10 000 most frequently occurring words in the training data. Rare words will be discarded. This allows you to work with vector data of manageable size.
As stated above the dataset should contain 25k reviews of movies which are the rows of the data sets.

In [3]:
(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words=10000)

print("train_data shape is ", train_data.shape)
print("train_labels shape is ", train_labels.shape)
print("test_data shape is ", train_data.shape)
print("test_labels shape is ", train_labels.shape)

train_data shape is  (25000,)
train_labels shape is  (25000,)
test_data shape is  (25000,)
test_labels shape is  (25000,)


Each row contains indexes of the word position in the dictionary and the label is either 1 or 0 to state whether the review was positive or negative.


In [4]:
random_index = random.randrange(0, train_data.shape[0])
random_review = train_data[random_index]
random_review_setiment = train_labels[random_index]

print (random_review)
if(random_review_setiment == 0 ):
    print("negative review")
else:
    print("positive review")

[1, 4, 22, 2013, 19, 6, 6704, 324, 7, 6, 2040, 860, 313, 601, 19, 2, 5370, 15, 4474, 23, 4, 499, 7, 6, 392, 2190, 5, 35, 23, 268, 2, 739, 15, 4535, 2, 2, 1898, 2, 6, 185, 860, 2, 3632, 773, 2, 2, 2, 2371, 56, 4, 2, 2524, 8, 4, 313, 1004, 9313, 2, 2, 2, 19, 937, 29, 9, 260, 35, 1586, 496, 41, 658, 2, 2, 2, 17, 2, 2, 145, 37, 571, 8, 30, 2, 1750, 2, 5167, 2957, 344, 8, 169, 27, 322, 5, 1475, 260, 55, 4691, 4249, 19, 257, 85, 27, 9658, 2730, 4, 2, 3632, 4653, 1496, 199, 2, 5, 2, 159, 8000, 1720, 120, 6, 1117, 303, 5, 2901, 2, 2498, 2060, 2957, 11, 1898, 23, 6, 780, 3182, 19, 27, 322, 3021, 2787, 742, 5, 68, 185, 577, 4232, 4232, 2, 68, 491, 464, 2365, 3378, 7517, 2, 37, 495, 18, 4, 298, 2, 1525, 98, 46, 34, 1556, 98, 6, 273, 8, 789, 25, 92, 359, 72, 8, 25, 121, 29, 2, 560, 45, 170, 38, 706, 88, 45, 2368, 8, 63, 199, 2901, 5, 3021, 462, 125, 17, 36, 540, 92, 264, 11, 2745, 33, 222, 18, 4, 58, 112, 15, 9, 220, 1241, 4, 22, 271, 83, 1591, 5440, 2690, 471, 23, 5, 125, 34, 533, 3021, 47, 35, 7

The words are indexed based on their occurence in an ascending order, that means they would be no word that has an index greater than 9999 and that our dictionary is limited to 9997 words.

Lets create a function that will decode the review from numbers to words

In [5]:
def decode_review(review):
    word_index = imdb.get_word_index()
    reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])
    decoded_review = ' '.join([reverse_word_index.get(i - 3, '?') for i in review])
    return decoded_review

In [6]:
decoded_review = decode_review(random_review)
print("review:", decoded_review)
if(random_review_setiment == 0 ):
    print("negative review")
else:
    print("positive review")

negative review


We need to standardize the reviews because they each contain a different number of words, we should encode them in such a way that they all have the same length while preserving the integrity of the data.

Since our diction has 10 000 words we can encode each review to a 10 000 length array where the index of each word is marked as a 1 and the rest of the array is zeros.

For example, review [2, 4, 5] would be [0, 0, 1, 0, 1, 5]


In [7]:
def vectorize_sequences(sequences, dimension=10000):
    results = np.zeros((len(sequences), dimension))
    for i, sequence in enumerate(sequences):
        results[i, sequence] = 1.
    return results

In [8]:
encoded_review = vectorize_sequences(train_data[1:2])
print(encoded_review)

[[0. 1. 1. ... 0. 0. 0.]]


## Neural Network Design



## Design Validation
Using a subset of the data to check if our dsesign will converge. Will use 10 000 out of 25 000 datasets to verify my models works.

In [9]:
x_train = vectorize_sequences(train_data)
y_train = np.asarray(train_labels).astype('float32')

x_test = vectorize_sequences(test_data)
y_test = np.asarray(test_labels).astype('float32')

x_val = x_train[:10000]
partial_x_train = x_train[10000:]

y_val = y_train[:10000]
partial_y_train = y_train[10000:]

Now to implement the model

In [2]:
model = models.Sequential()
model.add(layers.Dense(16, activation='relu', input_shape=(10000,)))
model.add(layers.Dense(16, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))
model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['accuracy'])
history = model.fit(partial_x_train, partial_y_train, epochs=20, batch_size=512, validation_data=(x_val, y_val))

NameError: name 'models' is not defined

Lets visualize the results

In [1]:
history_dict = history.history
loss_values = history_dict['loss']
val_loss_values = history_dict['val_loss']

epochs = range(1, len(val_loss_values) + 1)

plt.plot(epochs, loss_values, 'bo', label='Training loss')
plt.plot(epochs, val_loss_values, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.show()

NameError: name 'history' is not defined