#**Text Classification with the IMDb-Reviews Dataset from Keras**
@author: [vatsalya-gupta](https://github.com/vatsalya-gupta)

In [1]:
import numpy as np
import tensorflow as tf
from tensorflow import keras

imdb = keras.datasets.imdb

We will do our analysis with the 88000 most frequent unique words in our dataset. Here, we are making the train-test split, 80 % and 20 % respectively. Afterwards, we will split train_data into training and validation sets, making the final train-test-validate split be 70-20-10 %

In [2]:
(train_data, train_targets), (test_data, test_targets) = imdb.load_data(num_words = 88000)
data = np.concatenate((test_data, train_data), axis=0)
targets = np.concatenate((test_targets, train_targets), axis=0)

test_data = data[:10000]
test_labels = targets[:10000]
train_data = data[10000:]
train_labels = targets[10000:]

print(train_data[0])

[1, 670, 5304, 5622, 13500, 308, 8551, 23033, 25, 71, 1017, 6, 253, 22, 4, 436, 223, 100, 358, 134, 5, 85, 907, 71, 540, 2218, 88, 36, 28, 24, 1477, 4, 2483, 21, 1075, 12, 18, 148, 15, 3824, 15, 36, 122, 24, 34114, 8, 4, 204, 65, 7, 4, 1422, 89, 400, 127, 4, 228, 11, 6, 22, 2555, 2198, 8, 51, 9, 170, 23, 11, 4, 22, 12, 9, 4, 1310, 15, 5442, 14, 9, 51, 13, 264, 4, 907, 7, 134, 102, 71, 399, 1855, 6, 2392, 1310, 18, 154, 9485, 18, 4, 91, 173, 36, 3115, 1669, 19, 32, 134, 9485, 37, 9, 170, 8, 40, 98, 32, 75, 100, 28, 343, 53, 5355, 10, 10, 417, 51, 9, 498, 60, 1422, 209, 6, 171, 1298, 372]


Our training and testing data is in the form of an array of reviews, where each review is a list of integers and each integer represents a unique word. So we need to make it human readable. For this, we will be adding the following tags to the data, map the values to their respective keys and implement a function which converts the integers to the respective words.

In [3]:
word_index = imdb.get_word_index()
word_index = {k: (v + 3) for k, v in word_index.items()}
word_index["<PAD>"] = 0
word_index["<START>"] = 1
word_index["<UNK>"] = 2    # unknown
word_index["<UNUSED>"] = 3

reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])

np.save("word_index.npy", word_index)    # saving the word_index for future use

In [4]:
def decode_review(text):
	return " ".join([reverse_word_index.get(i, "?") for i in text])

In [5]:
print(decode_review(train_data[0]))

<START> robert altman nicolas roeg john luc goddard you were expecting a fun film the entire family could enjoy these and other directors were obviously chosen because they have not followed the mainstream but created it for those that complain that they did not adhere to the original story of the opera how often does the music in a film directly relate to what is going on in the film it is the mood that counts this is what i believe the directors of these movies were doing creating a contemporary mood for old operas for the most part they succeed wonderfully with all these operas who is going to like them all we could have used more beverly br br finally what is art even opera without a few naked women


In [6]:
print(len(train_data[0]), len(test_data[0]))

132 68


In the following block of code, we will be finding the length of the longest review in our dataset.

In [7]:
longest_train = max(len(l) for l in train_data)
longest_test = max(len(l) for l in test_data)

max_words = max(longest_train, longest_test)
print(max_words)

2494


Even though the longest review is 2494 words long, we can safely limit the length of our reviews to 500 words as most of them are well below that. For the ones with length less than 500 words, we will add zero padding to their end.

In [8]:
train_data = keras.preprocessing.sequence.pad_sequences(train_data, value = word_index["<PAD>"], padding = "post", maxlen = 500)
test_data = keras.preprocessing.sequence.pad_sequences(test_data, value = word_index["<PAD>"], padding = "post", maxlen = 500)

print(len(train_data[0]), len(test_data[0]))

500 500


We are using a Sequential model. An Embedding layer attempts to determine the meaning of each word in the sentence by mapping each word to a position in vector space (helps in grouping words like "fantastic" and "awesome"). The GlobalAveragePooling1D layer scales down our data's dimensions to make it easier computationally. The last two layers in our network are dense fully connected layers. The output layer is one neuron that uses the sigmoid function to get a value between 0 and 1 which will represent the likelihood of the review being negative or positive respectively.

In [9]:
model = keras.Sequential()
model.add(keras.layers.Embedding(88000, 16))    # 88000 words as input
model.add(keras.layers.GlobalAveragePooling1D())
model.add(keras.layers.Dense(16, activation = "relu"))
model.add(keras.layers.Dense(1, activation = "sigmoid"))

model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, None, 16)          1408000   
_________________________________________________________________
global_average_pooling1d (Gl (None, 16)                0         
_________________________________________________________________
dense (Dense)                (None, 16)                272       
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 17        
Total params: 1,408,289
Trainable params: 1,408,289
Non-trainable params: 0
_________________________________________________________________


Compiling the data using the following parameters. We are using loss as "binary_crossentropy", as the expected output of our model is either 0 or 1, that is negative or positive.

In [10]:
model.compile(optimizer = "adam", loss = "binary_crossentropy", metrics = ["accuracy"])

Here we split the training data into training and validation sets, then the training data is fit onto the model and the results are evaluated.

In [11]:
model.fit(train_data, train_labels, epochs = 10, batch_size = 256, validation_split = 0.125, verbose = 1)
model.evaluate(test_data, test_labels)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


[0.2526327669620514, 0.902999997138977]

Sample prediction from testing data.

In [12]:
test_review = test_data[0]
predict = model.predict(test_review)
print("Review:\n", decode_review(test_review))
print("Prediction:", predict[0])
print("Actual:", test_labels[0])

Review:
 <START> please give this one a miss br br kristy swanson and the rest of the cast rendered terrible performances the show is flat flat flat br br i don't know how michael madison could have allowed this one on his plate he almost seemed to know this wasn't going to work out and his performance was quite lacklustre so all you madison fans give this a miss <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD

Saving the model so that we don't have to train it again.

In [13]:
model.save("imdb_model.h5")    # any name ending with .h5
# model = keras.models.load_model("imdb_model.h5")    # loading the model, use this in any other project for testing

Function to encode a text based review into a list of integers.

In [14]:
def review_encode(s):
	encoded = [1]    # 1 implies "<START>"

	for word in s:
		if word.lower() in word_index:
			encoded.append(word_index[word.lower()] if (word_index[word.lower()] < 88001) else 2)    # vocabulary size is 88000
		else:
			encoded.append(2)    # 2 implies "<UNK>"

	return encoded

Evaluating our model on an [external review](https://www.imdb.com/review/rw2284594).

In [15]:
with open("sample_data/test.txt", encoding = "utf-8") as f:
	for line in f.readlines():
		nline = line.replace(",", "").replace(".", "").replace("(", "").replace(")", "").replace(":", "").replace("\"","").strip().split(" ")
		encode = review_encode(nline)
		encode = keras.preprocessing.sequence.pad_sequences([encode], value = word_index["<PAD>"], padding = "post", maxlen = 500)    # make the review 500 words long
		predict = model.predict(encode)
		print(line, "\n", encode, "\n", predict[0])
		sentiment = "Positive" if (predict[0] > 0.5) else "Negative"
		print("Sentiment:", sentiment)

The Shawshank Redemption is written and directed by Frank Darabont. It is an adaptation of the Stephen King novella Rita Hayworth and Shawshank Redemption. Starring Tim Robbins and Morgan Freeman, the film portrays the story of Andy Dufresne (Robbins), a banker who is sentenced to two life sentences at Shawshank State Prison for apparently murdering his wife and her lover. Andy finds it tough going but finds solace in the friendship he forms with fellow inmate Ellis "Red" Redding (Freeman). While things start to pick up when the warden finds Andy a prison job more befitting his talents as a banker. However, the arrival of another inmate is going to vastly change things for all of them. There was no fanfare or bunting put out for the release of the film back in 94, with a title that didn't give much inkling to anyone about what it was about, and with Columbia Pictures unsure how to market it, Shawshank Redemption barely registered at the box office. However, come Academy Award time the 

We are able to achieve a score of "highly positive" on the review rated 10/10 on IMDb. Hence, our model is fairly accurate.