#**Text Classification with the IMDb-Reviews Dataset from Keras**
@author: [vatsalya-gupta](https://github.com/vatsalya-gupta)

In [1]:
import numpy as np
import tensorflow as tf
from tensorflow import keras

imdb = keras.datasets.imdb

We will do our analysis with the 10000 most frequent unique words in our dataset. Here, we are making the train-test split, 80 % and 20 % respectively. Afterwards, we will split train_data into training and validation sets, making the final train-test-validate split be 60-20-20 %

In [2]:
(train_data, train_targets), (test_data, test_targets) = imdb.load_data(num_words = 10000)
data = np.concatenate((train_data, test_data), axis=0)
targets = np.concatenate((train_targets, test_targets), axis=0)

test_data = data[:10000]
test_labels = targets[:10000]
train_data = data[10000:]
train_labels = targets[10000:]

print(train_data[0])

[1, 13, 104, 14, 9, 31, 7, 4, 4343, 7, 4, 3776, 3394, 2, 495, 103, 141, 87, 2048, 17, 76, 2, 44, 164, 525, 13, 197, 14, 16, 338, 4, 177, 16, 6118, 5253, 2, 2, 2, 21, 61, 1126, 2, 16, 15, 36, 4621, 19, 4, 2, 157, 5, 605, 46, 49, 7, 4, 297, 8, 276, 11, 4, 621, 837, 844, 10, 10, 25, 43, 92, 81, 2282, 5, 95, 947, 19, 4, 297, 806, 21, 15, 9, 43, 355, 13, 119, 49, 3636, 6951, 43, 40, 4, 375, 415, 21, 2, 92, 947, 19, 4, 2282, 1771, 14, 5, 106, 2, 1151, 48, 25, 181, 8, 67, 6, 530, 9089, 1253, 7, 4, 2]


Our training and testing data is in the form of an array of reviews, where each review is a list of integers and each integer represents a unique word. So we need to make it human readable. For this, we will be adding the following tags to the data, map the values to their respective keys and implement a function which converts the integers to the respective words.

In [3]:
word_index = imdb.get_word_index()
word_index = {k: (v + 3) for k, v in word_index.items()}
word_index["<PAD>"] = 0
word_index["<START>"] = 1
word_index["<UNK>"] = 2    # unknown
word_index["<UNUSED>"] = 3

reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])

In [4]:
def decode_review(text):
	return " ".join([reverse_word_index.get(i, "?") for i in text])

In [5]:
print(decode_review(train_data[0]))

<START> i think this is one of the weakest of the kenneth branagh <UNK> works after such great efforts as much <UNK> about nothing etc i thought this was poor the cast was weaker alicia <UNK> <UNK> <UNK> but my biggest <UNK> was that they messed with the <UNK> work and cut out some of the play to put in the musical dance sequences br br you just don't do shakespeare and then mess with the play sorry but that is just wrong i love some cole porter just like the next person but <UNK> don't mess with the shakespeare skip this and watch <UNK> books if you want to see a brilliant shakespearean adaptation of the <UNK>


In [6]:
print(len(train_data[0]), len(test_data[0]))

118 218


In the following block of code, we will be finding the length of the longest review in our dataset.

In [7]:
longest_train = max(len(l) for l in train_data)
longest_test = max(len(l) for l in test_data)

max_words = max(longest_train, longest_test)
print(max_words)

2494


Even though the longest review is 2494 words long, we can safely limit the length of our reviews to 500 words as most of them are well below that. For the ones with length less than 500 words, we will add zero padding to their end.

In [8]:
train_data = keras.preprocessing.sequence.pad_sequences(train_data, value = word_index["<PAD>"], padding = "post", maxlen = 500)
test_data = keras.preprocessing.sequence.pad_sequences(test_data, value = word_index["<PAD>"], padding = "post", maxlen = 500)

print(len(train_data[0]), len(test_data[0]))

500 500


We are using a Sequential model. An Embedding layer attempts to determine the meaning of each word in the sentence by mapping each word to a position in vector space (helps in grouping words like "fantastic" and "awesome"). The GlobalAveragePooling1D layer scales down our data's dimensions to make it easier computationally. A Dropout layer is added to decrease overfitting. The last two layers in our network are dense fully connected layers. The output layer is one neuron that uses the sigmoid function to get a value between 0 and 1 which will represent the likelihood of the review being negative or positive respectively.

In [9]:
model = keras.Sequential()
model.add(keras.layers.Embedding(10000, 64))    # 10000 words as input
model.add(keras.layers.GlobalAveragePooling1D())
model.add(keras.layers.Dropout(0.5))
model.add(keras.layers.Dense(64, activation = "relu"))
model.add(keras.layers.Dense(1, activation = "sigmoid"))

model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, None, 64)          640000    
_________________________________________________________________
global_average_pooling1d (Gl (None, 64)                0         
_________________________________________________________________
dropout (Dropout)            (None, 64)                0         
_________________________________________________________________
dense (Dense)                (None, 64)                4160      
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 65        
Total params: 644,225
Trainable params: 644,225
Non-trainable params: 0
_________________________________________________________________


Compiling the data using the following parameters. We are using loss as "binary_crossentropy", as the expected output of our model is either 0 or 1, that is negative or positive.

In [10]:
model.compile(optimizer = "adam", loss = "binary_crossentropy", metrics = ["accuracy"])

Here we split the training data into training and validation sets, then the training data is fit onto the model and the results are evaluated.

In [11]:
fitModel = model.fit(train_data, train_labels, epochs = 8, batch_size = 256, validation_split = 0.25, verbose = 1)
results = model.evaluate(test_data, test_labels)

print(results)

Epoch 1/8
Epoch 2/8
Epoch 3/8
Epoch 4/8
Epoch 5/8
Epoch 6/8
Epoch 7/8
Epoch 8/8
[0.27172037959098816, 0.8971999883651733]


Sample prediction from testing data.

In [12]:
test_review = test_data[6]
predict = model.predict(test_review)
print("Review:\n", decode_review(test_review))
print("Prediction:", predict[6])
print("Actual:", test_labels[6])

Review:
 <START> lavish production values and solid performances in this straightforward adaption of jane <UNK> satirical classic about the marriage game within and between the classes in <UNK> 18th century england northam and paltrow are a <UNK> mixture as friends who must pass through <UNK> and lies to discover that they love each other good humor is a <UNK> virtue which goes a long way towards explaining the <UNK> of the aged source material which has been toned down a bit in its harsh <UNK> i liked the look of the film and how shots were set up and i thought it didn't rely too much on <UNK> of head shots like most other films of the 80s and 90s do very good results <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD

Saving the model so that we don't have to train it again.

In [13]:
model.save("imdb_model.h5")    # any name ending with .h5
# model = keras.models.load_model("imdb_model.h5")    # loading the model, use this in any other project for testing

Function to encode a text based review into a list of integers.

In [14]:
def review_encode(s):
	encoded = [1]

	for word in s:
		if word.lower() in word_index:
			encoded.append(word_index[word.lower()] if (word_index[word.lower()] < 10001) else 2)    # vocabulary size is 10000
		else:
			encoded.append(2)    # 2 means "<UNK>"

	return encoded

Evaluating our model on an [external review](https://www.imdb.com/review/rw2284594).

In [15]:
with open("sample_data/test.txt", encoding = "utf-8") as f:
	for line in f.readlines():
		nline = line.replace(",", "").replace(".", "").replace("(", "").replace(")", "").replace(":", "").replace("\"","").strip().split(" ")
		encode = review_encode(nline)
		encode = keras.preprocessing.sequence.pad_sequences([encode], value = word_index["<PAD>"], padding = "post", maxlen = 500)    # make the data 500 words long
		predict = model.predict(encode)
		print(line, "\n", encode, "\n", predict[0])
		sentiment = "Positive" if (predict[0] > 0.5) else "Negative"
		print("Sentiment:", sentiment)

The Shawshank Redemption is written and directed by Frank Darabont. It is an adaptation of the Stephen King novella Rita Hayworth and Shawshank Redemption. Starring Tim Robbins and Morgan Freeman, the film portrays the story of Andy Dufresne (Robbins), a banker who is sentenced to two life sentences at Shawshank State Prison for apparently murdering his wife and her lover. Andy finds it tough going but finds solace in the friendship he forms with fellow inmate Ellis "Red" Redding (Freeman). While things start to pick up when the warden finds Andy a prison job more befitting his talents as a banker. However, the arrival of another inmate is going to vastly change things for all of them. There was no fanfare or bunting put out for the release of the film back in 94, with a title that didn't give much inkling to anyone about what it was about, and with Columbia Pictures unsure how to market it, Shawshank Redemption barely registered at the box office. However, come Academy Award time the 

We are able to achieve a score of "highly positive" on the review rated 10/10 on IMDb. Hence, our model is fairly accurate.