# IMDB Reviews Sentiment Classification

First we import the required packages. It is necessary to first install the following packages:  
  
`pip install pandas`  
`pip install numpy`  
`pip install nltk`  
  
To install TensorFlow on CPU:  
`pip install tensorflow`  
To install TensorFlow on GPU:  
`pip install tensorflow-gpu`

In [1]:
import pandas as pd
import numpy as np
import re
import html

from tensorflow.keras.layers import Dense, LSTM, BatchNormalization, Embedding, Bidirectional
from tensorflow.keras.models import Sequential
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from nltk.stem import SnowballStemmer

### Load and clean the data

Read the data from .csv format

In [2]:
train = pd.read_csv('datasets/train.csv')
test = pd.read_csv('datasets/test.csv')

In [3]:
train['dataset'] = "train"
test['dataset'] = "test"

Split data into training, validation, and test datasets

In [4]:
trn_y = np.eye(2)[train.labels[:20000]] # One-hot encode the labels
val_y = np.eye(2)[train.labels[20000:]] # One-hot encode the labels
trn_txt = train.text[:20000]
val_txt = train.text[20000:]
tst_txt = test.text
texts = np.hstack([trn_txt, val_txt, tst_txt]).tolist()

Function for cleaning text and performing stemming

In [5]:
def stem(x):
    re1 = re.compile(r'  +')
    stemmer = SnowballStemmer('english')
    x = ' '.join([stemmer.stem(word) for word in str(x).split(' ')])
    x = x.replace('#39;', "'").replace('amp;', '&').replace('#146;', "'").replace(
        'nbsp;', ' ').replace('#36;', '$').replace('\\n', "\n").replace('quot;', "'").replace(
        '<br />', "\n").replace('\\"', '"').replace('<unk>','u_n').replace(' @.@ ','.').replace(
        ' @-@ ','-').replace('\\', ' \\ ')
    return re1.sub(' ', html.unescape(x))

Original text

In [6]:
texts[1]

'This is a extremely well-made film. The acting, script and camera-work are all first-rate. The music is good, too, though it is mostly early in the film, when things are still relatively cheery. There are no really superstars in the cast, though several faces will be familiar. The entire cast does an excellent job with the script.<br /><br />But it is hard to watch, because there is no good end to a situation like the one presented. It is now fashionable to blame the British for setting Hindus and Muslims against each other, and then cruelly separating them into two countries. There is some merit in this view, but it\'s also true that no one forced Hindus and Muslims in the region to mistreat each other as they did around the time of partition. It seems more likely that the British simply saw the tensions between the religions and were clever enough to exploit them to their own ends.<br /><br />The result is that there is much cruelty and inhumanity in the situation and this is very u

In [7]:
summaries = [stem(txt) for txt in texts]

Text after stemming

In [8]:
summaries[1]

'this is a extrem well-mad film. the acting, script and camera-work are all first-rate. the music is good, too, though it is most earli in the film, when thing are still relat cheery. there are no realli superstar in the cast, though sever face will be familiar. the entir cast doe an excel job with the script.\n\nbut it is hard to watch, becaus there is no good end to a situat like the one presented. it is now fashion to blame the british for set hindus and muslim against each other, and then cruelli separ them into two countries. there is some merit in this view, but it also true that no one forc hindus and muslim in the region to mistreat each other as they did around the time of partition. it seem more like that the british simpli saw the tension between the religion and were clever enough to exploit them to their own ends.\n\nthe result is that there is much cruelti and inhuman in the situat and this is veri unpleas to rememb and to see on the screen. but it is never paint as a bla

Create an integer token for each word and apply the tokenizer to the datasets. For more information on Tensorflow/Keras for text processing see:  
https://keras.io/preprocessing/text/

In [9]:
n_words = 5000
t = Tokenizer(n_words)
t.fit_on_texts(summaries)

In [10]:
trn_seq = t.texts_to_sequences([stem(txt) for txt in trn_txt])
val_seq = t.texts_to_sequences([stem(txt) for txt in val_txt])
tst_seq = t.texts_to_sequences([stem(txt) for txt in tst_txt])

Only keep up to 300 words of the review

In [11]:
max_words = 300
trn_seq = np.array(pad_sequences(trn_seq, max_words))
val_seq = np.array(pad_sequences(val_seq, max_words))
tst_seq = np.array(pad_sequences(tst_seq, max_words))

We can inspect the first sentence (converted to an array of integers)

In [12]:
trn_seq[1]

array([1034,    1,  360,  165,  138,   32,  404,  298,   16,    1,  218,
         17,    7,    6,  215,    5,   56,   94,   37,    6,   58,   47,
         96,    5,    3,  786,   30,    1,   27,    7,    6,  156, 1133,
          5, 1363,    1,  714,   15,  191,    2, 4049,  486,  276,   74,
          2,  101, 1827,  100,   89,  114,   37,    6,   46, 2596,    8,
         10,  366,   17,    7,   87,  303,   11,   58,   27,  510,    2,
       4049,    8,    1, 3781,    5,  276,   74,   13,   34,  124,  192,
          1,   49,    4,    7,  110,   51,   30,   11,    1,  714,  374,
        217,    1, 1055,  209,    1, 2026,    2,   72,  915,  202,    5,
       1402,  100,    5,   64,  201, 2696,    1,  659,    6,   11,   37,
          6,   78,    2,    8,    1,  786,    2,   10,    6,   54, 3772,
          5,  385,    2,    5,   53,   19,    1,  254,   17,    7,    6,
        118, 1259,   13,    3,  316,    2, 4320,  431,   37,    6,  455,
          2,   19,  204,    2,   87,    1,  290,   

## Build a Neural Network with Keras to predict sentiment from sequences

We represent each word as 64 numbers, put the sequence through an LSTM Neural Network. For more information see: https://keras.io/getting-started/sequential-model-guide/

In [13]:
model = Sequential([
        Embedding(n_words, 64, input_length = max_words, input_shape=(max_words,)),
        BatchNormalization(),
        LSTM(64, dropout=0.3, recurrent_dropout=0.3),
        BatchNormalization(),
        Dense(2, activation = 'softmax')
    ])

model.compile(loss = 'categorical_crossentropy', optimizer = Adam(lr=.01), metrics = ['accuracy'])

In [14]:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 300, 64)           320000    
_________________________________________________________________
batch_normalization (BatchNo (None, 300, 64)           256       
_________________________________________________________________
lstm (LSTM)                  (None, 64)                33024     
_________________________________________________________________
batch_normalization_1 (Batch (None, 64)                256       
_________________________________________________________________
dense (Dense)                (None, 2)                 130       
Total params: 353,666
Trainable params: 353,410
Non-trainable params: 256
_________________________________________________________________


In [15]:
model.fit(trn_seq,
          trn_y,
          validation_data = [val_seq, val_y],
          epochs = 3)

  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Train on 20000 samples, validate on 5000 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3


<tensorflow.python.keras.callbacks.History at 0x25af221d0b8>

Predict the sentiment for each review in the test dataset

In [16]:
preds = model.predict(tst_seq)

**Most likely to be negative sentiment**

In [17]:
test.text.iloc[np.argmax(preds[:,0])]

"This movie is just plain dumb. Don't bother watching it; believe me, you're better off.<br /><br />Long and short of the plot: a defense attorney represents a man who murdered his son and other children. In defending him, she comes across a wooden doll of Pinnochio. She takes the doll home. Pinnochio is possessed and begins to start killing people.<br /><br />This movie moves very slowly only to have such a weak ending. The plot is very bad and the Dennis Michael Tenney's musical score is pitiful. The story, written by Kevin S. Tenney, is just pointless and evokes NO horror or fear. This is a far cry from his work on Night of the Demons and Witchboard, which are decent outings but nothing to write home about. His directing is OK, but with such a bad story no one could have made this movie any good.<br /><br />In conclusion: 2 out of 10, perhaps the blandest, most boring movie I've seen all year."

**Most likely to be positive sentiment**

In [18]:
test.text.iloc[np.argmax(preds[:,1])]

'"Tourist Trap" is a bizarre, great horror film from the \'70s. The film is about a group of young adults, Becky, Jerry, and Molly, who are traveling in a jeep through a desert area. Their two other friends, Eileen and her boyfriend Woody, are in a separate car. When a wheel goes flat, Woody takes it to a nearby gas station - and meets a grisly fate to some bizarre telekinetic mayhem and some creepy mannequins. The friends get tired of waiting for Woody and go to a local "tourist trap" mannequin/wax museum. In front of the entrance, the car randomly breaks down, and the girls find an oasis area to go swimming in, where they are approached by Mr. Slausen, who runs the roadside attraction that is now closed down. He takes them up to the old western wax museum, and the girls stay behind while he and Jerry go to fix their car. Eileen, the curious of the two, wanders to an old house nearby, where she also falls to the hands of a mysterious masked killer and a bunch of life like mannequins. 

We can try out some of our own reviews for a sanity check.

In [19]:
def predict_words(strings):
    if type(strings) is str:
        strings = [strings]
    seq = np.array(pad_sequences(t.texts_to_sequences([stem(string) for string in strings]),max_words))
    pred = model.predict(seq)
    for i in range(len(strings)):
        print("%s  |  Positive Sentiment: %2.f%%" % (strings[i], pred[i][1]*100))

**Baseline sentiment**

In [20]:
predict_words('')

  |  Positive Sentiment: 53%


In [21]:
predict_words(['I love this movie! Great film','This movie is boring and terrible...'])

I love this movie! Great film  |  Positive Sentiment: 98%
This movie is boring and terrible...  |  Positive Sentiment:  3%


In [22]:
predict_words(['highly recommended','recommended','not recommended'])

highly recommended  |  Positive Sentiment: 88%
recommended  |  Positive Sentiment: 68%
not recommended  |  Positive Sentiment: 59%


In [23]:
predict_words(['good','not good','bad'])

good  |  Positive Sentiment: 61%
not good  |  Positive Sentiment: 32%
bad  |  Positive Sentiment:  9%


In [24]:
predict_words(['fast pace','slow pace','very slow pace'])

fast pace  |  Positive Sentiment: 62%
slow pace  |  Positive Sentiment: 17%
very slow pace  |  Positive Sentiment: 19%


**Create submission**

In [25]:
test['labels'] = preds[:,1]

In [26]:
test[['id','labels']].to_csv('predictions.csv', index=False)