# Sentiment Analysis using Many-to-few RNN 

we will construct a many-to-one RNN how how they many be used for classification and prediction
ecurrent neural network is any network with nodes that update their state between *prediction* runs. This is in contrast to a perceptron or CNN which only updates it state during training. A RNN is composted of __memory cells__, which hold a state between prediction runs, the most popular of which are LSTM and GRU cells. RNN's are trained by "unfolding" them to the desired input size, with the weights of each cell shared across the unfolding:

<img width=800px src="https://raw.githubusercontent.com/tipthederiver/Math-7243-2020/master/Labs/Lab%206/Lab%206%20RNN%20Unroll.PNG">



 We will Construct a __many to few__ RNN for the classification of text sentiment. We will be classifying the IMDB user comments database to see to see if reviews are positive or negative. 

## Step 1 : Load the dataset

In [1]:
import pandas as pd    # to load dataset
import numpy as np     # for mathematic equation
from nltk.corpus import stopwords   # to get collection of stopwords
from sklearn.model_selection import train_test_split       # for splitting dataset
from tensorflow.keras.preprocessing.text import Tokenizer  # to encode text to int
from tensorflow.keras.preprocessing.sequence import pad_sequences   # to do padding or truncating
from tensorflow.keras.models import Sequential     # the model
from tensorflow.keras.layers import Embedding, LSTM, Dense # layers of the architecture
from tensorflow.keras.callbacks import ModelCheckpoint   # save model
from tensorflow.keras.models import load_model   # load saved model
import re

In [2]:
data = pd.read_csv('IMDB Dataset.csv')

print(data)

                                                  review sentiment
0      One of the other reviewers has mentioned that ...  positive
1      A wonderful little production. <br /><br />The...  positive
2      I thought this was a wonderful way to spend ti...  positive
3      Basically there's a family where a little boy ...  negative
4      Petter Mattei's "Love in the Time of Money" is...  positive
...                                                  ...       ...
49995  I thought this movie did a down right good job...  positive
49996  Bad plot, bad dialogue, bad acting, idiotic di...  negative
49997  I am a Catholic taught in parochial elementary...  negative
49998  I'm going to have to disagree with the previou...  negative
49999  No one expects the Star Trek movies to be high...  negative

[50000 rows x 2 columns]


## Step 2: Data Pre-processing

In [3]:
x_data = data['review']       # Reviews/Input
y_data = data['sentiment']    # Sentiment/Output

### Pre-processing the reviews:
As reviews are unclean in the original dataset, we will clean them by removing html tags, non-alphabet such as punctuation and numbers and stop words. After that we will also lower_case them. 


In [4]:
stopwords=set(stopwords.words('english'))

In [5]:
stopwords

{'a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 'her',
 'here',
 'hers',
 'herself',
 'him',
 'himself',
 'his',
 'how',
 'i',
 'if',
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it's",
 'its',
 'itself',
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'only',
 'or',
 'other',
 'our',
 'ours',
 'ourselves',
 'out',
 'over',
 'own',
 'r

In [6]:
x_data = x_data.replace({'<.*?>': ''}, regex = True)    # remove html tag
x_data = x_data.replace({'[^A-Za-z]': ' '}, regex = True)     # remove non alphabet
x_data = x_data.apply(lambda review: [w for w in review.split() if w not in stopwords ])  # remove stop words
x_data = x_data.apply(lambda review: [w.lower() for w in review])   # lower case

### Encoding Sentiments
We will encode are sentiments such that 1 means positive and 0 means negative

In [7]:
y_data = y_data.replace('positive', 1)
y_data = y_data.replace('negative', 0)

In [8]:
print('Reviews')
print(x_data, '\n')
print('Sentiment')
print(y_data)

Reviews
0        [one, reviewers, mentioned, watching, oz, epis...
1        [a, wonderful, little, production, the, filmin...
2        [i, thought, wonderful, way, spend, time, hot,...
3        [basically, family, little, boy, jake, thinks,...
4        [petter, mattei, love, time, money, visually, ...
                               ...                        
49995    [i, thought, movie, right, good, job, it, crea...
49996    [bad, plot, bad, dialogue, bad, acting, idioti...
49997    [i, catholic, taught, parochial, elementary, s...
49998    [i, going, disagree, previous, comment, side, ...
49999    [no, one, expects, star, trek, movies, high, a...
Name: review, Length: 50000, dtype: object 

Sentiment
0        1
1        1
2        1
3        0
4        1
        ..
49995    1
49996    0
49997    0
49998    0
49999    0
Name: sentiment, Length: 50000, dtype: int64


## Step 3: Splitting the data into Train and Test

In [9]:
x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size = 0.2)

print('Train Set')
print(x_train, '\n')
print(x_test, '\n')
print('Test Set')
print(y_train, '\n')
print(y_test)

Train Set
2672     [in, hoot, logan, lerman, plays, roy, eberhard...
27263    [the, producers, made, big, mistake, casting, ...
7401     [ahh, dull, v, shows, pilots, slammed, togethe...
39408    [this, movie, commits, i, would, call, emotion...
7752     [this, movie, stupid, i, want, back, i, paid, ...
                               ...                        
49450    [i, sort, accidentally, ended, watching, movie...
4537     [this, movie, so, stupid, i, could, bare, watc...
2760     [watching, midnight, cowboy, like, taking, mas...
27525    [everybody, wants, editor, watch, movie, it, s...
32705    [this, movie, visually, stunning, who, cares, ...
Name: review, Length: 40000, dtype: object 

35907    [just, fellow, movie, fans, get, point, film, ...
25313    [yumiko, wakana, sakai, pretty, adopted, daugh...
30861    [the, dominating, conflict, couple, fine, acto...
24752    [simply, great, movie, doubt, great, story, su...
4978     [clich, ridden, story, impending, divorce, eye...
 

In [10]:
#calculating max length of a review
max_length=[]
for review in x_train:
    max_length.append(len(review))
    
        


In [11]:
max_length

[103,
 75,
 168,
 90,
 98,
 127,
 277,
 140,
 200,
 246,
 90,
 106,
 195,
 65,
 226,
 180,
 59,
 135,
 132,
 83,
 134,
 143,
 87,
 80,
 59,
 114,
 80,
 42,
 80,
 108,
 46,
 115,
 574,
 75,
 104,
 25,
 97,
 80,
 86,
 98,
 184,
 82,
 64,
 103,
 135,
 85,
 106,
 77,
 466,
 424,
 76,
 103,
 95,
 74,
 81,
 136,
 37,
 55,
 76,
 48,
 70,
 60,
 83,
 120,
 89,
 87,
 386,
 116,
 125,
 96,
 79,
 80,
 319,
 83,
 101,
 169,
 71,
 89,
 196,
 153,
 20,
 65,
 107,
 76,
 76,
 39,
 91,
 76,
 68,
 203,
 262,
 105,
 72,
 248,
 160,
 166,
 152,
 86,
 69,
 67,
 170,
 83,
 102,
 82,
 136,
 79,
 46,
 204,
 473,
 133,
 86,
 55,
 138,
 138,
 67,
 73,
 167,
 211,
 72,
 195,
 49,
 290,
 131,
 170,
 192,
 77,
 205,
 153,
 104,
 172,
 430,
 92,
 108,
 138,
 63,
 127,
 384,
 113,
 92,
 153,
 37,
 114,
 339,
 433,
 86,
 100,
 59,
 86,
 72,
 184,
 52,
 83,
 98,
 149,
 68,
 115,
 183,
 69,
 108,
 64,
 86,
 66,
 174,
 99,
 125,
 110,
 79,
 90,
 92,
 208,
 127,
 67,
 130,
 82,
 334,
 24,
 29,
 287,
 89,
 32,
 186,
 63,
 

In [12]:
max_length=int(np.ceil(np.mean(max_length)))

In [13]:
max_length

130

In [15]:
#Encode Review 
token = Tokenizer(lower=False)#lower=flase because we have already made every word lower case 
token.fit_on_texts(x_train)
x_train = token.texts_to_sequences(x_train)
x_test = token.texts_to_sequences(x_test)
x_train = pad_sequences(x_train, maxlen=max_length, padding='post', truncating='post')
x_test = pad_sequences(x_test, maxlen=max_length, padding='post', truncating='post')

total_words = len(token.word_index) + 1   # add 1 because of 0 padding

print('Encoded X Train\n', x_train, '\n')
print('Encoded X Test\n', x_test, '\n')
print('Maximum review length: ', max_length)

Encoded X Train
 [[   49  5398  5873 ...     0     0     0]
 [    2  1042    24 ...     0     0     0]
 [16639   640  1817 ...  5956  1235     1]
 ...
 [   66  2975  2825 ...     0     0     0]
 [ 1155   395  3254 ...    14     6    23]
 [    8     3  2067 ...     0     0     0]] 

Encoded X Test
 [[  449  1536     3 ...     0     0     0]
 [66836 73462    91 ...   932   458    60]
 [    2 13835  1714 ...     0     0     0]
 ...
 [   54  1423    15 ...     0     0     0]
 [    1   293  2059 ...     0     0     0]
 [  789  1475   717 ... 13595  7040 14472]] 

Maximum review length:  130


## Step 4: Building the RNN Architecture

In [16]:
# ARCHITECTURE
EMBED_DIM = 32
LSTM_OUT = 64

model = Sequential()
model.add(Embedding(total_words, EMBED_DIM, input_length = max_length))
model.add(LSTM(LSTM_OUT))
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])

print(model.summary())

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 130, 32)           2956160   
                                                                 
 lstm (LSTM)                 (None, 64)                24832     
                                                                 
 dense (Dense)               (None, 1)                 65        
                                                                 
Total params: 2,981,057
Trainable params: 2,981,057
Non-trainable params: 0
_________________________________________________________________
None


## Step 5: Training the model 

For training, it is simple. We only need to fit our x_train (input) and y_train (output/label) data. For this training, I use a mini-batch learning method with a batch_size of 128 and 10 epochs.

In [21]:
checkpoint = ModelCheckpoint(
    'LSTM.h5',
    monitor='accuracy',
    save_best_only=True,
    verbose=1
)


In [22]:
model.fit(x_train, y_train, batch_size = 128, epochs = 10, callbacks=[checkpoint])

Epoch 1/10
Epoch 1: accuracy improved from -inf to 0.95938, saving model to LSTM.h5
Epoch 2/10
Epoch 2: accuracy improved from 0.95938 to 0.97825, saving model to LSTM.h5
Epoch 3/10
Epoch 3: accuracy improved from 0.97825 to 0.98563, saving model to LSTM.h5
Epoch 4/10
Epoch 4: accuracy improved from 0.98563 to 0.98870, saving model to LSTM.h5
Epoch 5/10
Epoch 5: accuracy improved from 0.98870 to 0.99112, saving model to LSTM.h5
Epoch 6/10
Epoch 6: accuracy improved from 0.99112 to 0.99185, saving model to LSTM.h5
Epoch 7/10
Epoch 7: accuracy improved from 0.99185 to 0.99345, saving model to LSTM.h5
Epoch 8/10
Epoch 8: accuracy improved from 0.99345 to 0.99418, saving model to LSTM.h5
Epoch 9/10
Epoch 9: accuracy improved from 0.99418 to 0.99485, saving model to LSTM.h5
Epoch 10/10
Epoch 10: accuracy did not improve from 0.99485


<keras.callbacks.History at 0x2752bc8bbe0>


## Step 6: Testing
To evaluate the model, we need to predict the sentiment using our x_test data and comparing the predictions with y_test (expected output) data. Then, we calculate the accuracy of the model by dividing numbers of correct prediction with the total data. Resulted an accuracy of 86.63%

In [38]:
y_pred = model.predict(x_test)
y_pred = np.round(y_pred).astype(int)

true = 0
for i, y in enumerate(y_test):
    if y == y_pred[i]:
        true += 1

print('Correct Prediction: {}'.format(true))
print('Wrong Prediction: {}'.format(len(y_pred) - true))
print('Accuracy: {}'.format(true/len(y_pred)*100))

Correct Prediction: 8596
Wrong Prediction: 1404
Accuracy: 85.96000000000001


## Step 7: Load the saved model

In [40]:
loaded_model = load_model('LSTM.h5')

In [109]:
review = str(input('Movie Review: '))

Movie Review: A load of crap!! I am telling you now, please do not watch this film, it is a waste of money and a waste of time. Instead you could actually be having fun!


In [110]:
#Let's preprocess the input before entering it into the model 
regex = re.compile(r'[^a-zA-Z\s]')
review = regex.sub('', review)
print('Cleaned: ', review)

words = review.split(' ')
filtered_text = [w for w in words if w not in stopwords]
filtered_text = ' '.join(filtered_text)
filtered_text = [filtered_text.lower()]

print('Filtered: ', filtered_text)

Cleaned:  A load of crap I am telling you now please do not watch this film it is a waste of money and a waste of time Instead you could actually be having fun
Filtered:  ['a load crap i telling please watch film waste money waste time instead could actually fun']


In [120]:
texts = filtered_text
tokenizer = Tokenizer()
tokenizer.fit_on_texts(texts)
tokenize_words = tokenizer.texts_to_sequences(texts)

In [122]:
tokenize_words

[[2, 3, 4, 5, 6, 7, 8, 9, 1, 10, 1, 11, 12, 13, 14, 15]]

In [136]:
tokenize_words = pad_sequences(tokenize_words, padding='pre', truncating='pre')

In [137]:

print(tokenize_words)

[[ 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  2  3  4  5  6  7
   8  9  1 10  1 11 12 13 14 15]]


In [138]:
result = loaded_model.predict(tokenize_words)
print(result)

[[0.04770437]]


In [139]:
if result >= 0.5:
    print('positive')
else:
    print('negative')

negative
