# Algorithm for Sentimental Analysis using RNN 

### 1) First we need to convert the raw text-words into so-called tokens which are integer values.

### 2) Then we convert these integer-tokens into so-called embeddings which are real-valued vectors, whose mapping will be trained along with the neural network, so as to map words with similar meanings to similar embedding-vectors. 

### 3) Then we input these embedding-vectors to a Recurrent Neural Network which can take sequences of arbitrary length as input and output a kind of summary of what it has seen in the input.

### 4) Output from the RNN is squashed by an activation function (Sigmoid in this case)

### 5) output is between 0 and 1 
 
### { 0: highly negative, 1 : highly positive }

In [1]:
#importing required Libraries

%matplotlib inline
import matplotlib.pyplot as plt
import tensorflow as tf
import numpy as np
import warnings
warnings.filterwarnings('ignore')
from scipy.spatial.distance import cdist
# from tf.keras.models import Sequential  # This does not work!
from keras.models import Sequential
from keras.layers import Dense, GRU, Embedding
from keras.optimizers import Adam
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

Using TensorFlow backend.


In [3]:
# Some of the code and explaination here is taken from https://github.com/Hvass-Labs/ :)

In [4]:
import imdb # this is helper package to download and load the imdb dataset by https://github.com/Hvass-Labs/

In [5]:
imdb.maybe_download_and_extract() #Downloading and Extracting the dataset

Data has apparently already been downloaded and unpacked.


In [6]:
x_train_text, y_train = imdb.load_data(train=True) #loading train data
x_test_text, y_test = imdb.load_data(train=False) # loading test data

In [7]:
print("Train-set size: ", len(x_train_text))
print("Test-set size:  ", len(x_test_text))

Train-set size:  25000
Test-set size:   25000


In [8]:
data_text = x_train_text + x_test_text

In [9]:
x_train_text[100] # looking at an example text 

"Not to mention easily Pierce Brosnon's best performance. Of course Greg Kinnear is always great. Really, when has he really been bad? I think this film is incredibly underrated! The use of colors in this movie is something very different in today's film world where every other movie has the Payback blue filter. I also love the way they used the song by Asia. Proving that even what was once thought of as kinda cheesy can be really cool placed correctly.<br /><br />I was making my first feature when this came out. Being that my film was a hit-man movie, I had to check out anything in the genre that was released. After seeing it, I'm sure it had some effect on me through the process. It was pretty cool when my film got on the IMDb that it would recommend this film if you liked mine. How any of the others relate I have no idea, making an even more interesting coincidence.<br /><br />http://www.imdb.com/title/tt1337580/"

In [10]:
y_train[100]

1.0

In [11]:
num_words = 10000

In [12]:
tokenizer = Tokenizer(num_words=num_words)

In [13]:
%%time
tokenizer.fit_on_texts(data_text)

CPU times: user 12.3 s, sys: 36.6 ms, total: 12.3 s
Wall time: 12.3 s


In [14]:
tokenizer.word_index
#. This is ordered by the number of occurrences of the words in the data-set.
#These integer-numbers are called word indices or "tokens" because they uniquely 
#identify each word in the vocabulary.

{'the': 1,
 'and': 2,
 'a': 3,
 'of': 4,
 'to': 5,
 'is': 6,
 'br': 7,
 'in': 8,
 'it': 9,
 'i': 10,
 'this': 11,
 'that': 12,
 'was': 13,
 'as': 14,
 'for': 15,
 'with': 16,
 'movie': 17,
 'but': 18,
 'film': 19,
 'on': 20,
 'not': 21,
 'you': 22,
 'are': 23,
 'his': 24,
 'have': 25,
 'be': 26,
 'one': 27,
 'he': 28,
 'all': 29,
 'at': 30,
 'by': 31,
 'an': 32,
 'they': 33,
 'so': 34,
 'who': 35,
 'from': 36,
 'like': 37,
 'or': 38,
 'just': 39,
 'her': 40,
 'out': 41,
 'about': 42,
 'if': 43,
 "it's": 44,
 'has': 45,
 'there': 46,
 'some': 47,
 'what': 48,
 'good': 49,
 'when': 50,
 'more': 51,
 'very': 52,
 'up': 53,
 'no': 54,
 'time': 55,
 'my': 56,
 'even': 57,
 'would': 58,
 'she': 59,
 'which': 60,
 'only': 61,
 'really': 62,
 'see': 63,
 'story': 64,
 'their': 65,
 'had': 66,
 'can': 67,
 'me': 68,
 'well': 69,
 'were': 70,
 'than': 71,
 'much': 72,
 'we': 73,
 'bad': 74,
 'been': 75,
 'get': 76,
 'do': 77,
 'great': 78,
 'other': 79,
 'will': 80,
 'also': 81,
 'into': 82,
 'p

In [15]:
x_train_tokens = tokenizer.texts_to_sequences(x_train_text) # converting all the text in training data to tokens

In [16]:
x_train_text[1] # actual text without tokens

'DIG! is funny, fun, amusing, interesting, stylish, and very well done. Knowing that it was made on such a shoestring budget over 7 years it is amazing that such a story can be told, especially with such style and substance. If you are a music fan or documentary fan this is a must see.<br /><br />Focusing on The Brian Jonestown Masssacre and The Dandy Warhols over the years is a brilliant way to show the contrast between a decent band who meets with moderate success through perseverance and the ability to compromise and a genius megalomaniacal lead singer backed up by a varied cast of characters who sabotage their own success through drugs, alcohol, and insanity. If I did not know that this is footage of real people, I would swear it was an incredibly well written and imaginative scripted piece. The story is compelling, concise, and simply amazing.'

In [17]:
np.array(x_train_tokens[1]) # text after tokenizing

array([3574,    6,  152,  245, 1143,  218, 3138,    2,   52,   69,  221,
       1332,   12,    9,   13,   90,   20,  138,    3,  332,  121,  702,
        153,    9,    6,  491,   12,  138,    3,   64,   67,   26,  566,
        261,   16,  138,  396,    2, 2389,   43,   22,   23,    3,  207,
        326,   38,  640,  326,   11,    6,    3,  206,   63,    7,    7,
       4252,   20,    1, 1819,    2,    1, 8230,  121,    1,  153,    6,
          3,  513,   95,    5,  119,    1, 2206,  201,    3,  540, 1144,
         35,  911,   16, 1007,  140,    2,    1, 1239,    5, 9498,    2,
          3, 1209,  469, 1843, 7720,   53,   31,    3, 7201,  174,    4,
        102,   35, 9597,   65,  199, 1007,  140, 1479, 4805,    2, 5147,
         43,   10,  115,   21,  118,   12,   11,    6,  893,    4,  144,
         83,   10,   58, 3732,    9,   13,   32,  950,   69,  407,    2,
       3233, 3677,  412,    1,   64,    6, 1488,    2,  330,  491])

In [18]:
x_test_tokens = tokenizer.texts_to_sequences(x_test_text) # converting text data into tokens

# Padding

The Recurrent Neural Network can take sequences of arbitrary length as input, but in order to use a whole batch of data, the sequences need to have the same length. 
But we can't take the length of longest review and pad that many zeros to the shorter reviews because it will take lot of memory so we have to figure out a particular length that will be sufficent for most of our data

In [19]:
num_tokens = np.array([len(tokens) for tokens in x_train_tokens + x_test_tokens])
#making a list to store the lengths of tokenized reviews in both training and test data set

In [20]:
np.mean(num_tokens) #calculating average length 

221.27716000000001

In [21]:
np.max(num_tokens) # maximum length of any tokenized review

2208

In [22]:
np.min(num_tokens)# minimum

6

## Visualizing Token Lengths

In [23]:
import plotly.plotly as py
import plotly.graph_objs as go

trace0 = go.Box(
    y=num_tokens
)
data=[trace0]
py.iplot(data)

### We can see in the above box plot that most of the token lengths are between 0 and 500 but also we have some outliers which go upto 2000+ 

In [24]:
max_tokens = np.mean(num_tokens) + 2 * np.std(num_tokens)
max_tokens = int(max_tokens)
max_tokens

544

In [None]:
#The max number of tokens we will allow is set to the average plus 2 standard deviations.

We have already seen that most of our data is between 0 and 500 length but we can verify it 

In [25]:
str(np.sum(num_tokens < max_tokens) / len(num_tokens) * 100) +' %' 

'94.528 %'

## When we pad data we need to decide where to pad the data,wether pad in the beginning or in the end

#### - If we pad at the end then there might be a chance that RNN might get confused seeing lot of zeroes after processing some data 

#### - So we need to pad in beginning 

In [26]:
pad = 'pre'

In [27]:
x_train_pad = pad_sequences(x_train_tokens, maxlen=max_tokens,
                            padding=pad, truncating=pad)

In [28]:
x_test_pad = pad_sequences(x_test_tokens, maxlen=max_tokens,
                           padding=pad, truncating=pad)

In [29]:
x_train_pad.shape

(25000, 544)

In [30]:
x_test_pad.shape

(25000, 544)

In [31]:
np.array(x_train_tokens[1]) # before padding

array([3574,    6,  152,  245, 1143,  218, 3138,    2,   52,   69,  221,
       1332,   12,    9,   13,   90,   20,  138,    3,  332,  121,  702,
        153,    9,    6,  491,   12,  138,    3,   64,   67,   26,  566,
        261,   16,  138,  396,    2, 2389,   43,   22,   23,    3,  207,
        326,   38,  640,  326,   11,    6,    3,  206,   63,    7,    7,
       4252,   20,    1, 1819,    2,    1, 8230,  121,    1,  153,    6,
          3,  513,   95,    5,  119,    1, 2206,  201,    3,  540, 1144,
         35,  911,   16, 1007,  140,    2,    1, 1239,    5, 9498,    2,
          3, 1209,  469, 1843, 7720,   53,   31,    3, 7201,  174,    4,
        102,   35, 9597,   65,  199, 1007,  140, 1479, 4805,    2, 5147,
         43,   10,  115,   21,  118,   12,   11,    6,  893,    4,  144,
         83,   10,   58, 3732,    9,   13,   32,  950,   69,  407,    2,
       3233, 3677,  412,    1,   64,    6, 1488,    2,  330,  491])

In [32]:
np.array(x_train_pad[1]) # After Padding

array([   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,   

In [33]:
num_tokens_pad = np.array([len(tokens) for tokens in x_train_pad + x_test_pad])


In [34]:
import plotly.plotly as py
import plotly.graph_objs as go

trace0 = go.Box(
    y=num_tokens_pad
)
data=[trace0]
py.iplot(data)

In [None]:
## We can see in the above plot that now all the data have same length (544)

## Alternatives to padding

### There are various options if you don't want to do padding:

#### 1) Make your batch size equal to 1 i.e feed data one by one into RNN 

#### 2) Grouping sequences of same lengths i.e all the sequences of particular lengths like 100 or 500 together



## Creating the Recurrent Neural Network using Keras 



In [35]:
model = Sequential()

The first layer in the RNN is a so-called Embedding-layer which converts each integer-token into a vector of values
Tokenized data is huge from 0 to vocaulary length (10000 in this case) and value of tokens(integer values) does not make any sense so tokens are converted into embedded vector  which is a vector that maps words with similar semantic meanings and of length usually 100-300

Watch detailed explaination of Embeddings by Andrew Ng :https://www.youtube.com/watch?v=DDByc9LyMV8 

In [36]:
embedding_size = 8

In [37]:
model.add(Embedding(input_dim=num_words,
                   output_dim=embedding_size,
                   input_length=max_tokens,
                   ))

In [38]:
model.add(GRU(units=16, return_sequences=True))#layer below will be processing sequences so thats why return_sequences=True

In [39]:
model.add(GRU(units=8, return_sequences=True))#layer below will be processing sequences so thats why return_sequences=True

In [40]:
model.add(GRU(units=4))# now we don't need sequences because in the next layer we will predict the output

In [41]:
model.add(Dense(1, activation='sigmoid'))\
# fully connected layer with output =1 since we will predict either positve or negative


In [42]:
optimizer = Adam(lr=1e-3)

In [43]:
model.compile(loss='binary_crossentropy',
              optimizer=optimizer,
              metrics=['accuracy'])

In [44]:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 544, 8)            80000     
_________________________________________________________________
gru_1 (GRU)                  (None, 544, 16)           1200      
_________________________________________________________________
gru_2 (GRU)                  (None, 544, 8)            600       
_________________________________________________________________
gru_3 (GRU)                  (None, 4)                 156       
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 5         
Total params: 81,961
Trainable params: 81,961
Non-trainable params: 0
_________________________________________________________________


In [45]:
%%time
history=model.fit(x_train_pad, y_train,
          validation_split=0.05, epochs=3, batch_size=50)

Train on 23750 samples, validate on 1250 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3
CPU times: user 59min 13s, sys: 18min 56s, total: 1h 18min 10s
Wall time: 20min 8s


In [46]:
acc = history.history['acc']
val_acc = history.history['val_acc']
loss = history.history['loss']
val_loss = history.history['val_loss']
epochs = range(1, len(acc) + 1)


trace0 = go.Scatter(
    x = list(epochs),
    y = list(acc),
    mode = 'lines',
    name = 'training accuracy'
)

trace1= go.Scatter(
    x = list(epochs),
    y = list(val_acc),
    mode = 'lines',
    name = 'Validation accuracy'
)

layout = go.Layout( title='Training and Validation Accuracy')


data = [trace0,trace1]
fig = go.Figure(data=data, layout=layout)


py.iplot(fig)


In [47]:
trace0 = go.Scatter(
    x = list(epochs),
    y = list(loss),
    mode = 'lines',
    name = 'training loss'
)

trace1= go.Scatter(
    x = list(epochs),
    y = list(val_loss),
    mode = 'lines',
    name = 'Validation loss'
)

layout = go.Layout( title='Training and Validation loss')


data = [trace0,trace1]
fig = go.Figure(data=data, layout=layout)


py.iplot(fig)


In [48]:
%%time
result = model.evaluate(x_test_pad, y_test)

CPU times: user 4min 39s, sys: 1min 43s, total: 6min 22s
Wall time: 1min 25s


In [49]:
print("Accuracy: {0:.2%}".format(result[1]))

Accuracy: 84.70%


# Checking on unknown Real Data 

#### to check this I took two reviews from IMDB 

1) Positive review for Peaky Blinders(TV Series)

2) Negative Review for  Race 3 (Indian Movie)

Positive Review URL : https://www.imdb.com/review/rw2878383/?ref_=tt_urv

Negative Review URL : https://www.imdb.com/title/tt7431594/reviews?ref_=tt_ql_3

        
        

In [140]:
positive_review='''I was not expecting it to be this good,I really enjoyed all 4 episodes. 
The story is interesting,the acting is brilliant and the cinematography is just beautiful!
I am eagerly waiting for the next episodes.When I compare Peaky Blinders to other popular TV shows that use
sex,brutality and violence to shock the audiences and get high ratings(which they actually do)this sincere work is 
like needlework;fine,classy and detailed.The makers of this drama have not chosen the easy way,they have set off to
make a first class period drama,that dares to be different.Cillian Murphy is at his best,I will even go as far as
to say that this is one of the best performances I have seen of him.Sam Neil and Helen McCrory must be praised,all
casting is perfect.Peaky Blinders sets high standards for other television dramas to follow.'''

negative_review=''' I don't know what kind of mental conditions these people are suffering from, who are rating this movie
10/10. Why couldn't they just make it simple why this whole addition of crap. Just another crappy amalgamation of
the movies which had a better script. I just don't think Salman will make any sensible movies in which he just acts
good and doesn't just say mindless dialogues.'''

text=[positive_review,negative_review]

In [141]:
tokens = tokenizer.texts_to_sequences(text) # we need to tokenize

In [142]:
tokens_pad = pad_sequences(tokens, maxlen=max_tokens,
                           padding=pad, truncating='pre')
# padding

In [143]:
tokens_pad.shape

(2, 544)

In [144]:
a=model.predict(tokens_pad)[0]
b=model.predict(tokens_pad)[1]

In [145]:
if a > 0.60: # I am thresholding it at 60%
    print('Positive Review with a score of {} %'.format(a[0]*100))
else:
    print('Negative Review ')

Positive Review with a score of 97.52593040466309 %


In [146]:
if b > 0.50: # I am thresholding it at 50%
    print('Positive Review ')
else:
    print('Negative Review with a score of {} %'.format(b[0]*100))

Negative Review with a score of 2.249847538769245 %


# We can see that it is classifying pretty good 