![alt text](https://drive.google.com/uc?export=view&id=1UXScsVx_Wni_JuDdB8LeTnM6jsPfIwkW)

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.

# Sentiment Classification

### Dataset
- Dataset of 50,000 movie reviews from IMDB, labeled by sentiment positive (1) or negative (0)
- Reviews have been preprocessed, and each review is encoded as a sequence of word indexes (integers).
- For convenience, words are indexed by overall frequency in the dataset, so that for instance the integer "3" encodes the 3rd most frequent word in the data. This allows for quick filtering operations such as: "only consider the top 10,000 most common words, but eliminate the top 20 most common words".
- As a convention, "0" does not stand for a specific word, but instead is used to encode any unknown word.

Command to import data
- `from tensorflow.keras.datasets import imdb`

### Import the data (4 Marks)
- Use `imdb.load_data()` method
- Get train and test set
- Take 10000 most frequent words

In [None]:
from tensorflow.keras.datasets import imdb
(X_train_orig, y_train_orig), (X_test_orig, y_test_orig) = imdb.load_data(num_words=10000)

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz


### Pad each sentence to be of same length (4 Marks)
- Take maximum sequence length as 300

In [23]:
from keras.preprocessing import sequence
X_train = sequence.pad_sequences(X_train_orig, maxlen=300)
X_test = sequence.pad_sequences(X_test_orig, maxlen=300)
y_train=y_train_orig
y_test=y_test_orig

Since the split of train and test is by default 50:50. Merging the data to split it into 75:25.

In [24]:
from sklearn.model_selection import train_test_split
import numpy as np
X=np.concatenate((X_train,X_test),axis=0)
Y=np.concatenate((y_train,y_test),axis=0)
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.25)

### Print shape of features & labels (4 Marks)

Number of review, number of words in each review

In [25]:
print('Total number of reviews in Train Data is :',len(X_train))
length = [len(i) for i in X_train]
print('Average words in each review in Train Data is :',np.mean(length))
print('Shape of Train Data:',X_train.shape)
print("Total Unique Words in Train Data:", len(np.unique(np.hstack(X_train))))

Total number of reviews in Train Data is : 37500
Average words in each review in Train Data is : 300.0
Shape of Train Data: (37500, 300)
Total Unique Words in Train Data: 9999


In [26]:
print('Total number of reviews in Test Data is :',len(X_test))
length = [len(i) for i in X_test]
print('Average words in each review in Test Data is :',np.mean(length))
print('Shape of Test Data:',X_test.shape)
print("Total Unique Words in Test Data:", len(np.unique(np.hstack(X_test))))

Total number of reviews in Test Data is : 12500
Average words in each review in Test Data is : 300.0
Shape of Test Data: (12500, 300)
Total Unique Words in Test Data: 9996


Number of labels

In [27]:
import numpy as np
print('Total number of Lables in Train Data is :',len(y_train))
print('Unique Lables in Train Data are :',np.unique(y_train))

Total number of Lables in Train Data is : 37500
Unique Lables in Train Data are : [0 1]


In [28]:
print('Total number of Lables in Test Data is :',len(y_test))
print('Unique Lables in Test Data are :',np.unique(y_test))

Total number of Lables in Test Data is : 12500
Unique Lables in Test Data are : [0 1]


### Print value of any one feature and it's label (4 Marks)

Train Feature value

In [29]:
print('Sample Train Feature Value: ',X_train[5])
if y_train[5]:
  review='Positive'
else:
  review='Negative'
print('Label for the sample train data is: ',review)

Sample Train Feature Value:  [   0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    1   13   64  219    2  280  220 1088  153  596   21   12
   47 2368   72  237   10   10   31  720  572 2677   11  330    5  100
   64   28   77 3357   34    6   87  167   17  488    2 4064    9   10
   10    4 2008    4 3324    7  479    5    6 4887 1272    5 7790   32
 2866   23  711    2 5202   35  511    7    2   47    2    4    2    2
   27   87 5144    8    2   29    9  579   29  215 3552   27  577    2
    8 8393   11  661    8 3235    4 4239   18    4    2    7    4 3207
 5159   29 2721    2   21  266  187    5 3351   27  322    2    8  721
   68  577    8    4 3888 1250   11  661    8 2275    4  833    7   32
 5197    2   10   10    2    5    2 3854  169   46   44    4 3552    5
 3980    8    4 6183   18 7525    5 4546 2086    4 2606    2 2147   15
   27  403   47   77  343   11   14    2    2   

Test Data Sample

In [30]:
print('Sample Test Feature Value: ',X_test[5])
if y_test[5]:
  review='Positive'
else:
  review='Negative'
print('Label for the sample test data is: ',review)

Sample Test Feature Value:  [   0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    1   92 1632    8    4 5328  425 1985
  255    4 6645   26   73  573   18 1489   35 4268   23  383    5 1006
  120  779 2629   11   68  189  108   21   14   31  133    9   43    6
  227   99   76 1985  255  186    8   28    6 2928  383  136    2  125
   19    4  425  109  170  932    5 4653  880   41 1943  253    4   86
  171  211   21    6 3139  234    7   14  461   55 7654  946    2   24
   60    6 1985   21    6    2 2284  625    2   16    2   18    2    4
   22  191   60 1197   94 1163   19    4   86  747  234    6 2763  112
 7063 2467  189   13  197   13   16   11   18    6 1157  356  103  134
    8    6  247  338  109 2078    7    4  668  112 3727 5233    5 3810
    8    6 1060  708   33    4  130    4  167 5692   14    9  448   23
    6  283   65 1243   32  208   10   10    8   3

### Decode the feature value to get original sentence (4 Marks)

First, retrieve a dictionary that contains mapping of words to their index in the IMDB dataset

Sample Train Data

In [31]:
import re
index = imdb.get_word_index()
reverse_index = dict([(value, key) for (key, value) in index.items()]) 

Now use the dictionary to get the original words from the encodings, for a particular sentence

In [32]:
decoded = " ".join( [reverse_index.get(i - 3, "") for i in X_train[6]] )
print('Sample Feature value of Training Data:') 
print(decoded.lstrip())

Sample Feature value of Training Data:
only a handful of the segments are engaging here a segment with a garage attendant from  is heartbreaking one with   bob  makes its point twist by twist until the final shot  things br br the problem with this movie is that only a few of the clips  paris the others are so  shot in theme tone  production that you may as well be watching the years best commercials 2006 it's really all over the place it doesn't develop over it's running time and nothing  the directors in no  successfully joins the pieces tedium sets in i'm at the one hour twenty minute point and  wood is in some dumb over commercial  vampire  it has about as much to do with paris as old ladies  in the  fantasy shows up i think first in the  brothers segment uh thanks j e for ruining another movie and then makes way too many appearances the point of being in paris is that you don't need make believe crap to make your days extraordinary why  it by neighborhood if  de la madeleine is  w

Get the sentiment for the above sentence
- positive (1)
- negative (0)

In [33]:
if y_train[6]:
  review='Positive'
else:
  review='Negative'
print('Training Label Value:',review)

Training Label Value: Negative


Sample Test Data

In [34]:
decoded = " ".join( [reverse_index.get(i - 3, "") for i in X_test[6]] )
print('Sample Feature value of Test Data:') 
print(decoded.lstrip())

Sample Feature value of Test Data:
i'm tired of people judging films on their historical accuracy it's a movie people the writers and directors are supposed to put their own spin into the story there are a number of movies out there that aren't entirely accurate with the history braveheart   gangs of new york  an american legend the last of the  all fantastic films that are mildly inaccurate historically if you want to see a few great actors do what they do best then i suggest you see this film and don't worry about the accuracy of the facts just enjoy the quality of the film the storyline and one of the greatest actors of our time


In [35]:
if y_test[6]:
  review='Positive'
else:
  review='Negative'
print('Training Label Value:',review)

Training Label Value: Positive


### Define model (10 Marks)
- Define a Sequential Model
- Add Embedding layer
  - Embedding layer turns positive integers into dense vectors of fixed size
  - `tensorflow.keras` embedding layer doesn't require us to onehot encode our words, instead we have to give each word a unique integer number as an id. For the imdb dataset we've loaded this has already been done, but if this wasn't the case we could use sklearn LabelEncoder.
  - Size of the vocabulary will be 10000
  - Give dimension of the dense embedding as 100
  - Length of input sequences should be 300
- Add LSTM layer
  - Pass value in `return_sequences` as True
- Add a `TimeDistributed` layer with 100 Dense neurons
- Add Flatten layer
- Add Dense layer

In [74]:
from keras import Sequential
from keras.layers import Embedding, LSTM, Dense, Flatten,TimeDistributed,Input
vocabulary_size=10000
embedding_size=100
input_len=300
model=Sequential()
model.add(Embedding(vocabulary_size, embedding_size, input_length=input_len))
model.add(LSTM(100,return_sequences=True,activation='elu'))
model.add(TimeDistributed(Dense(100, activation='softsign')))
model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))



### Compile the model (4 Marks)
- Use Optimizer as Adam
- Use Binary Crossentropy as loss
- Use Accuracy as metrics

In [75]:
from keras.optimizers import Adam

model.compile(loss = 'binary_crossentropy', optimizer='adam',metrics = ['accuracy'])

### Print model summary (4 Marks)

In [76]:
print(model.summary())

Model: "sequential_8"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_7 (Embedding)      (None, 300, 100)          1000000   
_________________________________________________________________
lstm_7 (LSTM)                (None, 300, 100)          80400     
_________________________________________________________________
time_distributed_7 (TimeDist (None, 300, 100)          10100     
_________________________________________________________________
flatten_7 (Flatten)          (None, 30000)             0         
_________________________________________________________________
dense_15 (Dense)             (None, 1)                 30001     
Total params: 1,120,501
Trainable params: 1,120,501
Non-trainable params: 0
_________________________________________________________________
None


### Fit the model (4 Marks)

In [77]:
from keras.callbacks import EarlyStopping
es = EarlyStopping(monitor='val_accuracy',mode='max')
model.fit(X_train, y_train,validation_split=0.05, epochs=15, batch_size=64, callbacks=[es])

Epoch 1/15
Epoch 2/15


<tensorflow.python.keras.callbacks.History at 0x7f4c83ac9dd8>

### Evaluate model (4 Marks)

In [78]:
result = model.evaluate(X_test, y_test)



In [79]:
print("Accuracy of model: {0:.2%}".format(result[1]))

Accuracy of model: 89.67%


### Predict on one sample (4 Marks)

In [80]:
y_pred=model.predict(X_test)
y_pred_new=y_pred
j=0
for i in y_pred:
  if i<0.5: 
    y_pred_new[j]=0
  else:
    y_pred_new[j]=1
  j=j+1
print('Actual Output: ',y_test[3])
print('Predicted Output: ',y_pred[3])

Actual Output:  1
Predicted Output:  [1.]


In [82]:
from sklearn import metrics
import pandas as pd
accuracy_score_test=metrics.accuracy_score(y_test,y_pred_new)
conf_metr=metrics.confusion_matrix(y_test,y_pred_new,labels=[1,0])
df_conf_metr=pd.DataFrame(conf_metr,index = [i for i in ["Actual 1","Actual 0"]],columns=[i for i in ["Predict 1","Predict 0"]])

In [83]:
print("Confusion Matrix :")
print(df_conf_metr)

Confusion Matrix :
          Predict 1  Predict 0
Actual 1       5732        551
Actual 0        740       5477


# ***Model Summary***

IMDB Sentiment Classification with LSTM is giving the test accuracy of 89.67. The model has the train accuracy of 92.67 and Val accuracy of 89.44.

The model uses Sequential model with layers like Embedding Layer with the vocabulary size of 10000 ,  Embedding size of 100 and input length of 300.
Followed by LSTM Layer , TimeDistributes Layer with Dense and activation function as SOftsign, Flatten Layer and Dense layer with activation as Sigmoid.

The model has correctly predicted the 5732 Postive Sentiment and 5477 Negative Sentiment.

It has also predicted incorrectly 740 as Positive Sentiment and 551 as Negative Sentiment.