![alt text](https://drive.google.com/uc?export=view&id=1UXScsVx_Wni_JuDdB8LeTnM6jsPfIwkW)

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.

# Sentiment Classification

### Dataset
- Dataset of 50,000 movie reviews from IMDB, labeled by sentiment positive (1) or negative (0)
- Reviews have been preprocessed, and each review is encoded as a sequence of word indexes (integers).
- For convenience, words are indexed by overall frequency in the dataset, so that for instance the integer "3" encodes the 3rd most frequent word in the data. This allows for quick filtering operations such as: "only consider the top 10,000 most common words, but eliminate the top 20 most common words".
- As a convention, "0" does not stand for a specific word, but instead is used to encode any unknown word.

Command to import data
- `from tensorflow.keras.datasets import imdb`

### Import the data (2 Marks)
- Use `imdb.load_data()` method
- Get train and test set
- Take 10000 most frequent words

In [1]:
from tensorflow.keras.datasets import imdb
import pandas as pd
import numpy as np
import tensorflow as tf

In [2]:
#### Add your code here ####
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=10000)

  x_train, y_train = np.array(xs[:idx]), np.array(labels[:idx])
  x_test, y_test = np.array(xs[idx:]), np.array(labels[idx:])


### Pad each sentence to be of same length (2 Marks)
- Take maximum sequence length as 300

In [3]:
#### Add your code here ####
#Define maximum number of words to consider in each review
max_review_length = 300
#Pad training and test reviews
X_train = tf.keras.preprocessing.sequence.pad_sequences(X_train,
                                                        maxlen=max_review_length,
                                                        padding='pre', truncating='post')
X_test = tf.keras.preprocessing.sequence.pad_sequences(X_test, 
                                                       maxlen=max_review_length, 
                                                       padding='pre', truncating='post')

### Print shape of features & labels (2 Marks)

Number of review, number of words in each review

In [4]:
#### Add your code here ####
print('Number of reviews in training set: ', len(X_train))
print('Number of words in each review in training set: ', len(X_train[0]))

Number of reviews in training set:  25000
Number of words in each review in training set:  300


In [5]:
#### Add your code here ####
print('Number of reviews in test set: ', len(X_test))
print('Number of words in each review in test set: ', len(X_test[0]))

Number of reviews in test set:  25000
Number of words in each review in test set:  300


Number of labels

In [6]:
#### Add your code here ####
print('Number of labels in training set: ', len(y_train))

Number of labels in training set:  25000


In [7]:
#### Add your code here ####
print('Number of labels in test set: ', len(y_test))

Number of labels in test set:  25000


### Print value of any one feature and it's label (2 Marks)

Feature value

In [8]:
#### Add your code here ####
print('Feature for 1st review in training set= ', X_train[0])

Feature for 1st review in training set=  [   0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    1   14
   22   16   43  530  973 1622 1385   65  458 4468   66 3941    4  173
   36  256    5   25  100   43  838  112   50  670    2    9   35  480
  284    5  150    4  172  112  167    2  336  385   39    4  172 4536
 1111   17  546   38   13  447    4  192   50   16    6  147 2025   19
   14   22    4 1920 4613  469    4   22   71   87   12   16   43  530
   38   76   15   13 1247    4   22   17  515   17   12   16  626   18
    2    5   62  386   12    8  316    8  106    5    4 2223 5244   16
  480   66 3785   33    4  130   12 

Label value

In [9]:
#### Add your code here ####
print('Label for 1st review in training set = ', y_train[0])

Label for 1st review in training set =  1


### Decode the feature value to get original sentence (2 Marks)

First, retrieve a dictionary that contains mapping of words to their index in the IMDB dataset

In [10]:
#### Add your code here ####
word_to_id = imdb.get_word_index()

Now use the dictionary to get the original words from the encodings, for a particular sentence

In [11]:
INDEX_FROM=3
word_to_id = {k:(v+INDEX_FROM) for k,v in word_to_id.items()}
word_to_id["<PAD>"] = 0
word_to_id["<START>"] = 1
word_to_id["<UNK>"] = 2
word_to_id["<UNUSED>"] = 3

id_to_word = {value:key for key,value in word_to_id.items()}
#Getting decoded value for feature in index 7
print(' '.join(id_to_word[id] for id in X_train[7] ))

<START> the <UNK> tells the story of the four hamilton siblings teenager francis <UNK> <UNK> twins <UNK> joseph <UNK> <UNK> <UNK> <UNK> the <UNK> david samuel who is now the surrogate parent in charge the <UNK> move house a lot <UNK> is unsure why is unhappy with the way things are the fact that his brother's sister kidnap <UNK> murder people in the basement doesn't help relax or calm <UNK> nerves either francis <UNK> something just isn't right when he eventually finds out the truth things will never be the same again br br co written co produced directed by mitchell <UNK> phil <UNK> as the butcher brothers who's only other film director's credit so far is the april <UNK> day 2008 remake enough said this was one of the <UNK> to die <UNK> at the 2006 after dark <UNK> or whatever it's called in keeping with pretty much all the other's i've seen i thought the <UNK> was complete total utter crap i found the character's really poor very unlikable the slow moving story failed to capture my i

Get the sentiment for the above sentence
- positive (1)
- negative (0)

In [12]:
#### Add your code here ####
if y_train[7] == 1:
  print('Positive')
else:
  print('Negative')

Negative


### Define model (10 Marks)
- Define a Sequential Model
- Add Embedding layer
  - Embedding layer turns positive integers into dense vectors of fixed size
  - `tensorflow.keras` embedding layer doesn't require us to onehot encode our words, instead we have to give each word a unique integer number as an id. For the imdb dataset we've loaded this has already been done, but if this wasn't the case we could use sklearn LabelEncoder.
  - Size of the vocabulary will be 10000
  - Give dimension of the dense embedding as 100
  - Length of input sequences should be 300
- Add LSTM layer
  - Pass value in `return_sequences` as True
- Add a `TimeDistributed` layer with 100 Dense neurons
- Add Flatten layer
- Add Dense layer

In [13]:
#### Add your code here ####
#Initialize model

import tensorflow as tf
tf.keras.backend.clear_session()
model = tf.keras.Sequential()
#Adding Embdedding Layer
model.add(tf.keras.layers.Embedding(input_dim= 10000 + 1, #Vocablury size
                                    output_dim = 100, 
                                    trainable=True,
                                    input_length=300) #Number of words in each review
          )
#Adding LSTM Layer
model.add(tf.keras.layers.Dropout(0.8))
model.add(tf.keras.layers.LSTM(256, return_sequences=True, recurrent_dropout= 0.8, dropout = 0.8)) #RNN State - size of cell state and hidden state


#Adding time distributed layer
model.add(tf.keras.layers.TimeDistributed(tf.keras.layers.Dense(100,activation='relu')))
model.add(tf.keras.layers.Dropout(0.8))

#Adding Flatten Layer
model.add(tf.keras.layers.Flatten())

#Adding Dense Layer
model.add(tf.keras.layers.Dense(1,activation='sigmoid'))



### Compile the model (2 Marks)
- Use Optimizer as Adam
- Use Binary Crossentropy as loss
- Use Accuracy as metrics

In [14]:
#### Add your code here ####
model.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = 'accuracy')

### Print model summary (2 Marks)

In [15]:
### Add your code here ####
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 300, 100)          1000100   
_________________________________________________________________
dropout (Dropout)            (None, 300, 100)          0         
_________________________________________________________________
lstm (LSTM)                  (None, 300, 256)          365568    
_________________________________________________________________
time_distributed (TimeDistri (None, 300, 100)          25700     
_________________________________________________________________
dropout_1 (Dropout)          (None, 300, 100)          0         
_________________________________________________________________
flatten (Flatten)            (None, 30000)             0         
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 3

### Fit the model (2 Marks)

In [16]:
#### Add your code here ####
model.fit(X_train, y_train,batch_size=256, epochs=20 , validation_data=(X_test,y_test), verbose = True)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<tensorflow.python.keras.callbacks.History at 0x7f2591ba6790>

### Evaluate model (2 Marks)

In [17]:
#### Add your code here ####
model.evaluate(X_test, y_test)



[0.33032694458961487, 0.8795199990272522]

### Predict on one sample (2 Marks)

In [22]:
#### Add your code here ####
predictions = model.predict(X_test)

In [23]:
#Using a threshold of 0.5 to convert the predictions to binary format
binary_pred = []
for pred in predictions:
  if pred <= 0.5:
    binary_pred.append(0)
  else:
    binary_pred.append(1)

In [24]:
#Raw model prediction for the review in index 1 in the test set
predictions[1]

array([0.9999989], dtype=float32)

In [25]:
#Looking at the prediction for the review in index 1 in the test set
binary_pred[1]

1

In [26]:
#Looking at the label for the review in index 1 in the test set
y_test[1]

1

In [27]:
#Looking at the actual review in index 1 in the test set
print(' '.join(id_to_word[id] for id in X_test[1] ))

<PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <START> this film requires a lot of patience because it focuses on mood and character development the plot is very simple and many of the scenes take place on the same set in frances <UNK> the sandy dennis character apartment but the film builds to a disturbing climax br br the characters create an atmosphere <UNK> with sexual tension and psychological <UNK> it's very interesting that robert altman directed this considering the style and structure of his other films still the trademark altman audio style is evident here and there i think what really makes this film work is the brilliant performance by sandy dennis it's definitely one of her darker characters but she plays it so perfectly and convincingly that it's scary michael burns does a good job 

Thus we can see that the model has correctly identified the review as a positive one !