![alt text](https://drive.google.com/uc?export=view&id=1UXScsVx_Wni_JuDdB8LeTnM6jsPfIwkW)

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.

# Sentiment Classification

### Dataset
- Dataset of 50,000 movie reviews from IMDB, labeled by sentiment positive (1) or negative (0)
- Reviews have been preprocessed, and each review is encoded as a sequence of word indexes (integers).
- For convenience, words are indexed by overall frequency in the dataset, so that for instance the integer "3" encodes the 3rd most frequent word in the data. This allows for quick filtering operations such as: "only consider the top 10,000 most common words, but eliminate the top 20 most common words".
- As a convention, "0" does not stand for a specific word, but instead is used to encode any unknown word.

Command to import data
- `from tensorflow.keras.datasets import imdb`

### Import the data (2 Marks)
- Use `imdb.load_data()` method
- Get train and test set
- Take 10000 most frequent words

In [198]:
from tensorflow.keras.datasets import imdb
import tensorflow_datasets as tfds
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Flatten, Dense, Dropout, Conv1D, MaxPooling1D
from tensorflow.keras.preprocessing.sequence import pad_sequences
import numpy as np
import matplotlib.pyplot as plt

In [199]:
#### Add your code here ####

data = imdb.load_data()
print(data)

((array([list([1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 22665, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 21631, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 19193, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 5244, 16, 480, 66, 3785, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 10311, 8, 4, 107, 117, 5952, 15, 256, 4, 31050, 7, 3766, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 12118, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 194, 7486, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 5535, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16, 283, 5, 16, 4472, 113, 103, 32, 15, 16, 5345, 19, 178, 32]),
       list([1, 194,

In [200]:
imdb.load_data(
    path='imdb.npz', num_words=None, skip_top=0, maxlen=None, seed=113,
    start_char=1, oov_char=2, index_from=3)

((array([list([1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 22665, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 21631, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 19193, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 5244, 16, 480, 66, 3785, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 10311, 8, 4, 107, 117, 5952, 15, 256, 4, 31050, 7, 3766, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 12118, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 194, 7486, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 5535, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16, 283, 5, 16, 4472, 113, 103, 32, 15, 16, 5345, 19, 178, 32]),
         list([1, 19

In [201]:
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=10000)
print(x_train.shape)
print(x_test.shape)

(25000,)
(25000,)


### Pad each sentence to be of same length (2 Marks)
- Take maximum sequence length as 300

In [202]:
#### Add your code here ####
from tensorflow.keras.preprocessing.sequence import pad_sequences
x_train = pad_sequences(x_train, maxlen=300, value = 0.0) # 0.0 because it corresponds with <PAD>
x_test = pad_sequences(x_test, maxlen=300, value = 0.0) # 0.0 because it corresponds with <PAD>

### Print shape of features & labels (2 Marks)

Number of review, number of words in each review

In [203]:
#### Add your code here ####
print(x_train.shape)
print(x_test.shape)
print(y_train.shape)
print(y_test.shape)

(25000, 300)
(25000, 300)
(25000,)
(25000,)


In [204]:
print("Maximum value of a word index ")
print(max([max(sequence) for sequence in x_train]))
print("Maximum length num words of review in train ")
print(max([len(sequence) for sequence in x_train]))

Maximum value of a word index 
9999
Maximum length num words of review in train 
300


In [205]:
import numpy as np
print("Categories:", np.unique(y_train))
print("Number of unique words:", len(np.unique(np.hstack(x_train))))
length = [len(i) for i in data]
print("Average Review length:", np.mean(length))
print("Standard Deviation:", round(np.std(length)))



Categories: [0 1]
Number of unique words: 9999
Average Review length: 2.0
Standard Deviation: 0.0


Number of labels

In [206]:
#### Add your code here ####
import numpy as np
print("Categories:", np.unique(y_train))

Categories: [0 1]


### Print value of any one feature and it's label (2 Marks)

Feature value

In [208]:
#### Add your code here ####
x_train[0]

array([   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    1,   14,   22,   16,   43,  530,
        973, 1622, 1385,   65,  458, 4468,   66, 3941,    4,  173,   36,
        256,    5,   25,  100,   43,  838,  112,   50,  670,    2,    9,
         35,  480,  284,    5,  150,    4,  172,  112,  167,    2,  336,
        385,   39,    4,  172, 4536, 1111,   17,  546,   38,   13,  447,
          4,  192,   50,   16,    6,  147, 2025,   19,   14,   22,    4,
       1920, 4613,  469,    4,   22,   71,   87,   

Label value

In [209]:
#### Add your code here ####
y_train[0]

1

### Decode the feature value to get original sentence (2 Marks)

First, retrieve a dictionary that contains mapping of words to their index in the IMDB dataset

In [232]:
#### Add your code here ####
word2id = imdb.get_word_index()

Now use the dictionary to get the original words from the encodings, for a particular sentence

In [233]:
#### Add your code here ####
id2word = {i: word for word, i in word2id.items()}

Get the sentiment for the above sentence
- positive (1)
- negative (0)

In [212]:
#### Add your code here ####

In [213]:

print('---review with words---')
print([id2word.get(i, ' ') for i in x_train[6]])
print('---label---')
print(y_train[6])

---review with words---
[' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', 'the', 'boiled', 'full', 'involving', 'to', 'impressive', 'boring', 'this', 'as', 'murderi

In [234]:
print('---review with words---')
print([id2word.get(i, ' ') for i in x_train[90]])
print('---label---')
print(y_train[90])

---review with words---
[' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', 'the', 'this', 'enough', 'premiere', 'and', 'better', 'executed', 'ability', 'br', 'and', 'with', 'his', 'her', 'and', 'movie', 'it', 'stick', 'politics', 'i', 'i', 'was', 'one', 'is', 'excellent', 'cut', 'this', 'and', 'only', 'natural', 'with', 'lot', 'br', 'of', 'how', 'truly', 'full', 'this', 'of', 'want', 'f', 'br', 'and', 'pop', 'and', 'off', 'that', 'however', 'of', 'here', 'br', 'realistically', 'and', 'me', 'will', 'her', 'points', 'violent', 'this', 'and', 'of', '1', 'for', 'from', 'me', 'in', 'and', 'of', 'guy', 'to', 'simple', 

### Define model (10 Marks)
- Define a Sequential Model
- Add Embedding layer
  - Embedding layer turns positive integers into dense vectors of fixed size
  - `tensorflow.keras` embedding layer doesn't require us to onehot encode our words, instead we have to give each word a unique integer number as an id. For the imdb dataset we've loaded this has already been done, but if this wasn't the case we could use sklearn LabelEncoder.
  - Size of the vocabulary will be 10000
  - Give dimension of the dense embedding as 100
  - Length of input sequences should be 300
- Add LSTM layer
  - Pass value in `return_sequences` as True
- Add a `TimeDistributed` layer with 100 Dense neurons
- Add Flatten layer
- Add Dense layer

In [214]:
# Model configuration
max_sequence_length = 300
num_distinct_words = 10000
embedding_output_dims = 100
loss_function = 'binary_crossentropy'
optimizer = 'adam'
additional_metrics = ['accuracy']
number_of_epochs = 100
verbosity_mode = True
validation_split = 0.20

In [222]:
#### Add your code here ####
from tensorflow.keras.layers import LSTM, Embedding, Dense, TimeDistributed, Dropout, Bidirectional, Input
model = Sequential()
model.add(Embedding(num_distinct_words, embedding_output_dims, input_length=max_sequence_length))
model.add(Dropout(0.90))
model.add(LSTM(units=100, return_sequences=True, recurrent_dropout=0.1, recurrent_activation = "sigmoid"))
model.add(TimeDistributed(Dense(100)))
model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))
model.summary()

Model: "sequential_9"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_9 (Embedding)      (None, 300, 100)          1000000   
_________________________________________________________________
dropout_7 (Dropout)          (None, 300, 100)          0         
_________________________________________________________________
lstm_9 (LSTM)                (None, 300, 100)          80400     
_________________________________________________________________
time_distributed_9 (TimeDist (None, 300, 100)          10100     
_________________________________________________________________
flatten_5 (Flatten)          (None, 30000)             0         
_________________________________________________________________
dense_15 (Dense)             (None, 1)                 30001     
Total params: 1,120,501
Trainable params: 1,120,501
Non-trainable params: 0
____________________________________________

### Compile the model (2 Marks)
- Use Optimizer as Adam
- Use Binary Crossentropy as loss
- Use Accuracy as metrics

In [223]:
#### Add your code here ####

# compiling the model
model.compile(
 optimizer = "adam",
 loss = "binary_crossentropy",
 metrics = ["accuracy"]
)

### Print model summary (2 Marks)

In [224]:
#### Add your code here ####
model.summary()

Model: "sequential_9"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_9 (Embedding)      (None, 300, 100)          1000000   
_________________________________________________________________
dropout_7 (Dropout)          (None, 300, 100)          0         
_________________________________________________________________
lstm_9 (LSTM)                (None, 300, 100)          80400     
_________________________________________________________________
time_distributed_9 (TimeDist (None, 300, 100)          10100     
_________________________________________________________________
flatten_5 (Flatten)          (None, 30000)             0         
_________________________________________________________________
dense_15 (Dense)             (None, 1)                 30001     
Total params: 1,120,501
Trainable params: 1,120,501
Non-trainable params: 0
____________________________________________

### Fit the model (2 Marks)

In [225]:
import tensorflow as tf
print(x_train.shape)
print(y_train.shape)
print(type(x_train))
print(type(y_train))

x_final = tf.convert_to_tensor(x_train)
y_final = tf.convert_to_tensor(y_train)

print(x_final.shape)
print(y_train.shape)

model.fit(x_train, y_train, epochs=2, batch_size = 60)

(25000, 300)
(25000,)
<class 'numpy.ndarray'>
<class 'numpy.ndarray'>
(25000, 300)
(25000,)
Epoch 1/2
Epoch 2/2


<tensorflow.python.keras.callbacks.History at 0x7fc199d6c240>

### Evaluate model (2 Marks)

In [220]:
#### Add your code here ####
print(x_test.shape)
print(y_test.shape)

scores = model.evaluate(x_test, y_test)
print('Test accuracy:', scores[1])

(25000, 300)
(25000,)
Test accuracy: 0.8807600140571594


### Predict on one sample (2 Marks)

In [226]:
#### Add your code here ####
y_train[90]

0

In [229]:
y_train[400]

1

In [230]:
# Combining one positive and one negative response
text_bad = x_train[90]
text_good = x_train[400]
texts = (text_bad, text_good)
padded_texts = pad_sequences(texts, maxlen=300, value = 0.0) # 0.0 because it corresponds with <PAD>


In [231]:
predictions = model.predict(padded_texts)
print(predictions)

[[0.09104243]
 [0.95401865]]


In [None]:
# Since 90th record has negative feedback its prediction value is very less as .09 and 400th record has positive feedback so it has higher prediction