![alt text](https://drive.google.com/uc?export=view&id=1UXScsVx_Wni_JuDdB8LeTnM6jsPfIwkW)

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# Sentiment Classification

### Dataset
- Dataset of 50,000 movie reviews from IMDB, labeled by sentiment positive (1) or negative (0)
- Reviews have been preprocessed, and each review is encoded as a sequence of word indexes (integers).
- For convenience, words are indexed by overall frequency in the dataset, so that for instance the integer "3" encodes the 3rd most frequent word in the data. This allows for quick filtering operations such as: "only consider the top 10,000 most common words, but eliminate the top 20 most common words".
- As a convention, "0" does not stand for a specific word, but instead is used to encode any unknown word.

Command to import data
- `from tensorflow.keras.datasets import imdb`

### Import the data (2 Marks)
- Use `imdb.load_data()` method
- Get train and test set
- Take 10000 most frequent words

In [2]:
#### Add your code here ####

import tensorflow as tf
tf.version

<module 'tensorflow._api.v2.version' from '/usr/local/lib/python3.7/dist-packages/tensorflow/_api/v2/version/__init__.py'>

In [3]:
from tensorflow.python.eager.context import internal_operation_seed
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [4]:
from tensorflow.keras.datasets import imdb

In [5]:
(X_train, y_train), (X_test, y_test) = tf.keras.datasets.imdb.load_data(
    num_words=10000,
    seed=142,
    skip_top=20
)

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz


In [6]:
len(y_train), len(y_test)

(25000, 25000)

In [7]:
# Concatenating the train and test set into a single array
X = np.concatenate((X_train, X_test), axis=0)
y = np.concatenate((y_train, y_test), axis=0)

In [8]:
len(X), len(y)

(50000, 50000)

### Pad each sentence to be of same length (2 Marks)
- Take maximum sequence length as 300

In [9]:
#### Add your code here ####
print(len(X[2]))
print(len(X[3]))
print(len(X[100]))

215
59
950


In [10]:
# max sequence length is 

n = 0
i=0
for i in X:
  j = len(i)
  if j>n:
    n = j
print('max length of reviews is', n)

max length of reviews is 2494


In [12]:
from tensorflow.keras.preprocessing.sequence import pad_sequences
X_padded = pad_sequences(
    X,
    maxlen=300,
    padding='pre',
    truncating='pre',
)

### Print shape of features & labels (2 Marks)

Number of review, number of words in each review

In [16]:
print('Number of reviews = ', len(X_padded))

Number of reviews =  50000


In [17]:
## Words in each review
index = 1
for i in X_padded:
  print('Number of words in review', index, 'is', len(i))
  index = index+1

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Number of words in review 45001 is 300
Number of words in review 45002 is 300
Number of words in review 45003 is 300
Number of words in review 45004 is 300
Number of words in review 45005 is 300
Number of words in review 45006 is 300
Number of words in review 45007 is 300
Number of words in review 45008 is 300
Number of words in review 45009 is 300
Number of words in review 45010 is 300
Number of words in review 45011 is 300
Number of words in review 45012 is 300
Number of words in review 45013 is 300
Number of words in review 45014 is 300
Number of words in review 45015 is 300
Number of words in review 45016 is 300
Number of words in review 45017 is 300
Number of words in review 45018 is 300
Number of words in review 45019 is 300
Number of words in review 45020 is 300
Number of words in review 45021 is 300
Number of words in review 45022 is 300
Number of words in review 45023 is 300
Number of words in review 45024 is 300

Number of labels

In [18]:
s = set(y)
print('The unique labels are', s)
print('The number of unique lables is', len(list(s)))

The unique labels are {0, 1}
The number of unique lables is 2


In [19]:
len(y)

50000

### Print value of any one feature and it's label (2 Marks)

Feature value

In [22]:
X_padded[50]

array([  23,    2,  364,  499,    2,   72,   21,    2,   52, 3834,   37,
        412,   50,   57,  824,  104,   36,   26,    2, 4834,  205,  704,
          2,    2,    2,  769,  133,    2,    2, 1081,   31, 5352, 1966,
       5445, 2274, 2478, 3365, 6508,    2, 1696,  611,  271,  145,    2,
          2,  117, 2059,    2, 5966, 7189,  121,   29,    2, 1447,    2,
         34,    2,  719,    2,    2,  339,   46,    2,    2, 1416, 7691,
         29,  505,   49, 2388,  676,   83,    2,  707,  511,    2,   36,
       2541,    2, 4969, 1428,    2,    2,    2,   50,   26, 8489,  806,
          2,    2, 1143, 1850,    2,    2,  125,    2, 4122,    2,    2,
        539,    2,    2, 3976,    2,    2,  368, 6498,   21,    2,  265,
         29, 2906,  958,    2,  921,    2,    2,   49,    2,    2,   53,
          2, 9370, 3547,   68,  290,    2, 6780,  429,   68,    2,   29,
       1590, 7772, 5308,    2,   35, 3437, 6155,    2,   21,  164,  549,
        341,    2,  170,    2,  570,   90,    2,   

Label value

In [23]:
y[50]

1

In [None]:
#### Add your code here ####

### Decode the feature value to get original sentence (2 Marks)

First, retrieve a dictionary that contains mapping of words to their index in the IMDB dataset

In [24]:
index = imdb.get_word_index()
reverse_index = dict([(value, key) for (key, value) in index.items()])

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb_word_index.json


Now use the dictionary to get the original words from the encodings, for a particular sentence

In [25]:
decoded = " ".join( [reverse_index.get(i - 3, "#") for i in X_padded[50]] )
print(decoded) 

on # low side # me but # good folk who live there no doubt think they are # god's own country # # # storyline here # # familiar one acclaimed international musician daniel suffers health breakdown # mid career goes back # # little village # northern sweden where he # born # by # local # # help out # # church choir he turns some unlikely talent into # class act # they enter # contest held # # # there are echoes sorry # # band players # # off # models # # girls # # dancers # # full monty but # course he causes plenty # emotional # # some # # more # villagers realise their worth # revolt against their # he faces hostile husbands # an increasingly dubious # but nothing except death # going # stop him # # despite # somewhat corny story we get # know # like many # # characters who come across # people rather than caricatures despite many # them being # # # did wonder about # wife # being # # so long  sweden # one country # # world where such violence # pretty strongly # he # also # bit youn

Get the sentiment for the above sentence
- positive (1)
- negative (0)

In [26]:
if y[50]==1:
  print('The review is positive')
elif y[50]==0:
  print('The review is negative')

The review is positive


### Define model (10 Marks)
- Define a Sequential Model
- Add Embedding layer
  - Embedding layer turns positive integers into dense vectors of fixed size
  - `tensorflow.keras` embedding layer doesn't require us to onehot encode our words, instead we have to give each word a unique integer number as an id. For the imdb dataset we've loaded this has already been done, but if this wasn't the case we could use sklearn LabelEncoder.
  - Size of the vocabulary will be 10000
  - Give dimension of the dense embedding as 100
  - Length of input sequences should be 300
- Add LSTM layer
  - Pass value in `return_sequences` as True
- Add a `TimeDistributed` layer with 100 Dense neurons
- Add Flatten layer
- Add Dense layer

In [27]:
X_padded = X_padded[0:10000]
y = y[0:10000]

In [28]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_padded, y, random_state=142, test_size=0.2)

In [29]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((8000, 300), (2000, 300), (8000,), (2000,))

In [30]:
from keras.models import Sequential
from keras.layers import Dense
from keras.layers.embeddings import Embedding
from keras.layers import Flatten
model = Sequential()

In [31]:
model.add(Embedding(input_dim=10000, output_dim=100, input_length=300))

In [34]:
from tensorflow.keras.layers import LSTM, Dense, Dropout, TimeDistributed
model.add(LSTM(100, return_sequences=True))

### Compile the model (2 Marks)
- Use Optimizer as Adam
- Use Binary Crossentropy as loss
- Use Accuracy as metrics

In [32]:
model.compile(loss = 'binary_crossentropy', optimizer='adam',metrics = ['accuracy'])

### Print model summary (2 Marks)

In [35]:
print(model.summary())

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 300, 100)          1000000   
                                                                 
 lstm (LSTM)                 (None, 300, 100)          80400     
                                                                 
Total params: 1,080,400
Trainable params: 1,080,400
Non-trainable params: 0
_________________________________________________________________
None


### Fit the model (2 Marks)

### Evaluate model (2 Marks)

In [37]:
model.fit(X_train, y_train, epochs = 10, batch_size=32, verbose=2)

Epoch 1/10


ValueError: ignored

### Predict on one sample (2 Marks)

In [None]:
#### Add your code here ####