## <span style="font-family: Arial; font-weight:bold;font-size:1.25em;color:#00b3e5;">Domain: Sequential NLP

### <span style="font-family: Arial; font-weight:bold;font-size:1.25em;color:#00b3e5;">Problem Description:
<font color=darkblue>
Generate Word Embedding and retrieve outputs of each layer with Keras based on the Classification task.
<br>
Word embedding are a type of word representation that allows words with similar meaning to have a similar representation.
<br>
It is a distributed representation for the text that is perhaps one of the key breakthroughs for the impressive performance of deep learning methods on challenging natural language processing problems.
<br>
We will use the IMDb dataset to learn word embedding as we train our dataset. This dataset contains 25,000 movie reviews from IMDB, labeled with a sentiment (positive or negative).
<br>
</font> 

### <span style="font-family: Arial; font-weight:bold;font-size:1.25em;color:#00b3e5;">Dataset:
<font color=darkblue>
The Dataset of 25,000 movie reviews from IMDB, labeled by sentiment (positive/negative). Reviews have been preprocessed, and each review is encoded as a sequence of word indexes (integers). For convenience, the words are indexed by their frequency in the dataset, meaning the for that has index 1 is the most frequent word. Use the first 20 words from each review to speed up training, using a max vocab size of 10,000. As a convention, "0" does not stand for a specific word, but instead is used to encode any unknown word.
<br>
</font> 

### <span style="font-family: Arial; font-weight:bold;font-size:1.25em;color:#00b3e5;">Overview:
<font color=darkblue>
Using the IMDB dataset and generate Word Embeddings. Build a Sequential Model using Keras for the Sentiment Classification task and Report the Accuracy of the model
</font>

### <span style="font-family: Arial; font-weight:bold;font-size:1.25em;color:#00b3e5;">Objective:
<font color=darkblue>Build a Sequential Model using Keras for the Sentiment Classification task and Report the Accuracy of the model
</font>

### <span style="font-family: Arial; font-weight:bold;font-size:1.25em;color:#00b3e5;">Import the necessary libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline

from keras.datasets import imdb
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Flatten
from keras.layers.embeddings import Embedding
from keras.preprocessing import sequence
from keras.preprocessing.sequence import pad_sequences

from keras.layers import LSTM
from keras.layers import Dropout
from keras.layers.convolutional import Conv1D
from keras.layers.convolutional import MaxPooling1D
from keras import backend as K


In [2]:
print('Numpy Version : ', np.__version__)
print('Pandas Version : ', pd.__version__)
print('Matplotlib Version : ', matplotlib.__version__)


Numpy Version :  1.19.4
Pandas Version :  1.1.3
Matplotlib Version :  3.2.2


### <span style="font-family: Arial; font-weight:bold;font-size:1.25em;color:#00b3e5;">Load IMDB dataset

### <span style="font-family: Arial; font-weight:bold;font-size:1.25em;color:#00b3e5;">Import test and train data

In [3]:
# fix random seed for reproducibility
np.random.seed(111)

In [4]:
vocab_size = 10000 #vocab size
maxlen = 500  #number of word used from each review

In [5]:
# vocab_size is no.of words to consider from the dataset, ordering based on frequency.
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=vocab_size) 


  x_train, y_train = np.array(xs[:idx]), np.array(labels[:idx])
  x_test, y_test = np.array(xs[idx:]), np.array(labels[idx:])


In [6]:
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(25000,) (25000,) (25000,) (25000,)


In [7]:
# Summarize number of words
print("Number of words: ")
print(len(np.unique(np.hstack(X_train))))
print(len(np.unique(np.hstack(X_test))))

Number of words: 
9998
9951


In [8]:
X = np.concatenate((X_train, X_test), axis=0)
y = np.concatenate((y_train, y_test), axis=0)

In [9]:
# summarize size
print("Training data: ")
print(X.shape)
print(y.shape)

Training data: 
(50000,)
(50000,)


In [10]:
# Summarize number of classes
print("Categories:", np.unique(y))
print("Number of unique words:", len(np.unique(np.hstack(X))))

Categories: [0 1]
Number of unique words: 9998


### <span style="font-family: Arial; font-weight:bold;font-size:1.25em;color:#00b3e5;">Get the word index and then Create a key-value pair for word and word_id

In [11]:
# A dictionary mapping words to an integer index
word_index = imdb.get_word_index()


In [12]:
# The first indices are reserved
word_index = {k: (v + 3) for k, v in word_index.items()}
word_index["<PAD>"] = 0
word_index["<START>"] = 1
word_index["<UNK>"] = 2  # unknown
word_index["<UNUSED>"] = 3

reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])


In [13]:
def decode_review(text):
    return ' '.join([reverse_word_index.get(i, '?') for i in text])


In [14]:
n=5
if y[n]==1:
  print("Label:", y[n], " Positive Review")
else:
  print("Label:", y[n], " Negative Review")
print(X[n])
 
decoded = decode_review(X[n])
print(decoded)


Label: 0  Negative Review
[1, 778, 128, 74, 12, 630, 163, 15, 4, 1766, 7982, 1051, 2, 32, 85, 156, 45, 40, 148, 139, 121, 664, 665, 10, 10, 1361, 173, 4, 749, 2, 16, 3804, 8, 4, 226, 65, 12, 43, 127, 24, 2, 10, 10]
<START> begins better than it ends funny that the russian submarine crew <UNK> all other actors it's like those scenes where documentary shots br br spoiler part the message <UNK> was contrary to the whole story it just does not <UNK> br br


In [15]:
n=11509
if y[n]==1:
  print("Label:", y[n], " Positive Review")
else:
  print("Label:", y[n], " Negative Review")
print(X[n])
 
decoded = decode_review(X[n])
print(decoded)


Label: 1  Positive Review
[1, 18, 1450, 9, 6, 2, 6532, 8341, 2, 10, 10, 12, 9, 1167, 8, 1582, 89, 111, 87, 1956, 28, 2, 1450, 11, 5738, 2868, 690, 2, 346, 21, 3210, 4414, 5, 422, 102, 4, 6541, 22, 167, 2431, 2, 301, 44, 1450, 10, 10, 12, 2564, 4, 3537, 4440, 2, 467, 4, 890, 2690, 6121, 11, 2, 19, 2, 11, 129, 3062, 4099, 5, 2, 2, 2, 5, 8159, 33, 314, 316, 11, 4, 182, 47, 107, 2, 27, 205, 5, 1450, 10, 10, 1450, 9, 210, 3445, 19, 119, 5, 883, 5, 1450, 7824, 9772, 63, 9, 9048, 2, 6978, 9, 6, 1594, 7, 346, 108, 400, 7352, 39, 3070, 1020, 907, 39, 32, 120, 4, 182, 11, 257, 75, 413, 1081, 19, 31, 7, 4, 543, 7, 641, 891, 2, 5, 19, 4, 2, 7, 32, 2088, 2, 2511, 5, 4260, 37, 32, 855, 11, 119, 11, 94, 111, 9809, 5, 5592, 11, 49, 7, 4, 2, 6978, 75, 26, 4, 4684, 7, 4, 2076, 3267, 7, 4, 5082, 15, 485, 8, 4181, 602, 2, 5, 382, 649, 40, 18, 2, 5, 2, 2, 11, 4, 890, 7, 2, 11, 4, 636, 22, 42, 18, 2, 2, 5, 2, 2, 17, 6, 428, 430, 5, 6, 4863, 250, 625, 1665, 2667, 883, 526, 34, 2, 2, 778, 23, 2, 852, 2, 13, 6

In [16]:
n=13509
if y[n]==1:
  print("Label:", y[n], " Positive Review")
else:
  print("Label:", y[n], " Negative Review")
print(X[n])
 
decoded = decode_review(X[n])
print(decoded)


Label: 1  Positive Review
[1, 45, 6, 902, 14, 20, 9, 38, 254, 8, 79, 129, 957, 23, 11, 4, 178, 13, 258, 12, 143, 6, 1281, 374, 6409, 5, 12, 16, 434, 290, 12, 14, 9, 209, 6, 824, 4, 118, 22, 93, 315, 4, 1751, 2155, 999, 5, 4, 1885, 22, 7, 4, 3814, 4796, 167, 1265, 2, 93, 389, 108, 44, 4, 3494, 5, 19, 1601, 1662, 29, 1075, 6, 2447, 787, 6929, 4, 2, 7, 4, 999, 10, 10, 6151, 185, 5, 6611, 3064, 28, 6, 389, 1175, 200, 98, 5, 36, 339, 97, 14, 20, 6, 389, 883, 2, 2, 9, 1047, 5, 9119, 137, 2, 988, 9, 8368, 5, 4590, 125, 4, 3921, 200, 4, 109, 2126, 31, 7, 4, 91, 878, 21, 11, 4, 130, 6586, 1519, 23, 22, 10, 10, 1601, 1662, 9, 4, 91, 1796, 1152, 1751, 2155, 22, 207, 110, 2, 1077, 4, 2, 5, 7255, 3120, 8, 471, 4, 2, 2059, 121, 988, 5, 2, 412, 83, 6, 5061, 4, 2, 7, 4, 3494, 26, 115, 3714, 11, 192, 507, 9915, 8, 4, 22, 21, 17, 2, 2, 4, 22, 17, 6, 1796, 1152, 2447, 787, 4, 119, 200, 4, 105, 166, 4, 904, 306, 329, 2494, 12, 166, 4, 22, 2272, 5, 2, 10, 10, 1601, 1662, 9, 4, 2, 3603, 7, 4, 1751, 2155, 99

#### <span style="font-family: Arial; font-weight:bold;font-size:1.25em;color:#00b3e5;">Padding

In [17]:
X_train = pad_sequences(X_train, value=word_index["<PAD>"], padding='post', maxlen=maxlen)
X_test = pad_sequences(X_test, value=word_index["<PAD>"], padding='post', maxlen=maxlen)

In [18]:
len(X_train[0]), len(X_train[1])


(500, 500)

### <span style="font-family: Arial; font-weight:bold;font-size:1.25em;color:#00b3e5;">Build a Sequential Model using Keras for the Sentiment Classification task

#### <span style="font-family: Arial; font-weight:bold;font-size:1.25em;color:#00b3e5;">Sequence Classification

In [19]:
embedding_vecor_length = 20

In [20]:
model_seq = Sequential()
model_seq.add(Embedding(vocab_size, embedding_vecor_length, input_length=maxlen))
model_seq.add(Flatten())
model_seq.add(Dense(250, activation='relu'))
model_seq.add(Dense(1, activation='sigmoid'))
model_seq.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

In [21]:
print(model_seq.summary())

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 500, 20)           200000    
_________________________________________________________________
flatten (Flatten)            (None, 10000)             0         
_________________________________________________________________
dense (Dense)                (None, 250)               2500250   
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 251       
Total params: 2,700,501
Trainable params: 2,700,501
Non-trainable params: 0
_________________________________________________________________
None


In [22]:
#Fit the model
model_seq.fit(X_train, y_train, epochs=5, batch_size=128, verbose=2)

Epoch 1/5
196/196 - 5s - loss: 0.4964 - accuracy: 0.7231
Epoch 2/5
196/196 - 5s - loss: 0.1811 - accuracy: 0.9322
Epoch 3/5
196/196 - 5s - loss: 0.0503 - accuracy: 0.9866
Epoch 4/5
196/196 - 6s - loss: 0.0099 - accuracy: 0.9986
Epoch 5/5
196/196 - 6s - loss: 0.0023 - accuracy: 0.9998


<tensorflow.python.keras.callbacks.History at 0x26da77d31f0>

In [23]:
#Evaluate the model
scores_seq = model_seq.evaluate(X_test, y_test, verbose=0)

print("Accuracy: %.2f%%" % (scores_seq[1]*100))

Accuracy: 85.59%


#### <span style="font-family: Arial; font-weight:bold;font-size:1.25em;color:#00b3e5;">LSTM for Sequence Classification

In [24]:
model_seq_lstm = Sequential()
model_seq_lstm.add(Embedding(vocab_size, embedding_vecor_length, input_length=maxlen))
model_seq_lstm.add(LSTM(100))
model_seq_lstm.add(Dense(1, activation='sigmoid'))
model_seq_lstm.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

In [25]:
print(model_seq_lstm.summary())

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 500, 20)           200000    
_________________________________________________________________
lstm (LSTM)                  (None, 100)               48400     
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 101       
Total params: 248,501
Trainable params: 248,501
Non-trainable params: 0
_________________________________________________________________
None


In [26]:
#Fit the model
model_seq_lstm.fit(X_train, y_train, epochs=3, batch_size=128, verbose=2)

Epoch 1/3
196/196 - 200s - loss: 0.6929 - accuracy: 0.5030
Epoch 2/3
196/196 - 206s - loss: 0.6967 - accuracy: 0.5177
Epoch 3/3
196/196 - 213s - loss: 0.6741 - accuracy: 0.5287


<tensorflow.python.keras.callbacks.History at 0x26dab818490>

In [27]:
#Evaluate the model
scores_seq_lstm = model_seq_lstm.evaluate(X_test, y_test, verbose=0)

print("Accuracy: %.2f%%" % (scores_seq_lstm[1]*100))

Accuracy: 51.24%


#### <span style="font-family: Arial; font-weight:bold;font-size:1.25em;color:#00b3e5;">LSTM for Sequence Classification with Dropout

In [28]:
model_seq_lstm_dropout = Sequential()
model_seq_lstm_dropout.add(Embedding(vocab_size, embedding_vecor_length, input_length=maxlen))
model_seq_lstm_dropout.add(Dropout(0.2))
model_seq_lstm_dropout.add(LSTM(100))
model_seq_lstm_dropout.add(Dropout(0.2))
model_seq_lstm_dropout.add(Dense(1, activation='sigmoid'))
model_seq_lstm_dropout.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

In [29]:
print(model_seq_lstm_dropout.summary())

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 500, 20)           200000    
_________________________________________________________________
dropout (Dropout)            (None, 500, 20)           0         
_________________________________________________________________
lstm_1 (LSTM)                (None, 100)               48400     
_________________________________________________________________
dropout_1 (Dropout)          (None, 100)               0         
_________________________________________________________________
dense_3 (Dense)              (None, 1)                 101       
Total params: 248,501
Trainable params: 248,501
Non-trainable params: 0
_________________________________________________________________
None


In [30]:
#Fit the model
model_seq_lstm_dropout.fit(X_train, y_train, epochs=3, batch_size=128, verbose=2)

Epoch 1/3
196/196 - 215s - loss: 0.6933 - accuracy: 0.5031
Epoch 2/3
196/196 - 212s - loss: 0.6918 - accuracy: 0.5080
Epoch 3/3
196/196 - 203s - loss: 0.6886 - accuracy: 0.5192


<tensorflow.python.keras.callbacks.History at 0x26da6a076a0>

In [31]:
#Evaluate the model
scores_seq_lstm_dropout = model_seq_lstm_dropout.evaluate(X_test, y_test, verbose=0)

print("Accuracy: %.2f%%" % (scores_seq_lstm_dropout[1]*100))

Accuracy: 50.44%


#### <span style="font-family: Arial; font-weight:bold;font-size:1.25em;color:#00b3e5;">precise LSTM for Sequence Classification

In [32]:
model_seq_lstm_precise = Sequential()
model_seq_lstm_precise.add(Embedding(vocab_size, embedding_vecor_length, input_length=maxlen))
model_seq_lstm_precise.add(LSTM(100, dropout=0.2, recurrent_dropout=0.2))
model_seq_lstm_precise.add(Dense(1, activation='sigmoid'))
model_seq_lstm_precise.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

In [33]:
print(model_seq_lstm_precise.summary())

Model: "sequential_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_3 (Embedding)      (None, 500, 20)           200000    
_________________________________________________________________
lstm_2 (LSTM)                (None, 100)               48400     
_________________________________________________________________
dense_4 (Dense)              (None, 1)                 101       
Total params: 248,501
Trainable params: 248,501
Non-trainable params: 0
_________________________________________________________________
None


In [34]:
#Fit the model
model_seq_lstm_precise.fit(X_train, y_train, epochs=3, batch_size=128, verbose=2)

Epoch 1/3
196/196 - 401s - loss: 0.6937 - accuracy: 0.5038
Epoch 2/3
196/196 - 405s - loss: 0.6908 - accuracy: 0.5142
Epoch 3/3
196/196 - 402s - loss: 0.6858 - accuracy: 0.5242


<tensorflow.python.keras.callbacks.History at 0x26da90628e0>

In [35]:
#Evaluate the model
scores_seq_lstm_precise = model_seq_lstm_precise.evaluate(X_test, y_test, verbose=0)

print("Accuracy: %.2f%%" % (scores_seq_lstm_precise[1]*100))

Accuracy: 51.49%


#### <span style="font-family: Arial; font-weight:bold;font-size:1.25em;color:#00b3e5;">CNN and LSTM for Sequence Classification

In [36]:
model_seq_lstm_cnn = Sequential()
model_seq_lstm_cnn.add(Embedding(vocab_size, embedding_vecor_length, input_length=maxlen))
model_seq_lstm_cnn.add(Conv1D(filters=32, kernel_size=3, padding='same', activation='relu'))
model_seq_lstm_cnn.add(LSTM(100))
model_seq_lstm_cnn.add(Dense(1, activation='sigmoid'))
model_seq_lstm_cnn.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

In [37]:
print(model_seq_lstm_cnn.summary())

Model: "sequential_4"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_4 (Embedding)      (None, 500, 20)           200000    
_________________________________________________________________
conv1d (Conv1D)              (None, 500, 32)           1952      
_________________________________________________________________
lstm_3 (LSTM)                (None, 100)               53200     
_________________________________________________________________
dense_5 (Dense)              (None, 1)                 101       
Total params: 255,253
Trainable params: 255,253
Non-trainable params: 0
_________________________________________________________________
None


In [38]:
#Fit the model
model_seq_lstm_cnn.fit(X_train, y_train, epochs=3, batch_size=128, verbose=2)

Epoch 1/3
196/196 - 180s - loss: 0.6978 - accuracy: 0.5034
Epoch 2/3
196/196 - 204s - loss: 0.6813 - accuracy: 0.5250
Epoch 3/3
196/196 - 205s - loss: 0.6695 - accuracy: 0.5322


<tensorflow.python.keras.callbacks.History at 0x26dd09f9b80>

In [39]:
#Evaluate the model
scores_seq_lstm_cnn = model_seq_lstm_cnn.evaluate(X_test, y_test, verbose=0)

print("Accuracy: %.2f%%" % (scores_seq_lstm_cnn[1]*100))

Accuracy: 51.32%


### <span style="font-family: Arial; font-weight:bold;font-size:1.25em;color:#00b3e5;">Retrieve the output of each layer in Keras for a given single test sample from the trained model you built

In [40]:
# with a Sequential model
layer1_output = K.function([model_seq.layers[0].input], [model_seq.layers[1].output])
layer2_output = K.function([model_seq.layers[0].input], [model_seq.layers[2].output])
layer3_output = K.function([model_seq.layers[0].input], [model_seq.layers[3].output])


In [41]:
layer1 = layer1_output([X_test][0])
layer2 = layer2_output([X_test][0])
layer3 = layer3_output([X_test][0])


In [42]:
layer1

[array([[-0.01447504,  0.00384187,  0.00748494, ...,  0.00550764,
         -0.00560445,  0.00756306],
        [-0.01447504,  0.00384187,  0.00748494, ...,  0.00550764,
         -0.00560445,  0.00756306],
        [-0.03957103,  0.02544248, -0.00233473, ...,  0.00263179,
         -0.01730805,  0.04146944],
        ...,
        [-0.01447504,  0.00384187,  0.00748494, ...,  0.00550764,
         -0.00560445,  0.00756306],
        [-0.01447504,  0.00384187,  0.00748494, ...,  0.00550764,
         -0.00560445,  0.00756306],
        [-0.01447504,  0.00384187,  0.00748494, ...,  0.00550764,
         -0.00560445,  0.00756306]], dtype=float32)]

In [43]:
layer2

[array([[0.11262846, 0.17361796, 0.15848213, ..., 0.21624184, 0.28417572,
         0.40211496],
        [0.6366247 , 0.6599688 , 0.7278754 , ..., 1.0163685 , 0.        ,
         0.        ],
        [0.41450405, 0.41564944, 0.37862363, ..., 0.18982676, 0.24773087,
         0.08817422],
        ...,
        [0.10430388, 0.12525654, 0.13948679, ..., 0.1437923 , 0.25996444,
         0.38182613],
        [0.21796092, 0.2222244 , 0.21280958, ..., 0.22931527, 0.16754338,
         0.26743484],
        [0.32403067, 0.37360978, 0.3495034 , ..., 0.34490663, 0.11185843,
         0.20344909]], dtype=float32)]

In [44]:
layer3

[array([[0.02823198],
        [0.99999964],
        [0.98118687],
        ...,
        [0.0104104 ],
        [0.27146804],
        [0.9827529 ]], dtype=float32)]

#### <span style="font-family: Arial; font-weight:bold;font-size:1.25em;color:#00b3e5;">Conclusion :
<font color=darkblue>
Goal is to build a Sequential Model and Report the Accuracy of the model. We loaded the IMDB daa set using Keras, generated word_index, created Sequential modeland reported accuracy. Output of each layer from Sequential model was displayed.
</font>