# Sentiment Classification


### Generate Word Embeddings and retrieve outputs of each layer with Keras based on Classification task

Word embeddings are a type of word representation that allows words with similar meaning to have a similar representation.

It is a distributed representation for text that is perhaps one of the key breakthroughs for the impressive performance of deep learning methods on challenging natural language processing problems.

We willl use the imdb dataset to learn word embeddings as we train our dataset. This dataset contains 25,000 movie reviews from IMDB, labeled with sentiment (positive or negative). 



### Dataset

`from keras.datasets import imdb`

Dataset of 25,000 movies reviews from IMDB, labeled by sentiment (positive/negative). Reviews have been preprocessed, and each review is encoded as a sequence of word indexes (integers). For convenience, the words are indexed by their frequency in the dataset, meaning the for that has index 1 is the most frequent word. Use the first 20 words from each review to speed up training, using a max vocab size of 10,000.

As a convention, "0" does not stand for a specific word, but instead is used to encode any unknown word.


### Aim

1. Import test and train data  
2. Import the labels ( train and test) 
3. Get the word index and then Create key value pair for word and word_id. (12.5 points)
4. Build a Sequential Model using Keras for Sentiment Classification task. (10 points)
5. Report the Accuracy of the model. (5 points)  
6. Retrive the output of each layer in keras for a given single test sample from the trained model you built. (2.5 points)


In [1]:
from google.colab import drive
drive.mount('/content/drive/')

Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount("/content/drive/", force_remount=True).


In [0]:
import os

In [3]:
pwd

u'/content'

In [0]:
import os
os.chdir('/content/drive/My Drive/')

In [0]:
os.chdir('/content/drive/My Drive/greatlakes/labexternal9/')

In [6]:

import numpy as np

from keras.preprocessing import sequence
from keras.models import Sequential, load_model
from keras.layers import Dense, Embedding
from keras.layers import LSTM

Using TensorFlow backend.


In [0]:
from keras.datasets import imdb
#vocab size
vocab_size = 10000
# save np.load
np_load_data = np.load

In [0]:
# modify the default parameters of np.load
np.load = lambda *a,**k: np_load_data(*a, allow_pickle=True, **k)

In [0]:


# call load_data with allow_pickle implicitly set to true
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=vocab_size) 

In [0]:
# restore np.load for future normal usage
np.load = np_load_data

In [11]:
print("x_train ", x_train.shape)
print("y_train ", y_train.shape)
print("_"*100)

('x_train ', (25000,))
('y_train ', (25000,))
____________________________________________________________________________________________________


In [12]:

print("x_test ", x_test.shape)
print("y_test ", y_test.shape)
print("_"*100)

('x_test ', (25000,))
('y_test ', (25000,))
____________________________________________________________________________________________________


In [13]:
print("Maximum value of a word index ")
print(max([max(sequence) for sequence in x_train]))
print("Maximum length num words of review in train ")
print(max([len(sequence) for sequence in x_train]))


Maximum value of a word index 
9999
Maximum length num words of review in train 
2494


In [14]:
print(x_train.shape, y_train.shape)

((25000,), (25000,))


In [15]:
print(x_test.shape, y_test.shape)

((25000,), (25000,))


In [16]:
word_index = imdb.get_word_index()

reverse_word_index = dict(
[(value, key) for (key, value) in word_index.items()])

decoded_review = ' '.join(
[reverse_word_index.get(i - 3, '?') for i in x_train[123]])

print(decoded_review)

? beautiful and touching movie rich colors great settings good acting and one of the most charming movies i have seen in a while i never saw such an interesting setting when i was in china my wife liked it so much she asked me to ? on and rate it so other would enjoy too


In [0]:
from keras.preprocessing.sequence import pad_sequences
vocab_size = 10000 #vocab size
maxlen = 300  #number of word used from each review

In [0]:
#load dataset as a list of ints
# save np.load
np_load_data = np.load

# modify the default parameters of np.load
np.load = lambda *a,**k: np_load_data(*a, allow_pickle=True, **k)
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=vocab_size)
#make all sequences of the same length
x_train = pad_sequences(x_train, maxlen=maxlen)
x_test =  pad_sequences(x_test, maxlen=maxlen)

In [19]:
print("x_train ", x_train.shape)
print("y_train ", y_train.shape)
print("_"*100)
print("x_test ", x_test.shape)
print("y_test ", y_test.shape)
print("_"*100)
print("Maximum value of a word index ")
print(max([max(sequence) for sequence in x_train]))
print("Maximum length num words of review in train ")
print(max([len(sequence) for sequence in x_train]))

('x_train ', (25000, 300))
('y_train ', (25000,))
____________________________________________________________________________________________________
('x_test ', (25000, 300))
('y_test ', (25000,))
____________________________________________________________________________________________________
Maximum value of a word index 
9999
Maximum length num words of review in train 
300


In [20]:
print(x_train.shape, y_train.shape)
print(x_test.shape, y_test.shape)

((25000, 300), (25000,))
((25000, 300), (25000,))


## Build Keras Embedding Layer Model
We can think of the Embedding layer as a dicionary that maps a index assigned to a word to a word vector. This layer is very flexible and can be used in a few ways:

* The embedding layer can be used at the start of a larger deep learning model. 
* Also we could load pre-train word embeddings into the embedding layer when we create our model.
* Use the embedding layer to train our own word2vec models.

The keras embedding layer doesn't require us to onehot encode our words, instead we have to give each word a unqiue intger number as an id. For the imdb dataset we've loaded this has already been done, but if this wasn't the case we could use sklearn [LabelEncoder](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html).

In [21]:
model = Sequential()
model.add(Embedding(vocab_size, 128))
model.add(LSTM(128, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(1, activation='sigmoid'))

# try using different optimizers and different optimizer configs
model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])
model.summary()

W0804 12:43:14.216044 140399556089728 deprecation_wrapper.py:119] From /usr/local/lib/python2.7/dist-packages/keras/backend/tensorflow_backend.py:74: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

W0804 12:43:14.238028 140399556089728 deprecation_wrapper.py:119] From /usr/local/lib/python2.7/dist-packages/keras/backend/tensorflow_backend.py:517: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

W0804 12:43:14.241085 140399556089728 deprecation_wrapper.py:119] From /usr/local/lib/python2.7/dist-packages/keras/backend/tensorflow_backend.py:4138: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.

W0804 12:43:14.346317 140399556089728 deprecation_wrapper.py:119] From /usr/local/lib/python2.7/dist-packages/keras/backend/tensorflow_backend.py:133: The name tf.placeholder_with_default is deprecated. Please use tf.compat.v1.placeholder_with_default instead.

W0804 12:43:14.359673 

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, None, 128)         1280000   
_________________________________________________________________
lstm_1 (LSTM)                (None, 128)               131584    
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 129       
Total params: 1,411,713
Trainable params: 1,411,713
Non-trainable params: 0
_________________________________________________________________


In [22]:
batch_size = 50

model.fit(x_train, y_train,
          batch_size=batch_size,
          epochs=10,
          validation_data=(x_test, y_test))

Train on 25000 samples, validate on 25000 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7fb11a775490>

In [0]:
os.chdir('/content/drive/My Drive/greatlakes/labexternal9/')

In [0]:
# save the model to file
model.save('SeqNLP_Project1_model.h5')

In [30]:
pwd

u'/content/drive/My Drive/greatlakes/labexternal9'

In [0]:
# load the model
model = load_model('SeqNLP_Project1_model.h5')

In [32]:
model.fit(x_train, y_train, batch_size=batch_size, epochs=10)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7fb111298850>

In [33]:
score, acc = model.evaluate(x_test, y_test,
                            batch_size=batch_size)
print('Test score:', score)
print('Test accuracy:', acc)

('Test score:', 0.7613391863256693)
('Test accuracy:', 0.8599599949121475)


#### Insights : 

* Hence the Model evaluation shows the model accuracy has 0.85 and model score as 0.76 