# Sentiment Classification


### Generate Word Embeddings and retrieve outputs of each layer with Keras based on Classification task

Word embeddings are a type of word representation that allows words with similar meaning to have a similar representation.

It is a distributed representation for text that is perhaps one of the key breakthroughs for the impressive performance of deep learning methods on challenging natural language processing problems.

We willl use the imdb dataset to learn word embeddings as we train our dataset. This dataset contains 25,000 movie reviews from IMDB, labeled with sentiment (positive or negative). 



### Dataset

`from keras.datasets import imdb`

Dataset of 25,000 movies reviews from IMDB, labeled by sentiment (positive/negative). Reviews have been preprocessed, and each review is encoded as a sequence of word indexes (integers). For convenience, the words are indexed by their frequency in the dataset, meaning the for that has index 1 is the most frequent word. Use the first 20 words from each review to speed up training, using a max vocab size of 10,000.

As a convention, "0" does not stand for a specific word, but instead is used to encode any unknown word.


### Aim

1. Import test and train data  
2. Import the labels ( train and test) 
3. Get the word index and then Create key value pair for word and word_id. (12.5 points)
4. Build a Sequential Model using Keras for Sentiment Classification task. (10 points)
5. Report the Accuracy of the model. (5 points)  
6. Retrive the output of each layer in keras for a given single test sample from the trained model you built. (2.5 points)


#### Usage:

In [1]:
from keras.datasets import imdb

Using TensorFlow backend.


In [0]:
vocab_size = 10000 #vocab size

In [0]:
import numpy as np
# save np.load
np_load_old = np.load

# modify the default parameters of np.load
np.load = lambda *a,**k: np_load_old(*a, allow_pickle=True, **k)

In [0]:
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=vocab_size) # vocab_size is no.of words to consider from the dataset, ordering based on frequency.

In [0]:
np.load = np_load_old

In [0]:
from keras.preprocessing.sequence import pad_sequences
vocab_size = 10000 #vocab size
maxlen = 300  #number of word used from each review

In [0]:
#make all sequences of the same length
x_train = pad_sequences(x_train, maxlen=maxlen)
x_test =  pad_sequences(x_test, maxlen=maxlen)

In [0]:
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Flatten
from keras.layers.convolutional import Conv1D
from keras.layers.convolutional import MaxPooling1D
from keras.layers.embeddings import Embedding
from keras.layers import LSTM
from keras import callbacks

In [10]:
x_train.shape

(25000, 300)

In [11]:
x_test.shape

(25000, 300)

In [12]:
y_train.shape

(25000,)

In [13]:
y_test.shape


(25000,)

In [14]:
print(np.unique(y_train))
print(np.unique(y_test))

[0 1]
[0 1]


## Build Keras Embedding Layer Model
We can think of the Embedding layer as a dicionary that maps a index assigned to a word to a word vector. This layer is very flexible and can be used in a few ways:

* The embedding layer can be used at the start of a larger deep learning model. 
* Also we could load pre-train word embeddings into the embedding layer when we create our model.
* Use the embedding layer to train our own word2vec models.

The keras embedding layer doesn't require us to onehot encode our words, instead we have to give each word a unqiue intger number as an id. For the imdb dataset we've loaded this has already been done, but if this wasn't the case we could use sklearn [LabelEncoder](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html).

In [15]:
print(x_train[0])

[   0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    1   14
   22   16   43  530  973 1622 1385   65  458 4468   66 3941    4  173
   36  256    5   25  100   43  838  112   50  670    2    9   35  480
  284    5  150    4  172  112  167    2  336  385   39    4  172 4536
 1111   17  546   38   13  447    4  192   50   16    6  147 2025   19
   14   22    4 1920 4613  469    4   22   71   87   12   16   43  530
   38   76   15   13 1247    4   22   17  515   17   12   16  626   18
    2    5   62  386   12    8  316    8  106    5    4 2223 5244   16
  480   66 3785   33    4  130   12   16   38  619    5   25  124   51
   36 

In [16]:
word_id = imdb.get_word_index()

Downloading data from https://s3.amazonaws.com/text-datasets/imdb_word_index.json


In [0]:
INDEX_FROM = 3

word_id = {k:(v+INDEX_FROM) for k,v in word_id.items()} # Shift all the items to make space for the SPECIAL WORDS
word_id["_UNK_"] = 0
word_id["_START_"] = 1
word_id["_CUT_"] = 2

In [0]:
id_word = {v:k for k,v in word_id.items()}

In [19]:
id_word[4]

u'the'

In [20]:
print(' '.join(id_word[i] for i in x_train[0] ))

_UNK_ _UNK_ _UNK_ _UNK_ _UNK_ _UNK_ _UNK_ _UNK_ _UNK_ _UNK_ _UNK_ _UNK_ _UNK_ _UNK_ _UNK_ _UNK_ _UNK_ _UNK_ _UNK_ _UNK_ _UNK_ _UNK_ _UNK_ _UNK_ _UNK_ _UNK_ _UNK_ _UNK_ _UNK_ _UNK_ _UNK_ _UNK_ _UNK_ _UNK_ _UNK_ _UNK_ _UNK_ _UNK_ _UNK_ _UNK_ _UNK_ _UNK_ _UNK_ _UNK_ _UNK_ _UNK_ _UNK_ _UNK_ _UNK_ _UNK_ _UNK_ _UNK_ _UNK_ _UNK_ _UNK_ _UNK_ _UNK_ _UNK_ _UNK_ _UNK_ _UNK_ _UNK_ _UNK_ _UNK_ _UNK_ _UNK_ _UNK_ _UNK_ _UNK_ _UNK_ _UNK_ _UNK_ _UNK_ _UNK_ _UNK_ _UNK_ _UNK_ _UNK_ _UNK_ _UNK_ _UNK_ _UNK_ _START_ this film was just brilliant casting location scenery story direction everyone's really suited the part they played and you could just imagine being there robert _CUT_ is an amazing actor and now the same being director _CUT_ father came from the same scottish island as myself so i loved the fact there was a real connection with this film the witty remarks throughout the film were great it was just brilliant so much that i bought the film as soon as it was released for _CUT_ and would recommend 

## Retrive the output of each layer in keras for a given single test sample from the trained model you built

In [0]:
from keras import backend as K

TEST_INPUT_INDEX = 3

In [0]:
def print_layer_outputs(model, test_input_index=TEST_INPUT_INDEX):
    input_ = model.input                                        
    outputs = [layer.output for layer in lstm_model.layers]     
    func = K.function([input_, K.learning_phase()], outputs )  

    # Testing
    test = [x_test[test_input_index]]
    layer_outs = func([test, 1.])

    for i, layer_out in enumerate(layer_outs):
        print("OUTPUT SHAPE for Layer {} ({}) : {}".format(i+1, outputs[i].name, layer_out.shape))
        print(layer_out)
        print()

    print("EXPECTED OUTPUT LABEL : {}".format(y_test[test_input_index]))

In [24]:
lstm_model = Sequential()
lstm_model.add(Embedding(vocab_size, 128))
lstm_model.add(LSTM(128, dropout=0.2, recurrent_dropout=0.2))
lstm_model.add(Dense(1, activation='sigmoid'))

W0825 15:28:49.384851 140018370099072 deprecation_wrapper.py:119] From /usr/local/lib/python2.7/dist-packages/keras/backend/tensorflow_backend.py:74: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

W0825 15:28:49.425821 140018370099072 deprecation_wrapper.py:119] From /usr/local/lib/python2.7/dist-packages/keras/backend/tensorflow_backend.py:517: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

W0825 15:28:49.432460 140018370099072 deprecation_wrapper.py:119] From /usr/local/lib/python2.7/dist-packages/keras/backend/tensorflow_backend.py:4138: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.

W0825 15:28:49.560270 140018370099072 deprecation_wrapper.py:119] From /usr/local/lib/python2.7/dist-packages/keras/backend/tensorflow_backend.py:133: The name tf.placeholder_with_default is deprecated. Please use tf.compat.v1.placeholder_with_default instead.

W0825 15:28:49.569722 

In [0]:
early_stopping = callbacks.EarlyStopping(monitor='val_loss', min_delta=0,
                                         patience=10, verbose=1, mode='auto')

In [26]:
#Compile
lstm_model.compile(loss='binary_crossentropy', 
              optimizer='adam',
              metrics=['accuracy'])

W0825 15:29:19.242911 140018370099072 deprecation_wrapper.py:119] From /usr/local/lib/python2.7/dist-packages/keras/optimizers.py:790: The name tf.train.Optimizer is deprecated. Please use tf.compat.v1.train.Optimizer instead.

W0825 15:29:19.398277 140018370099072 deprecation_wrapper.py:119] From /usr/local/lib/python2.7/dist-packages/keras/backend/tensorflow_backend.py:3376: The name tf.log is deprecated. Please use tf.math.log instead.

W0825 15:29:19.405412 140018370099072 deprecation.py:323] From /usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/nn_impl.py:180: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where


In [27]:
lstm_model.fit(x_train, y_train, batch_size=64, epochs=20,
          validation_data=(x_test, y_test),
          callbacks=[early_stopping])

loss, acc = lstm_model.evaluate(x_test, y_test, batch_size=64)

print('Test loss (LOWER is better)      :', loss)
print('Test accuracy (HIGHER is better) :', acc)

Train on 25000 samples, validate on 25000 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 00016: early stopping
('Test loss (LOWER is better)      :', 0.7491366099262238)
('Test accuracy (HIGHER is better) :', 0.8554000000190735)


In [28]:
print_layer_outputs(lstm_model)

OUTPUT SHAPE for Layer 1 (embedding_1/embedding_lookup/Identity:0) : (1, 300, 128)
[[[ 0.16638324  0.04940548  0.03265361 ...  0.13320364  0.02578399
   -0.08691566]
  [ 0.16638324  0.04940548  0.03265361 ...  0.13320364  0.02578399
   -0.08691566]
  [ 0.16638324  0.04940548  0.03265361 ...  0.13320364  0.02578399
   -0.08691566]
  ...
  [-0.04995142 -0.04405298  0.06257865 ...  0.06440125 -0.020206
   -0.03984457]
  [ 0.07828303  0.03545779  0.01321205 ...  0.02114357  0.00130465
   -0.05563096]
  [-0.01449515  0.05962845 -0.03083712 ...  0.04419639  0.05406221
    0.06366812]]]
()
OUTPUT SHAPE for Layer 2 (lstm_1/TensorArrayReadV3:0) : (1, 128)
[[ 0.37320197 -0.18202047  0.07033734 -0.32115197 -0.08030289 -0.12198625
  -0.00545509  0.21287002 -0.04580798  0.08436047 -0.10227307  0.01241867
  -0.22514728 -0.05820316 -0.00446917 -0.10885498  0.11282193  0.09502894
   0.00080859 -0.070156   -0.17870657 -0.00718841 -0.19159992  0.04050032
  -0.5057934  -0.01663282 -0.19426614 -0.20117939

In [0]:
v_model = Sequential()
v_model.add(Embedding(vocab_size, 128, input_length=maxlen))
v_model.add(Flatten())
v_model.add(Dense(250, activation='relu'))
v_model.add(Dense(1, activation='sigmoid'))
v_model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

In [31]:
#vanilla model
v_model.fit(x_train, y_train, batch_size=64, epochs=20,
          validation_data=(x_test, y_test),
          callbacks=[early_stopping])

Train on 25000 samples, validate on 25000 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 00011: early stopping


<keras.callbacks.History at 0x7f5847fe64d0>

In [33]:
loss, acc = v_model.evaluate(x_test, y_test, batch_size=64)

print('Test loss (LOWER is better)      :', loss)
print('Test accuracy (HIGHER is better) :', acc)

('Test loss (LOWER is better)      :', 0.6950853090476989)
('Test accuracy (HIGHER is better) :', 0.863080000038147)


In [0]:
input_ = v_model.input
outputs = [layer.output for layer in v_model.layers]
func = K.function([input_, K.learning_phase()], outputs )

In [0]:
test = [x_test[TEST_INPUT_INDEX]]
layer_outs = func([test, 1.])

In [38]:
for i, layer_out in enumerate(layer_outs):
    print("OUTPUT SHAPE for Layer {} ({}) : {}".format(i+1, outputs[i].name, layer_out.shape))
    print(layer_out)
    print()

print("EXPECTED OUTPUT LABEL : {}".format(y_test[TEST_INPUT_INDEX]))

OUTPUT SHAPE for Layer 1 (embedding_2/embedding_lookup/Identity:0) : (1, 300, 128)
[[[-0.0043788  -0.00404044  0.01119822 ...  0.00757719  0.0016334
   -0.0039446 ]
  [-0.0043788  -0.00404044  0.01119822 ...  0.00757719  0.0016334
   -0.0039446 ]
  [-0.0043788  -0.00404044  0.01119822 ...  0.00757719  0.0016334
   -0.0039446 ]
  ...
  [-0.00062622  0.0127108   0.02629201 ...  0.03687827  0.04996743
   -0.00037604]
  [-0.02858324 -0.02247133  0.01345623 ...  0.0132363   0.01304052
    0.00252912]
  [-0.05393251  0.02738246  0.01452963 ... -0.06496478 -0.12210263
   -0.06148017]]]
()
OUTPUT SHAPE for Layer 2 (flatten_1/Reshape:0) : (1, 38400)
[[-0.0043788  -0.00404044  0.01119822 ... -0.06496478 -0.12210263
  -0.06148017]]
()
OUTPUT SHAPE for Layer 3 (dense_2/Relu:0) : (1, 250)
[[0.67135656 0.5776657  0.511561   0.719964   0.5483909  0.47536692
  0.64605933 0.4882938  0.6224323  0.56616205 0.42435616 0.36288357
  0.63522816 0.0799265  0.5311148  0.         0.6376988  0.5062854
  0.576965

In [39]:
#Cnn Model

cnn_model = Sequential()
cnn_model.add(Embedding(vocab_size, 128, input_length=maxlen))
cnn_model.add(Conv1D(filters=64, kernel_size=3, padding='same', activation='relu'))
cnn_model.add(MaxPooling1D(pool_size=2))
cnn_model.add(Flatten())
cnn_model.add(Dense(250, activation='relu'))
cnn_model.add(Dense(1, activation='sigmoid'))
cnn_model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

W0825 16:28:44.541110 140018370099072 deprecation_wrapper.py:119] From /usr/local/lib/python2.7/dist-packages/keras/backend/tensorflow_backend.py:3976: The name tf.nn.max_pool is deprecated. Please use tf.nn.max_pool2d instead.



In [40]:
cnn_model.fit(x_train, y_train, batch_size=64, epochs=20,
          validation_data=(x_test, y_test),
          callbacks=[early_stopping])

loss, acc = cnn_model.evaluate(x_test, y_test, batch_size=64)

print('Test loss (LOWER is better)      :', loss)
print('Test accuracy (HIGHER is better) :', acc)

Train on 25000 samples, validate on 25000 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 00011: early stopping
('Test loss (LOWER is better)      :', 0.7667903868770599)
('Test accuracy (HIGHER is better) :', 0.8800399999809265)


In [0]:
input_ = cnn_model.input                               
outputs = [layer.output for layer in cnn_model.layers]
func = K.function([input_, K.learning_phase()], outputs )

In [0]:
test = [x_test[TEST_INPUT_INDEX]]
layer_outs = func([test, 1.])

In [43]:
for i, layer_out in enumerate(layer_outs):
    print("OUTPUT SHAPE for Layer {} ({}) : {}".format(i+1, outputs[i].name, layer_out.shape))
    print(layer_out)
    print()

print("EXPECTED OUTPUT LABEL : {}".format(y_test[TEST_INPUT_INDEX]))

OUTPUT SHAPE for Layer 1 (embedding_3/embedding_lookup/Identity:0) : (1, 300, 128)
[[[-0.03964212 -0.0802965  -0.05474008 ...  0.01203048 -0.04466333
    0.0506419 ]
  [-0.03964212 -0.0802965  -0.05474008 ...  0.01203048 -0.04466333
    0.0506419 ]
  [-0.03964212 -0.0802965  -0.05474008 ...  0.01203048 -0.04466333
    0.0506419 ]
  ...
  [ 0.00441057  0.04100005  0.01868439 ...  0.04971323  0.04151232
   -0.02462053]
  [-0.11718191  0.00831428 -0.02294085 ... -0.05033892  0.01161678
    0.02111319]
  [ 0.03288076  0.04441469 -0.00708851 ... -0.03903577  0.03864863
    0.02397821]]]
()
OUTPUT SHAPE for Layer 2 (conv1d_1/Relu:0) : (1, 300, 64)
[[[0.         0.         0.         ... 0.         0.         0.        ]
  [0.         0.         0.         ... 0.         0.         0.        ]
  [0.         0.         0.         ... 0.         0.         0.        ]
  ...
  [0.         0.1331963  0.         ... 0.         0.00190557 0.        ]
  [0.         0.         0.         ... 0.065358

In [0]:
#Logistic Regression

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, log_loss, f1_score
from sklearn.pipeline import Pipeline

In [45]:
x_train_s = [' '.join(map(str, row)) for row in x_train]
x_test_s = [' '.join(map(str, row)) for row in x_test]

# We'll just use the default values.
pipeline = Pipeline([('counter', TfidfVectorizer()), 
                     ('classifier', LogisticRegression())])
pipeline.fit(x_train_s, y_train)



Pipeline(memory=None,
     steps=[('counter', TfidfVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.float64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm=u'l2', preprocessor=None, smooth_idf...penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False))])

In [46]:
y_sklearn = pipeline.predict(x_test_s)
y_proba_sklearn = pipeline.predict_proba(x_test_s)

print('Test loss     : {}'.format(log_loss(y_test, y_proba_sklearn)))
print('Test accuracy : {}'.format(accuracy_score(y_test, y_sklearn)))
print('Test f1 score : {}'.format(f1_score(y_test, y_sklearn)))

Test loss     : 0.310943791863
Test accuracy : 0.88408
Test f1 score : 0.884357541899


**Sentiment Analysis Observation**

**LSTM**

* Test loss (LOWER is better) : 0.6593842343902588
* Test accuracy (HIGHER is better) : 0.8450000000190735
* Epochs : 13
* Overfitted : Yes

**NN/MLP**

* Test loss (LOWER is better) : 0.7345848669528962
* Test accuracy (HIGHER is better) : 0.8631200000381469
* Epochs : 11
* Overfitted : Yes

**CNN**

* Test loss (LOWER is better) : 0.7345374908304214
* Test accuracy (HIGHER is better) : 0.8804400000190735
* Epochs : 11
* Overfitted : Yes

**Logistic Regression**

Test loss : 0.31094379459836513
Test accuracy : 0.88408
Test f1 score : 0.8843575418994414

**Conclusion**

The LSTM based RNN was the slowest to overfit. 

The MLP and the CNN quickly overfitted. 

Even though the LSTM based RNN has the least accuracy score, it is likely to generalize better if the dataset is bigger because of it ability to learn complex relations over larger contexts. 

As such with the smaller dataset provided to these models, a simple Logistic Regression performed better!