<a href="https://colab.research.google.com/github/shangeth/Google-ML-Academy/blob/master/2-Deep-Neural-Networks/2_9_ANN_Natural_Language_Processing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h1><a href='https://shangeth.com/courses/'>Google Explore ML Academy</a></h1>
<h3>Instructor: <a href='https://shangeth.com/'>Shangeth Rajaa</a></h3>
<hr>

# Sentiment Analysis

We are going to classify a movie review as Positive or Negative review given a text review.

![](https://cfml.se/img/blog/sentiment_classification/top_img.png)


We'll use the IMDB dataset that contains the text of 50,000 movie reviews from the Internet Movie Database. These are split into 25,000 reviews for training and 25,000 reviews for testing. The training and testing sets are balanced, meaning they contain an equal number of positive and negative reviews.

# IMDB Dataset

## Download the Dataset

In [0]:
import tensorflow as tf

imdb = tf.keras.datasets.imdb
(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words=10000)

## Explore the Dataset

In [87]:
train_data.shape, train_labels.shape, test_data.shape, test_labels.shape

((25000,), (25000,), (25000,), (25000,))

In [88]:
len(train_data[0])

218

Each of these numbers correspond to a word.

In [89]:
word_2_int = tf.keras.datasets.imdb.get_word_index(path='imdb_word_index.json')

word_2_int['hello'], word_2_int['world']

(4822, 179)

In [90]:
len(word_2_int)

88584

In [91]:
def sentence_2_int(sentence):
    sentence_2_int_list = []
    for i in sentence.lower().split(' '):
        sentence_2_int_list.append(word_2_int[i])
    return sentence_2_int_list

sentence = "Worst movie i've ever seen"
sentence_2_int(sentence)

[246, 17, 204, 123, 107]

## Preparing the data

### Padding

Every sentence will be off different length, to pass the sentences through the ANN model, we need to have a fixed length data.
So we
- pad small sentence
- cut very long sentence

In [92]:
pad_value = 0
sentence_len = 100

train_data = tf.keras.preprocessing.sequence.pad_sequences(train_data,
                                                        value=0,
                                                        padding='post',
                                                        maxlen=sentence_len)

test_data = tf.keras.preprocessing.sequence.pad_sequences(test_data,
                                                       value=0,
                                                       padding='post',
                                                       maxlen=sentence_len)

train_data.shape, train_labels.shape, test_data.shape, test_labels.shape

((25000, 100), (25000,), (25000, 100), (25000,))

# ANN Model

We will use a new layer into our model called the Embedding layer. 

The words of each sentence are now represented with numbers, we cannot directly feed those number into the model, we need to convert the numbers into vectors for the model to understand what word it is?

but how do we decide the vectors? we leave that to the model. 

So embedding layer, takes in an integer and converts it into a vector.

In [98]:
vocab_size = 10000
embedding_dim = 16

import tensorflow as tf
from tensorflow import keras

tf.keras.backend.clear_session()

model = tf.keras.Sequential([
                             tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=sentence_len),
                             # the model will take as input an integer matrix of size (batch, input_length).
                             # now model.output_shape == (None, 100, 16), where None is the batch dimension.
                             tf.keras.layers.Dropout(0.4),
                             tf.keras.layers.Flatten(),
                             tf.keras.layers.Dense(units=sentence_len*embedding_dim), 
                             tf.keras.layers.Activation('relu'),
                             tf.keras.layers.Dropout(0.4),
                             tf.keras.layers.Dense(units=500),
                             tf.keras.layers.Activation('relu'),
                             tf.keras.layers.Dropout(0.4),
                             tf.keras.layers.Dense(units=1), 
                             tf.keras.layers.Activation('sigmoid')
                             ])
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 100, 16)           160000    
_________________________________________________________________
dropout (Dropout)            (None, 100, 16)           0         
_________________________________________________________________
flatten (Flatten)            (None, 1600)              0         
_________________________________________________________________
dense (Dense)                (None, 1600)              2561600   
_________________________________________________________________
activation (Activation)      (None, 1600)              0         
_________________________________________________________________
dropout_1 (Dropout)          (None, 1600)              0         
_________________________________________________________________
dense_1 (Dense)              (None, 500)               8

In [99]:
optimizer = tf.keras.optimizers.Adam(lr=0.001)
model.compile(optimizer=optimizer, loss='binary_crossentropy', metrics=['accuracy'])

tf_history = model.fit(train_data, train_labels, batch_size=2000, epochs=10, verbose=True, validation_data=(test_data, test_labels))

Train on 25000 samples, validate on 25000 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [0]:
model.save('trained_model.h5')

# Model Pipeline

## Load trained Model

In [103]:
trained_model = tf.keras.models.load_model('trained_model.h5')

trained_model.summary()

Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 100, 16)           160000    
_________________________________________________________________
dropout (Dropout)            (None, 100, 16)           0         
_________________________________________________________________
flatten (Flatten)            (None, 1600)              0         
_________________________________________________________________
dense (Dense)                (None, 1600)              2561600   
_________________________________________________________________
activation (Activation)      (None, 1600)              0         
__________________

## Sentence to Vectors

In [111]:
import numpy as np

def sentence_2_int(sentence):
    sentence_2_int_list = []
    for i in sentence.lower().split(' '):
        sentence_2_int_list.append(word_2_int[i])
    arr = np.array(sentence_2_int_list).reshape(1,-1)
    arr = tf.keras.preprocessing.sequence.pad_sequences(arr, value=0, padding='post', maxlen=100)
    return arr

sentence = "Worst movie i've ever seen"
sentence_2_int(sentence)

array([[246,  17, 204, 123, 107,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0]], dtype=int32)

In [112]:
model.predict(sentence_2_int(sentence))

array([[0.22403198]], dtype=float32)

In [0]:
def sentence_2_prediction(sentence):
    vector = sentence_2_int(sentence)
    prob = model.predict(vector)
    prediction = prob > 0.5
    if prediction == 1:
        print('Positive Review :D')
    else:
        print('Negative Review :(')

In [114]:
sentence = 'Good Movie i really enjoyed it'

sentence_2_prediction(sentence)

Positive Review :D


In [121]:
sentence = 'worst movie'

sentence_2_prediction(sentence)

Negative Review :(


If you try many different sentences, you may notice the model actually doen't perform well. There may be many reasons for it.

- vocabulary size
- sequence length(no of words in a sentence for padding)
- model architecture
- embedding dim

Train the model by changing all the above to improve its performance.
