<h3>Sentiment Analysis</h3>

We are going to use a dataset from <b>IMDB</b> movie review. 

Let's start by importing the necessary libraries.

In [1]:
from keras.datasets import imdb
from keras.preprocessing import sequence
#import keras.engine.topology as KE
import tensorflow as tf
import os
import numpy as np


VOCAB_SIZE = 88584

MAXLEN = 250
BATCH_SIZE = 64

(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words = VOCAB_SIZE)

#train_data[0]

If the review is greater than 250 words then trim off the extra words. 
If the review is less than 250 words add the necessary amount to make it 250.

In [2]:
train_data = sequence.pad_sequences(train_data, MAXLEN)
test_data  = sequence.pad_sequences(test_data, MAXLEN)

Let's build our model. It will have a sequential architecture with one LSTM layer. 32 stands for the output dimension of the vectors generated by the embedding layer. 

In [3]:
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(VOCAB_SIZE, 64),
    tf.keras.layers.LSTM(32),
    tf.keras.layers.Dense(1, activation="sigmoid")
])

model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, None, 64)          5669376   
_________________________________________________________________
lstm (LSTM)                  (None, 32)                12416     
_________________________________________________________________
dense (Dense)                (None, 1)                 33        
Total params: 5,681,825
Trainable params: 5,681,825
Non-trainable params: 0
_________________________________________________________________


We will now train the model using train data and labels, using 10% of the data to perform testing validation.

Important hyper-parameters include: 
10 epochs & a
64 batch size

In [4]:
model.compile(loss="binary_crossentropy" ,optimizer="rmsprop",metrics=['acc'])
history = model.fit(train_data, train_labels, epochs=10, validation_split=0.1)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


Let's evaluate the model using test data and labels. The model is 85.97% accurate.

In [5]:
results = model.evaluate(test_data, test_labels)
print(results)


#model.save('sentiment-analysis.h5')

[0.5146768093109131, 0.8551200032234192]


Language in its original form cannot be accurately processed by a machine, so you need to process the language to make it easier for the machine to understand. The first part of making sense of the data is through a process called tokenization, or splitting strings into smaller parts called tokens.

A token is a sequence of characters in text that serves as a unit. Based on how you create the tokens, they may consist of words, emoticons, hashtags, links, or even individual characters. A basic way of breaking language into tokens is by splitting the text based on whitespace and punctuation.

In [6]:
import keras as ass
word_index = imdb.get_word_index()

def encode_text(text):
    tokens = ass.preprocessing.text.text_to_word_sequence(text)
    tokens = [word_index[word] if word in word_index else 0 for word in tokens]
    return sequence.pad_sequences([tokens], MAXLEN)[0]

text = "that movie was just amazing, so amazing"
encoded = encode_text(text)
print(encoded)

[  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0  12  17  13  4

In [7]:
reverse_word_index = {value: key for (key, value) in word_index.items()}

def decode_integers(integers):
    PAD = 0
    text = ""
    for num in integers:
        if num != PAD:
            text += reverse_word_index[num] + " "
            
    return text[:-1]

print(decode_integers(encoded))

that movie was just amazing so amazing


Let's test the model using sample reviews.

We will use a positive and negative review. The expected output for a positive review should be higher than that of a negative review.

In [8]:
def predict(text):
    encoded_text = encode_text(text)
    pred = np.zeros((1,250))
    pred[0] = encoded_text
    result = model.predict(pred)
    print(result[0])
positive_review = "It not only serves as an item of good entertainment, but is also admirable in depicting the scientific changes in the USA in the 1960s, the social life issues of that era, and differences that existed in the country, especially among African-Americans."
predict(positive_review)

negative_review = "If you can keep both eyes open through its whole three-hour length you're a better man than I am."
predict(negative_review)

[0.9882065]
[0.49058613]


Really impressive results. You can try out testing your own reviews.

<h3>Conclusion</h3>

You can load pre-trained models and have some fun with <b>Sentiment Analysis</b>.


In [None]:
new_model = tf.keras.models.load_model('sentiment-analysis.h5')

Hope you had fun. Thanks