# Get Started With Sentiment Analysis Using TensorFlow Keras

Sentiment Analysis is among the text classification applications in which a given text is classified into a positive class or a negative class (sometimes, a neutral class, too) based on the context. This session discusses sentiment analysis using TensorFlow Keras with the IMDB movie reviews dataset, one of the famous Sentiment Analysis datasets.TensorFlow’s Keras API offers the complete functionality required to build and execute a deep learning model. This session assumes that the reader is familiar with the basics of deep learning and Recurrent Neural Networks (RNNs)

References:

https://www.tensorflow.org/datasets/catalog/imdb_reviews

https://www.tensorflow.org/text/tutorials/text_classification_rnn


To read about it more please refer [this](https://analyticsindiamag.com/getting-started-sentiment-analysis-tensorflow-keras/) article.

# Code Implementation

## Create the Environment

Create the necessary Python environment by importing the frameworks and libraries.

In [None]:
!python -m pip install pip --upgrade --user -q --no-warn-script-location
!python -m pip install numpy pandas seaborn matplotlib scipy statsmodels sklearn tensorflow nltk gensim --user -q --no-warn-script-location

import IPython
IPython.Application.instance().kernel.do_shutdown(True)

In [None]:
import numpy as np
import tensorflow as tf
import tensorflow_datasets as tfds
from tensorflow import keras
from tensorflow.keras.layers import Dense, Dropout, Bidirectional, LSTM
import matplotlib.pyplot as plt

## Download the IMDB dataset 

IMDB reviews dataset is available with TensorFlow Datasets in different variants: 

1. Plain text reviews, 
2. Byte-encoded texts, 
3. Integer-encoded texts with around 8k vocabulary
4. Integer-encoded texts with around 32k vocabulary

Here, we use the dataset that has integer-encoded texts with around 8k vocabulary words.

In [None]:
data, meta = tfds.load('imdb_reviews/subwords8k',
                      with_info = True,
                      as_supervised = True)

In [None]:
data.keys()

We do not require unsupervised data. Hence, we can obtain two datasets for train and test sets.

In [None]:
train = data['train']
test = data['test']
train, test

## Prepare an Encoder

We have discussed that the dataset comes with texts being encoded into integers. Encoding into integers is mandatory since machines can read only numbers. However, humans can not read those integer texts. Hence, we need a decoder that can reverse the encoding action, by which we can convert the numbers into text and read in English. We need an encoder that can convert an example text (from outside of the dataset) into integers. 

Metadata that comes with the dataset contains the encoder originally used while preparing the dataset. It can perform encoding and decoding operations.

In [None]:
# explore the features in metadata
meta.features

It can be observed that metadata contains the encoder under the key ‘text’.

In [None]:
# extract the encoder
encoder = meta.features['text'].encoder

The encoded integers will be numbered from 1 to vocabulary size. How many vocabulary words are there in the encoder?

In [None]:
encoder.vocab_size

What are the original text words?

In [None]:
print(encoder.subwords[:100])

Test the encoder by sampling a sentence, encoding it into integers, and decoding back into text.

In [None]:
example = 'Analytics India Magazine !'
enc = encoder.encode(example)
enc

We have provided a sentence with three words and one exclamation mark, but it is encoded into an eleven-element integer list. The split words are technically called tokens. Let’s explore the numbers and corresponding tokens by using the decode method.

In [None]:
for integer in enc:
    text = encoder.decode([integer])
    print('%4d : %s'%(integer, text))

In [None]:
it = iter(train)

In [None]:
next(it)[0].numpy().shape, next(it)[1].numpy().shape

## Preprocess the Dataset

The input texts are of variable lengths. But a deep learning model can not accept inputs of different sizes. We have to fix the length of each input token. If there are fewer tokens than fixed length, the vector will be made up by padding with zeros. It is accomplished by using the padded_batch method. It pads the sequences in a batch to have an equal number of sequence lengths. Since the large vocabulary size will make the manipulations complicated; it should be embedded into a small-sized vector representation. We perform this process with an Embedding layer

In [None]:
BUFFER_SIZE = 10000
BATCH_SIZE = 64
AUTOTUNE = tf.data.AUTOTUNE

train_data = train.shuffle(BUFFER_SIZE)
train_data = train_data.padded_batch(BATCH_SIZE, padded_shapes=([None],[]))
train_data = train_data.prefetch(AUTOTUNE)

test_data = test.padded_batch(BATCH_SIZE, padded_shapes=([None],[]))

In [None]:
embed_layer = keras.layers.Embedding(encoder.vocab_size, 64)

## Build Model

Unlike images and structured data, texts have a sequential order of tokens that contribute to the context. Hence, the deep learning model should be able to remember past tokens in order when processing a specific token. This is achieved by implementing either Recurrent Neural Networks or Transformers. Here, we prefer Recurrent Neural Networks with LSTM units to model our problem. LSTM (Long-Short Term Memory) units capture the temporal relationship of the past portion of the embedded sequence in memory and models the sequential relationships among texts. LSTM units can be modeled with bi-directional layers so that the model can understand the context of a sentence in both directions, namely, left-to-right and right-to-left. 

In [None]:
model = keras.Sequential([
    # embedding layer
    embed_layer,
    # bidirectional LSTM layers
    Bidirectional(LSTM(64, 
                       dropout=0.5, 
                       recurrent_dropout=0.5, 
                       return_sequences=True)),
    Bidirectional(LSTM(32, 
                       dropout=0.5, 
                       recurrent_dropout=0.5, 
                       return_sequences=True)),
    Bidirectional(LSTM(16, 
                       dropout=0.5, 
                       recurrent_dropout=0.5)),
    # Classification head
    Dense(64, activation='relu', kernel_regularizer='l2'),
    Dropout(0.5),
    Dense(1, activation='sigmoid')    
])

We have used dropout layers and kernel regularizer to contain the overfitting of the model. In the LSTM layer, dropout is executed in two stages, one for the input data and another for the recurrent temporal data.

How many parameters does the model have?

In [None]:
model.summary()

Plotting the model gives a better understanding of data flow through layers.

In [None]:
keras.utils.plot_model(model, show_shapes=True, dpi=48)

## Train the Model

Compile the built model with Adam optimizer, Accuracy metric and Binary Cross-entropy loss function.

In [None]:
model.compile(loss='binary_crossentropy',
             optimizer='adam',
             metrics=['accuracy'])

Train the model for 2 epochs. It should be noted that model training may take more time than multi-layer perceptrons (MLPs) and CNNs, because of handling temporal relationships in LSTM layers.

In [None]:
history = model.fit(train_data, 
                    validation_data=test_data, 
                    epochs=2)

Model Performance Evaluation

The model has been trained and is ready to make inferences. Plot the training losses to have a better understanding of its performance.

In [None]:
hist = history.history

plt.plot(hist['loss'])
plt.plot(hist['val_loss'])
plt.legend(labels=['Training', 'Validation'])
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.show()

In [None]:
# Sample prediction

samples = ['The plot is fantastic', 
           'The movie was cool and thrilling', 
           'one of the worst films I have ever seen']

# encode into integers
sample_encoded = [encoder.encode(sample) for sample in samples]

# pad with zeros to have same length 
sample_padded = []
for s in sample_encoded:
    pad_length = 128 - len(s)
    zeros = [0]*pad_length
    s.extend(zeros)
    s = tf.convert_to_tensor(s)
    sample_padded.append(s)
    
# convert into tensor before feeding the model
sample_padded = tf.convert_to_tensor(sample_padded)
#make predictions
predictions = model.predict(sample_padded)
predictions

In [None]:
print('Predictions on sample test reviews... \n')
for i in range(len(samples)):
    pred = predictions[i][0]
    sentiment = 'positive' if pred>0.5 else 'negative'
    print('%40s : %s'%(samples[i], sentiment))