# Sentiment Analysis (Natural Language Processing)

Natural Language Processing (NLP) is a vast subject with many different specializations. Here we are going to discuss sentiment analysis. We will be specifically looking at developing a sentiment model based on Amazon video game reviews.


<table align="left">
    <td>
        <a target="_blank" href="https://colab.research.google.com/github/thushv89/manning_tf2_in_action/blob/master/Ch09/9.1.Sentiment_analysis.ipynb"><img src="https://www.tensorflow.org/images/colab_logo_32px.png" />Run in Google Colab</a>
    </td>
</table>



## Importing libraries and other housekeeping

In [1]:
import tensorflow as tf
#import tensorflow_hub as hub
import requests
print(tf.__version__)
import zipfile
import requests
import os
import time
import pandas as pd
import random
import shutil
from tensorflow.keras.preprocessing.image import ImageDataGenerator
import os
import tensorflow.keras.layers as layers
import tensorflow.keras.models as models
from tensorflow.keras.losses import CategoricalCrossentropy
import tensorflow.keras.backend as K
from tensorflow.keras.callbacks import EarlyStopping, CSVLogger
import numpy as np
from PIL import Image
import pickle
from tensorflow.keras.models import load_model, Model
from PIL import Image
from PIL.PngImagePlugin import PngImageFile
import matplotlib.pyplot as plt
import glob
from functools import partial
import nltk

gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
    try:
        # Currently, memory growth needs to be the same across GPUs
        for gpu in gpus:
            tf.config.experimental.set_memory_growth(gpu, True)
    except:
        print("Couldn't set memory_growth")
        pass
    
    
def fix_random_seed(seed):
    """ Setting the random seed of various libraries """
    try:
        np.random.seed(seed)
    except NameError:
        print("Warning: Numpy is not imported. Setting the seed for Numpy failed.")
    try:
        tf.random.set_seed(seed)
    except NameError:
        print("Warning: TensorFlow is not imported. Setting the seed for TensorFlow failed.")
    try:
        random.seed(seed)
    except NameError:
        print("Warning: random module is not imported. Setting the seed for random failed.")

# Fixing the random seed
random_seed=4321
fix_random_seed(random_seed)

2.2.1


## Downloading data

Here as the dataset we are going to use an Amazon video game review dataset available through this [link](http://deepyeti.ucsd.edu/jianmo/amazon/categoryFilesSmall/Video_Games_5.json.gz).

In [2]:
# Downloading the data
# http://deepyeti.ucsd.edu/jianmo/amazon/categoryFilesSmall/Video_Games_5.json.gz

import os
import requests
import gzip
import shutil

# Retrieve the data
if not os.path.exists(os.path.join('data','Video_Games_5.json.gz')):
    url = "http://deepyeti.ucsd.edu/jianmo/amazon/categoryFilesSmall/Video_Games_5.json.gz"
    # Get the file from web
    r = requests.get(url)

    if not os.path.exists('data'):
        os.mkdir('data')
    
    # Write to a file
    with open(os.path.join('data','Video_Games_5.json.gz'), 'wb') as f:
        f.write(r.content)
else:
    print("The tar file already exists.")
    
if not os.path.exists(os.path.join('data', 'Video_Games_5.json')):
    with gzip.open(os.path.join('data','Video_Games_5.json.gz'), 'rb') as f_in:
        with open(os.path.join('data','Video_Games_5.json'), 'wb') as f_out:
            shutil.copyfileobj(f_in, f_out)
else:
    print("The extracted data already exists")


The tar file already exists.
The extracted data already exists


## Loading review data

Let's load a sample of data. The columns that are of interest to us are,

* overall - The overall stars received in the reivew
* verified - Whether the reviewer is a verified buyer or not
* reviewTime - When the review was posted
* reviewText - The review itself

In [3]:
import pandas as pd

# Read the JSON file
review_df = pd.read_json(os.path.join('data', 'Video_Games_5.json'), lines=True, orient='records')
# Select on the columns we're interested in 
review_df = review_df[["overall", "verified", "reviewTime", "reviewText"]]
review_df.head()

Unnamed: 0,overall,verified,reviewTime,reviewText
0,5,True,"10 17, 2015","This game is a bit hard to get the hang of, bu..."
1,4,False,"07 27, 2015",I played it a while but it was alright. The st...
2,3,True,"02 23, 2015",ok game.
3,2,True,"02 20, 2015","found the game a bit too complicated, not what..."
4,5,True,"12 25, 2014","great game, I love it and have played it since..."


## Cleaning up data

Remove entries where the description is null or empty

In [4]:
print("Before cleaning up: {}".format(review_df.shape))
review_df = review_df[~review_df["reviewText"].isna()]
review_df = review_df[review_df["reviewText"].str.strip().str.len()>0]
print("After cleaning up: {}".format(review_df.shape))

Before cleaning up: (497577, 4)
After cleaning up: (497419, 4)


## Checking verified vs non-verified review counts

To improve the quality of the data reviewed, we will only use the verified reviews. But we have to make sure there's enough data. It seems more than 66% of the data is coming from verified buyers.

In [5]:
review_df["verified"].value_counts()

True     332504
False    164915
Name: verified, dtype: int64

## Check the review count for each rating from verified buyers

Let's check how many reviews are there for each different rating. As you can see there are way too many 5 star reviews in our dataset. Therefore we must make sure to account for this imbalance when batching data and creating a model.

In [6]:
verified_df = review_df.loc[review_df["verified"], :]
verified_df["overall"].value_counts()

5    222335
4     54878
3     27973
1     15200
2     12118
Name: overall, dtype: int64

## Map rating to a positive/negative label

To make our classification task simple, we will map the different star ratings to a positive (1) or a negative (0) label. Here we map both 5 and 4 reviews to 1 (positive) and 3,2, and 1 to 0 (negative).

In [7]:
# Use pandas map function to map different star ratings to 0/1
verified_df["label"]=verified_df["overall"].map({5:1, 4:1, 3:0, 2:0, 1:0})
verified_df["label"].value_counts()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


1    277213
0     55291
Name: label, dtype: int64

## Shuffling the data

Let's shuffle the data to make sure there is no order in the data

In [8]:
# We are sampling 100% of the data in a random fashion, leading to a shuffled dataset
verified_df = verified_df.sample(frac=1.0, random_state=random_seed)

# Splint the data to inputs (data) and targets (labels)
data, labels = verified_df["reviewText"], verified_df["label"]

## Preprocessing the text

Here we perform preprocessing. Mainly we're going to focus on the following 

* Lower case (nltk) - Turn "I am" to "i am"
* Remove numbers (regex) - Turn "i am 24 years old" to "i am years old"
* Remove stop words (nltk) - Turn "i go to the shop" to "i go shop"
* Lemmatize (nltk) - Turn "i went to buy flowers" to "i go to buy flower"

Preprocessing helps to reduce the features space, thus the model learning faster

In [9]:
import nltk
# We need to download several nltk artefacts to perform the preprocessing
nltk.download('averaged_perceptron_tagger', download_dir='nltk')
nltk.download('wordnet', download_dir='nltk')
nltk.download('stopwords', download_dir='nltk')
nltk.download('punkt', download_dir='nltk')
nltk.data.path.append(os.path.abspath('nltk'))

from nltk import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import re
import string

# Define a lemmatizer (converts words to base form)
lemmatizer = WordNetLemmatizer()

# Define the English stopwords
EN_STOPWORDS = set(stopwords.words('english')) - {'not'}

def clean_text(doc):
    """ A function that cleans a given document (i.e. a text string)"""
    
    # Turn to lower case
    doc = doc.lower()
    # the shortened form n't is expanded to not
    doc = doc.replace("n\'t ", ' not ')
    # shortened forms like 'll 're 'd 've are removed as they don't add much value to this task
    doc = re.sub(r"(?:\'ll |\'re |\'d |\'ve )", " ", doc)
    # numbers are removed
    doc = re.sub(r"/d+","", doc)
    # break the text in to tokens (or words), while doing that ignore stopwords from the result
    # stopwords again do not add any value to the task
    tokens = [w for w in word_tokenize(doc) if w not in EN_STOPWORDS and w not in string.punctuation]  
    
    # Here we lemmatize the words in the tokens
    # to lemmatize, we get the pos tag of each token and 
    # if it is N (noun) or V (verb) we lemmatize, else 
    # keep the original form
    pos_tags = nltk.pos_tag(tokens)
    clean_text = [
        lemmatizer.lemmatize(w, pos=p[0].lower()) \
        if p[0]=='N' or p[0]=='V' else w \
        for (w, p) in pos_tags
    ]

    # return the clean text
    return clean_text

# Run a sample
sample_doc = 'She sells seashells by the seashore.'
print("Before clean: {}".format(sample_doc))
print("After clean: {}".format(' '.join(clean_text(sample_doc))))

# Apply the transformation to the full text
# this is time consuming
print("\nProcessing all the review data ... This can take a long time (~ 1hr)")
data = data.apply(lambda x: clean_text(x))
print("\tDone")

[nltk_data] Downloading package averaged_perceptron_tagger to nltk...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to nltk...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to nltk...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to nltk...
[nltk_data]   Package punkt is already up-to-date!


Before clean: She sells seashells by the seashore.
After clean: sell seashell seashore

Processing all the review data ... This can take a long time (~ 1hr)
	Done


## Save the preprocessed data

In [10]:
data.to_pickle(os.path.join('data','sentiment_data.pkl'))
labels.to_pickle(os.path.join('data','sentiment_labels.pkl'))

## Splitting data to train/valid/test

In [12]:
def train_valid_test_split(inputs, labels, train_fraction=0.8):
    """ Splits a given dataset into three sets; training, validation and test """    
    
    # Separate indices of negative and positive data points
    neg_indices = pd.Series(labels.loc[(labels==0)].index)
    pos_indices = pd.Series(labels.loc[(labels==1)].index)
    
    n_valid = int(min([len(neg_indices), len(pos_indices)]) * ((1-train_fraction)/2.0))
    n_test = n_valid
    
    neg_test_inds = neg_indices.sample(n=n_test)
    neg_valid_inds = neg_indices.loc[~neg_indices.isin(neg_test_inds)].sample(n=n_test)
    neg_train_inds = neg_indices.loc[~neg_indices.isin(neg_test_inds.tolist()+neg_valid_inds.tolist())]
    
    pos_test_inds = pos_indices.sample(n=n_test)
    pos_valid_inds = pos_indices.loc[~pos_indices.isin(pos_test_inds)].sample(n=n_test)
    pos_train_inds = pos_indices.loc[
        ~pos_indices.isin(pos_test_inds.tolist()+pos_valid_inds.tolist())
    ]
    
    tr_x = inputs.loc[neg_train_inds.tolist() + pos_train_inds.tolist()].sample(frac=1.0, random_state=random_seed)
    tr_y = labels.loc[neg_train_inds.tolist() + pos_train_inds.tolist()].sample(frac=1.0, random_state=random_seed)
    v_x = inputs.loc[neg_valid_inds.tolist() + pos_valid_inds.tolist()].sample(frac=1.0, random_state=random_seed)
    v_y = labels.loc[neg_valid_inds.tolist() + pos_valid_inds.tolist()].sample(frac=1.0, random_state=random_seed)
    ts_x = inputs.loc[neg_test_inds.tolist() + pos_test_inds.tolist()].sample(frac=1.0, random_state=random_seed)
    ts_y = labels.loc[neg_test_inds.tolist() + pos_test_inds.tolist()].sample(frac=1.0, random_state=random_seed)
    
    print('Training data: {}'.format(len(tr_x)))
    print('Validation data: {}'.format(len(v_x)))
    print('Test data: {}'.format(len(ts_x)))
    
    return (tr_x, tr_y), (v_x, v_y), (ts_x, ts_y)
    
(tr_x, tr_y), (v_x, v_y), (ts_x, ts_y) = train_valid_test_split(data, labels)

with open(os.path.join('data','sentiments_processed.pkl'), 'wb') as f:
    pickle.dump(((tr_x, tr_y), (v_x, v_y), (ts_x, ts_y)), f)

Training data: 310388
Validation data: 11058
Test data: 11058


In [13]:
with open(os.path.join('data', 'sentiments_processed.pkl'), 'rb') as f:
    (tr_x, tr_y), (v_x, v_y), (ts_x, ts_y) = pickle.load(f)

print("Some sample targets")
print(tr_y.head(n=10))

Some sample targets
222084    1
280454    1
64524     1
94311     0
70376     1
480877    0
147446    0
240965    0
191264    0
176397    1
Name: label, dtype: int64


## Analysing the data (training)

### Analyse the vocabulary

Let's see what are the most popular words as well as some summary statistics about the data (e.g. mean frequency)

In [14]:
from collections import Counter
# Create a large list which contains all the words in all the reviews
data_list = [w for doc in tr_x for w in doc]

# Create a Counter object from that list
# Counter returns a dictionary, where key is a word and the value is the frequency
cnt = Counter(data_list)

# Convert the result to a pd.Series 
freq_df = pd.Series(list(cnt.values()), index=list(cnt.keys())).sort_values(ascending=False)
# Print most common words
print(freq_df.head(n=10))

# Print summary statistics
print("\nMedian: {}\n".format(freq_df.median()))
print(freq_df.describe(percentiles=[0.25,0.5,0.75,0.9]))

game     407818
not      248244
play     128235
's       127844
get      108819
like     100279
great     97041
one       89948
good      77212
time      63450
dtype: int64

Median: 1.0

count    133500.000000
mean         75.733715
std        1752.039796
min           1.000000
25%           1.000000
50%           1.000000
75%           4.000000
90%          20.000000
max      407818.000000
dtype: float64


### Analyse the sequence length (number of words) of reviews

In [15]:
# Create a pd.Series, which contain the sequence length for each review
seq_length_ser = tr_x.str.len()

# Get the median as well as summary statistics of the sequence length
print("Median length: {}\n".format(seq_length_ser.median()))
seq_length_ser.describe(percentiles=[0.25,0.5,0.75,0.8])

Median length: 12.0



count    310388.000000
mean         32.573589
std          74.059833
min           0.000000
25%           4.000000
50%          12.000000
75%          29.000000
80%          38.000000
max        3162.000000
Name: reviewText, dtype: float64

In [16]:
seq_length_ser = seq_length_ser.sort_values()
# Check the median of the shortest 33% of data
print(seq_length_ser.iloc[:int(seq_length_ser.shape[0]/3.0)].median())
# Check the median of the shortest 33%-66% of data
print(seq_length_ser.iloc[int(seq_length_ser.shape[0]/3.0): int(seq_length_ser.shape[0]*2.0/3.0)].median())
# Check the median of the longest 33% of data
print(seq_length_ser.iloc[int(seq_length_ser.shape[0]*2.0/3.0):].median())

2.0
12.0
46.0


## Define hyperparameters

Based on above analysis, define the vocabulary size and sequence lenght, both of which we need to define the data pipeline and the model

In [17]:
n_vocab = (freq_df >= 25).sum()
n_seq = 39
print("Using a vocabulary of size: {}".format(n_vocab))
print("Using a sequence length: {}".format(n_seq))

Using a vocabulary of size: 11865
Using a sequence length: 39


## Transforming text to numbers

We will define a Keras tokenizer that will take sequences of words (tokens) and convert them to sequences of numbers. This is achieved by building a dictionary that maps a given word to an unique ID.


### Defining a Keras tokenizer

In [18]:
from tensorflow.keras.preprocessing.text import Tokenizer

# Define a tokenizer that will convert words to IDs
# words that are less frequent will be replaced by 'unk'
tokenizer = Tokenizer(num_words=n_vocab, oov_token='unk', lower=False)

# Fit the tokenizer on the data
tokenizer.fit_on_texts(tr_x.tolist())

# Convert all of train/validation/test data to sequences of IDs
tr_x = tokenizer.texts_to_sequences(tr_x.tolist())
v_x = tokenizer.texts_to_sequences(v_x.tolist())
ts_x = tokenizer.texts_to_sequences(ts_x.tolist())


## Defining the `tf.data` Pipeline

Here we will define a `tf.data` pipeline that takes in,
* A list of list, where the outer list contains the individual reviews and the inner list contains the word IDs of a review

and perform,

* Bucketing, to separate sequences with different lengths to predefined buckets (each bucket has predefined boundaries) and return batches containing seuqences of constant length (with the help of padding)
* Shuffle the data


In [22]:


def get_tf_pipeline(text_seq, labels, batch_size=64, bucket_boundaries=[5,15], max_length=50, shuffle=False):
    """ Define a data pipeline that converts sequences to batches of data """
    
    # Concatenate the label and the input sequence so that we don't mess up the order when we shuffle
    data_seq = [[b]+a for a,b in zip(text_seq, labels) ]
    # Define the variable sequence dataset as a ragged tensor
    tf_data = tf.ragged.constant(data_seq)[:,:max_length]
    # Create a dataset out of the ragged tensor
    text_ds = tf.data.Dataset.from_tensor_slices(tf_data)

    # Bucketing the data
    # Bucketing assign each sequence to a bucket depending on the length
    # If you define bucket boundaries as [5, 15], then you get buckets,
    # [0, 5], [5, 15], [15,inf]
    bucket_fn = tf.data.experimental.bucket_by_sequence_length(
        lambda x: tf.cast(tf.shape(x)[0],'int32'), 
        bucket_boundaries=bucket_boundaries, 
        bucket_batch_sizes=[batch_size,batch_size,batch_size], 
        padded_shapes=None,
        padding_values=0, 
        pad_to_bucket_boundary=False
    )

    # Apply bucketing
    text_ds = text_ds.map(lambda x: x).apply(bucket_fn)
    
    # Shuffle the data
    if shuffle:
        text_ds = text_ds.shuffle(buffer_size=10*batch_size)
        
    # Split the data to inputs and labels
    text_ds = text_ds.map(lambda x: (x[:,1:], x[:,0]))    
    
    return text_ds


## Validate the behavior of bucketing function

Here we will look at what actually takes place when you perform bucketing on some sequences.

In [20]:
x = [[1,2],[1],[1,2,3], [2,3,6,4,5],[2,0,9,7],[2,4,214,21],[3,4,42,7,3,2,45,52],[3,2,6,543,2,3243,2,134,52,23],[3,32,21,3,2,4,134,45,1,1,45]]
y = [0,0,0, 1, 1, 1, 0, 0, 0]

a = get_tf_pipeline(x, y, batch_size=2, bucket_boundaries=[3,5], max_length=15, shuffle=True)

for x,y in a.take(6):
    print('\n')
    print(x)
    print('\ty=', y)




tf.Tensor(
[[1 2 0]
 [1 2 3]], shape=(2, 3), dtype=int32)
	y= tf.Tensor([0 0], shape=(2,), dtype=int32)


tf.Tensor(
[[   3    2    6  543    2 3243    2  134   52   23    0]
 [   3   32   21    3    2    4  134   45    1    1   45]], shape=(2, 11), dtype=int32)
	y= tf.Tensor([0 0], shape=(2,), dtype=int32)


tf.Tensor([[1]], shape=(1, 1), dtype=int32)
	y= tf.Tensor([0], shape=(1,), dtype=int32)


tf.Tensor(
[[  2   4 214  21   0   0   0   0]
 [  3   4  42   7   3   2  45  52]], shape=(2, 8), dtype=int32)
	y= tf.Tensor([1 0], shape=(2,), dtype=int32)


tf.Tensor(
[[2 3 6 4 5]
 [2 0 9 7 0]], shape=(2, 5), dtype=int32)
	y= tf.Tensor([1 1], shape=(2,), dtype=int32)


## Take out some example to see

In [21]:
train_ds = get_tf_pipeline(tr_x, tr_y, shuffle=True)
valid_ds = get_tf_pipeline(v_x, v_y)

print("Some training data ...")
for x,y in train_ds.take(2):
    print("Input sequence shape: {}".format(x.shape))
    print(y)

print("\nSome validation data ...")
for x,y in valid_ds.take(2):
    print("Input sequence shape: {}".format(x.shape))
    print(y)

Some training data ...
Input sequence shape: (64, 49)
tf.Tensor(
[0 0 1 0 1 1 1 0 0 0 1 1 0 1 1 0 1 1 1 0 1 0 1 1 1 1 1 0 0 1 1 1 1 1 1 0 1
 1 1 1 1 0 1 1 1 1 1 1 1 0 1 1 1 1 0 1 0 1 1 1 1 1 1 1], shape=(64,), dtype=int32)
Input sequence shape: (64, 49)
tf.Tensor(
[1 1 1 1 1 1 0 0 0 0 1 0 1 0 1 1 1 1 1 1 0 0 1 1 1 0 0 0 1 1 0 1 0 0 1 0 1
 0 1 1 0 1 0 1 1 1 1 1 1 1 0 1 1 1 1 0 1 1 1 1 1 1 0 1], shape=(64,), dtype=int32)

Some validation data ...
Input sequence shape: (64, 49)
tf.Tensor(
[0 0 0 1 0 1 1 1 1 0 1 1 0 0 0 0 0 1 0 0 1 0 0 1 0 0 0 1 1 0 1 0 1 1 0 0 1
 1 1 0 0 0 1 0 0 1 0 1 1 0 1 0 1 1 1 0 1 0 0 1 0 0 0 0], shape=(64,), dtype=int32)
Input sequence shape: (64, 13)
tf.Tensor(
[1 0 0 0 1 0 1 1 1 1 1 1 0 0 1 0 0 1 1 0 1 1 1 0 0 0 1 1 0 1 0 1 1 1 1 0 1
 0 0 1 1 0 0 0 1 1 1 1 0 1 0 0 1 0 1 1 0 0 1 1 0 1 1 0], shape=(64,), dtype=int32)


## Define the sentiment analysis model

Here we are going to define a model. Our model has,

* A masking layer, we will mask-out value zero inputs, which will not contribute to the loss function
* A lambda layer that converts IDs to one hot vectors
* A LSTM model with 128 nodes
* A Dense layer that has 512 nodes and ReLU activation
* Final Dense layer that outputs the sentiment (sigmoid activation)

Finally, the model will have,
* A binary crossentropy loss
* Adam optimizer
* Accuracy metric

In [26]:
import tensorflow.keras.backend as K

K.clear_session()

model = tf.keras.models.Sequential([
    # Create a mask to mask out zero inputs
    tf.keras.layers.Masking(mask_value=0.0, input_shape=(None,)),
    # After creating the mask, convert inputs to onehot encoded inputs
    tf.keras.layers.Lambda(lambda x: tf.one_hot(tf.cast(x,'int32'), depth=n_vocab), input_shape=(None,)),
    # Defining an LSTM layer
    tf.keras.layers.LSTM(128, return_state=False, return_sequences=False),
    # Defining a Dense layer
    tf.keras.layers.Dense(512, activation='relu'),
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Dense(1, activation='sigmoid')
])
# Compile the model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
masking (Masking)            (None, None)              0         
_________________________________________________________________
lambda (Lambda)              (None, None, 11865)       0         
_________________________________________________________________
lstm (LSTM)                  (None, 128)               6140928   
_________________________________________________________________
dense (Dense)                (None, 512)               66048     
_________________________________________________________________
dropout (Dropout)            (None, 512)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 513       
Total params: 6,207,489
Trainable params: 6,207,489
Non-trainable params: 0
______________________________________________

## Create the data pipelines

In [27]:
print("Defining data pipelines")

# Using a batch size of 128
batch_size =128

train_ds = get_tf_pipeline(tr_x, tr_y, batch_size=batch_size, shuffle=True)
valid_ds = get_tf_pipeline(v_x, v_y, batch_size=batch_size)
test_ds = get_tf_pipeline(ts_x, ts_y, batch_size=batch_size)
print('\tDone...')

Defining data pipelines
	Done...


## Defining the negative weights

We discussed first that our dataset has a high class imbalance. Particularly, there are more positive inputs than negative ones. This means we need to weigh our negative examples more.

In [28]:
# There is a class imbalance in the data therefore we are defining a weight for negative inputs
neg_weight = (tr_y==1).sum()/(tr_y==0).sum()
print("Will be using a weight of {} for negative samples".format(neg_weight))

Will be using a weight of 6.017113919471887 for negative samples


## Training the model

In [29]:
os.makedirs('eval', exist_ok=True)

# Logging the performance metrics to a CSV file
csv_logger = tf.keras.callbacks.CSVLogger(os.path.join('eval','1_sentiment_analysis.log'))

monitor_metric = 'val_loss'
mode = 'min'
print("Using metric={} and mode={} for EarlyStopping".format(monitor_metric, mode))

# Reduce LR callback
lr_callback = tf.keras.callbacks.ReduceLROnPlateau(
    monitor=monitor_metric, factor=0.1, patience=3, mode=mode, min_lr=1e-8
)

# EarlyStopping itself increases the memory requirement
# restore_best_weights will increase the memory req for large models
es_callback = tf.keras.callbacks.EarlyStopping(
    monitor=monitor_metric, patience=6, mode=mode, restore_best_weights=False
)

# Train the model
t1 = time.time()

model.fit(train_ds, validation_data=valid_ds, epochs=10, class_weight={0:neg_weight, 1:1.0}, callbacks=[es_callback, lr_callback, csv_logger])
t2 = time.time()

print("It took {} seconds to complete the training".format(t2-t1))

Using metric=val_loss and mode=min for EarlyStopping
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
It took 612.4707632064819 seconds to complete the training


In [31]:
os.makedirs('models', exist_ok=True)
tf.keras.models.save_model(model, os.path.join('models', '1_sentiment_analysis.h5'))

## Evaluating the model on test data

In [32]:
model.evaluate(test_ds)



[0.8677629232406616, 0.8037619590759277]

## Sentiment analysis with an Embedding layer

Next we are going to enhance our model using an Embedding layer. An embedding layer is very useful to capture relationships between different words. For example, in the context of sentiment analysis, words like "good", "great" should produce similar feature vectors. An embedding layer achieves this effect.

In [33]:
import tensorflow.keras.backend as K

K.clear_session()

model = tf.keras.models.Sequential([
    # Create a mask to mask out zero inputs
    tf.keras.layers.Masking(mask_value=0.0, input_shape=(None,)),    
    # Adding an Embedding layer
    tf.keras.layers.Embedding(input_dim=n_vocab+1, output_dim=128, 
                              #mask_zero=True, 
                              input_shape=(None,)),
    # Defining an LSTM layer
    tf.keras.layers.LSTM(128, return_state=False, return_sequences=False),
    # Defining Dense layers
    tf.keras.layers.Dense(512, activation='relu'),
    # Defining a dropout layer
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

# Compile the model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
masking (Masking)            (None, None)              0         
_________________________________________________________________
embedding (Embedding)        (None, None, 128)         1518848   
_________________________________________________________________
lstm (LSTM)                  (None, 128)               131584    
_________________________________________________________________
dense (Dense)                (None, 512)               66048     
_________________________________________________________________
dropout (Dropout)            (None, 512)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 513       
Total params: 1,716,993
Trainable params: 1,716,993
Non-trainable params: 0
______________________________________________

## Training the newly defined model

In [34]:
print("Defining data pipelines")
batch_size=128
train_ds = get_tf_pipeline(tr_x, tr_y, batch_size=128, shuffle=True)
valid_ds = get_tf_pipeline(v_x, v_y, batch_size=128,)
test_ds = get_tf_pipeline(ts_x, ts_y, batch_size=128)
print('\tDone...')

os.makedirs('eval', exist_ok=True)

# Logging the performance metrics to a CSV file
csv_logger = tf.keras.callbacks.CSVLogger(os.path.join('eval','2_sentiment_analysis_embeddings.log'))

monitor_metric = 'val_loss'
mode = 'min' if 'loss' in monitor_metric else 'max'
print("Using metric={} and mode={} for EarlyStopping".format(monitor_metric, mode))

# Reduce LR callback
lr_callback = tf.keras.callbacks.ReduceLROnPlateau(
    monitor=monitor_metric, factor=0.1, patience=3, mode=mode, min_lr=1e-8
)

# EarlyStopping itself increases the memory requirement
# restore_best_weights will increase the memory req for large models
es_callback = tf.keras.callbacks.EarlyStopping(
    monitor=monitor_metric, patience=6, mode=mode, restore_best_weights=False
)

t1 = time.time()

model.fit(train_ds, validation_data=valid_ds, epochs=25, class_weight={0:neg_weight, 1:1.0}, callbacks=[es_callback, lr_callback, csv_logger])
t2 = time.time()

print("It took {} seconds to complete the training".format(t2-t1))

Defining data pipelines
	Done...
Using metric=val_loss and mode=min for EarlyStopping
Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25
Epoch 6/25
Epoch 7/25
Epoch 8/25
It took 261.9115107059479 seconds to complete the training


In [35]:
os.makedirs('models', exist_ok=True)
tf.keras.models.save_model(model, os.path.join('models', '2_sentiment_analysis_embeddings.h5'))

## Evaluate on the test set

Our new model gives slightly better accuracy on the test data

In [36]:
test_ds = get_tf_pipeline(ts_x, ts_y, batch_size=128)
model.evaluate(test_ds)



[0.7214286923408508, 0.8110870122909546]

## Analyse some of the results

Here, let's analyse some of the inputs with strong positive/negative sentiments predicted by the model.

In [118]:
test_ds = get_tf_pipeline(ts_x, ts_y, batch_size=128)

# Go through the test data and gather all examples
test_x = []
test_pred = []
test_y = []
for x, y in test_ds:
    test_x.append(x)    
    test_pred.append(model.predict(x))
    test_y.append(y)

# Check the sizes
test_x = [doc for t in test_x for doc in t.numpy().tolist()]
print("X: {}".format(len(test_x)))
test_pred = tf.concat(test_pred, axis=0).numpy()
print("Pred: {}".format(test_pred.shape))
test_y = tf.concat(test_y, axis=0).numpy()
print("Y: {}".format(test_y.shape))

X: 11058
Pred: (11058, 1)
Y: (11058,)


## Printing the reviews

In [119]:
sorted_pred = np.argsort(test_pred.flatten())
min_pred = sorted_pred[:5]
max_pred = sorted_pred[-5:]

print("Most negative reviews\n")
print("="*50)
for i in min_pred:    
    print(" ".join(tokenizer.sequences_to_texts([test_x[i]])), '\n')
    
print("\nMost positive reviews\n")
print("="*50)
for i in max_pred:
    print(" ".join(tokenizer.sequences_to_texts([test_x[i]])), '\n')


Most negative reviews

buy game high rating promise gameplay saw youtube story so-so graphic mediocre control terrible could not adjust control option preference .. crouch would hold onto left trigger could slip ... also double tap right trigger change weapon suck .. fire weapon require push right button not right trigger often 

attempt install game quad core windows 7 pc zero luck go back forth try every suggestion rockstar support absolutely useless game defect manufacturer not buy side note 'm also po amazon wo not anything either consumer guess 'm totally unk not right either rockstar amazon refund money terrible customer 

way product 5 star 28 review write tone lot review similar play 2 song expert drum say unless play tennis shoe fact screw not flush mean feel every kick specifically two screw leave plus pedal completely torn mount screw something actually go wrong pedal instal unscrew send back ea 

unk interactive stranger unk unk genre develop operation flashpoint various re

## Bonus: Training the model with a custom loss

This code is a bonus section and an experimental one. Essentially, we are trying to see if adaptively using a negative weight on samples (depending on how much negatives/positive samples in a batch), can help the model to perform better. But it doesn't seem to have a strong effect since we're already weight the negative examples.

In [32]:
import tensorflow.keras.backend as K

K.clear_session()

model = tf.keras.models.Sequential([
    # Create a mask to mask out zero inputs
    tf.keras.layers.Masking(mask_value=0.0, input_shape=(None,)),
    # After creating the mask, convert inputs to onehot encoded inputs
    tf.keras.layers.Lambda(lambda x: tf.one_hot(tf.cast(x,'int32'), depth=n_vocab), input_shape=(None,)),
    # Defining an LSTM layer
    tf.keras.layers.LSTM(256, return_state=False, return_sequences=False),
    # Defining a Dense layer
    tf.keras.layers.Dense(100, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

def weighted_binary_crossentropy(y_true, y_pred):
    
    pos_mask = tf.cast(tf.math.equal(y_true, 1),'float32')
    n_pos = tf.reduce_sum(pos_mask)
    neg_mask = tf.cast(tf.math.equal(y_true, 0),'float32')
    n_neg = tf.reduce_sum(neg_mask)
    
    w_pos = n_neg / (n_pos+n_neg)
    w_neg = n_pos / (n_pos+n_neg)
    
    w_mask = (pos_mask*w_pos) + (neg_mask*w_neg)
    
    bce = tf.keras.losses.BinaryCrossentropy(reduction=tf.keras.losses.Reduction.NONE)
    
    return tf.reduce_mean(bce(y_true, y_pred)*w_mask)

# Compile the model
model.compile(loss=weighted_binary_crossentropy, optimizer='adam', metrics=['accuracy'])
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lambda (Lambda)              (None, None, 6609)        0         
_________________________________________________________________
lstm (LSTM)                  (None, 200)               5448000   
_________________________________________________________________
dense (Dense)                (None, 1)                 201       
Total params: 5,448,201
Trainable params: 5,448,201
Non-trainable params: 0
_________________________________________________________________


In [33]:
print("Defining data pipelines")
train_ds = get_tf_pipeline(tr_x, tr_y, shuffle=True)
valid_ds = get_tf_pipeline(v_x, v_y)
test_ds = get_tf_pipeline(ts_x, ts_y)
print('\tDone...')

os.makedirs('eval', exist_ok=True)

# Logging the performance metrics to a CSV file
csv_logger = tf.keras.callbacks.CSVLogger(os.path.join('eval','3_sentiment_analysis_custom_loss.log'))

monitor_metric = 'val_loss'
mode = 'min' if 'loss' in monitor_metric else 'max'
print("Using metric={} and mode={} for EarlyStopping".format(monitor_metric, mode))

# Reduce LR callback
lr_callback = tf.keras.callbacks.ReduceLROnPlateau(
    monitor=monitor_metric, factor=0.1, patience=3, mode=mode, min_lr=1e-8
)

# EarlyStopping itself increases the memory requirement
# restore_best_weights will increase the memory req for large models
es_callback = tf.keras.callbacks.EarlyStopping(
    monitor=monitor_metric, patience=6, mode=mode, restore_best_weights=False
)

t1 = time.time()

model.fit(train_ds, validation_data=valid_ds, epochs=10, callbacks=[es_callback, lr_callback, csv_logger])
t2 = time.time()

print("It took {} seconds to complete the training".format(t2-t1))

Defining data pipelines
	Done...
Using metric=val_loss and mode=min for EarlyStopping
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
It took 804.270180940628 seconds to complete the training
