# Natural Language Processing

Natural Language Processing (NLP) is a vast subject with many different specializations. Here we are going to discuss two important topics.

* Sentiment analysis
* Language modelling

<table align="left">
    <td>
        <a target="_blank" href="https://colab.research.google.com/github/thushv89/manning_tf2_in_action/blob/master/Ch09/9.1.Natural_Language_Processing.ipynb"><img src="https://www.tensorflow.org/images/colab_logo_32px.png" />Run in Google Colab</a>
    </td>
</table>



## Importing libraries and other housekeeping

In [3]:
import tensorflow as tf
#import tensorflow_hub as hub
import requests
print(tf.__version__)
import zipfile
import requests
import os
import time
import pandas as pd
import random
import shutil
from tensorflow.keras.preprocessing.image import ImageDataGenerator
import os
import tensorflow.keras.layers as layers
import tensorflow.keras.models as models
from tensorflow.keras.losses import CategoricalCrossentropy
import tensorflow.keras.backend as K
from tensorflow.keras.callbacks import EarlyStopping, CSVLogger
import numpy as np
from PIL import Image
import pickle
from tensorflow.keras.models import load_model, Model
from PIL import Image
from PIL.PngImagePlugin import PngImageFile
import matplotlib.pyplot as plt
import glob
from functools import partial
import nltk

gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
    try:
        # Currently, memory growth needs to be the same across GPUs
        for gpu in gpus:
            tf.config.experimental.set_memory_growth(gpu, True)
    except:
        print("Couldn't set memory_growth")
        pass
    
    
def fix_random_seed(seed):
    """ Setting the random seed of various libraries """
    try:
        np.random.seed(seed)
    except NameError:
        print("Warning: Numpy is not imported. Setting the seed for Numpy failed.")
    try:
        tf.random.set_seed(seed)
    except NameError:
        print("Warning: TensorFlow is not imported. Setting the seed for TensorFlow failed.")
    try:
        random.seed(seed)
    except NameError:
        print("Warning: random module is not imported. Setting the seed for random failed.")

# Fixing the random seed
random_seed=4321
fix_random_seed(random_seed)

2.2.1


## Downloading data

In [4]:
# Downloading the data
# http://deepyeti.ucsd.edu/jianmo/amazon/categoryFilesSmall/Video_Games_5.json.gz

import os
import requests
import gzip
import shutil

# Retrieve the data
if not os.path.exists(os.path.join('data','Video_Games_5.json.gz')):
    url = "http://deepyeti.ucsd.edu/jianmo/amazon/categoryFilesSmall/Video_Games_5.json.gz"
    # Get the file from web
    r = requests.get(url)

    if not os.path.exists('data'):
        os.mkdir('data')
    
    # Write to a file
    with open(os.path.join('data','Video_Games_5.json.gz'), 'wb') as f:
        f.write(r.content)
else:
    print("The tar file already exists.")
    
if not os.path.exists(os.path.join('data', 'Video_Games_5.json')):
    with gzip.open(os.path.join('data','Video_Games_5.json.gz'), 'rb') as f_in:
        with open(os.path.join('data','Video_Games_5.json'), 'wb') as f_out:
            shutil.copyfileobj(f_in, f_out)
else:
    print("The extracted data already exists")


## Loading review data

In [5]:
import pandas as pd

review_df = pd.read_json(os.path.join('data', 'Video_Games_5.json'), lines=True, orient='records')
review_df = review_df[["overall", "verified", "reviewTime", "reviewText"]]
review_df.head()

Unnamed: 0,overall,verified,reviewTime,reviewText
0,5,True,"10 17, 2015","This game is a bit hard to get the hang of, bu..."
1,4,False,"07 27, 2015",I played it a while but it was alright. The st...
2,3,True,"02 23, 2015",ok game.
3,2,True,"02 20, 2015","found the game a bit too complicated, not what..."
4,5,True,"12 25, 2014","great game, I love it and have played it since..."


## Cleaning up data

Remove entries where the description is null or empty

In [6]:
print("Before cleaning up: {}".format(review_df.shape))
review_df = review_df[~review_df["reviewText"].isna()]
review_df = review_df[review_df["reviewText"].str.strip().str.len()>0]
print("After cleaning up: {}".format(review_df.shape))

Before cleaning up: (497577, 4)
After cleaning up: (497419, 4)


## Checking verified vs non-verified review counts

To improve the quality of the data reviewed, we will only use the verified reviews

In [7]:
review_df["verified"].value_counts()

True     332504
False    164915
Name: verified, dtype: int64

## Check the review count for each rating
As you can see there are way too many 5 star reviews in our dataset. Therefore we must make sure to account for this imbalance when batching data and creating a model.

In [8]:
verified_df = review_df.loc[review_df["verified"], :]
verified_df["overall"].value_counts()

5    222335
4     54878
3     27973
1     15200
2     12118
Name: overall, dtype: int64

## Map rating to a positive/negative label

Here we map both 5 and 4 reviews to 1 (positive) and 3,2, and 1 to 0 (negative).

In [9]:
verified_df["label"]=verified_df["overall"].map({5:1, 4:1, 3:0, 2:0, 1:0})
verified_df["label"].value_counts()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


1    277213
0     55291
Name: label, dtype: int64

## Preprocessing the text

* Lower case (Spacy)
* Remove stop words (Spacy)
* Lemmatize (Spacy)



In [10]:
verified_df = verified_df.sample(frac=1.0, random_state=random_seed)
data, labels = verified_df["reviewText"], verified_df["label"]

In [14]:
import nltk
nltk.download('averaged_perceptron_tagger', download_dir='nltk')
nltk.download('wordnet', download_dir='nltk')
nltk.download('stopwords', download_dir='nltk')
nltk.download('punkt', download_dir='nltk')
nltk.data.path.append(os.path.abspath('nltk'))

from nltk import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import re
import string

lemmatizer = WordNetLemmatizer()

EN_STOPWORDS = set(stopwords.words('english')) - {'not'}

def clean_text(doc):
    doc = doc.lower()
    doc = doc.replace("n\'t ", ' not ')
    doc = re.sub(r"(?:\'ll |\'re |\'d |\'ve )", " ", doc)
    doc = re.sub(r"/d+","", doc)
    tokens = [w for w in word_tokenize(doc) if w not in EN_STOPWORDS and w not in string.punctuation]  
    pos_tags = nltk.pos_tag(tokens)
    clean_text = [
        lemmatizer.lemmatize(w, pos=p[0].lower()) \
        if p[0]=='N' or p[0]=='V' else w \
        for (w, p) in pos_tags
    ]

    return clean_text

sample_doc = 'She sells seashells by the seashore.'
print("Before clean: {}".format(sample_doc))
print("After clean: {}".format(' '.join(clean_text(sample_doc))))
print("\nProcessing all the review data ...")
data = data.apply(lambda x: clean_text(x))
print("\tDone")

[nltk_data] Downloading package averaged_perceptron_tagger to nltk...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to nltk...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to nltk...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to nltk...
[nltk_data]   Unzipping tokenizers/punkt.zip.


Before clean: She sells seashells by the seashore.
After clean: sell seashell seashore

Processing all the review data ...
	Done


In [15]:
data.to_pickle(os.path.join('data','data.pkl'))
labels.to_pickle(os.path.join('data','labels.pkl'))

## Analyse the vocabulary

In [16]:
from collections import Counter
data_list = [w for doc in data for w in doc]
cnt = Counter(data_list)
freq_df = pd.Series(list(cnt.values()), index=list(cnt.keys())).sort_values(ascending=False)
print(freq_df.head(n=10))
print("\nMedian: {}\n".format(freq_df.median()))
print(freq_df.describe(percentiles=[0.25,0.5,0.75,0.9]))

game     442821
not      274722
's       139352
play     139233
get      119013
like     109298
great    102644
one       97786
good      83155
time      69146
dtype: int64

Median: 1.0

count    140380.000000
mean         78.487919
std        1861.175078
min           1.000000
25%           1.000000
50%           1.000000
75%           4.000000
90%          19.000000
max      442821.000000
dtype: float64


In [17]:
seq_length_ser = data.str.len()
print("Median length: {}\n".format(seq_length_ser.median()))
seq_length_ser.describe(percentiles=[0.25,0.5,0.75,0.8])

Median length: 12.0



count    332504.000000
mean         33.136846
std          74.867728
min           0.000000
25%           4.000000
50%          12.000000
75%          30.000000
80%          39.000000
max        3162.000000
Name: reviewText, dtype: float64

In [54]:
n_vocab = (freq_df >= 25).sum()
n_seq = 39
print("Using a vocabulary of size: {}".format(n_vocab))
print("Using a sequence length: {}".format(n_seq))

Using a vocabulary of size: 12402
Using a sequence length: 39


In [19]:
print(seq_length_ser.shape[0]/3.0)
seq_length_ser = seq_length_ser.sort_values()
print(seq_length_ser.iloc[:int(seq_length_ser.shape[0]/3.0)].median())
print(seq_length_ser.iloc[int(seq_length_ser.shape[0]/3.0): int(seq_length_ser.shape[0]*2.0/3.0)].median())
print(seq_length_ser.iloc[int(seq_length_ser.shape[0]*2.0/3.0):].median())

110834.66666666667
2.0
12.0
47.0


## Transforming text to numbers

* Only keep most commmon n-worods (Keras)

## What is not covered here
* Instead of words, use n_grams to represent a review

Here we are going to transform the preprocessed text to number sequences and pad them to a fixed length using tf.data

## Defining a Keras tokenizer

In [55]:
from tensorflow.keras.preprocessing.text import Tokenizer

tokenizer = Tokenizer(num_words=n_vocab, oov_token='unk', lower=False)
tokenizer.fit_on_texts(data.tolist())
data_seq = tokenizer.texts_to_sequences(data.tolist())

data_seq = pd.Series(data_seq)
labels = labels.reset_index(drop=True)

## Splitting data to train/valid/test

In [56]:
def train_valid_test_split(text_seq, labels, train_fraction=0.8):
    """ Splits a given dataset into three sets; training, validation and test """
    
    # Remove empty data points
    valid_ids = (data_seq.str.len()>0).index
    text_seq = text_seq.loc[valid_ids]
    labels = labels.loc[valid_ids]
    
    # Separate indices of negative and positive data points
    neg_indices = pd.Series(labels.loc[(labels==0)].index)
    pos_indices = pd.Series(labels.loc[(labels==1)].index)
    
    n_valid = int(min([len(neg_indices), len(pos_indices)]) * ((1-train_fraction)/2.0))
    n_test = n_valid
    
    neg_test_inds = neg_indices.sample(n=n_test)
    neg_valid_inds = neg_indices.loc[~neg_indices.isin(neg_test_inds)].sample(n=n_test)
    neg_train_inds = neg_indices.loc[~neg_indices.isin(neg_test_inds.tolist()+neg_valid_inds.tolist())]
    
    pos_test_inds = pos_indices.sample(n=n_test)
    pos_valid_inds = pos_indices.loc[~pos_indices.isin(pos_test_inds)].sample(n=n_test)
    pos_train_inds = pos_indices.loc[
        ~pos_indices.isin(pos_test_inds.tolist()+pos_valid_inds.tolist())
    ]
    
    tr_x = text_seq.loc[neg_train_inds.tolist() + pos_train_inds.tolist()].sample(frac=1.0, random_state=random_seed)
    tr_y = labels.loc[neg_train_inds.tolist() + pos_train_inds.tolist()].sample(frac=1.0, random_state=random_seed)
    v_x = text_seq.loc[neg_valid_inds.tolist() + pos_valid_inds.tolist()].sample(frac=1.0, random_state=random_seed)
    v_y = labels.loc[neg_valid_inds.tolist() + pos_valid_inds.tolist()].sample(frac=1.0, random_state=random_seed)
    ts_x = text_seq.loc[neg_test_inds.tolist() + pos_test_inds.tolist()].sample(frac=1.0, random_state=random_seed)
    ts_y = labels.loc[neg_test_inds.tolist() + pos_test_inds.tolist()].sample(frac=1.0, random_state=random_seed)
    
    print('Training data: {}'.format(len(tr_x)))
    print('Validation data: {}'.format(len(v_x)))
    print('Test data: {}'.format(len(ts_x)))
    
    return (tr_x, tr_y), (v_x, v_y), (ts_x, ts_y)
    
(tr_x, tr_y), (v_x, v_y), (ts_x, ts_y) = train_valid_test_split(data_seq, labels)

import pickle

with open(os.path.join('data','reviews.pkl'), 'wb') as f:
    pickle.dump(((tr_x, tr_y), (v_x, v_y), (ts_x, ts_y)), f)

Training data: 310388
Validation data: 11058
Test data: 11058


In [57]:
with open(os.path.join('data', 'reviews.pkl'), 'rb') as f:
    (tr_x, tr_y), (v_x, v_y), (ts_x, ts_y) = pickle.load(f)
    
print(tr_y)

219329    1
242101    1
139482    1
165147    0
282412    1
         ..
244172    1
54855     1
17086     0
79139     1
68723     0
Name: label, Length: 310388, dtype: int64


## `tf.data` Pipeline

In [58]:
#data_seq_sample = data_seq[:1000]
#label_sample = labels[:1000].reset_index(drop=True)

with open(os.path.join('data', 'reviews.pkl'), 'rb') as f:
    (tr_x, tr_y), (v_x, v_y), (ts_x, ts_y) = pickle.load(f)

def get_tf_pipeline(text_seq, labels, batch_size=64, bucket_boundaries=[5,15], max_length=50, shuffle=False):
    
    data_seq = [[b]+a for a,b in zip(text_seq, labels) ]
    text_ds = tf.data.Dataset.from_tensor_slices(tf.ragged.constant(data_seq)[:,:max_length])

    bucket_fn = tf.data.experimental.bucket_by_sequence_length(
        lambda x: tf.cast(tf.shape(x)[0],'int32'), 
        bucket_boundaries=bucket_boundaries, 
        bucket_batch_sizes=[batch_size,batch_size,batch_size], 
        padded_shapes=None,
        padding_values=0, 
        pad_to_bucket_boundary=False
    )

    text_ds = text_ds.map(lambda x: x).apply(bucket_fn)
    
    if shuffle:
        text_ds = text_ds.shuffle(buffer_size=10*batch_size)
        
    text_ds = text_ds.map(lambda x: (x[:,1:], x[:,0]))
    #text_ds = text_ds.map(lambda x,y: (tf.one_hot(x, depth=n_vocab),y))
    
    return text_ds

neg_weight = (tr_y==1).sum()/(tr_y==0).sum()
print("Will be using a weight of {} for negative samples".format(neg_weight))


Will be using a weight of 6.017113919471887 for negative samples


## Validate the behavior of bucketing fn

In [53]:
x = [[1,2],[1],[1,2,3], [2,3,6,4,5],[2,0,9,7],[2,4,214,21],[3,4,42,7,3,2,45,52],[3,2,6,543,2,3243,2,134,52,23],[3,32,21,3,2,4,134,45,1,1,45]]
y = [0,0,0, 1, 1, 1, 0, 0, 0]

a = get_tf_pipeline(x, y, batch_size=2, bucket_boundaries=[3,5], max_length=15, shuffle=True)

for x,y in a.take(6):
    print('\n')
    print(x)
    print('\ty=', y)




tf.Tensor(
[[1 2 0]
 [1 2 3]], shape=(2, 3), dtype=int32)
	y= tf.Tensor([0 0], shape=(2,), dtype=int32)


tf.Tensor(
[[   3    2    6  543    2 3243    2  134   52   23    0]
 [   3   32   21    3    2    4  134   45    1    1   45]], shape=(2, 11), dtype=int32)
	y= tf.Tensor([0 0], shape=(2,), dtype=int32)


tf.Tensor(
[[  2   4 214  21   0   0   0   0]
 [  3   4  42   7   3   2  45  52]], shape=(2, 8), dtype=int32)
	y= tf.Tensor([1 0], shape=(2,), dtype=int32)


tf.Tensor([[1]], shape=(1, 1), dtype=int32)
	y= tf.Tensor([0], shape=(1,), dtype=int32)


tf.Tensor(
[[2 3 6 4 5]
 [2 0 9 7 0]], shape=(2, 5), dtype=int32)
	y= tf.Tensor([1 1], shape=(2,), dtype=int32)


In [24]:
train_ds = get_tf_pipeline(tr_x, tr_y, shuffle=True)
valid_ds = get_tf_pipeline(v_x, v_y)

print("Some training data ...")
for x,y in train_ds.take(2):
    print("Input sequence shape: {}".format(x.shape))
    print(y)

print("\nSome validation data ...")
for x,y in valid_ds.take(2):
    print("Input sequence shape: {}".format(x.shape))
    print(y)

Some training data ...
Input sequence shape: (64, 49)
tf.Tensor(
[0 1 1 1 1 1 1 0 1 1 1 1 1 1 1 0 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1], shape=(64,), dtype=int32)
Input sequence shape: (64, 49)
tf.Tensor(
[0 1 1 1 1 1 0 0 0 1 1 1 1 0 1 0 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 0 1
 1 1 1 0 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 0 1 0 1 1 1 0 1], shape=(64,), dtype=int32)

Some validation data ...
Input sequence shape: (64, 49)
tf.Tensor(
[0 0 0 1 0 1 1 1 1 0 1 1 0 0 0 0 0 1 0 0 1 0 0 1 0 0 0 1 1 0 1 0 1 1 0 0 1
 1 1 0 0 0 1 0 0 1 0 1 1 0 1 0 1 1 1 0 1 0 0 1 0 0 0 0], shape=(64,), dtype=int32)
Input sequence shape: (64, 13)
tf.Tensor(
[1 0 0 0 1 0 1 1 1 1 1 1 0 0 1 0 0 1 1 0 1 1 1 0 0 0 1 1 0 1 0 1 1 1 1 0 1
 0 0 1 1 0 0 0 1 1 1 1 0 1 0 0 1 0 1 1 0 0 1 1 0 1 1 0], shape=(64,), dtype=int32)


## Define the model

In [111]:
import tensorflow.keras.backend as K

K.clear_session()

model = tf.keras.models.Sequential([
    # Create a mask to mask out zero inputs
    tf.keras.layers.Masking(mask_value=0.0, input_shape=(None,)),
    # After creating the mask, convert inputs to onehot encoded inputs
    tf.keras.layers.Lambda(lambda x: tf.one_hot(tf.cast(x,'int32'), depth=n_vocab), input_shape=(None,)),
    # Defining an LSTM layer
    tf.keras.layers.LSTM(128, return_state=False, return_sequences=False),
    # Defining a Dense layer
    tf.keras.layers.Dense(512, activation='relu'),
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Dense(1, activation='sigmoid')
])
# Compile the model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
masking (Masking)            (None, None)              0         
_________________________________________________________________
lambda (Lambda)              (None, None, 12402)       0         
_________________________________________________________________
lstm (LSTM)                  (None, 128)               6415872   
_________________________________________________________________
dense (Dense)                (None, 512)               66048     
_________________________________________________________________
dropout (Dropout)            (None, 512)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 513       
Total params: 6,482,433
Trainable params: 6,482,433
Non-trainable params: 0
______________________________________________

## Define data pipelines

In [112]:
print("Defining data pipelines")
batch_size =128
train_ds = get_tf_pipeline(tr_x, tr_y, batch_size=batch_size, shuffle=True)
valid_ds = get_tf_pipeline(v_x, v_y, batch_size=batch_size)
test_ds = get_tf_pipeline(ts_x, ts_y, batch_size=batch_size)
print('\tDone...')

Defining data pipelines
	Done...


## Train the model

Without masking

```
Using metric=val_loss and mode=min for EarlyStopping
Epoch 1/10
4852/4852 [==============================] - 100s 21ms/step - loss: 0.7669 - accuracy: 0.7991 - val_loss: 0.4141 - val_accuracy: 0.8158 - lr: 0.0010
Epoch 2/10
4852/4852 [==============================] - 99s 20ms/step - loss: 0.6120 - accuracy: 0.8472 - val_loss: 0.3767 - val_accuracy: 0.8296 - lr: 0.0010
Epoch 3/10
4852/4852 [==============================] - 99s 20ms/step - loss: 0.5260 - accuracy: 0.8729 - val_loss: 0.4139 - val_accuracy: 0.8264 - lr: 0.0010
Epoch 4/10
4852/4852 [==============================] - 93s 19ms/step - loss: 0.4431 - accuracy: 0.8965 - val_loss: 0.4197 - val_accuracy: 0.8296 - lr: 0.0010
Epoch 5/10
4852/4852 [==============================] - 93s 19ms/step - loss: 0.3685 - accuracy: 0.9179 - val_loss: 0.5000 - val_accuracy: 0.8103 - lr: 0.0010
Epoch 6/10
4852/4852 [==============================] - 96s 20ms/step - loss: 0.2712 - accuracy: 0.9426 - val_loss: 0.6457 - val_accuracy: 0.8149 - lr: 1.0000e-04
Epoch 7/10
4852/4852 [==============================] - 99s 20ms/step - loss: 0.2387 - accuracy: 0.9513 - val_loss: 0.7918 - val_accuracy: 0.8019 - lr: 1.0000e-04
Epoch 8/10
4852/4852 [==============================] - 99s 20ms/step - loss: 0.2179 - accuracy: 0.9566 - val_loss: 0.8088 - val_accuracy: 0.8046 - lr: 1.0000e-04
It took 787.5877232551575 seconds to complete the training
```

With masking

```
Using metric=val_loss and mode=min for EarlyStopping
Epoch 1/10
4852/4852 [==============================] - 97s 20ms/step - loss: 0.7797 - accuracy: 0.7997 - val_loss: 0.4164 - val_accuracy: 0.8064 - lr: 0.0010
Epoch 2/10
4852/4852 [==============================] - 102s 21ms/step - loss: 0.6149 - accuracy: 0.8461 - val_loss: 0.3865 - val_accuracy: 0.8213 - lr: 0.0010
Epoch 3/10
4852/4852 [==============================] - 91s 19ms/step - loss: 0.5263 - accuracy: 0.8716 - val_loss: 0.3870 - val_accuracy: 0.8328 - lr: 0.0010
Epoch 4/10
4852/4852 [==============================] - 96s 20ms/step - loss: 0.4422 - accuracy: 0.8969 - val_loss: 0.4511 - val_accuracy: 0.8229 - lr: 0.0010
Epoch 5/10
4852/4852 [==============================] - 98s 20ms/step - loss: 0.3710 - accuracy: 0.9164 - val_loss: 0.4894 - val_accuracy: 0.8225 - lr: 0.0010
Epoch 6/10
4852/4852 [==============================] - 94s 19ms/step - loss: 0.2735 - accuracy: 0.9426 - val_loss: 0.6428 - val_accuracy: 0.8100 - lr: 1.0000e-04
Epoch 7/10
4852/4852 [==============================] - 94s 19ms/step - loss: 0.2425 - accuracy: 0.9501 - val_loss: 0.7428 - val_accuracy: 0.8048 - lr: 1.0000e-04
Epoch 8/10
4852/4852 [==============================] - 96s 20ms/step - loss: 0.2217 - accuracy: 0.9555 - val_loss: 0.8030 - val_accuracy: 0.8009 - lr: 1.0000e-04
It took 775.7794342041016 seconds to complete the training
```

In [113]:
os.makedirs('eval', exist_ok=True)

# Logging the performance metrics to a CSV file
csv_logger = tf.keras.callbacks.CSVLogger(os.path.join('eval','1_sentiment_analysis.log'))

monitor_metric = 'val_loss'
mode = 'min' if 'loss' in monitor_metric else 'max'
print("Using metric={} and mode={} for EarlyStopping".format(monitor_metric, mode))

# Reduce LR callback
lr_callback = tf.keras.callbacks.ReduceLROnPlateau(
    monitor=monitor_metric, factor=0.1, patience=3, mode=mode, min_lr=1e-8
)

# EarlyStopping itself increases the memory requirement
# restore_best_weights will increase the memory req for large models
es_callback = tf.keras.callbacks.EarlyStopping(
    monitor=monitor_metric, patience=6, mode=mode, restore_best_weights=False
)

t1 = time.time()

model.fit(train_ds, validation_data=valid_ds, epochs=10, class_weight={0:neg_weight, 1:1.0}, callbacks=[es_callback, lr_callback, csv_logger])
t2 = time.time()

print("It took {} seconds to complete the training".format(t2-t1))

Using metric=val_loss and mode=min for EarlyStopping
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
It took 631.7865388393402 seconds to complete the training


In [114]:
model.evaluate(test_ds)



[0.8022735714912415, 0.8069270849227905]

## Sentiment analysis with an Embedding layer

In [115]:
import tensorflow.keras.backend as K

K.clear_session()

model = tf.keras.models.Sequential([
    tf.keras.layers.Masking(mask_value=0.0, input_shape=(None,)),
    # Create a mask to mask out zero inputs
    tf.keras.layers.Embedding(input_dim=n_vocab+1, output_dim=128, 
                              #mask_zero=True, 
                              input_shape=(None,)),
    # Defining an LSTM layer
    tf.keras.layers.LSTM(128, return_state=False, return_sequences=False),
    # Defining a Dense layer
    tf.keras.layers.Dense(512, activation='relu'),
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

# Compile the model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
masking (Masking)            (None, None)              0         
_________________________________________________________________
embedding (Embedding)        (None, None, 128)         1587584   
_________________________________________________________________
lstm (LSTM)                  (None, 128)               131584    
_________________________________________________________________
dense (Dense)                (None, 512)               66048     
_________________________________________________________________
dropout (Dropout)            (None, 512)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 513       
Total params: 1,785,729
Trainable params: 1,785,729
Non-trainable params: 0
______________________________________________

In [116]:
print("Defining data pipelines")
batch_size=128
train_ds = get_tf_pipeline(tr_x, tr_y, batch_size=128, shuffle=True)
valid_ds = get_tf_pipeline(v_x, v_y, batch_size=128,)
test_ds = get_tf_pipeline(ts_x, ts_y, batch_size=128)
print('\tDone...')

os.makedirs('eval', exist_ok=True)

# Logging the performance metrics to a CSV file
csv_logger = tf.keras.callbacks.CSVLogger(os.path.join('eval','3_sentiment_analysis.log'))

monitor_metric = 'val_loss'
mode = 'min' if 'loss' in monitor_metric else 'max'
print("Using metric={} and mode={} for EarlyStopping".format(monitor_metric, mode))

# Reduce LR callback
lr_callback = tf.keras.callbacks.ReduceLROnPlateau(
    monitor=monitor_metric, factor=0.1, patience=3, mode=mode, min_lr=1e-8
)

# EarlyStopping itself increases the memory requirement
# restore_best_weights will increase the memory req for large models
es_callback = tf.keras.callbacks.EarlyStopping(
    monitor=monitor_metric, patience=6, mode=mode, restore_best_weights=False
)

t1 = time.time()

model.fit(train_ds, validation_data=valid_ds, epochs=25, class_weight={0:neg_weight, 1:1.0}, callbacks=[es_callback, lr_callback, csv_logger])
t2 = time.time()

print("It took {} seconds to complete the training".format(t2-t1))

Defining data pipelines
	Done...
Using metric=val_loss and mode=min for EarlyStopping
Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25
Epoch 6/25
Epoch 7/25
Epoch 8/25
It took 269.5488338470459 seconds to complete the training


## Evaluate on the test set

In [117]:
test_ds = get_tf_pipeline(ts_x, ts_y, batch_size=128)
model.evaluate(test_ds)



[0.7106139063835144, 0.8234761953353882]

## Analyse some of the top positive/negative sentiments predicted by the model

In [118]:
test_ds = get_tf_pipeline(ts_x, ts_y, batch_size=128)

test_x = []
test_pred = []
test_y = []
for x, y in test_ds:
    test_x.append(x)    
    test_pred.append(model.predict(x))
    test_y.append(y)

test_x = [doc for t in test_x for doc in t.numpy().tolist()]
print("X: {}".format(len(test_x)))
test_pred = tf.concat(test_pred, axis=0).numpy()
print("Pred: {}".format(test_pred.shape))
test_y = tf.concat(test_y, axis=0).numpy()
print("Y: {}".format(test_y.shape))

X: 11058
Pred: (11058, 1)
Y: (11058,)


## Printing the reviews

In [119]:
sorted_pred = np.argsort(test_pred.flatten())
min_pred = sorted_pred[:5]
max_pred = sorted_pred[-5:]

print("Most negative reviews\n")
print("="*50)
for i in min_pred:    
    print(" ".join(tokenizer.sequences_to_texts([test_x[i]])), '\n')
    
print("\nMost positive reviews\n")
print("="*50)
for i in max_pred:
    print(" ".join(tokenizer.sequences_to_texts([test_x[i]])), '\n')


Most negative reviews

buy game high rating promise gameplay saw youtube story so-so graphic mediocre control terrible could not adjust control option preference .. crouch would hold onto left trigger could slip ... also double tap right trigger change weapon suck .. fire weapon require push right button not right trigger often 

attempt install game quad core windows 7 pc zero luck go back forth try every suggestion rockstar support absolutely useless game defect manufacturer not buy side note 'm also po amazon wo not anything either consumer guess 'm totally unk not right either rockstar amazon refund money terrible customer 

way product 5 star 28 review write tone lot review similar play 2 song expert drum say unless play tennis shoe fact screw not flush mean feel every kick specifically two screw leave plus pedal completely torn mount screw something actually go wrong pedal instal unscrew send back ea 

unk interactive stranger unk unk genre develop operation flashpoint various re

## Bonus: Training the model with a custom loss

In [32]:
import tensorflow.keras.backend as K

K.clear_session()

model = tf.keras.models.Sequential([
    # Create a mask to mask out zero inputs
    #tf.keras.layers.Masking(mask_value=0.0, input_shape=(None,)),
    # After creating the mask, convert inputs to onehot encoded inputs
    tf.keras.layers.Lambda(lambda x: tf.one_hot(tf.cast(x,'int32'), depth=n_vocab), input_shape=(None,)),
    # Defining an LSTM layer
    tf.keras.layers.LSTM(256, return_state=False, return_sequences=False),
    # Defining a Dense layer
    tf.keras.layers.Dense(100, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

def weighted_binary_crossentropy(y_true, y_pred):
    
    pos_mask = tf.cast(tf.math.equal(y_true, 1),'float32')
    n_pos = tf.reduce_sum(pos_mask)
    neg_mask = tf.cast(tf.math.equal(y_true, 0),'float32')
    n_neg = tf.reduce_sum(neg_mask)
    
    w_pos = n_neg / (n_pos+n_neg)
    w_neg = n_pos / (n_pos+n_neg)
    
    w_mask = (pos_mask*w_pos) + (neg_mask*w_neg)
    
    bce = tf.keras.losses.BinaryCrossentropy(reduction=tf.keras.losses.Reduction.NONE)
    
    return tf.reduce_mean(bce(y_true, y_pred)*w_mask)

# Compile the model
model.compile(loss=weighted_binary_crossentropy, optimizer='adam', metrics=['accuracy'])
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lambda (Lambda)              (None, None, 6609)        0         
_________________________________________________________________
lstm (LSTM)                  (None, 200)               5448000   
_________________________________________________________________
dense (Dense)                (None, 1)                 201       
Total params: 5,448,201
Trainable params: 5,448,201
Non-trainable params: 0
_________________________________________________________________


In [33]:
print("Defining data pipelines")
train_ds = get_tf_pipeline(tr_x, tr_y, shuffle=True)
valid_ds = get_tf_pipeline(v_x, v_y)
test_ds = get_tf_pipeline(ts_x, ts_y)
print('\tDone...')

os.makedirs('eval', exist_ok=True)

# Logging the performance metrics to a CSV file
csv_logger = tf.keras.callbacks.CSVLogger(os.path.join('eval','2_sentiment_analysis.log'))

monitor_metric = 'val_loss'
mode = 'min' if 'loss' in monitor_metric else 'max'
print("Using metric={} and mode={} for EarlyStopping".format(monitor_metric, mode))

# Reduce LR callback
lr_callback = tf.keras.callbacks.ReduceLROnPlateau(
    monitor=monitor_metric, factor=0.1, patience=3, mode=mode, min_lr=1e-8
)

# EarlyStopping itself increases the memory requirement
# restore_best_weights will increase the memory req for large models
es_callback = tf.keras.callbacks.EarlyStopping(
    monitor=monitor_metric, patience=6, mode=mode, restore_best_weights=False
)

t1 = time.time()

model.fit(train_ds, validation_data=valid_ds, epochs=10, callbacks=[es_callback, lr_callback, csv_logger])
t2 = time.time()

print("It took {} seconds to complete the training".format(t2-t1))

Defining data pipelines
	Done...
Using metric=val_loss and mode=min for EarlyStopping
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
It took 804.270180940628 seconds to complete the training
