# LSTMs for review-text-based rating prediction

Processing text data usually leads to variable length sequences handling. Training networks for text analysis with regular LSTM cells is time consuming. However, the CuDNNLSTM - fast implementation of LSTM - does not support masking of input sequences so far, hence one can be concerned if they could be useful in NLP related tasks.  
In this notebook I made a simple benchmark of time/performance of LSTM/CuDNNLSTM based models trained with zero padded, concatenated, masked and unmasked data.

Thanks to @kratisaxena for the nice [EDA](https://www.kaggle.com/kratisaxena/eda-classification-for-reviews-using-rnn) I learned a lot from.



In [None]:
# Load Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import os
import time
import keras
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential, Model
from keras.layers import Embedding, Flatten, Dense, LSTM, GRU, Dropout, Input

Let's preview the data file.

In [None]:
df = pd.read_csv("../input/Womens Clothing E-Commerce Reviews.csv", index_col=0)
print('Records:',len(df))
df.head()

So in this approach the Rating will be predicted based on the Review Text content. Hence let's drop records with missing Review Text value.

In [None]:
df = df[pd.notnull(df['Review Text'])]
df.info()

## Download and extract word embeddings

In [None]:
import requests
import zipfile

url = "http://nlp.stanford.edu/data/glove.6B.zip"
r = requests.get(url, allow_redirects=True)
open('../working/test.zip', 'wb').write(r.content)

with zipfile.ZipFile('../working/test.zip', 'r') as zip_ref:
    zip_ref.extractall('../working/')
    
os.listdir('../working/')

There are few versions avaliable in the downloaded package, I'll pick 300d.

In [None]:
n_embeddings=300

Configure a tokenizer and fit it on the available Review Text data:

In [None]:
from keras.preprocessing.text import Tokenizer
max_words = 15000

t = Tokenizer(num_words=max_words, char_level=False, split=' ')
t.fit_on_texts( df['Review Text'])

vocab_size = len(t.word_index)+1
vocab_size

Make sequences of tokens:

In [None]:
sequences = t.texts_to_sequences(df['Review Text'])

df['n_tokens'] = [len(seq) for seq in sequences] # get length of sequences
df.drop(df[df['n_tokens']<2].index, inplace=True) # remove short sequences
df.reset_index(inplace=True)

sequences = t.texts_to_sequences(df['Review Text']) # tokenize again
max_length = max(df['n_tokens'])
print('Max length of sequence:',max_length)

## Two ways of representing tokenized text (sequences):
* **Zero padding**  
Each sequence can be zero-padded to create a dense ndarray.

In [None]:
from keras.preprocessing.sequence import pad_sequences
sequences_pad = pad_sequences(sequences=sequences, maxlen=max_length, padding='post')

* **Concatenating**  
As CuDNNLSTM does not support masking, maybe the network would take advantage of concatenating word sequences instead of making zero padding. Therefore other way of representing descriptive data is: if the max_length is e.g. 10, shorter sequences will be copied, concatenated and clipped until they have 10 words. E.g. from the sequence [1, 43, 9] we will produce [1, 43, 9, 1, 43, 9, 1, 43, 9, 1]. This could refer to having "this is great this is great this is great this" instead of "this is great _ _ _ _ _ _ _" for underscore being a special masking character.

In [None]:
sequences_concat = [np.asarray( (seq*(int(max_length/(len(seq))+len(seq))))[0:max_length]) for seq in sequences]
sequences_concat = np.vstack(sequences_concat)

## Reading the embedding matrix

In [None]:
# load the whole embedding into memory
embeddings_index = dict()
f = open('glove.6B.'+str(n_embeddings)+'d.txt')
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()
print('Loaded %s word vectors.' % len(embeddings_index))

In [None]:
# create a weight matrix for words in training docs
embedding_matrix = np.zeros((vocab_size, n_embeddings))
for word, i in t.word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector

## Prepare training/validation/testing sets

In [None]:
labels = np.asarray(df["Rating"].values)
print('Shape of seq concat tensor:', sequences_concat.shape)
print('Shape of seq padded tensor:', sequences_pad.shape)
print('Shape of label tensor:', labels.shape)

In [None]:
indices = np.arange(df.shape[0])
np.random.shuffle(indices)
sequences_pad = sequences_pad[indices]
sequences_concat = sequences_concat[indices]
labels = labels[indices]

In [None]:
trainingP = 0.6
validationP = 0.2
testP = 0.2

training_samples = int(len(sequences_concat)*trainingP)
validation_samples = training_samples + int(len(sequences_concat)*validationP)

x_trainP = sequences_pad[:training_samples]
x_trainC = sequences_concat[:training_samples]
y_train = labels[:training_samples]

x_valP = sequences_pad[training_samples: validation_samples] 
x_valC = sequences_concat[training_samples: validation_samples] 
y_val = labels[training_samples: validation_samples]

x_testP = sequences_pad[validation_samples:]
x_testC = sequences_concat[validation_samples:]
y_test = labels[validation_samples:]

x_test_text = df['Review Text'].loc[indices][validation_samples:]
x_test_text = x_test_text.tolist()

## RNN network and helpers

The **Score1** metric tells about percentage of examples, which rating was predicted with error less than 1.  
The root mean squared error (**rmse**) will be used as the loss function.

In [None]:
import keras.backend as K
from keras.layers import add, Lambda

def Score1(y_true, y_pred):
    minus_yt = Lambda(lambda x: -x)(y_true)
    subtracted =  K.abs( add([y_pred, minus_yt]) )    
    return K.mean(K.less(subtracted, 1.0), axis=-1)

def rmse(y_true, y_pred):
    return K.sqrt(K.mean(K.square(y_pred - y_true), axis=-1))

Class that will help tracking the duration of each training epoch:

In [None]:
class TimeHistory(keras.callbacks.Callback):
    def on_train_begin(self, logs={}):
        self.times = []
    def on_epoch_begin(self, batch, logs={}):
        self.epoch_time_start = time.time()
    def on_epoch_end(self, batch, logs={}):
        self.times.append(time.time() - self.epoch_time_start)

Simple custom bidirectional RNN network with optional custom memory cell.

In [None]:
from keras.layers import Bidirectional, CuDNNLSTM
from keras.layers.normalization import BatchNormalization
from keras.optimizers import adam

def build_RNN(LSTM_CELL=LSTM, mask_zero=False):
            
    model = Sequential()     
    model.add(Embedding(input_dim    = vocab_size, 
                        output_dim   = n_embeddings, 
                        weights      = [embedding_matrix], 
                        input_length = max_length,
                        mask_zero    = mask_zero,
                        trainable    = False))    
    model.add(Bidirectional( LSTM_CELL(24, return_sequences=False)) )
    model.add(BatchNormalization())
    model.add(Dropout(rate=0.7))         
    model.add(Dense(units=4,  activation='relu'))
    model.add(Dense(units=1,  activation='relu'))
    
    optimizer = adam(clipnorm=1.0)
    model.compile(optimizer=optimizer, loss=rmse, metrics=['acc', Score1]) 
    
    return model

Having unbalanced dataset (in terms of rating distribution) let's weight the input data for the loss function during the training:

In [None]:
from sklearn.utils import class_weight
class_weights = class_weight.compute_class_weight('balanced',
                                                 np.unique(df['Rating']),
                                                 df['Rating'])
class_weight_dict = {i+1:class_weights[i] for i in range(5)}

In [None]:
def make_training(LSTM_CELL, mask_zero, x_train, x_val, epochs, batch_size):
    #K.clear_session()
    from numpy.random import seed as nseed
    nseed(2019)
    from tensorflow import set_random_seed
    set_random_seed(2019)
    
    time_callback = TimeHistory()    
    model = build_RNN(LSTM_CELL = LSTM_CELL, mask_zero = False)
    
    history_RNN = model.fit(x_train, y_train,
                        callbacks  = [time_callback],
                        epochs     = epochs,
                        batch_size = batch_size,
                        shuffle         = True,
                        class_weight    = class_weight_dict,
                        validation_data = (x_val, y_val))

    return history_RNN, time_callback, model

## Training different models (CuDNNLSTM/LSTM)
As the data is ready and model function is finished let's train four models: two with CuDNNLSTM, two with LSTM, fed with differently prepared data:

In [None]:
epochs = 10
batch  = 64

In [None]:
h1, t1, m1 = make_training(CuDNNLSTM, False, x_trainC, x_valC, epochs=epochs, batch_size=batch)

In [None]:
h2, t2, m2 = make_training(CuDNNLSTM, False, x_trainP, x_valP, epochs=epochs, batch_size=batch)

In [None]:
h3, t3, m3 = make_training(LSTM, False, x_trainC, x_valC, epochs=epochs, batch_size=batch)

In [None]:
h4, t4, m4 = make_training(LSTM, True, x_trainP, x_valP, epochs=epochs, batch_size=batch)

In [None]:
hlist = [h1, h2, h3, h4]
tlist = [t1, t2, t3, t4]
labels = ['CuDNN | no mask | con',
          'CuDNN | no mask | pad',
          'no Cu | no mask | con',
          'no Cu |   mask  | pad']
markers=['x','o','d','s']

Note that although charts plotted against the epoch number are very similar (which is good, CuDNNLSTM based models are learning well), the wall time plots unravel the obvious time-saving virtue of using fast cells.

In [None]:
fig, axs = plt.subplots(nrows=4, ncols=3, figsize=(12, 10))
for x_sel in [0,1]:
    for i, opt in enumerate(['loss','acc', 'Score1']+['val_loss','val_acc', 'val_Score1']):
        for _h, _t, _l, _m in zip(hlist, tlist, labels, markers):
            ax = axs[i//3+x_sel*2, i%3]             
            x_axis = range(1,epochs+1) if x_sel == 0 else np.cumsum(_t.times)
            x_label = 'epochs' if x_sel == 0 else 'time [s]'
            ax.plot(x_axis, _h.history[opt], label=_l, marker=_m)
        ax.set_title(opt)
        ax.set_xlabel(x_label)    
ax.legend()
plt.tight_layout()

If you look closely, validation plots obtained from the models trained on the concatenated data may be less noisy in comparison to those trained on the padded data (even with mask - for LSTM cell). On the other hand, they give better results, at least within the investigated 10 epochs period. With that, let's train CuDNNLSTM-based models for some more epochs to see in practice how the Rating prediction works.

In [None]:
more_epochs = 50

In [None]:
hc, tc, mc = make_training(CuDNNLSTM, False, x_trainC, x_valC, epochs=more_epochs, batch_size=batch)

In [None]:
hp, tp, mp = make_training(CuDNNLSTM, False, x_trainP, x_valP, epochs=more_epochs, batch_size=batch)

In [None]:
hlist2  = [hc, hp]
tlist2  = [tc, tp]
labels2 = ['CuDNN | no mask | con',
           'CuDNN | no mask | pad']
markers2=['x','o']

In [None]:
fig, axs = plt.subplots(nrows=4, ncols=3, figsize=(12, 10))
for x_sel in [0,1]:
    for i, opt in enumerate(['loss','acc', 'Score1']+['val_loss','val_acc', 'val_Score1']):
        for _h, _t, _l, _m in zip(hlist2, tlist2, labels2, markers2):
            ax = axs[i//3+x_sel*2, i%3]             
            x_axis = range(1,more_epochs+1) if x_sel == 0 else np.cumsum(_t.times)
            x_label = 'epochs' if x_sel == 0 else 'time [s]'
            ax.plot(x_axis, _h.history[opt], label=_l, marker=_m)
        ax.set_title(opt)
        ax.set_xlabel(x_label)    
ax.legend()
plt.tight_layout()

## Model performance investigation
To tell which model (trained with concatenated or padded data) is better, more epochs should be considered. The concatenated-data-based model seems to give more promising and reliable results so Let's pick it for further performance investigation:

In [None]:
scores = mc.evaluate(x_testC, y_test)
print('Loss: {0:.4}\t Accuracy: {1:.4}\t Score1: {2:.4}'.format(scores[0],scores[1],scores[2]))

Let the model predict outputs of the testing set.

In [None]:
output = mc.predict(x=x_testC)
diffs  = np.squeeze(output)-y_test

In [None]:
p1=plt.hist(diffs, bins=9)
p1=plt.title('Rating prediction error histogram')
p1=plt.xlabel('Rating prediction error (RPE)')
p1=plt.ylabel('Quantity [n]')

In [None]:
abs_diff = np.abs(diffs)
qu, vals = np.histogram(abs_diff, range=[0, 5], bins=20)
p2=plt.bar(np.arange(0,5,0.25),qu)
p2=plt.title('Absolute differences histogram')
p2=plt.xlabel('Absolute Rating prediction error (ARPE)')
p2=plt.ylabel('Quantity [n]')

In [None]:
p3=plt.plot(np.arange(0,5,0.25),100*np.cumsum(qu)/len(abs_diff), 'b-o')
p3=plt.grid()
p3=plt.xlabel('Absolute Rating prediction error (ARPE)')
p3=plt.ylabel('Percentage of Ratings recognized\nwith error less than given ARPE')

In [None]:
p4=plt.scatter(df['n_tokens'].loc[indices][validation_samples:].tolist(), np.abs(np.squeeze(output)-y_test), alpha=0.15)
p4=plt.xlabel('Review Text length (words) [n]')
p4=plt.ylabel('Absolute Rating prediction error (ARPE)')

In [None]:
from sklearn.metrics import confusion_matrix
cm=confusion_matrix(y_test,np.round(output))
print('Confusion matrix:')
print(cm)

What we got above in the confusion matrix may not look superb but it is a good improvement in comparison to training w/o class_weight=class_weight_dict.

Before checking the predicted Rating for selected examples, let's see if the decoded testing set (x_test_C) really contains the expected line:

In [None]:
first_words = 12
text_concat = []
for i, word in enumerate(x_testC[0,:]):
    if word == 0 or i == first_words: # if special character or we don't need more words
        break
    text_concat.append(t.index_word[word])    

print('Original text:\t',' '.join(x_test_text[0].split(' ')[0:first_words])) # from df['Review Text']
print('From tokens:\t',  ' '.join(text_concat))
print('Tokens: \t', x_testC[0,:first_words])

The last thing to do is to take several text reviews from the testing set and to display them along with the associated customer rating and the rating predicted by the RNN network:

In [None]:
for i in range(20):
    print('---------------------------')
    print('"'+x_test_text[i]+'"')
    print('Rating:\t\t{0}'.format(y_test[i]))
    print('Prediction:\t{0:.3}'.format(output[i][0]))
    print('Is correct: \t{0}'.format(True if np.abs(output[i][0]-y_test[i])<0.5 else False))
    print('Is close: \t{0}'.format(True if np.abs(output[i][0]-y_test[i])<1.0 else False))
    