In [1]:
from Attention import AttentionLayer #This is used or our sequence2sequence network as keras does not support an attention layer. I downloaded one off of the internet

In [2]:
#Import necessary packages
import numpy as np
import pandas as pd 
import re
from bs4 import BeautifulSoup
from keras.preprocessing.text import Tokenizer 
from keras.preprocessing.sequence import pad_sequences
from nltk.corpus import stopwords
from tensorflow.keras.layers import Input, LSTM, Embedding, Dense, Concatenate, TimeDistributed
from tensorflow.keras.models import Model
from tensorflow.keras.callbacks import EarlyStopping

Using TensorFlow backend.


In [3]:
import tensorflow as tf
print("Num GPUs Available: ", len(tf.config.experimental.list_physical_devices('GPU')))

Num GPUs Available:  1


---

## Explore and clean the dataset

---

In [4]:
data=pd.read_csv("Reviews.csv",nrows=200000)

In [5]:
data.head()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,5,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,1,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,4,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,2,1307923200,Cough Medicine,If you are looking for the secret ingredient i...
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,5,1350777600,Great taffy,Great taffy at a great price. There was a wid...


In [6]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200000 entries, 0 to 199999
Data columns (total 10 columns):
 #   Column                  Non-Null Count   Dtype 
---  ------                  --------------   ----- 
 0   Id                      200000 non-null  int64 
 1   ProductId               200000 non-null  object
 2   UserId                  200000 non-null  object
 3   ProfileName             199992 non-null  object
 4   HelpfulnessNumerator    200000 non-null  int64 
 5   HelpfulnessDenominator  200000 non-null  int64 
 6   Score                   200000 non-null  int64 
 7   Time                    200000 non-null  int64 
 8   Summary                 199992 non-null  object
 9   Text                    200000 non-null  object
dtypes: int64(5), object(5)
memory usage: 15.3+ MB


We can see that there is lots of useless information in this dataset for our problem. I am going to define what I want to look at. First of all, I will check for null values and then work from there.

In [7]:
#Checking for null values
data.isnull().sum()

Id                        0
ProductId                 0
UserId                    0
ProfileName               8
HelpfulnessNumerator      0
HelpfulnessDenominator    0
Score                     0
Time                      0
Summary                   8
Text                      0
dtype: int64

In [8]:
#Removing duplicates and null values
data.drop_duplicates(subset=['Text'],inplace=True)#dropping duplicates
data.dropna(axis=0,inplace=True)#dropping na

---

## Data Pre-processing

---

In order to use the review text for my model, I first need to clean the dataset. This means that I will have to remove anything that will make the text messy and harder for the model to work with. One thing that I have struggled with in NLP projects is being able to keep a words meaning after removing a contraction from it. In order to do this effectively, I need to create a dictionary of as many contractions that I can think of.

In [9]:
contraction_mapping = {"ain't": "is not", "aren't": "are not","can't": "cannot", "'cause": "because", "could've": "could have", "couldn't": "could not",
                           "didn't": "did not",  "doesn't": "does not", "don't": "do not", "hadn't": "had not", "hasn't": "has not", "haven't": "have not",
                           "he'd": "he would","he'll": "he will", "he's": "he is", "how'd": "how did", "how'd'y": "how do you", "how'll": "how will", "how's": "how is",
                           "I'd": "I would", "I'd've": "I would have", "I'll": "I will", "I'll've": "I will have","I'm": "I am", "I've": "I have", "i'd": "i would",
                           "i'd've": "i would have", "i'll": "i will",  "i'll've": "i will have","i'm": "i am", "i've": "i have", "isn't": "is not", "it'd": "it would",
                           "it'd've": "it would have", "it'll": "it will", "it'll've": "it will have","it's": "it is", "let's": "let us", "ma'am": "madam",
                           "mayn't": "may not", "might've": "might have","mightn't": "might not","mightn't've": "might not have", "must've": "must have",
                           "mustn't": "must not", "mustn't've": "must not have", "needn't": "need not", "needn't've": "need not have","o'clock": "of the clock",
                           "oughtn't": "ought not", "oughtn't've": "ought not have", "shan't": "shall not", "sha'n't": "shall not", "shan't've": "shall not have",
                           "she'd": "she would", "she'd've": "she would have", "she'll": "she will", "she'll've": "she will have", "she's": "she is",
                           "should've": "should have", "shouldn't": "should not", "shouldn't've": "should not have", "so've": "so have","so's": "so as",
                           "this's": "this is","that'd": "that would", "that'd've": "that would have", "that's": "that is", "there'd": "there would",
                           "there'd've": "there would have", "there's": "there is", "here's": "here is","they'd": "they would", "they'd've": "they would have",
                           "they'll": "they will", "they'll've": "they will have", "they're": "they are", "they've": "they have", "to've": "to have","wasn't": "was not", "we'd": "we would", "we'd've": "we would have", "we'll": "we will", "we'll've": "we will have", "we're": "we are",
                           "we've": "we have", "weren't": "were not", "what'll": "what will", "what'll've": "what will have", "what're": "what are",
                           "what's": "what is", "what've": "what have", "when's": "when is", "when've": "when have", "where'd": "where did", "where's": "where is",
                           "where've": "where have", "who'll": "who will", "who'll've": "who will have", "who's": "who is", "who've": "who have",
                           "why's": "why is", "why've": "why have", "will've": "will have", "won't": "will not", "won't've": "will not have",
                           "would've": "would have", "wouldn't": "would not", "wouldn't've": "would not have", "y'all": "you all",
                           "y'all'd": "you all would","y'all'd've": "you all would have","y'all're": "you all are","y'all've": "you all have",
                           "you'd": "you would", "you'd've": "you would have", "you'll": "you will", "you'll've": "you will have",
                           "you're": "you are", "you've": "you have"}

The next steps for pre-processing are as follows:
- Convert everything to lowercase
- Remove HTML tags
- Fix contractions
- Remove any text inside the parenthesis ( )
- Eliminate punctuations and special characters
- Remove stopwords and "short words"

In [10]:
#Let's create a function to do the above pre-processing steps
stop_words = set(stopwords.words('english')) #Stop words are unnecessary words in the sentences such as "I, am, ect..."

def text_cleaner(text,num):
    new_text = text.lower() #Making all words lowercase
    new_text = BeautifulSoup(new_text, "lxml").text #Removing any HTML characters
    new_text = re.sub(r'\([^)]*\)', '', new_text) #Removing punctuation
    new_text = re.sub('"','', new_text) #Removing characters that should not be in reviews
    new_text = ' '.join([contraction_mapping[t] if t in contraction_mapping else t for t in new_text.split(" ")]) #Using the contraction dictionary to fix text  
    new_text = re.sub(r"'s\b","",new_text) #Removing punctuation
    new_text = re.sub("[^a-zA-Z]", " ", new_text) #just removes anything that we do not want. exluding what is in []
    new_text = re.sub('[m]{2,}', 'mm', new_text) #making sure no words/letters are repeated
    if(num==0):
        tokens = [w for w in new_text.split() if not w in stop_words]
    else:
        tokens=new_text.split()
    long_words=[]
    for i in tokens:
        if len(i)>1: #removing short words
            long_words.append(i)   
    return (" ".join(long_words)).strip()

#Calling the function to clean the dataset, both the review column and the summary column. 
clean_text = []
for t in data['Text']:
    clean_text.append(text_cleaner(t,0))
clean_summary = []
for t in data['Summary']:
    clean_summary.append(text_cleaner(t,1))

  ' Beautiful Soup.' % self._decode_markup(markup)
  markup
  ' Beautiful Soup.' % self._decode_markup(markup)
  markup
  markup


In [11]:
#Example of a processed review
clean_text[:1]

['bought several vitality canned dog food products found good quality product looks like stew processed meat smells better labrador finicky appreciates product better']

In [12]:
#Creating new columns in the dataset and cleaning up the dataframe
data['clean_text']=clean_text
data['clean_summary']=clean_summary
data.replace('', np.nan, inplace=True)
data.dropna(axis=0,inplace=True)

In [13]:
#Looking at the first 2 entries in our dataset
data.head(2)

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text,clean_text,clean_summary
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,5,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...,bought several vitality canned dog food produc...,good quality dog food
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,1,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...,product arrived labeled jumbo salted peanuts p...,not as advertised


Now, let's set the maximum length for both summary and text. This is because we do not want some reviews to be incredibly long. Also this makes the model perform better and more accuractely

In [14]:
max_text_len=30 #30 word limit
max_summary_len=10 #10 word limit

With the above restrictions, we need to signify the start and end of each summary. I will not be using the words "Start" and "end" because some summaries and reviews have those words within them. Thus, if I used those words the function would cut off the sentence early. 

In [15]:
#Selecting any review that falls within or equal to the limitations that I set above.
clean_text =np.array(data['clean_text'])
clean_summary=np.array(data['clean_summary'])

short_text=[]
short_summary=[]

for i in range(len(clean_text)):
    if(len(clean_summary[i].split())<=max_summary_len and len(clean_text[i].split())<=max_text_len):
        short_text.append(clean_text[i])
        short_summary.append(clean_summary[i])
        
df=pd.DataFrame({'text':short_text,'summary':short_summary})
df['summary'] = df['summary'].apply(lambda x : 'sostok '+ x + ' eostok')

---

## Splitting dataset for model training in the future

---

In [16]:
from sklearn.model_selection import train_test_split
x_tr,x_val,y_tr,y_val=train_test_split(np.array(df['text']),np.array(df['summary']),test_size=0.1,random_state=0,shuffle=True)

---

## Tokenizing reviews and summaries

---

In [17]:
from keras.preprocessing.text import Tokenizer 
from keras.preprocessing.sequence import pad_sequences

#prepare a tokenizer for reviews on training data
x_tokenizer = Tokenizer() 
x_tokenizer.fit_on_texts(list(x_tr))

## Identifying rare words and then tokenizing

In [18]:
thresh=4 #Setting the threshhold that identifies what a rare word is. This means that any review text that is below 4 is considered a rare word

cnt=0 #gives the number of rare words whose count falls below threshold.
tot_cnt=0 #gives the size of vocabulary aka. every unique word in the text
freq=0 #linked to cnt, frequency of it
tot_freq=0 #linked to tot_cnt, frequency of it

for key,value in x_tokenizer.word_counts.items():
    tot_cnt=tot_cnt+1
    tot_freq=tot_freq+value
    if(value<thresh):
        cnt=cnt+1
        freq=freq+value
    
print("% of rare words in vocabulary:",(cnt/tot_cnt)*100)
print("Total Coverage of rare words:",(freq/tot_freq)*100)

% of rare words in vocabulary: 66.66187629340078
Total Coverage of rare words: 2.159620622721174


In [19]:
#prepare a tokenizer for reviews on training data
x_tokenizer = Tokenizer(num_words=tot_cnt-cnt) 
x_tokenizer.fit_on_texts(list(x_tr))

#convert text sequences into integer sequences
x_tr_seq = x_tokenizer.texts_to_sequences(x_tr) 
x_val_seq = x_tokenizer.texts_to_sequences(x_val)

#padding zero upto maximum length
x_tr = pad_sequences(x_tr_seq,  maxlen=max_text_len, padding='post') #adding zeros at the end of the sentence to match the max length
x_val = pad_sequences(x_val_seq, maxlen=max_text_len, padding='post') #adding zeros at the end of the sentence to match the max length

#size of vocabulary I added the +1 because of the word we added to signify the end of the sentence.
x_voc = x_tokenizer.num_words + 1

## Doing the same for the y_tokenizer, or the tokenizer for the summaries

In [20]:
#prepare a tokenizer for summaries on training data
y_tokenizer = Tokenizer()   
y_tokenizer.fit_on_texts(list(y_tr))

In [21]:
thresh=6 #Setting the threshhold that identifies what a rare word is. This means that any review text that is below 4 is considered a rare word

cnt=0 #gives the number of rare words whose count falls below threshold.
tot_cnt=0 #gives the size of vocabu;ary aka. every unique word in the text.
freq=0 #linked to cnt, frequency of it
tot_freq=0 #linked to tot_cnt, frequency of it

for key,value in y_tokenizer.word_counts.items():
    tot_cnt=tot_cnt+1
    tot_freq=tot_freq+value
    if(value<thresh):
        cnt=cnt+1
        freq=freq+value
    
print("% of rare words in vocabulary:",(cnt/tot_cnt)*100)
print("Total Coverage of rare words:",(freq/tot_freq)*100)

% of rare words in vocabulary: 76.76174496644296
Total Coverage of rare words: 3.8919606209874456


In [22]:
#prepare a tokenizer for reviews on training data
y_tokenizer = Tokenizer(num_words=tot_cnt-cnt) 
y_tokenizer.fit_on_texts(list(y_tr))

#convert text sequences into integer sequences
y_tr_seq = y_tokenizer.texts_to_sequences(y_tr) 
y_val_seq = y_tokenizer.texts_to_sequences(y_val) 

#padding zero upto maximum length
y_tr = pad_sequences(y_tr_seq, maxlen=max_summary_len, padding='post')
y_val = pad_sequences(y_val_seq, maxlen=max_summary_len, padding='post')

#size of vocabulary, I added the +1 because of the word we added to signify the end of the sentence.
y_voc = y_tokenizer.num_words +1

In [23]:
#Making sure that the word count of the start token is equal to the length of the training data. 
#Since each sostok is equivelent to the start of the summary we can use it to compare.
y_tokenizer.word_counts['sostok'],len(y_tr)

(80413, 80413)

## Deleting Rows that are completely empty (i.e. just have the start and end token)

In [24]:
ind=[]
for i in range(len(y_tr)):
    cnt=0
    for j in y_tr[i]:
        if j!=0:
            cnt=cnt+1
    if(cnt==2):
        ind.append(i)

y_tr=np.delete(y_tr,ind, axis=0)
x_tr=np.delete(x_tr,ind, axis=0)

In [25]:
ind=[]
for i in range(len(y_val)):
    cnt=0
    for j in y_val[i]:
        if j!=0:
            cnt=cnt+1
    if(cnt==2):
        ind.append(i)

y_val=np.delete(y_val,ind, axis=0)
x_val=np.delete(x_val,ind, axis=0)

---

## Building the Sequence-to-Sequence model using LSTM.

---

## For the encoder I will be creating a 3 stacked LSTM. This leads to a better representation of the sequence and hopefully will allow for an imrpoved model. 

## I will also be making use of the attention.py module that I imported in the begining of this notebook. I will use a function from the module to set up my attention layer.

In [26]:
from keras import backend as K 
K.clear_session() #Clearing session to make sure nothing messes up the model if I need to train it again. 

latent_dim = 300 #This is the number of nodes used as the input generator, this can be any number.
embedding_dim = 100

#Encoder
encoder_inputs = Input(shape=(max_text_len,))

#embedding layer
enc_emb =  Embedding(x_voc, embedding_dim,trainable=True)(encoder_inputs)

#encoder lstm 1
encoder_lstm1 = LSTM(latent_dim,return_sequences=True,return_state=True,dropout=0.4,recurrent_dropout=0.4)
encoder_output1, state_h1, state_c1 = encoder_lstm1(enc_emb)

#encoder lstm 2
encoder_lstm2 = LSTM(latent_dim,return_sequences=True,return_state=True,dropout=0.4,recurrent_dropout=0.4)
encoder_output2, state_h2, state_c2 = encoder_lstm2(encoder_output1)

#encoder lstm 3
encoder_lstm3=LSTM(latent_dim, return_state=True, return_sequences=True,dropout=0.4,recurrent_dropout=0.4)
encoder_outputs, state_h, state_c= encoder_lstm3(encoder_output2)

# Set up the decoder, using `encoder_states` as initial state.
decoder_inputs = Input(shape=(None,))

#embedding layer
dec_emb_layer = Embedding(y_voc, embedding_dim,trainable=True)
dec_emb = dec_emb_layer(decoder_inputs)

decoder_lstm = LSTM(latent_dim, return_sequences=True, return_state=True,dropout=0.4,recurrent_dropout=0.2)
decoder_outputs,decoder_fwd_state, decoder_back_state = decoder_lstm(dec_emb,initial_state=[state_h, state_c])

# Attention layer
attn_layer = AttentionLayer(name='attention_layer')
attn_out, attn_states = attn_layer([encoder_outputs, decoder_outputs])

# Concat attention input and decoder LSTM output
decoder_concat_input = Concatenate(axis=-1, name='concat_layer')([decoder_outputs, attn_out])

#dense layer
decoder_dense =  TimeDistributed(Dense(y_voc, activation='softmax'))
decoder_outputs = decoder_dense(decoder_concat_input)

# Define the model 
model = Model([encoder_inputs, decoder_inputs], decoder_outputs)

model.summary()

Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            [(None, 30)]         0                                            
__________________________________________________________________________________________________
embedding (Embedding)           (None, 30, 100)      1160000     input_1[0][0]                    
__________________________________________________________________________________________________
lstm (LSTM)                     [(None, 30, 300), (N 481200      embedding[0][0]                  
__________________________________________________________________________________________________
input_2 (InputLayer)            [(None, None)]       0                                            
______________________________________________________________________________________________

In [27]:
#Sparse categorical cross-entropy as the loss function since it converts the integer sequence to a one-hot vector on the fly. This overcomes any memory issues.
model.compile(optimizer='rmsprop', loss='sparse_categorical_crossentropy')

## In order to keep make sure the model does not overfit I will be using early stopping. This will make the model stop training after 2 consectutive validation loss increases.

In [28]:
es = EarlyStopping(monitor='val_loss', mode='min', verbose=1,patience=2)

## Training the model

In order to use our tokenized data I need to identify the shape and what part of the arrays will be used for the LSTM. Also I have outlined that there will be 50 epochs but I do not expect the model to train that far because of the early stopping that is included.

In [29]:
history=model.fit([x_tr,y_tr[:,:-1]], y_tr.reshape(y_tr.shape[0],y_tr.shape[1], 1)[:,1:] ,
                  epochs=50,callbacks=[es],batch_size=128,
                  validation_data=([x_val,y_val[:,:-1]], y_val.reshape(y_val.shape[0],y_val.shape[1], 1)[:,1:]))

Train on 78845 samples, validate on 8739 samples
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 00021: early stopping


## Since the model stoped improving after epoch 19, the early stopping that we implemented made the model end at epoch 21.

In [30]:
#Saving the model for future use.
model.save('prototype_1.h5')

## Time to make inferences and generate some examples

First we need to create a dictionary to chance the indexes of a word to the actual word

In [31]:
reverse_target_word_index=y_tokenizer.index_word
reverse_source_word_index=x_tokenizer.index_word
target_word_index=y_tokenizer.word_index

In [32]:
# Encode the input sequence to get the feature vector
encoder_model = Model(inputs=encoder_inputs,outputs=[encoder_outputs, state_h, state_c])

# Decoder setup
# Below tensors will hold the states of the previous time step
decoder_state_input_h = Input(shape=(latent_dim,))
decoder_state_input_c = Input(shape=(latent_dim,))
decoder_hidden_state_input = Input(shape=(max_text_len,latent_dim))

# Get the embeddings of the decoder sequence
dec_emb2= dec_emb_layer(decoder_inputs) 
# To predict the next word in the sequence, set the initial states to the states from the previous time step
decoder_outputs2, state_h2, state_c2 = decoder_lstm(dec_emb2, initial_state=[decoder_state_input_h, decoder_state_input_c])

#attention inference
attn_out_inf, attn_states_inf = attn_layer([decoder_hidden_state_input, decoder_outputs2])
decoder_inf_concat = Concatenate(axis=-1, name='concat')([decoder_outputs2, attn_out_inf])

# A dense softmax layer to generate prob dist. over the target vocabulary
decoder_outputs2 = decoder_dense(decoder_inf_concat) 

# Final decoder model
decoder_model = Model(
    [decoder_inputs] + [decoder_hidden_state_input,decoder_state_input_h, decoder_state_input_c],
    [decoder_outputs2] + [state_h2, state_c2])

## In order to decode the sentences and retun something that is able to be read by humans we need to create a function

In [33]:
def decode_sequence(input_seq):
    #Encode the input as state vectors.
    e_out, e_h, e_c = encoder_model.predict(input_seq)
    
    #Generate empty target sequence of length 1.
    target_seq = np.zeros((1,1))
    
    #Populate the first word of target sequence with the start word.
    target_seq[0, 0] = target_word_index['sostok']

    stop_condition = False
    decoded_sentence = ''
    while not stop_condition:
      
        output_tokens, h, c = decoder_model.predict([target_seq] + [e_out, e_h, e_c])

        #Sample a token
        sampled_token_index = np.argmax(output_tokens[0, -1, :])
        sampled_token = reverse_target_word_index[sampled_token_index]
        
        if(sampled_token!='eostok'):
            decoded_sentence += ' '+sampled_token

        #Exit condition: either hit max length or find stop word.
        if (sampled_token == 'eostok'  or len(decoded_sentence.split()) >= (max_summary_len-1)):
            stop_condition = True

        #Update the target sequence (of length 1).
        target_seq = np.zeros((1,1))
        target_seq[0, 0] = sampled_token_index

        #Update internal states
        e_h, e_c = h, c

    return decoded_sentence

## Now we need to create two functions for both the summary and the review text identifying the start and end word of the summary and essentially just converting the integers to words.

In [34]:
def seq2summary(input_seq):
    newString=''
    for i in input_seq:
        if((i!=0 and i!=target_word_index['sostok']) and i!=target_word_index['eostok']):
            newString=newString+reverse_target_word_index[i]+' '
    return newString

def seq2text(input_seq):
    newString=''
    for i in input_seq:
        if(i!=0):
            newString=newString+reverse_source_word_index[i]+' '
    return newString

## Let's generate a few examples. You will see the review text, the original summary given by the dataset, and then the summary that the model we created has generated.

In [38]:
for i in range(20,24):
    print("Review:",seq2text(x_tr[i]))
    print("Original summary:",seq2summary(y_tr[i]))
    print("Predicted summary:",decode_sequence(x_tr[i].reshape(1,max_text_len)))
    print("\n")

Review: cats eating food several years like well going switch indoor cat type 
Original summary: core pet food 
Predicted summary:  cats love it


Review: kids love month old month old eat love started giving son months still loves till day 
Original summary: yum yum mum mum 
Predicted summary:  my baby loves these


Review: green tea nice pleasant flavor strong weak need add sweeteners look tea overall refreshing drink packaging great well remove perforated part cardboard along bottom dispense individually wrapped tea bags 
Original summary: nice flavor 
Predicted summary:  nice flavor


Review: month old hates even mix earth best veggies loves spit tried salmon hates bought three cases stuff keep trying 
Original summary: hates it 
Predicted summary:  my baby loves it




## It is clear that the model is going to need to be improved in order to generalize to other datasets.