# Introduction

## Setup

1) Ensure that the reviews1_cleaned.txt and reviews5_cleaned.txt are in the datafiles folder. <br>
2) Code needs to be ran on tensorflow version 1 <br>
3) Code has to be ran on colab with 25gb ram (>12 gb ram will be used) and runtime has to be either GPU or TPU.

# Training done
1) Train on food reviews (1 star) <br>
2) Train on food reviews (5 stars)

## General Instructions
1) Some of the parameters will require manual changing of values if values other than the best parameter values are to be used. The lines of code to change will be denoted by comments. <br>
2) In addition, 3 sets of code, catering to 1 layer, 2 layers and 3 layers of LSTM are provided below, so please take note not to run all cells at once. <br>
3) Training epochs in the code are set as 1 to allow us to control our training and test our text generation at each epoch easily. However, it can be varied depending on the need but it is recommended to keep training epochs to be below 3 to avoid having an incomplete training epoch just before the colab instance expires. <br> <br>

If everything is left unchanged, the notebook will be training LSTM models on the following hyperparameters by default: <br>
1) 1 star reviews <br>
2) 10 words input <br>
3) 100 nodes per layer <br>

## Text Generation Instructions
It is possible to jump straight to the text generation using the tokenizers and model weights that we had provided. Follow the steps below. <br>
1) Mount the drive and change directory to the lstm_final folder <br>
2) Import the libraries <br>
3) Proceed to Text Generation section and run the code in the 3 cells




# Summary of findings
1) Validation loss decreases when we increase the number of LSTM layers, increase the number of nodes per LSTM layer and also increase the number of words used as input. <br>
2) Our best performing model is able to generate slightly complex reviews but at the expense of being unable to learn proper uppercasing because using text that have been converted to lowercases for training offers us the ability to generate more realistic text. <br>


In [0]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


In [0]:
# Change to the appropriate directory
cd 'drive/My Drive/BT4222/lstm_final'

/content/drive/My Drive/BT4222/lstm_final


In [0]:
import random
import sys
import numpy as np
import pandas as pd
from pickle import dump, load

%tensorflow_version 1.x
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.models import load_model
from keras.layers import Dense
from keras.layers import Dropout
from keras.layers import LSTM
from keras.layers import Embedding
from keras.callbacks import ModelCheckpoint, ReduceLROnPlateau
from keras.utils import np_utils
from keras.utils import to_categorical
from keras import optimizers
import keras
from keras import layers

TensorFlow 1.x selected.


Using TensorFlow backend.


### Performing additional step of data cleaning to remove the 1st and last token of text files

In [0]:
# 1 star review
one_star = './datafiles/reviews1_cleaned.txt'
five_star = './datafiles/reviews5_cleaned.txt'

# change the one_star to five_star if planning to train on 5 star reviews
text = open(one_star).read()

In [0]:
# Remove the leading 'text\n' token because of text file.
words_list = text.split(" ")
words_list[0:10]

['text\n"I', 'went', 'at', '230', 'on', 'a', 'Monday', '.', 'It', 'was']

In [0]:
words_list[0] = words_list[0].split('\n"')[1]
words_list[0]

'I'

In [0]:
# Remove the ending '\n' token because of text file
words_list[-1] = words_list[-1].split("\"\n")[0]
words_list[-1]

'!'

# Data Preparation

In [0]:
# Made reference to this github 
# https://github.com/irdanish11/Sentence-Prediction-using-LSTMs_aka-Language-Modeling/blob/master/model.py
# Preparing sequence of n len. In this case n = 11 because it will be split into 10 words and 1 word eventually and 10 words input is currently offering us the best performance.

# Vary the train_len to control the length of each text sequences to be used for training.
train_len = 11

In [0]:
text_sequences = []
for i in range(train_len,len(words_list)):
    seq = words_list[i-train_len:i]
    text_sequences.append(seq)

print(text_sequences[0:10])

[['I', 'went', 'at', '230', 'on', 'a', 'Monday', '.', 'It', 'was', 'dimsum'], ['went', 'at', '230', 'on', 'a', 'Monday', '.', 'It', 'was', 'dimsum', 'I'], ['at', '230', 'on', 'a', 'Monday', '.', 'It', 'was', 'dimsum', 'I', 'hated'], ['230', 'on', 'a', 'Monday', '.', 'It', 'was', 'dimsum', 'I', 'hated', 'every'], ['on', 'a', 'Monday', '.', 'It', 'was', 'dimsum', 'I', 'hated', 'every', 'second'], ['a', 'Monday', '.', 'It', 'was', 'dimsum', 'I', 'hated', 'every', 'second', 'I'], ['Monday', '.', 'It', 'was', 'dimsum', 'I', 'hated', 'every', 'second', 'I', 'was'], ['.', 'It', 'was', 'dimsum', 'I', 'hated', 'every', 'second', 'I', 'was', 'there'], ['It', 'was', 'dimsum', 'I', 'hated', 'every', 'second', 'I', 'was', 'there', ','], ['was', 'dimsum', 'I', 'hated', 'every', 'second', 'I', 'was', 'there', ',', 'the']]


In [0]:
# Check the total number of sequences
len(text_sequences)

22934217

In [0]:
# Create Tokenizer, change convert_lower to False if you do not want to convert text to lowercases
#convert_lower = True
#tok = Tokenizer(filters = '', lower = convert_lower)
#tok.fit_on_texts(text_sequences)


In [0]:
# Save tokenizer
#dump(tok,open('./tokenizer/tokenizer_1_star','wb'))       

In [0]:
# For subsequent run to ensure we always use the same tokenizer for transformation and training

# tok = load(open("./tokenizer/tokenizer_1_star_not_lowercase", "rb" ))  # Without conversion to lowercase
tok = load(open("./tokenizer/tokenizer_1_star", "rb" )) # With conversion to lowercase, which has better performance

In [0]:
# Check number of unique vocabulary
vocab_size = len(tok.word_counts)
vocab_size

158467

In [0]:
sequences = tok.texts_to_sequences(text_sequences) 

In [0]:
n_sequences = np.empty([len(sequences),train_len], dtype='int32')
for i in range(len(sequences)):
    n_sequences[i] = sequences[i]

In [0]:
# E.g sequence has n words. n-1 words used for training, 1 word used as target.
train_inputs = n_sequences[:,:-1] 
train_targets = n_sequences[:,-1] 
seq_len = train_inputs.shape[1]

print(train_inputs.shape)
print(train_targets.shape)

(22934217, 10)
(22934217,)


# LSTM Models


The code below are segmented into 1 layer LSTM, 2 layers LSTM and 3 layers LSTM. Run only the specific segment of the code that you need.

In [0]:
# Must be ran regardless of which number of layers you choose to train. 
num_nodes = 100

## 1 Layer LSTM


In [0]:
model_1_layer = Sequential() 
model_1_layer.add(Embedding(vocab_size, seq_len, input_length = seq_len))
model_1_layer.add(LSTM(num_nodes))
model_1_layer.add(Dense(vocab_size + 1, activation = 'softmax'))

# Sparse categorical crossentropy is used due to memory limitations 
model_1_layer.compile(loss = 'sparse_categorical_crossentropy', optimizer = optimizers.adam(lr = 0.001), metrics=['accuracy'])
model_1_layer.summary()


Instructions for updating:
If using Keras pass *_constraint arguments to layers.
Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 10, 10)            1584670   
_________________________________________________________________
lstm_1 (LSTM)                (None, 100)               44400     
_________________________________________________________________
dense_1 (Dense)              (None, 158468)            16005268  
Total params: 17,634,338
Trainable params: 17,634,338
Non-trainable params: 0
_________________________________________________________________


In [0]:
# Training done 1 epochs at a time due to the long training time need for each epochs.
path_name = 'lstm_model_1_layer_{}_nodes_{}_words.h5'.format(num_nodes, train_len - 1)
path = './checkpoints/' + path_name
checkpoint = ModelCheckpoint(path, monitor = 'loss', verbose=1, save_best_only = True, mode = 'min')

print('---Training parameters---')
print('Number of layers: 1')
print('Number of nodes: ', num_nodes)
print('Number of words as input: ', train_len - 1)
print()

# Change the epoch value to suit your needs
model_1_layer.fit(train_inputs, train_targets, batch_size = 1024, epochs = 3, verbose = 1, callbacks = [checkpoint])
model_1_layer.save(path_name)

## 2 Layers LSTM

In [0]:
model_2_layer = Sequential() 
model_2_layer.add(Embedding(vocab_size, seq_len, input_length = seq_len))
model_2_layer.add(LSTM(num_nodes, return_sequences = True))
model_2_layer.add(LSTM(num_nodes))
model_2_layer.add(Dense(vocab_size + 1, activation = 'softmax'))

# Sparse categorical crossentropy is used due to memory limitations 
model_2_layer.compile(loss = 'sparse_categorical_crossentropy', optimizer = optimizers.adam(lr = 0.001), metrics=['accuracy'])
model_2_layer.summary()

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 10, 10)            1584670   
_________________________________________________________________
lstm_2 (LSTM)                (None, 10, 100)           44400     
_________________________________________________________________
lstm_3 (LSTM)                (None, 100)               80400     
_________________________________________________________________
dense_2 (Dense)              (None, 158468)            16005268  
Total params: 17,714,738
Trainable params: 17,714,738
Non-trainable params: 0
_________________________________________________________________


In [0]:
# Training done 1 epochs at a time due to the long training time need for each epochs.
path_name = 'lstm_model_2_layer_{}_nodes_{}_words.h5'.format(num_nodes, train_len - 1)
path = './checkpoints/' + path_name
checkpoint = ModelCheckpoint(path, monitor = 'loss', verbose=1, save_best_only = True, mode = 'min')

print('---Training parameters---')
print('Number of layers: 2')
print('Number of nodes: ', num_nodes)
print('Number of words as input: ', train_len - 1)
print()

# Change the epoch value to suit your needs
model_2_layer.fit(train_inputs, train_targets, batch_size = 1024, epochs = 1, verbose = 1, callbacks = [checkpoint])
model_2_layer.save(path_name)

## 3 Layers LSTM

In [0]:
model_3_layer = Sequential() 
model_3_layer.add(Embedding(vocab_size, seq_len, input_length = seq_len))
model_3_layer.add(LSTM(num_nodes, return_sequences = True))
model_3_layer.add(LSTM(num_nodes, return_sequences = True))
model_3_layer.add(LSTM(num_nodes))
model_3_layer.add(Dense(vocab_size + 1, activation = 'softmax'))

# Sparse categorical crossentropy is used due to memory limitations 
model_3_layer.compile(loss = 'sparse_categorical_crossentropy', optimizer = optimizers.adam(lr = 0.001), metrics=['accuracy'])
model_3_layer.summary()

Model: "sequential_4"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_4 (Embedding)      (None, 10, 10)            1584670   
_________________________________________________________________
lstm_7 (LSTM)                (None, 10, 50)            12200     
_________________________________________________________________
lstm_8 (LSTM)                (None, 10, 50)            20200     
_________________________________________________________________
lstm_9 (LSTM)                (None, 50)                20200     
_________________________________________________________________
dense_4 (Dense)              (None, 158468)            8081868   
Total params: 9,719,138
Trainable params: 9,719,138
Non-trainable params: 0
_________________________________________________________________


In [0]:
# Training done 1 epochs at a time due to the long training time need for each epochs.
path_name = 'lstm_model_3_layer_{}_nodes_{}_words.h5'.format(num_nodes, train_len - 1)
path = './checkpoints/' + path_name
checkpoint = ModelCheckpoint(path, monitor = 'loss', verbose=1, save_best_only = True, mode = 'min')

print('---Training parameters---')
print('Number of layers: 3')
print('Number of nodes: ', num_nodes)
print('Number of words as input: ', train_len - 1)
print()

# Change the epoch value to suit your needs
model_3_layer.fit(train_inputs, train_targets, batch_size = 1024, epochs = 1, verbose = 1, callbacks = [checkpoint])
model_3_layer.save(path_name)

# Text Generation 





In [0]:
''' 
Generate text

Input: 1) model 
       2) tokenizer
       3) input_text: Provide a prompt in str format
       4) num_gen_words: number of words to generate. 
'''
def gen_text(model, tokenizer, input_text, num_gen_words):
    seq_len = model.input_shape[1]
    output_text = [input_text]
    pred_word = None
    final_text = ''

    for i in range(num_gen_words):
        encoded_text = tokenizer.texts_to_sequences([input_text])[0]
        if(pred_word == ''):
          encoded_text.append(9) 
        pad_encoded = pad_sequences([encoded_text], maxlen=seq_len,truncating='pre')
        pred_word_ind = model.predict_classes(pad_encoded,verbose=0)[0]

        pred_word = tokenizer.index_word[pred_word_ind]
        input_text += ' ' + pred_word
        output_text.append(pred_word)

    for text in output_text:
      if text not in ['!', '?', ',', '.', '']:
        final_text = final_text + ' '+ text
      else:
        final_text = final_text + text

    return final_text

In [0]:
'''
Sample loading of best model.

We have provided 3 model weights that we had train for this project in the checkpoints folder. 
1) Weights of best performing model for 1 star reviews with conversion to lowercases: 'model_1_star_best'
2) Weights of best performing model for 5 star reviews with conversion to lowercases: 'model_5_star_best'
3) Weights of model for 1 star reviews without conversion to lowercases: 'model_1_star_not_lowercase'

The weights for the model that we had trained on the 1 star review without conversion to lowercases is also provided for you to compare the performances.

The weights variable needs to be changed if training is done on your part. The weights for the additional trainings will also be stored in the checkpoints folder.

The tokenizers can be found in the tokenizers folder.
1) tokenizer_1_star: For 1 star reviews
2) tokenizer_5_star: For 5 star reviews
3) tokenizer_1_star_not_lowercase: For 1 star reviews without conversion to lowercases
'''
weights = 'model_1_star_best.h5'
pathname = './checkpoints/' + weights
best_model = load_model(pathname)
tok = load(open("./tokenizer/tokenizer_1_star", "rb" )) 

In [0]:
# Sample text generation 
gen_text(best_model, tok, 'I am disappointed', 30)

' I am disappointed. i was so excited to try this place again. i ordered a chicken sandwich and it was a little bit of a jar. the'