# Chapter 12 - RNN and LSTM

In [None]:
#Importing dependencies
import numpy as np
import string
import random
import re
import tensorflow as tf
import pandas as pd
import tensorflow.keras as keras
import matplotlib.pyplot as plt
import sklearn
from tensorflow.keras.models import Sequential
from numpy import array, argmax, random, take
from tensorflow.keras.layers import Dense, Activation
from tensorflow.keras.layers import Dropout
from tensorflow.keras.layers import RNN, SimpleRNN, LSTM,  Embedding, RepeatVector
from tensorflow.keras.optimizers import RMSprop
from tensorflow.keras.callbacks import ModelCheckpoint
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split
%matplotlib inline
#For plotting the matplotlib graphs in notebook

Mount your drive and write the path of your dataset folder

In [None]:
#Data_path="/content/drive/My Drive/DataSets/Chapter-12/Datasets/"
Data_path="https://raw.githubusercontent.com/venkatareddykonasani/ML_DL_py_TF/master/Chapter12_RNN_LSTM_V3/Datasets/"

## Case Study – Word prediction
This case study uses a simple dataset to understand the concept of sequential ANN models.

### Objective and Data
We would like to predict the third word based on the sequence of two words. The dataset contains a list of three words. This type of data is also known as three-gram data. If we consider the most occurring N-word sequence, then it is known as N-gram data. This dataset is freely available on COCA ( Corpus of Contemporary American English) website https://www.english-corpora.org/coca/ . We are using a subset in this example. In this subset of the data, we mostly considered the word sequences starting with love or hate.

The final goal is to take the first two words as input and predict the final word. Below is the code for
importing the data.

In [None]:
import pandas as pd
column_names = ['word1', 'word2', 'word3']

input_3gram = pd.read_csv(Data_path+ "3Gram_love_data.txt", delimiter='\t', names=column_names) #Importing csv file with column names
print("shape of data", input_3gram.shape)

In [None]:
print("Few sample records from data \n", input_3gram.sample(10))

There are a total of 5351 rows, and each row has three words, here three words constitute three
columns. The below code is used to find the frequency of distinct words in coulmn1 and column2.

In [None]:
print("\nFrequency of word1 values \n", input_3gram["word1"].value_counts())
print("\nFrequency of word2 values \n", input_3gram["word2"].value_counts())

From the above output table, we can see that column-1 has all the words related to love and hate.
More than 90% of the words are related to love. The second column has a few more unique words.

### Data processing
Before building the model, we need to perform three steps.
1. Find all the unique words from three columns
2. The model can not be built on string data. Create a words_indices dictionary. Map the words to numbers and crate a dictionary with words as keys and numbers as values. Later we will create one-hot encoded variables for all the unique numbers. This conversion is necessary for building the model.
3. We can not give the predictions as numbers. Create one more dictionary from numbers to words. We need to finally give the predicted output in the form of words only. We need to maintain a second dictionary that has numbers as keys and words as values.

Let us see the code for each of these steps. First step is to fins all the unique words from three
columns

In [None]:
"""
Finding our words to create dictionary
Here we find unique values in each column and save each of those values .
Later which we will take the unique value for the entire appened columns
This will be our vocabulary list,which are the unique words in our data file
"""
unique_words = []
for i in list(input_3gram.columns.values):
    for j in pd.unique(input_3gram[i]):
        unique_words.append(j)
unique_words = np.unique(unique_words)


print('Count of unique words overall:', len(unique_words))
print('unique words list:', unique_words)

From the above output, we can see that there are overall 139 unique words. These are unique words
from all three columns. In the previous output, we have already seen that column-1 has a few
unique words. Now we will create two dictionaries, words to indices, and indices to words.

In [None]:
"""
creating our word:indice pair dictionary and inverse
Here will be creating two dictonary values
word_indices : This contains each words mapped to an unique digit 
indices_words : This contains each digits mapped to a word in the same sequence as word_indices 
"""
word_indices = dict((w, i) for i, w in enumerate(unique_words))
indices_words = dict((i, w) for i, w in enumerate(unique_words))

print("word_indices dictionary \n",word_indices)
print("word_indices.keys \n", word_indices.keys())
print("word_indices.values \n", word_indices.values())
print("\n ########################################\n")
print("indices_words dictionary \n", indices_words)
print("indices_words keys \n",indices_words.keys())
print("indices_words values \n",indices_words.values())

From the above output, we can see the mapping for each of the words to a number. In the word_indices dictionary, words are keys, and numbers are values. The output also shows the keys and values separately; we are not displaying them here. Total of 139 numbers from 0 to 138. The
word &quot;love&quot; is mapped to 70, and the word &quot;way&quot; is mapped to 129. The second part of the output is
the numbers to words dictionary.

The above 2nd output is just a copy of the previous dictionary, where keys and values are swapped. In the indices_words dictionary, keys are numbers, and values are words. 70 is mapped to &quot;love,&quot; and 129 is mapped to &quot;way.&quot; These two dictionaries are used later. One is used before building the model, and the second dictionary is used at the time of prediction. We will now convert the values in column-1 to one-hot encoded values. There are 139 unique words. So the first word with one column will be one hot encoded with 139 columns. Below code, converts word1 into one-hot
encoded columns.

In [None]:
### Onehot encoding of word1
word1 = input_3gram['word1'].map(word_indices)
word1_onehot = keras.utils.to_categorical(np.array(word1), num_classes=len(word_indices))
print("word1_onehot shape is ",word1_onehot.shape)

As expected, the output has 139 columns. Each column corresponds to one unique word. Let us see a couple of examples.

In [None]:
#Lets take example of two different words
print("The word in row 0 is -->"+input_3gram['word1'][0])
print("The one hot encoded version of the word in row 0 is \n",word1_onehot[0])

print("\nThe word in row 500 is --> "+input_3gram['word1'][500])
print("The one hot encoded version of the word in row 500 is \n",word1_onehot[500])

From the above output, we can see the word in first row is “hate.” The one-hot encoded value for
that row shows the value “1” in the $42^{nd}$ column. The word in column 500 is love, and it has value “1”
in the $71^{st}$ column. We will convert word2 and word3 also to a one-hot encoded format using the
below code.

In [None]:
##one hot encoding for word2 and word3 
word2 = input_3gram['word2'].map(word_indices)
word2_onehot = keras.utils.to_categorical(np.array(word2), num_classes=len(word_indices))
print("word2_onehot shape is ",word2_onehot.shape)

word3 = input_3gram['word3'].map(word_indices)
word3_onehot = keras.utils.to_categorical(np.array(word3), num_classes=len(word_indices))
print("word3_onehot shape is ",word3_onehot.shape)

We are done with data pre-processing. Now we are ready to build the two models.

### Model building
There will be two models. The first ANN model takes word1_onehot as input and word2_onehot as
output. We will extract the hidden layer output from this model and use it in the next model. Below
is the code for building the model.

In [None]:
ANN_model1 = Sequential()
ANN_model1.add(Dense(10, input_dim=word1_onehot.shape[1], activation='sigmoid'))
ANN_model1.add(Dense(word2_onehot.shape[1] ,activation='softmax'))
ANN_model1.summary()

In [None]:
ANN_model1.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
# Train model
history = ANN_model1.fit(word1_onehot, word2_onehot, epochs=20, batch_size=64,  verbose=1)

We are not interested in this model. We are interested in the intermediate output values for each
record. Since there are ten hidden nodes, the hidden nodes result in a matrix that will have 5351
rows, and 10 columns. Below code helps us in extracting that matrix

In [None]:
#We will see what the 1st hidden layer output representation of the data  
# to predict the hidden layer activations, 
# let's rewrite first layer of our model and give it the weights from fully trained previous model
model1_hidden = Sequential()
model1_hidden.add(Dense(10, input_dim=word1_onehot.shape[1], weights=ANN_model1.layers[0].get_weights()))
model1_hidden.add(Activation('sigmoid'))

In [None]:
# Getting the hidden layer activations
model1_hidden_output = model1_hidden.predict(word1_onehot)
#peak into our hidden layer activations
print("The hidden layer output for every record - Shape of it \n", model1_hidden_output.shape)
print("Few five records from hidden layer \n",model1_hidden_output[:5])

As expected, the shape of model1 hidden layer output is (5351,10). The output also shows the
hidden node outputs for the first five records; each record has ten values calculated from ten output
nodes. Now we will append this to the word2_onehot and build the second ANN model.

In [None]:
"""
We append the input words of the words2 column in the output of the h1 layer,this gives us the combined input representation
"""
word2_hidden_append = np.append(model1_hidden_output,word2_onehot, axis=1)
print("word2_hidden_append Shape", word2_hidden_append.shape)

In [None]:
ANN_model2 = Sequential()
ANN_model2.add(Dense(10, input_dim=word2_hidden_append.shape[1], activation='sigmoid'))
ANN_model2.add(Dense(word3_onehot.shape[1], activation='softmax'))
ANN_model2.summary()

In [None]:
ANN_model2.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
# Train model
history = ANN_model2.fit(word2_hidden_append, word3_onehot, epochs=20, batch_size=64,  verbose=1)

We are now done with the sequential ANN models. The predictions will not be the same as standard
ANN models; in the next section, we will see how to use these models to get predicted values on
new data points.

### Prediction
We need to write a custom predict function. This function takes a sequence of two words. These two words will be converted to numbers using the word_indices dictionary, followed by one-hot encoding. The function uses the first word and using ANN_model1, and we extract the hidden layer output values. These values will need then we appended to the second word, and final prediction will be made using the second model. The prediction will be a number; we then convert it to word using indices_words dictionary. Below is the code for writing the custom predict function.

In [None]:
# A predict function that takes input word1 and word2; and predict word3 
#1. take the input word , and represent them using digits from the word_indices dictonary values
#2. getting the intermediate hidden nodes for word1
#3. appending hidden activations with word2 as final test set
#4. prediction on this test set
def two_step_pred(words_in):

    index_input=word_indices[words_in[0]]
    indices_in = keras.utils.to_categorical(index_input, num_classes=len(word_indices))
    indices_in=indices_in.reshape(1,len(word_indices))
    h1_test = model1_hidden.predict(indices_in) # getting our intermediate hidden activations from model1h
    
    
    index_input2=word_indices[words_in[1]]
    indices_in2 = keras.utils.to_categorical(index_input2, num_classes=len(word_indices))
    indices_in2= indices_in2.reshape(1,len(word_indices))
    X2_test = np.append(h1_test, indices_in2, axis=1) #preparing final test data by appending hidden with word2
    
    yhat = ANN_model2.predict_classes(X2_test) #predicting final output from model2
    
    print("Input words --> ", words_in)
    print("Predicted word --> ", indices_words[yhat[0]])

We will now use this function to get some predicted values.

In [None]:
two_step_pred(['love', 'it'])
two_step_pred(['love', 'to'])
two_step_pred(['love', 'the'])

The accuracy of the model depends on the training data. The predictions are reasonable here. That concludes our case study on predicting the third word using the first two words as input. We used sequential ANN models to solve this problem of sequential dependency. Understanding of this case
study is vital for understanding RNN models

## Recurrent Neural Networks
To solve the sequential data problem, we manually built sequential ANN models. Recurrent Neural Networks are programmed sequential ANN models. The procedure we followed till now will be done by RNN automatically. We need to mention the number of time steps, then RNN models will automatically stack the required number of ANN models. If we have to predict the N th value in the sequence, then we need to build an RNN model with N time steps.

### Model building
While building RNN models, we need to mention the number of time steps along with the standard parameters like the number of hidden nodes and input shape. We need to add the SimpleRNN layer and mention these parameters.

In [None]:
model = Sequential()
model.add(SimpleRNN(4, use_bias=False, input_shape=(2,2)))
model.add(Dense(3, use_bias=False, activation='softmax'))
model.summary()

Since we have excluded the bias, there are a total of 36 parameters. Now we will include the bias in
this network and observe the number of parameters.

In [None]:
model = Sequential()
model.add(SimpleRNN(4, input_shape=(2,2)))
model.add(Dense(3, activation='softmax'))
model.summary()

From the above outputs, we can see the number of parameters is matching to our manual
calculation using the formula. Since there is a concept of shared weights, the number of time steps
and the length of the sequence has no impact on the number of parameters. If we change the time
steps to 4, RNN will still result in 43 parameters.

In [None]:
model = Sequential()
model.add(SimpleRNN(4, input_shape=(4,2)))
model.add(Dense(3, activation='softmax'))
model.summary()

### Word prediction using RNN model
We have manually built a sequential ANN stack too solve the case study where we are predicting the third word. With RNN models, we just need to mention the number of time steps; the ANN stacks will be automatically taken care of by the RNN model. We need to supply the word1 and word2 inputs and build an RNN model with time steps=2. Below is the code for data preparation.

In [None]:
word1_word2 = input_3gram[['word1','word2']]
for i in list(word1_word2.columns.values):
    word1_word2[i] = word1_word2[i].map(word_indices)

word1_word2=np.array(word1_word2)
#The same data is reshaped with similar structure but appended with 1 value to make it 3d array
word1_word2=np.reshape(word1_word2,(word1_word2.shape[0],2,1))
word1_word2_onehot = keras.utils.to_categorical(np.array(word1_word2), num_classes=len(word_indices))
print("word1_word2_onehot shape", word1_word2_onehot.shape)

In the above code, we tried appending word1, word2 column-wise. Then reshaped them followed by one-hot encoding. Finally, the code will give us a three-dimensional array with 5351 rows, two columns, and each column has 139 dimensions(one-hot encoded).

Now we are ready with data. We are predicting the third word, which means the time steps are two. The target variable is the third word, which is also one-hot encoded.

In [None]:
print("time steps" , word1_word2_onehot.shape[1])
print("Input nodes" , word1_word2_onehot.shape[2])
print("output nodes" , word3_onehot.shape[1])

We will now build the RNN model using the below code.

In [None]:
model_rnn = Sequential()
#model.add(SimpleRNN('number of hidden nodes in each rnn cell', input_shape=(timesteps, input_data_dim)))
model_rnn.add(SimpleRNN(30, input_shape=(word1_word2_onehot.shape[1],word1_word2_onehot.shape[2]))) 
model_rnn.add(Dense(word3_onehot.shape[1], activation='softmax'))
model_rnn.summary()

In the above code, we have mentioned two-time steps and thirty hidden nodes at each time point.

We will now compile and train the RNN model.

In [None]:
# compile network
model_rnn.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
# fit network
model_rnn.fit(word1_word2_onehot, word3_onehot, epochs=20)

The accuracy of the model depends on the strength of the training data. We have only five thousand
records and 139 dimensions in the data. We may not achieve a model with good accuracy. We need
more data for higher accuracy. But this model will be better than our previous manual sequential
ANN model.

The model is now ready for predicting. Below is the prediction code.

In [None]:
def rnn_word_pred(in_text):
    print("Input is - " , in_text)
    encoded = [word_indices[i] for i in in_text]
    encoded = np.array(encoded).reshape(1,2,1)
    encoded =keras.utils.to_categorical(np.array(encoded), num_classes=len(word_indices))
    ypred = model_rnn.predict_classes(encoded, verbose=0)[0]
    print("Output is --> " ,indices_words[ypred])

There are three significant steps in the above prediction function. Firstly converting words into indices, followed by reshaping them to 1row, two columns format, finally, one-hot encoding the input to bring into the shape (1,2,139). This pre-processed input will be sent to RNN to predict classes function. Finally, the numerical output will be converted as words before printing. This function takes a list of two words as input. Below are a few examples.

In [None]:
rnn_word_pred(['love', 'it'])
rnn_word_pred(['love', 'to'])
rnn_word_pred(['love', 'the'])

From the output, we can see the results are as good as our previous model. Once again, we need to
note that the predictions depend on our training data.

## RNN for long sequences
RNN models are useful for solving problems related to sequential data. In practical scenarios, RNN models seems to be failing to predict long sequences. Sequences where the number of time steps is more than ten, then the RNN algorithm is not giving accurate results.

The RNN models, in theory, should work with a sequence of any length. But in practice, the standard RNN models don’t have long term memory property. We will see a simple numerical example to prove it. We will take an example of long term dependency and verify the performance of the RNN models.

### Case study - Predicting the characters to form the word.
This case study is similar to the above case study of predicting the next word, but the approach is entirely different.


#### Data and Objective
In this example, we are considering three-gram data, but the data is formed by carefully choosing the three grams that are more than 15 characters long. The objective is to take the first 14 characters sequence as input and predict the next sequence of characters that form a word. The goal
is to predict the next word, but that word prediction id made by arranging the characters as a sequence. Character level input and output is the core difference between this model and model in the previous case study. Below code imports and prints a sample of the data.

In [None]:
import urllib.request  
urllib.request.urlretrieve("https://raw.githubusercontent.com/venkatareddykonasani/ML_DL_py_TF/master/Chapter12_RNN_LSTM_V3/Datasets/Long_sequence_3gram.csv", "Long_sequence_3gram.csv")

In [None]:
longseq_3gram = open('Long_sequence_3gram.csv').read().lower()
print(longseq_3gram[495:801])
print(longseq_3gram[30615:31000])

The above output shows a few examples from the data. We will now need to apply pre-processing
steps. We need to create a character to index the dictionary and preparation of X and y data.

#### Data processing
There are several steps in data pre-processing. We will start by replacing the commas with space using the below code

In [None]:
#Replace comma with space
longseq_3gram1= longseq_3gram.replace(',',' ').replace('\r','')
print(longseq_3gram1[495:750])
print(longseq_3gram1[30615:30800])

In this model, we need to map each character to an index while preparing the character to index the dictionary.

In [None]:
#Unique characters in our dataset we then sort it
chars = sorted(list(set(longseq_3gram1)))
print("Unique Characters in the text \n ",chars)
#\n is character string for new line, we dont need that in our dictionary of chars
chars.remove('\n')
print("\n Character after removing newline symbol \'\\n\'",chars)
print("\n overall chars count", len(chars))

In the above code, we are trying to count all the unique characters. We will finally exclude new line symbol “\n”.

From the output, we can see that there are 37 unique characters. Apart from the alphabets, we have a few numbers and symbols. We will now create the char to indices and indices to char dictionaries using the below code.

In [None]:
char_indices = dict((c, i) for i, c in enumerate(chars))
print("characters to indices dictionary\n", char_indices)
indices_char = dict((i, c) for i, c in enumerate(chars))
print("indices to char dictionary\n", indices_char)
print('unique chars: ', {len(chars)})

In the above code, we are creating two dictionaries.

A quick verification will show that “a” is mapped to 11 in the char_indices dictionary, and 11 is mapped to “a” in indices to char dictionary. The next step is to apply this char_indicies dictionary on the full data and convert it from a sequence of characters into a sequence of numbers. We have
removed the newline symbol from the data; we need to add space at the end of every line to compensate for it.

In [None]:
data = longseq_3gram1.splitlines()
##Adding a space at the end
data = [i+' ' for i in data]

##mapping our data into numbers
sentences = [[char_indices[j] for j in i] for i in data ]
print(data[0], sentences[0])
print(data[10], sentences[1])
print(data[20], sentences[2])
print(data[100], sentences[3])
print(data[400], sentences[400])
print(data[4000], sentences[4000])
print(data[9000], sentences[9000])
##Number of sentences
print("Number of sentences ", len(sentences))

The above code simply maps each character to a number by using char_indices dictionary. Code also includes printing a few examples.

From the output, we can see the sequence of numbers corresponding to the sequence of characters. Every sentence will end with 0, which is nothing but space. There are a total of 30,207 sentences. We need to convert this data to RNN friendly data now. In this case study, we would like to take a sequence of 14 characters to predict the next character. Our RNN will have input sequence length 14 and predict one output at a time.

One sentence of length 20 has been converted into six sentences with 14 inputs vs. one output pairs. We need to repeat the same for all the sentences using the below code.

In [None]:
#Since all the sentences may not be of same length,it is neccessary to make them consistent when passing to keras
#We select a sequence length
Seq_ln = 14
X = []
y = []
for i in sentences:
    for j in range(len(i)-Seq_ln):
        X.append(i[j:j+Seq_ln])
        y.append(i[j+Seq_ln])
len(X), len(y)

From the output, we can see that the number of sentences in the data has increased to 142,142 from the original 30,307. Each original sentence has almost created five new pairs of X and y. Below is the example from the code.

In [None]:
print("data[0:2]=", data[0:2])
print("sentences[0:2]=", sentences[0:2])

for i in range (0,20):
    print("X[",i,"]=", X[i],"y[",i,"]=", y[i])

We are trying to print the first two sentences and their corresponding X and y value conversions using the above code.

From the output we can see the X and y pairs. We are ready to build the model. We need to one hot encode the data and build the RNN model. Below is the code for the final steps of data processing.

In [None]:
#The first row is the X's first row up to 14 character
#The second row is the X's first row starting from second character up to 14 character
#The third row is the X's first row starting from third character up to 14 character and so on 
X=np.array(X)
X1=np.reshape(X,(X.shape[0],X.shape[1],1))
X1=keras.utils.to_categorical(np.array(X1), num_classes=len(char_indices))
print(X1.shape)

In [None]:
#Target Variable
y[:10]
#Reshapig our label for model
y1 = np.array(y)
# one hot encode outputs
y1 = keras.utils.to_categorical(np.array(y), num_classes=len(char_indices))
y1.shape

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X1, y1, test_size=0.20)
print("X_train.shape", X_train.shape)
print("y_train.shape", y_train.shape)
print("X_test.shape", X_test.shape)
print("y_test.shape", y_test.shape)

In the above code, we have reshaped the values in X and y, and one-hot encoded them. RNN model expects the data in a particular format, and we need to reshape the data in that format.

We are done with data preprocessing. We will go ahead and build the RNN model.

#### Model building
While building the RNN model, we need to mention the time steps and the number of hidden nodes. Below is the code for building the model.

In [None]:
#building the model
model_RNN2 = Sequential()
##model.add(SimpleRNN('number of hidden nodes in each rnn cell', input_shape=(timesteps, data_dim)))
model_RNN2.add(SimpleRNN(16, input_shape=(X_train.shape[1], X_train.shape[2]))) 
model_RNN2.add(Dense(len(char_indices)))
model_RNN2.add(Activation('softmax'))
model_RNN2.summary()

From the above code, we can see that we are building the model with 16 hidden nodes.

We will compile the model and train it.

In [None]:
# compile network
model_RNN2.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
# fit network
model_RNN2.fit(X_train, y_train, epochs=30, verbose=1, validation_data=(X_test, y_test))
model_RNN2.save_weights("char_rnn_model_weights_v1.hdf5")

We are training the model for 30 epochs and saving it in the weights file.

We can see from the output that the model is not improving after reaching 51% accuracy. The model will not show any improvement even after we train it for ten more epochs. Below is the code for additional epochs using the weights file.

In [None]:
import urllib.request  
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

In [None]:
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)
downloaded = drive.CreateFile({'id':"1VBKszu-PZY1EbdVl6NblXsW9363rWmrs"})   
downloaded.GetContentFile('Datasets.zip') 

!unzip -qq 'Datasets.zip'

In [None]:
weightsfile_model_RNN2= "Pre_trained_models/char_rnn_model_weights_v1.hdf5"
model_RNN2.load_weights(weightsfile_model_RNN2)

# compile network
model_RNN2.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
# fit network
model_RNN2.fit(X_train, y_train, epochs=2, verbose=1)

We will now use this model for prediction

#### Prediction
We need to write the predict function that takes input as characters and convert them to numbers. Use these indices to get the predictions, and finally convert them to characters to give us the output. We are going to write a predict function that will predict not just one character but a sequence of characters that will form a word. The prediction loop will continue until it hits a space, which marks the completion of the word. Below is the predict function.


In [None]:
#function to prepare test input
def prepare_input(in_text):
    X1 = np.array([char_indices[i] for i in in_text]).reshape(1,14,1)
    X1=keras.utils.to_categorical(np.array(X1), num_classes=len(char_indices))
    return(X1)
#function to loop our preditions
def complete_pred(in_text):
    #original_text = in_text
    #generated = in_text
    completion = ''
    while True:
        x = prepare_input(in_text)
        pred = model_RNN2.predict_classes(x, verbose=0)[0]

        next_char = indices_char[pred]

        in_text = in_text[1:] + next_char
        completion += next_char

        if len(completion)> 20 or next_char == ' ':
            return completion

From the code, we can see the two functions. One is for the usual character to index conversion and the other one for predicting the sequence of characters. We will now use this function for predictions.

In [None]:
in_text = 'officials say '
out_word = complete_pred(in_text)
print("Input text -->", in_text, "\npredicted word ---> ", out_word)
in_text = 'how dangerous '
out_word = complete_pred(in_text)
print("Input text -->", in_text, "\npredicted output ---> ", out_word)
in_text = 'political and '
out_word = complete_pred(in_text)
print("Input text -->", in_text, "\npredicted output ---> ", out_word)
in_text = 'whatever they '
out_word = complete_pred(in_text)
print("Input text -->", in_text, "\npredicted output ---> ", out_word)
in_text = 'of particular '
out_word = complete_pred(in_text)
print("Input text -->", in_text, "\npredicted output ---> ", out_word)

From the output, we can see that almost all the predictions are simple words like to, of, the, etc., These words are the most frequent in the data; they are known as stop words. RNN is merely predicting the stop words for all inputs. RNN has failed to model this data. The reason is the length of the sequence is 14. In practice, RNN models usually fail when the sequence length is more than ten.

## LSTM
Long Short Term Memory models are created by making several modifications to standard RNN models. RNN models do not have long term memory; we add some features to make it remember long term dependencies.

### LSTM case study
The case study on predicting the next characters we have used the standard RNN model. It has failed. We will now use the LSTM model on the same data. Below is the code building the lstm model.

In [None]:
#building the model
model_LSTM = Sequential()
#model1.add(LSTM('number of hidden nodes in each rnn cell', input_shape=(timesteps, data_dim)))
model_LSTM.add(LSTM(128, input_shape=(X_train.shape[1], X_train.shape[2]))) 
model_LSTM.add(Dense(len(char_indices)))
model_LSTM.add(Activation('softmax'))
model_LSTM.summary()

We will now compile and train the model

In [None]:
# compile network
model_LSTM.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
# fit network
model_LSTM.fit(X_train, y_train, epochs=30, verbose=1)
model_LSTM.save_weights("char_LSTM_model_weights_v1.hdf5")

This model shows a much better accuracy of 80%; on the same dataset, RNN has given us 50% accuracy. We will now use this model for prediction. Our prediction is made by predicting one character at a time and continue the predictions until we see space. Those sequences of characters will be formed as a predicted word.

In [None]:
weightsfile_model_LSTM= "Pre_trained_models/char_LSTM_model_weights_v1.hdf5"
model_LSTM.load_weights( weightsfile_model_LSTM)

# compile network
model_LSTM.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
# fit network
model_LSTM.fit(X_train, y_train,epochs=2, verbose=1)

In [None]:
#function to prepare test input
def prepare_input1(in_text):
    X1 = np.array([char_indices[i] for i in in_text]).reshape(1,14,1)
    X1= keras.utils.to_categorical(np.array(X1), num_classes=len(char_indices))
    return(X1)
#function to loop our preditions
def complete_pred1(in_text):
    #original_text = in_text
    #generated = in_text
    completion = ''
    while True:
        x = prepare_input1(in_text)
        pred = model_LSTM.predict_classes(x, verbose=0)[0]
        next_char = indices_char[pred]

        in_text = in_text[1:] + next_char
        completion += next_char

        if len(completion)> 20 or next_char == ' ':
            return completion

We will use the above function to predict on few test points

In [None]:
in_text = 'the emergence '
out_word = complete_pred1(in_text)
print("Input text -->", in_text, "; predicted output ---> ", out_word)
in_text = 'officials say '
out_word = complete_pred1(in_text)
print("Input text -->", in_text, "; predicted output ---> ", out_word)
in_text = 'and sentenced '
out_word = complete_pred1(in_text)
print("Input text -->", in_text, "; predicted output ---> ", out_word)
in_text = 'a combination '
out_word = complete_pred1(in_text)
print("Input text -->", in_text, "; predicted output ---> ", out_word)
in_text = 'and according '
out_word = complete_pred1(in_text)
print("Input text -->", in_text, "; predicted output ---> ", out_word)

We will take a few test cases and get their RNN and LSTM model predictions. That will help us in
comparing their performance. The below code gets the predictions using RNN and LSTM.

In [None]:
in_text = 'how dangerous '
out_word = complete_pred1(in_text)
print("Input text -->", in_text, "\nLSTM Prediction ---> ", out_word)
out_word1 = complete_pred(in_text)
print("RNN Prediction ---> ", out_word1)

print("\n")
in_text = 'political and '
out_word = complete_pred1(in_text)
print("Input text -->", in_text, "\nLSTM Prediction ---> ", out_word)
out_word1 = complete_pred(in_text)
print("RNN Prediction ---> ", out_word1)

print("\n")
in_text = 'of particular '
out_word = complete_pred1(in_text)
print("Input text -->", in_text, "\nLSTM Prediction ---> ", out_word)
out_word1 = complete_pred(in_text)
print("RNN Prediction ---> ", out_word1)

print("\n")
in_text = 'whatever they '
out_word = complete_pred1(in_text)
print("Input text -->", in_text, "\nLSTM Prediction ---> ", out_word)
out_word1 = complete_pred(in_text)
print("RNN Prediction ---> ", out_word1)

We can see from the results that the RNN predictions are generic words, whereas LSTM is predicting
more related words. We will now see other applications of LSTM.

## Case Study – Language Translation
LSTM models are powerful sequence models available today. One of the most useful applications of LSTM is a sequence to sequence models. Where we have a sequence as input and output is a sequence. If we are building a chatbot, we have a question as an input sequence of words, the output sequence of words will be the answer. Similarly, if we are talking about the language
translation model. Input will be a sequence of words from language-1, and output will be a sequence of words from language-2. Language-1 is the source language, and language-2 is the target language.
Like the translation model from English to French

### Data and Objective
In this case study the source language is English, and the target language is French. The objective is
to build a machine translation model. The data set has been downloaded from the http://www.manythings.org/anki/ website. Apart from English to French, there are several other datasets too. As an example, we are considering English to French in this case study. Dataset is publically available under CC-BY 2.0 license. Below code is used for importing the data

In [None]:
raw_data= open("fra-eng/fra.txt", mode='rt', encoding='utf-8').read()
raw_data=raw_data.strip().split('\n')
raw_data=[i.split('\t') for i in raw_data]
lang1_lang2_data=array(raw_data)
print(lang1_lang2_data)
print("Overall pairs", len(lang1_lang2_data))

The above code is trying to import the text file split the file into individual lines.

From the output, we can see that there is some text from language-1 followed by an identical corresponding from language-2. In our case, language-1 is English, and language-2 is French. From here on, we will use the generalized terminology of lang1 and lang2 terminology. It will be easy to read even if we are trying with different languages. The overall lang1 and lang2 pairs are 175,623. We need many more pairs to build a flawless model like google translate. We can build a decent model with this data.

### Data processing
We will now perform some basic data pre-processing tasks like removing punctuation, converting to lowercase

In [None]:
# Remove punctuation
lang1_lang2_data[:,0] = [word.translate(str.maketrans('', '', string.punctuation)) for word in lang1_lang2_data[:,0]]
lang1_lang2_data[:,1] = [word.translate(str.maketrans('', '', string.punctuation)) for word in lang1_lang2_data[:,1]]

print(lang1_lang2_data)

In [None]:
## convert text to lowercase
for word in range(len(lang1_lang2_data)):
    lang1_lang2_data[word,0] = lang1_lang2_data[word,0].lower()
    lang1_lang2_data[word,1] = lang1_lang2_data[word,1].lower()
print(lang1_lang2_data)

In [None]:
tokenizer = Tokenizer()
tokenizer.fit_on_texts(lang1_lang2_data[:, 0])
lang1_tokens=tokenizer
lang1_vocab_size = len(lang1_tokens.word_index) + 1
print("lang1_vocab_size", lang1_vocab_size)

In [None]:
tokenizer = Tokenizer()
tokenizer.fit_on_texts(lang1_lang2_data[:, 1])
lang2_tokens=tokenizer
lang2_vocab_size = len(lang2_tokens.word_index) + 1
print("lang2_vocab_size", lang2_vocab_size)

In the above code, we tried removing punctuation marks, then convert everything into lowercase followed by tokenizing. Tokenizing is nothing but dividing the data into words. Till now, our datasets have less vocabulary. We did manual tokens. Here we are using the tokenizer function.

From the output, we can see that there are 14,671 unique words in language-1 and 33,321 unique
words in language2. We are converting the data into a sequence of words. Later these words will be
converted to vectors in the word embedding layer. We will now create the train and test data.

In [None]:
# split data into train and test set
train, test = train_test_split(lang1_lang2_data, test_size=0.1, random_state = 44)

We are now ready to build the model, but before that, we need to convert these words into numbers followed by padding with zeros. Padding will mark the end of the paragraph or sentence.

In [None]:
lang1_seq_length=15
lang2_seq_length=15

X_train_seq=lang1_tokens.texts_to_sequences(train[:, 0])
X_train= pad_sequences(X_train_seq,lang1_seq_length,padding='post')

Y_train_seq=lang2_tokens.texts_to_sequences(train[:, 1])
Y_train= pad_sequences(Y_train_seq,lang2_seq_length,padding='post')

X_test_seq=lang1_tokens.texts_to_sequences(test[:, 0])
X_test= pad_sequences(X_test_seq,lang1_seq_length,padding='post')

Y_test_seq=lang2_tokens.texts_to_sequences(test[:, 1])
Y_test= pad_sequences(Y_test_seq,lang2_seq_length,padding='post')

print("X_train.shape", X_train.shape)
print("Y_train.shape",Y_train.shape)
print("X_test.shape",X_test.shape)
print("Y_test.shape", Y_test.shape)

The above code first maps the words to numbers using texts_to_sequnces function. Here is this example, we took the average length of a sentence as 15, if a sentence is less than 15 words long, then there will be zeros added to make every sentence of length 15. Long sentences will be cut down to 15 words. This padding is necessary to bring uniformity in the length of each sentence. If required, we can increase the length from 15 words to 20 words.

We will now print a row from the data to see the result of padding. Below is the code for printing a sample data point

In [None]:
print("Text data", [train[5, 0]])
print('Numbers sequence', X_train_seq[5])
print('Padded Sequence', X_train[5])

We can see from the output that the sentence is converted into numbers. This sentence has only 12 words; hence it has been padded with three zeros at the end. Now we are ready for building the model.

### Encoder and Decoder
The Sequence to sequence models is very different from all the models that we discussed until now. A sequence of words can not be simply converted to a sequence of words just by word to word conversion. We often see that the input and output sequences have different lengths. We need to follow Encoder and Decoder architecture to build the sequence to sequence models. The encoder is an LSTM model that will be used to understand and model the input sequence. Similarly, the decoder is another LSTM that will be used to model the output sequence.

### Model building
There are four significant steps in this model architecture.
* Word embedding for language-1
* Encoder LSTM
* Repeat Vector generation from thought vector. This step is to match the decoder dimensions
* Decoder LSTM
Below is the model that covers the above points

In [None]:
model = Sequential()
model.add(Embedding(lang1_vocab_size, 256, input_length=lang1_seq_length, mask_zero=True))
model.add(LSTM(128))
model.add(RepeatVector(lang2_seq_length))
model.add(LSTM(128, return_sequences=True))
model.add(Dense(lang2_vocab_size, activation='softmax'))
model.summary()

We can now compile and train this model using the below code

In [None]:
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy')
history = model.fit(X_train, Y_train.reshape(Y_train.shape[0], Y_train.shape[1], 1),  epochs=1, verbose=1, batch_size=1024)
model.save_weights('Eng_fra_model.hdf5')

The above code takes nearly four hours of execution time on a typical system. There are 8.3 million weight parameters. Once the model is built and saved, we can use it for prediction. We already have a model weight file; then, we can stop the above training process and directly load weights into the model. We executed this model and saved the weights. There is a high chance that this model training might hang the system. Readers can directly load the weights from the saved model. If required, we can run a few more epochs on top of it. Below is the code for loading the weights into
the model

In [None]:
model.load_weights("Pre_trained_models/Eng_fra_model.hdf5")

### Prediction
The prediction has three steps.
* Taking text data as input
* Pre-processing the text data
* Converting into numbers
* Prediction of the output sequences.
* Finally, Converting the output sequence of numbers into words.
Below is the code for pre-processing

In [None]:
def one_line_prediction(text1):
    
    def to_lines(text):
          sents = text.strip().split('\n')
          sents = [i.split('\t') for i in sents]
          return sents
    small_input = to_lines(text1)
    small_input = array(small_input)
    
    # Remove punctuation
    small_input[:,0] = [s.translate(str.maketrans('', '', string.punctuation)) for s in small_input[:,0]]
    # convert text to lowercase
    for i in range(len(small_input)):
        small_input[i,0] = small_input[i,0].lower()

    #encode and pad sequences
    small_input_seq=lang1_tokens.texts_to_sequences(small_input[0])
    small_input= pad_sequences(small_input_seq,lang1_seq_length,padding='post')
   

    #Load the model
    #Eng French Model
    #model.load_weights('/content/drive/My Drive/Training/Book/0.Chapters/Chapter12 RNN and LSTM/1.Archives/Eng_fra_model_v2.hdf5')

    pred_seq = model.predict_classes(small_input[0:1].reshape((small_input[0:1].shape[0],small_input[0:1].shape[1])))
    
    def num_to_word(n, tokens):
          for word, index in tokens.word_index.items():
              if index == n:
                  return word
          return None

    Lang2_text = []
    for word_num in pred_seq:
          sing_pred = []
          for i in range(len(word_num)):
                t = num_to_word(word_num[i], lang2_tokens)
                if i > 0:
                    if (t == num_to_word(word_num[i-1], lang2_tokens)) or (t == None):
                        sing_pred.append('')
                    else:
                        sing_pred.append(t)
                else:
                      if(t == None):
                              sing_pred.append('')
                      else:
                              sing_pred.append(t) 
          Lang2_text.append(' '.join(sing_pred))
    return(Lang2_text)

The above code looks complicated, but it is doing a simple task of mapping number sequence to
words. In this code, we added several if-else conditions that will take care of exceptions. Like null
input, end of the line, and so on. Otherwise, the below three lines are sufficient

Usually, it is a good idea to combine all the prediction related tasks into one predict function. Below are some of the results from our model predictions.

In [None]:
Input_sentences=["have a great Good day",
                 "Do you speak English",
                 "I do not know your language",
                 "I need help",
                 "Thank you very much",
                 "Where can I get this",
                 "How much does it cost",
                 "Where is the bathroom",
                 "Where is the ATM",
                 "I am a visitor here",
                 "Excuse me",
                 "What do you do for living",
                 "Here is my passport"]

for sent in Input_sentences:
  print([sent] , " -->",one_line_prediction(sent))

From the above results, we can see the model performance is not excellent. However, it can be
made better with more training data and a few more epochs.