<a href="https://colab.research.google.com/github/sudhirtakke/Word-Level-LSTM/blob/main/WordLevel_LSTM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Building new dialogues using Keras.

### Table of Contents

1. [Learning Goals](#section1)
2. [Language Model Design](#section2)
3. [Load Text](#section3)
4. [Clean Text](#section4)
5. [Save Cleaned Text](#section5)
6. [Train Language Model](#section6)
 - a. [Load Sequences](#section601)
 - b. [Encode Sequences](#section602)
 - c. [Sequence Inputs and Output](#section603)
 - d. [Fit Model](#section604)
 - e. [Save the model](#section605)
7. [Use Language model](#section7)
 - a. [Load the Data](#section701)
 - b. [Load Model](#section702)
 - c. [Generate Text](#section703)


<br>

* We are going to develop **word-level neural language model** and use it to generate text.

* A **language model** can predict the probability of the next word in the sequence, based on the **words already observed** in the sequence.

* **Neural network models** are a preferred method for **developing statistical language models** because they can use a **distributed representation** where different words with similar meanings have **similar representation**.

- Also, it is because they can use a **large context** of recently observed words when **making predictions**.



<a id=section1></a>
## 1. Learning goals
 

* How to prepare text for developing a **word-based language** model ?
* How to design and fit a **neural language model** with a **learned embedding** and an **LSTM hidden layer** ?
* How to use the **learned language model** to generate **new text** with **similar statistical properties** as the source text ?

### Overview
1. The Republic by Plato
2. Data Preparation
3. Train Language Model
4. Use Language Model

---

## The Republic by Plato
<br>

- Download the ASCII **text version** of the entire book (or books) here: [The Republic](https://https://www.gutenberg.org/ebooks/1497) and save it as *republic.txt*

- **Open the file in a text editor and delete the front and back matter. This includes details about the book at the beginning, a long analysis, and license information at the end.**

## Data Preparation

- We will start by **preparing the data** for modeling.

- The first step is to look at the data.

### Review the Text
- Open the text in an editor and just look at the text data.

- For example, here is the first piece of dialog:

> BOOK I.

        I went down yesterday to the Piraeus with Glaucon the son of Ariston,
        that I might offer up my prayers to the goddess (Bendis, the Thracian
        Artemis.); and also because I wanted to see in what manner they would
        celebrate the festival, which was a new thing. I was delighted with the
        procession of the inhabitants; but that of the Thracians was equally,
        if not more, beautiful. When we had finished our prayers and viewed the
        spectacle, we turned in the direction of the city; and at that instant
        Polemarchus the son of Cephalus chanced to catch sight of us from a
        distance as we were starting on our way home, and told his servant to
        run and bid us wait for him. The servant took hold of me by the cloak
        behind, and said: Polemarchus desires you to wait.

        I turned round, and asked him where his master was.

        There he is, said the youth, coming after you, if you will only wait.

        Certainly we will, said Glaucon; and in a few minutes Polemarchus
        appeared, and with him Adeimantus, Glaucon’s brother, Niceratus the son
        of Nicias, and several others who had been at the procession.

        Polemarchus said to me: I perceive, Socrates, that you and your
        companion are already on your way to the city.

        You are not far wrong, I said.
        ...
        
### Here’s what we see from a quick look:

* Book/Chapter headings (e.g. “BOOK I.”).
* British English spelling (e.g. “honoured”)
* Lots of punctuation (e.g. “–“, “;–“, “?–“, and more)
* Strange names (e.g. “Polemarchus”).
* Some long monologues that go on for hundreds of lines.
* Some quoted dialog (e.g. ‘…’)


<a id=section2></a>
### 2. PLAN:  Language model design

- It will be **statistical** and will **predict the probability** of **each word** given an input sequence of text.
- The **predicted word** will be fed in as input to in turn **generate the next word**.

- A key design decision is how long the **input sequences** should be.
- They need to be long enough to allow the model to **learn the context** for the words to predict.
- This **input length** will also define the **length of seed text** used to generate **new sequences** when we use the model.
- There is **no correct** answer.
- With enough time and resources, we could explore the **ability of the model** to learn with **differently sized input sequences**.

- Instead, we will pick a length of **50 words** for the length of the **input sequences**, somewhat arbitrarily.

We'll be using the following **process sequence** in this notebook:
![](https://raw.githubusercontent.com/insaid2018/DeepLearning/master/images/word_lstm_flow0.png)

<a id=section3></a>
### 3. Load Text

- The first step is to **load the text** into **memory** as a **sequence** of **loaded text**.

In [1]:
# Import tensorflow 2.x
# This code block will only work in Google Colab.
try:
    # %tensorflow_version only exists in Colab.
    %tensorflow_version 2.x
except Exception:
    pass

In [3]:
import urllib
import requests
response = urllib.request.urlopen('https://raw.githubusercontent.com/insaid2018/DeepLearning/master/Data/republic_clean.txt')
doc = response.read().decode('utf8')
print(doc[:200])

﻿BOOK I.

I went down yesterday to the Piraeus with Glaucon the son of Ariston,
that I might offer up my prayers to the goddess (Bendis, the Thracian
Artemis.); and also because I wanted to s


<a id=section4></a>
### 4. Clean Text

- We need to transform the **raw text** into a **sequence of tokens** or words that we can use as a source to train the model.

- Based on **reviewing the raw text** (above), below are some specific operations we will perform to clean the text. 
- We may want to explore **more cleaning operations** as an extension.

* **Replace ‘–‘** with a white space so we can split words better.
* **Split words** based on **white space**.
* Remove all **punctuation** from **words** to reduce the vocabulary size (e.g. ‘What?’ becomes ‘What’).
* **Remove all words** that are not alphabetic to remove standalone **punctuation tokens**.
* Normalize **all words** to **lowercase** to reduce the **vocabulary size**.


- **Vocabulary size** is a big deal with language modeling.
- A **smaller vocabulary** results in a **smaller model** that **trains faster.**

In [4]:
import string

# turn a doc into clean tokens
def clean_doc(doc):
    # replace '--' with a space ' '
    doc = doc.replace('--', ' ')
    # split into tokens by white space
    tokens = doc.split()
    # remove punctuation from each token
    table = str.maketrans('', '', string.punctuation)
    tokens = [w.translate(table) for w in tokens]
    # remove remaining tokens that are not alphabetic
    tokens = [word for word in tokens if word.isalpha()]
    # make lower case
    tokens = [word.lower() for word in tokens]
    return tokens

In [5]:
# clean document
tokens = clean_doc(doc)
print(tokens[:200])
print('Total Tokens: %d' % len(tokens))
print('Unique Tokens: %d' % len(set(tokens)))

['book', 'i', 'i', 'went', 'down', 'yesterday', 'to', 'the', 'piraeus', 'with', 'glaucon', 'the', 'son', 'of', 'ariston', 'that', 'i', 'might', 'offer', 'up', 'my', 'prayers', 'to', 'the', 'goddess', 'bendis', 'the', 'thracian', 'artemis', 'and', 'also', 'because', 'i', 'wanted', 'to', 'see', 'in', 'what', 'manner', 'they', 'would', 'celebrate', 'the', 'festival', 'which', 'was', 'a', 'new', 'thing', 'i', 'was', 'delighted', 'with', 'the', 'procession', 'of', 'the', 'inhabitants', 'but', 'that', 'of', 'the', 'thracians', 'was', 'equally', 'if', 'not', 'more', 'beautiful', 'when', 'we', 'had', 'finished', 'our', 'prayers', 'and', 'viewed', 'the', 'spectacle', 'we', 'turned', 'in', 'the', 'direction', 'of', 'the', 'city', 'and', 'at', 'that', 'instant', 'polemarchus', 'the', 'son', 'of', 'cephalus', 'chanced', 'to', 'catch', 'sight', 'of', 'us', 'from', 'a', 'distance', 'as', 'we', 'were', 'starting', 'on', 'our', 'way', 'home', 'and', 'told', 'his', 'servant', 'to', 'run', 'and', 'bid',

<a id=section5></a>
### 5. Save clean text

- We can organize the long list of tokens into sequences of **50 input words** and **1 output word**.

- That is, sequences of **51 words**.

- We can do this by iterating over the list of tokens from token 51 onwards and taking the prior **50 tokens as a sequence**, then repeating this process to the end of the list of tokens.

- We will transform the tokens into **space-separated strings** for later storage in a file.

- The code to split the list of **clean tokens** into **sequences with a length of 51 tokens** is listed below.

In [6]:
# organize into sequences of tokens
length = 50 + 1
sequences = list()
for i in range(length, len(tokens)):
    # select sequence of tokens
    seq = tokens[i-length:i]
    # convert into a line
    line = ' '.join(seq)
    # store
    sequences.append(line)
print('Total Sequences: %d' % len(sequences))

Total Sequences: 118633


In [7]:
sequences[:2]

['book i i went down yesterday to the piraeus with glaucon the son of ariston that i might offer up my prayers to the goddess bendis the thracian artemis and also because i wanted to see in what manner they would celebrate the festival which was a new thing i was',
 'i i went down yesterday to the piraeus with glaucon the son of ariston that i might offer up my prayers to the goddess bendis the thracian artemis and also because i wanted to see in what manner they would celebrate the festival which was a new thing i was delighted']

In [8]:
# save tokens to file, one dialog per line
def save_doc(lines, filename):
    data = '\n'.join(lines)
    file = open(filename, 'w')
    file.write(data)
    file.close()
    
# save sequences to file
out_filename = 'republic_sequences.txt'
save_doc(sequences, out_filename)

<a id=section6></a>
## 6. Train Language Model

We can now train a **statistical language mode**l from the prepared data.

The model we will train is a neural language model. It has a few unique characteristics:

* It uses a **distributed representation for words** so that different words with similar meanings will have a similar representation.
* It **learns** the **representation** at the same time as **learning the model.**
* It **learns** to **predict the probability** for the next word using the context of the last 100 words.

Specifically, we will use an **Embedding Layer** to learn the representation of words, and a **Long Short-Term Memory (LSTM)** recurrent neural network to learn to **predict words** based on their context.

### Let’s start by loading our training data.

<a id=section601></a>
### a. Load Sequences

- We can load our **training data** using the **`load_doc()`** function defined below.


- Once loaded, we can **split the data into separate training sequences** by splitting based on new lines.


- The snippet below will load the **‘republic_sequences.txt‘** data file from the current working directory.

In [9]:
# load doc into memory
def load_doc(filename):
    # open the file as read only
    file = open(filename, 'r')
    # read all text
    text = file.read()
    # close the file
    file.close()
    return text

# load
in_filename = 'republic_sequences.txt'
doc = load_doc(in_filename)
lines = doc.split('\n')

In [10]:
lines[:2]

['book i i went down yesterday to the piraeus with glaucon the son of ariston that i might offer up my prayers to the goddess bendis the thracian artemis and also because i wanted to see in what manner they would celebrate the festival which was a new thing i was',
 'i i went down yesterday to the piraeus with glaucon the son of ariston that i might offer up my prayers to the goddess bendis the thracian artemis and also because i wanted to see in what manner they would celebrate the festival which was a new thing i was delighted']

<a id=section602></a>
### b. Encode Sequences

- The **word embedding layer** expects input sequences to be comprised of integers.

- We can **map each word in our vocabulary** to a unique integer and encode our input sequences.
- Later, when we make predictions, we can convert the **prediction to numbers** and look up their **associated words** in the **same mapping**.

- To do this **encoding**, we will use the **`Tokenizer`** class in the Keras API.

- First, the **Tokenizer** must be trained on the **entire training dataset**, which means it finds all of the unique words in the data and assigns each a unique integer.

- We can then use the **fit Tokenizer** to encode all of the training sequences, **converting each sequence** from a **list of words** to a **list of integers**.

In [11]:
from numpy import array
from pickle import dump
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import LSTM
from tensorflow.keras.layers import Embedding

# integer encode sequences of words
tokenizer = Tokenizer()
tokenizer.fit_on_texts(lines)
sequences = tokenizer.texts_to_sequences(lines)

- We can access the **mapping of words to integers** as a dictionary attribute called **word_index on the Tokenizer object**.

- We need to know the **size of the vocabulary** for defining the embedding layer later. 
- We can determine the vocabulary by **calculating the size** of the **mapping dictionary**.

- Words are assigned values from **1 to the total number of words** (e.g. 7,409).
- The **Embedding layer** needs to allocate a **vector representation** for each word in this vocabulary from **index 1 to the largest index** 
- It is because indexing of arrays is **zero-offset**, the index of the word at the end of the vocabulary will be 7,409
- This means the array must be **7,409 + 1** in length.

- Therefore, when specifying the **vocabulary size** to the **Embedding layer**, we specify it as 1 larger than the **actual vocabulary**.

In [12]:
# vocabulary size
vocab_size = len(tokenizer.word_index) + 1
vocab_size

7410

<a id=section603></a>
### c. Sequence Inputs and Output

- Now that we have encoded the input sequences, we need to separate them into input (X) and output (y) elements.

- We can do this with **array slicing**.

- After separating, we need to **one hot encode the output word**. 
- This means converting it from an integer to a vector of 0 values, one for each word in the vocabulary, with a **1 to indicate the specific word** at the index of the words integer value.

- This is so that the model learns to **predict the probability distribution** for the next word and the ground truth from which to learn from is 0 for all words except the actual word that comes next.

- Keras provides the **to_categorical()** that can be used to **one hot encode** the output words for each **input-output sequence** pair.

- Finally, we need to specify to the **Embedding layer** how long input sequences are. 
- We know that there are **50 words** because we designed the model, but a good generic way to specify that is to use the **second dimension (number of columns)** of the input data’s shape. 
- That way, if We change the **length of sequences** when preparing data, We do not need to change this **data loading code**; it is generic.

In [13]:
# separate into input and output
sequences = array(sequences)
X, y = sequences[:,:-1], sequences[:,-1]
y = to_categorical(y, num_classes=vocab_size)
seq_length = X.shape[1]

In [14]:
X.shape

(118633, 50)

In [15]:
X[0]

array([1046,   11,   11, 1045,  329, 7409,    4,    1, 2873,   35,  213,
          1,  261,    3, 2251,    9,   11,  179,  817,  123,   92, 2872,
          4,    1, 2249, 7408,    1, 7407, 7406,    2,   75,  120,   11,
       1266,    4,  110,    6,   30,  168,   16,   49, 7405,    1, 1609,
         13,   57,    8,  549,  151,   11])

In [16]:
y.shape

(118633, 7410)

In [17]:
y[0]

array([0., 0., 0., ..., 0., 0., 0.], dtype=float32)

<a id=section604></a>
### d. Fit Model

- We can now **define and fit** our language model on the training data.

- The **learned embedding** needs to know the size of the vocabulary and the **length of input sequences** as previously discussed.

 - **Size of the embedding vector space**: a parameter to specify how many dimensions will be used to represent each word

- Common values are **50, 100, and 300**. 
- We will use 50 here, but consider **testing smaller or larger values**.

- We will use a **two LSTM hidden layers** with **100 memory cells** each. 
- More **memory cells** and a **deeper network** may achieve better results.

 
###  Procedure:

 - A **dense fully connected layer** with **100 neurons** connects to the **LSTM hidden layers** to interpret the features extracted from the sequence. 
 - The **output layer** predicts the **next word** as a single vector the **size of the vocabulary** with a probability for each word in the vocabulary. 
 - A **softmax activation functio**n is used to **ensure the outputs** have the characteristics of normalized probabilities.
 
 <center><img src = "https://raw.githubusercontent.com/insaid2018/Term-1/master/Images/images.png"/></center>

In [18]:
# define model
model = Sequential()
model.add(Embedding(vocab_size, 50, input_length=seq_length))
model.add(LSTM(100, return_sequences=True))
model.add(LSTM(100))
model.add(Dense(100, activation='relu'))
model.add(Dense(vocab_size, activation='softmax'))
print(model.summary())

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 50, 50)            370500    
_________________________________________________________________
lstm (LSTM)                  (None, 50, 100)           60400     
_________________________________________________________________
lstm_1 (LSTM)                (None, 100)               80400     
_________________________________________________________________
dense (Dense)                (None, 100)               10100     
_________________________________________________________________
dense_1 (Dense)              (None, 7410)              748410    
Total params: 1,269,810
Trainable params: 1,269,810
Non-trainable params: 0
_________________________________________________________________
None


-  The model is compiled specifying the **categorical cross entropy loss** needed to fit the model.
- Technically, the **model** is learning a **multi-class classification** and this is the suitable loss function for this type of problem. 
- The efficient **Adam implementation** to **mini-batch gradient descent** is used and accuracy is evaluated of the model.

- Finally, the **model** is fit on the data for **100 training epochs** with a modest batch size of 128 to speed things up.

- Training may take a **few hours on modern hardware** without GPUs. 
- We can speed it up with a **larger batch size** and/or fewer training epochs.

In [19]:
# compile model
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
# fit model
model.fit(X, y, batch_size=128, epochs=100)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

<tensorflow.python.keras.callbacks.History at 0x7f85141d0f90>

<a id=section605></a>
### e. Save the model

- Here, we use the **Keras model API** to save the model to the file **‘model.h5‘** in the current working directory.

- Later, when we **load the model** to make predictions.
- We will also need the **mapping of words** to **integers**. 
- This is in the **Tokenizer object**, and we can save that too **using Pickle**.

In [20]:
# save the model to file
model.save('model.h5')
# save the tokenizer
dump(tokenizer, open('tokenizer.pkl', 'wb'))

<a id=section7></a>
## 7. Use Language model

- In this case, we can use the model to generate **new sequences of text** that have the same **statistical properties** as the source text.

- We will start by **loading** the **training sequences** again.



<a id=section701></a>
### a. Load the data

In [21]:
# load doc into memory
def load_doc(filename):
    # open the file as read only
    file = open(filename, 'r')
    # read all text
    text = file.read()
    # close the file
    file.close()
    return text

# load cleaned text sequences
in_filename = 'republic_sequences.txt'
doc = load_doc(in_filename)
lines = doc.split('\n')

[](http://)

- We need the text so that we can choose a **source sequence** as input to the model for generating a **new sequence of text**.

- The model will require **50 words** as **input**.

- Later, we will need to specify the **expected length of input**.
- We can determine this from the **input sequences** by **calculating the length** of one line of the loaded data and **subtracting** **1** for the **expected output** word that is also on the same line.



In [22]:
seq_length = len(lines[0].split()) - 1

<a id=section702></a>
### b. Load Model

- We can now **load the model** from file.


- Keras provides the **load_model() function** for loading the model, ready for use.

In [23]:
from random import randint
from pickle import load
from tensorflow.keras.models import load_model
from tensorflow.keras.preprocessing.sequence import pad_sequences

# load the model
model = load_model('model.h5')

# load the tokenizer
tokenizer = load(open('tokenizer.pkl', 'rb'))

<a id=section703></a>
### c. Generate text

* The first step in generating text is **preparing a seed input**.


* We will select a **random line** of text from the **input text** for this purpose. 

In [24]:
from random import randint
from pickle import load
from tensorflow.keras.models import load_model
from tensorflow.keras.preprocessing.sequence import pad_sequences


# generate a sequence from a language model
def generate_seq(model, tokenizer, seq_length, seed_text, n_words):
    result = list()
    in_text = seed_text
    # generate a fixed number of words
    for _ in range(n_words):
        # encode the text as integer
        encoded = tokenizer.texts_to_sequences([in_text])[0]
        # truncate sequences to a fixed length
        encoded = pad_sequences([encoded], maxlen=seq_length, truncating='pre')
        # predict probabilities for each word
        yhat = model.predict_classes(encoded, verbose=0)
        # map predicted word index to word
        out_word = ''
        for word, index in tokenizer.word_index.items():
            if index == yhat:
                out_word = word
                break
        # append to input
        in_text += ' ' + out_word
        result.append(out_word)
    return ' '.join(result)


# load the model
model = load_model('model.h5')

# load the tokenizer
tokenizer = load(open('tokenizer.pkl', 'rb'))

# select a seed text
seed_text = lines[randint(0,len(lines))]
print("seed_text:" + '\n')
print(seed_text + '\n')

# generate new text
generated = generate_seq(model, tokenizer, seq_length, seed_text, 50)
print("generated_text:" + '\n')
print(generated)

seed_text:

live well and the unjust man will live ill that is what your argument proves and he who lives well is blessed and happy and he who lives ill the reverse of happy certainly then the just is happy and the unjust miserable so be it but happiness and not misery





generated_text:

is profitable for either or if there were no war in relation to mind thrasymachus and could he not be sophisms find a mans own light or ridiculous causes she gives the former times into the truer matters of the unjust having fallen up the chaos was taught him and


 - We can see that the text seems reasonable. In fact, the addition of concatenation would help in interpreting the seed and the generated text. Nevertheless, the generated text gets the right kind of words in the right kind of order.

