# Text generation using RNN - Word Level

To generate text using RNN, we need a to convert raw text to a supervised learning problem format.

Take, for example, the following corpus:

"Her brother shook his head incredulously. He was not aware of the situation at all."

First we need to divide the data into tabular format containing input (X) and output (y) sequences. In case of a character level model, the X and y will look like this:

|      X                |  Y      |
|-----------------------|---------|
|    < word1 >< word2 > | < word3 > |
|    Her brother        |  shook  |
|    brother shook      |  his    |
|    shook his          |  head   |
|    his head           | incredulously |
|    head incredulously |    .    |
|    ..                 |    .    |
|    situation at       |  all    |
|    at all             |    .    |

Note that in the above problem, the sequence length of **X is two words** and that of **y is one word**. Hence, this is a many-to-one architecture. We can, however, change the number of input words to any number depending on the problem.

A model is trained on such data. To generate text, we simply give the model any two words using which it predicts the next word. Then it appends the predicted word to the input sequence (to the extreme right of the sequence) and discards the first word (word on extreme left of the sequence). Then it predicts again using the new sequence and the cycle continues until a fix number of iterations. An example is shown below:

Seed text: "Did I"

|      X                                            |  Y                       |
|---------------------------------------------------|--------------------------|
|                        Did I                      |    < predicted word 1 >  |
|               I < predicted word 1 >              |    < predicted word 2 >  |
|       < predicted word 1 > < predicted word 2 >   |    < predicted word 3 >  |
|       < predicted word 2 > < predicted word 3 >   |    < predicted word 4 >  |
|                      ...                          |            ...           |

# Notebook Overview
1. Preprocess data
2. Build LSTM model
3. Generate text

In [1]:
# import libraries
import warnings
warnings.filterwarnings("ignore")

import re
import numpy as np
import nltk
from nltk.tokenize import word_tokenize
import requests
from nltk.tokenize import word_tokenize

from gensim.models import KeyedVectors

from keras.preprocessing.text import Tokenizer
from keras.utils import to_categorical
from keras.preprocessing.sequence import pad_sequences
# from keras.utils.vis_utils import plot_model
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers import Embedding

# 1. Preprocess data

In [2]:
# download ebook
url = "https://www.gutenberg.org/files/24869/24869-0.txt"
book = requests.get(url)
data = book.text

In [3]:
# let's look at the text
print(data[:500])

ï»¿The Project Gutenberg EBook of The Ramayana



This eBook is for the use of anyone anywhere at no cost and with almost no
restrictions whatsoever. You may copy it, give it away or re-use it under
the terms of the Project Gutenberg License included with this eBook or
online at http://www.gutenberg.org/license



Title: The Ramayana



Release Date: March 18, 2008 [Ebook #24869]

Language: English

Character set encoding: UTF-8


***START OF THE PROJECT GUTENBERG EBOOK THE


In [4]:
# subset the book from the first chapter, that ism INVOCATION - everything before first chapter is irrelevant data
start_index = re.search("invocation.\(1\)", data, re.I)
print(start_index.start())

19177


In [5]:
# Let's see how does the text look like
data = data[start_index.start():]

In [6]:
# let's look at the text
print(data[:500])

INVOCATION.(1)


Praise to VÃ¡lmÃ­ki,(2)bird of charming song,(3)
  Who mounts on Poesyâs sublimest spray,
And sweetly sings with accent clear and strong
  RÃ¡ma, aye RÃ¡ma, in his deathless lay.

Where breathes the man can listen to the strain
  That flows in music from VÃ¡lmÃ­kiâs tongue,
Nor feel his feet the path of bliss attain
  When RÃ¡maâs glory by the saint is sung!

The stream RÃ¡mÃ¡yan leaves its sacred fount
  The whole wide world from sin and stain to free.(4)
T


## Clean text

In [7]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [8]:
# define a function to clean text data
def clean_document(document, char_filter = r"[^\w]"):
    '''
    input:
    document          :  string
    char_filter       :  regex pattern - removes those characters from the text that match the pattern

    output: clean document
    '''

    # convert words to lower case
    document = document.lower()

    # tokenise words
    words = word_tokenize(document)

    # strip whitespace from all words
    words = [word.strip() for word in words]

    # join back words to get document
    document = " ".join(words)

    # remove unwanted characters
    document = re.sub(char_filter, " ", document)

    # replace multiple whitespaces with single whitespace
    document = re.sub(r"\s+", " ", document)

    # strip whitespace from document
    document = document.strip()

    return document

data = clean_document(data)

In [9]:
len(data)

2195566

Note: Reducing the size of the data. Since, it causes RAM consumption error when we use the whole data

In [10]:
data = data[:400000]

In [11]:
# length of text
words = word_tokenize(data)
print("Number of words in document: {}".format(len(words)))

Number of words in document: 78188


## Convert characters to integers

In [12]:
# use Keras' Tokenizer() function to encode text to integers
word_tokeniser = Tokenizer()
word_tokeniser.fit_on_texts([data])
encoded_words = word_tokeniser.texts_to_sequences([data])[0]

In [13]:
# check the size of the vocabulary
VOCABULARY_SIZE = len(word_tokeniser.word_index) + 1
print('Vocabulary Size: {}'.format(VOCABULARY_SIZE))

Vocabulary Size: 6713


## Divide data in X and y

### Create sequences

In each training sample, X will have a sequence of 5 words and y will have the sixth word. In other words, this means that use previous five words of a sequence to predict next word.

In [14]:
sequences = []
MAX_SEQ_LENGTH = 5  # X will have five words, y will have the sixth word

for i in range(MAX_SEQ_LENGTH, len(encoded_words)):
    sequence = encoded_words[i-MAX_SEQ_LENGTH:i+1]
    sequences.append(sequence)
sequences = np.array(sequences)

In [15]:
print('Total number of training samples: {}'.format(len(sequences)))
print('\nSample sequences: \n{}'.format(sequences[0:3]))

Total number of training samples: 78183

Sample sequences: 
[[3864 3865  650    3  107  720]
 [3865  650    3  107  720  937]
 [ 650    3  107  720  937 3866]]


In [16]:
# divide the sequence into X and y
sequences = np.array(sequences)

X = sequences[:,:-1]  # assign all but last words of a sequence to X
y = sequences[:,-1]   # assign last word of each sequence to y

In [17]:
# Look at the first training example
print("Input of the first data point:", X[0], "\n")
print("Output of the first data point: [", y[0], "]")

Input of the first data point: [3864 3865  650    3  107] 

Output of the first data point: [ 720 ]


### One-hot encode y

In [18]:
y.shape

(78183,)

In [19]:
y = to_categorical(y, num_classes=VOCABULARY_SIZE)

In [20]:
print(X.shape)
print(y.shape)

(78183, 5)
(78183, 6713)


There are 410241 sequences (data points) in total.

Remember that to use an RNN data has to be of the shape (#samples, #timesteps, #features)

In X, the third dimension, that is, number of features is missing because we're going to use the Keras' Embedding Layer. Hence we don't need to explicitly reshape the data to incorporate the third dimension. That will be done automatically by Keras.

In y, the second dimension is missing, that is, the number of timesteps because y is not a sequence, it's just a single word. The number of features are represented by a one-hot encoded vector whose length is the VOCABULARY_SIZE.

### Pad sequences

In [21]:
X = pad_sequences(X, maxlen=MAX_SEQ_LENGTH, padding='pre')
print('Input sequence length: {}'.format(MAX_SEQ_LENGTH))

Input sequence length: 5


# 2. LSTM

In [22]:
# create model architecture

EMBEDDING_SIZE = 100


model = Sequential()

# embedding layer
model.add(Embedding(VOCABULARY_SIZE, EMBEDDING_SIZE, input_length = MAX_SEQ_LENGTH))

# lstm layer 1
model.add(LSTM(128, return_sequences=True))

# lstm layer 2
model.add(LSTM(128))

# output layer
model.add(Dense(VOCABULARY_SIZE, activation='softmax'))

In [23]:
# compile network
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

# summarize defined model
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 5, 100)            671300    
                                                                 
 lstm (LSTM)                 (None, 5, 128)            117248    
                                                                 
 lstm_1 (LSTM)               (None, 128)               131584    
                                                                 
 dense (Dense)               (None, 6713)              865977    
                                                                 
Total params: 1786109 (6.81 MB)
Trainable params: 1786109 (6.81 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


In [24]:
# fit network
model.fit(X, y, epochs=10, verbose=2, batch_size=256)

Epoch 1/10
306/306 - 20s - loss: 7.0135 - accuracy: 0.0570 - 20s/epoch - 64ms/step
Epoch 2/10
306/306 - 5s - loss: 6.5848 - accuracy: 0.0588 - 5s/epoch - 18ms/step
Epoch 3/10
306/306 - 5s - loss: 6.3931 - accuracy: 0.0721 - 5s/epoch - 15ms/step
Epoch 4/10
306/306 - 5s - loss: 6.2261 - accuracy: 0.0813 - 5s/epoch - 15ms/step
Epoch 5/10
306/306 - 4s - loss: 6.0985 - accuracy: 0.0853 - 4s/epoch - 14ms/step
Epoch 6/10
306/306 - 4s - loss: 5.9840 - accuracy: 0.0921 - 4s/epoch - 13ms/step
Epoch 7/10
306/306 - 4s - loss: 5.8744 - accuracy: 0.1007 - 4s/epoch - 14ms/step
Epoch 8/10
306/306 - 4s - loss: 5.7769 - accuracy: 0.1081 - 4s/epoch - 15ms/step
Epoch 9/10
306/306 - 5s - loss: 5.6906 - accuracy: 0.1147 - 5s/epoch - 17ms/step
Epoch 10/10
306/306 - 4s - loss: 5.6091 - accuracy: 0.1192 - 4s/epoch - 13ms/step


<keras.src.callbacks.History at 0x7b8226bd3ac0>

### Load word embeddings to represent the input words

In [25]:
# word2vec download link (Size ~ 1.5GB): https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit

path = '/content/drive/MyDrive/GoogleNews-vectors-negative300.bin.gz'

# load word2vec using the following function present in the gensim library
word2vec = KeyedVectors.load_word2vec_format(path, binary=True)

In [26]:
# assign word vectors from word2vec model

EMBEDDING_SIZE  = 300  # each word in word2vec model is represented using a 300 dimensional vector
VOCABULARY_SIZE = len(word_tokeniser.word_index) + 1

# create an empty embedding matix
embedding_weights = np.zeros((VOCABULARY_SIZE, EMBEDDING_SIZE))

# create a word to index dictionary mapping
word2id = word_tokeniser.word_index

# copy vectors from word2vec model to the words present in corpus
for word, index in word2id.items():
    try:
        embedding_weights[index, :] = word2vec[word]
    except KeyError:
        pass

In [27]:
# check embedding dimension
print("Embeddings shape: {}".format(embedding_weights.shape))

Embeddings shape: (6713, 300)


In [28]:
# create model architecture

model_wv = Sequential()

# embedding layer
model_wv.add(Embedding(VOCABULARY_SIZE, EMBEDDING_SIZE, input_length = MAX_SEQ_LENGTH,
                    weights = [embedding_weights], trainable=True))

# lstm layer 1
model_wv.add(LSTM(128, return_sequences=True))

# lstm layer 2
# when using multiple LSTM layers, set return_sequences to True at the previous layer
# because the current layer expects a sequential intput rather than a single input
model_wv.add(LSTM(128))

# output layer
model_wv.add(Dense(VOCABULARY_SIZE, activation='softmax'))

In [29]:
# compile network
model_wv.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

# summarize defined model
model_wv.summary()

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_1 (Embedding)     (None, 5, 300)            2013900   
                                                                 
 lstm_2 (LSTM)               (None, 5, 128)            219648    
                                                                 
 lstm_3 (LSTM)               (None, 128)               131584    
                                                                 
 dense_1 (Dense)             (None, 6713)              865977    
                                                                 
Total params: 3231109 (12.33 MB)
Trainable params: 3231109 (12.33 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


In [30]:
# fit network
model_wv.fit(X, y, epochs=10, batch_size=256)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.src.callbacks.History at 0x7b80e22760b0>

# 3. Generate text

In [31]:
def generate_words(model, word_tokenizer, MAX_SEQ_LENGTH, seed, n_words):
    output_text = seed
    seed_text = seed.split()

    for _ in range(n_words):

        # Convert words to a sequence of indices
        encoded_words = word_tokenizer.texts_to_sequences([seed_text])[0]

        # Pad the sequence
        padded_words = pad_sequences([encoded_words], maxlen=MAX_SEQ_LENGTH, truncating='pre')

        # Predict next word (using `predict` instead of `predict_classes`)
        prediction = model.predict(padded_words, verbose=0)

        # Get the index of the highest probability class
        predicted_index = np.argmax(prediction, axis=-1)[0]

        # Find corresponding word in tokenizer and append to seed_text
        for word, index in word_tokenizer.word_index.items():
            if index == predicted_index:
                output_text += ' ' + word
                seed_text.append(word)
                break

        # Slide the window
        seed_text = seed_text[1:]

    return output_text

### Let's look at some text generations

In [32]:
# Suppose your model expects sequences of length 5
MAX_SEQ_LENGTH = 5

# text generation using first model - model without word embeddings
seed_text = "rama never told anyone about"
num_words = 100
print(generate_words(model, word_tokeniser, MAX_SEQ_LENGTH, seed_text, num_words))

rama never told anyone about the king the king and all the king and all the king and all the king and all the king and all the king and all the king and all the king and all the king and all the king and all the king and all the king and all the king and all the king and all the king and all the king and all the king and all the king and all the king and all the king and all the king and all the king and all the king and all the king and all the king


In [33]:
# text generation using second model - model with word embeddings
seed_text = "rama never told anyone about"
num_words = 100
print(generate_words(model_wv, word_tokeniser, MAX_SEQ_LENGTH, seed_text, num_words))

rama never told anyone about the king of all the king of all the king of all the king of all the king of all the king of all the king of all the king of all the king of all the king of all the king of all the king of all the king of all the king of all the king of all the king of all the king of all the king of all the king of all the king of all the king of all the king of all the king of all the king of all the king of all


In [34]:
# text generation using first model - model without word embeddings
seed_text = "how are you doing"
num_words = 100
print(generate_words(model, word_tokeniser, MAX_SEQ_LENGTH, seed_text, num_words))

how are you doing of the king to all the king and all the king and all the king and all the king and all the king and all the king and all the king and all the king and all the king and all the king and all the king and all the king and all the king and all the king and all the king and all the king and all the king and all the king and all the king and all the king and all the king and all the king and all the king and all the king and


In [35]:
# text generation using second model - model with word embeddings
seed_text = "how are you doing"
num_words = 100
print(generate_words(model_wv, word_tokeniser, MAX_SEQ_LENGTH, seed_text, num_words))

how are you doing spoke the king of all the king of all the king of all the king of all the king of all the king of all the king of all the king of all the king of all the king of all the king of all the king of all the king of all the king of all the king of all the king of all the king of all the king of all the king of all the king of all the king of all the king of all the king of all the king of all the king of
