# Text Generation

Text generation using deep learning has a wide range of applications in day to day life, such as converting text from one language to another, suggesting the next word while typing emails, checking grammatical mistakes and restructuring sentences, etc.
In this project, we will take text from `"Poirot Investigates"` by *Agatha Christie* and train our model to generate some relevant text of specified length when some input text is provided.


In [3]:
# Importing relevant packages...
from tensorflow.keras.callbacks import LambdaCallback
from tensorflow.keras.models import Sequential, load_model
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import LSTM, Embedding, GRU
from tensorflow.keras.optimizers import RMSprop
from tensorflow.keras.utils import get_file, to_categorical
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import matplotlib.pyplot as plt
import numpy as np
import random
import sys
import io
import requests
import re
import string

## Data Prepration

In [7]:
# Extracting text from the link... 
r = requests.get("https://www.gutenberg.org/files/61262/61262-0.txt")
raw_text = r.text

In [8]:
# Cleaning text; removing special chars, punctuations, and extra spaces...
def text_cleaner(data):
  data = data.split('\n') # splitting the data on new line
  data = data[104:] # removing the text like index page and intro about the book...
  data = " ".join(data) # joining it back to the one sting...
  data = data.replace('\r', '') # removing carriage return chars...
  data = re.sub(r'[^\x00-\x7f]', r'', data) # removing special chars...
  data = data.translate(str.maketrans('', '', string.punctuation)) # removing special chars...
  data = re.sub('\s+', ' ', data) # removing extra spaces...
  return data

In [9]:
data = text_cleaner(raw_text)

In [10]:
# creating corpus and dictionary... corpus: collection of all the words; dictionary: collection of all the unique words...
corpus = data.split(" ")
corpus = [x for x in corpus if x != ""]
dictionary = list(set(corpus))

In [11]:
print(f"Corpus size: {len(corpus)}")
print(f"Dictionary size: {len(dictionary)}")

Corpus size: 55337
Dictionary size: 7283


In [12]:
# creating list of sentences of length 31...

max_len = 30+1 # 30 words as features, and last word that needs to be predicted...
step_size = 1 # number of words over which sliding window is to be shifted(somethings similar to strides in conv-net)
all_sentences = []
for i in range(max_len, len(corpus)):
  sentence = corpus[i - max_len: i] # sliding window, dividing the whole text into multiple strings, each of length 31...
  sentence = ' '.join(sentence)
  all_sentences.append(sentence)

In [13]:
all_sentences[:10]

['POIROT INVESTIGATES I The Adventure of The Western Star I was standing at the window of Poirots rooms looking out idly on the street below Thats queer I ejaculated suddenly beneath',
 'INVESTIGATES I The Adventure of The Western Star I was standing at the window of Poirots rooms looking out idly on the street below Thats queer I ejaculated suddenly beneath my',
 'I The Adventure of The Western Star I was standing at the window of Poirots rooms looking out idly on the street below Thats queer I ejaculated suddenly beneath my breath',
 'The Adventure of The Western Star I was standing at the window of Poirots rooms looking out idly on the street below Thats queer I ejaculated suddenly beneath my breath What',
 'Adventure of The Western Star I was standing at the window of Poirots rooms looking out idly on the street below Thats queer I ejaculated suddenly beneath my breath What is',
 'of The Western Star I was standing at the window of Poirots rooms looking out idly on the street below

In [14]:
# tokenizing the words; converting words to numerical values...
tokenizer = Tokenizer()
tokenizer.fit_on_texts(all_sentences)
seq = tokenizer.texts_to_sequences(all_sentences)

In [15]:
# Converting one dimentional list to numpy ndarray...
seq = np.vstack(seq)

In [16]:
# Using first 30 columns of each rows as features and 31st as target variable...
X = seq[:, :-1]
y = seq[:, -1]

In [17]:
y = to_categorical(y) # one hot encoding the target variable...

## Modeling

### LSTM

In [None]:
# Sequential LSTM model to predict next word...
model = Sequential()

# input_dim is the length of the vocab/dictionary that we created earlier, output_dim is 50, and input length is 31...
model.add(Embedding(len(tokenizer.word_index) + 1, 50, input_length = X.shape[1])) 

# 64 LSTM units and return_sequences = True to pass it on to next LSTM layer...
model.add(LSTM(64, return_sequences=True))

model.add(LSTM(64))
model.add(Dense(128, activation='relu'))
model.add(Dense(len(tokenizer.word_index) + 1, activation='softmax'))

In [None]:
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 10, 50)            331400    
                                                                 
 lstm (LSTM)                 (None, 10, 64)            29440     
                                                                 
 lstm_1 (LSTM)               (None, 64)                33024     
                                                                 
 dense (Dense)               (None, 128)               8320      
                                                                 
 dense_1 (Dense)             (None, 6628)              855012    
                                                                 
Total params: 1,257,196
Trainable params: 1,257,196
Non-trainable params: 0
_________________________________________________________________


In [None]:
# Compiling the model with adam optimizer and training it for 500 epochs...
model.compile(loss = 'categorical_crossentropy', optimizer = 'adam', metrics = ['accuracy'])
lstm_history = model.fit(X, y, batch_size = 32, epochs=500)

Epoch 1/500
Epoch 2/500
Epoch 3/500
Epoch 4/500
Epoch 5/500
Epoch 6/500
Epoch 7/500
Epoch 8/500
Epoch 9/500
Epoch 10/500
Epoch 11/500
Epoch 12/500
Epoch 13/500
Epoch 14/500
Epoch 15/500
Epoch 16/500
Epoch 17/500
Epoch 18/500
Epoch 19/500
Epoch 20/500
Epoch 21/500
Epoch 22/500
Epoch 23/500
Epoch 24/500
Epoch 25/500
Epoch 26/500
Epoch 27/500
Epoch 28/500
Epoch 29/500
Epoch 30/500
Epoch 31/500
Epoch 32/500
Epoch 33/500
Epoch 34/500
Epoch 35/500
Epoch 36/500
Epoch 37/500
Epoch 38/500
Epoch 39/500
Epoch 40/500
Epoch 41/500
Epoch 42/500
Epoch 43/500
Epoch 44/500
Epoch 45/500
Epoch 46/500
Epoch 47/500
Epoch 48/500
Epoch 49/500
Epoch 50/500
Epoch 51/500
Epoch 52/500
Epoch 53/500
Epoch 54/500
Epoch 55/500
Epoch 56/500
Epoch 57/500
Epoch 58/500
Epoch 59/500
Epoch 60/500
Epoch 61/500
Epoch 62/500
Epoch 63/500
Epoch 64/500
Epoch 65/500
Epoch 66/500
Epoch 67/500
Epoch 68/500
Epoch 69/500
Epoch 70/500
Epoch 71/500
Epoch 72/500
Epoch 73/500
Epoch 74/500
Epoch 75/500
Epoch 76/500
Epoch 77/500
Epoch 78

In [1]:
# when input text and number of words to be generated are given... this function will return text...
def text_generator(model, tokenizer, seq_len, feature_text, num_words):
  text = []
  for i in range(num_words):
    token = tokenizer.texts_to_sequences([feature_text])[0]
    token = pad_sequences([token], maxlen = seq_len, truncating='pre')
    # y_pred = model.predict_classes(token)
    y_pred = model.predict(token) 
    y_pred = np.argmax(y_pred, axis=1)

    pred_word = ''
    for word, idx in tokenizer.word_index.items():
      if idx == y_pred:
        pred_word = word
        break
    feature_text += " "+ pred_word
    text.append(pred_word)

  return " ".join(text)

In [None]:
text_generator(model, tokenizer, X.shape[1], all_sentences[1881], 50)

'of the west or the western star it has been going to occur the same theory alone on the two stones mr and lifted up three character of regents hundred he tracks sufficiently we were tired came out facing you the seaports hospitals as modern weeping after a lean few'

In [None]:
model.save('./lstm_model_v2.h5') # saving model...

## Testing Model

In [5]:
lstm_model = load_model('./lstm_model.h5') # loading model...

In [23]:
num_of_words = 50 # number of words to be generated...

# input text...
text = """My Name is Roshan Pandey, Generate me some text from Poirot Investigates by Agatha Christie of length 50 words."""

text_generator(lstm_model, tokenizer, X.shape[1], text, num_of_words)

'found at lord willard nor rapidly him i turned the leaves ah on her apron and appearance we shall be obliged to introduce my life monsieur poirot he could i smiled forward in a low voice a man who made no italian before he had guests the mode upon me'

### GRU

In [None]:
# Sequential LSTM model to predict next word...
gru_model = Sequential()
gru_model.add(Embedding(len(tokenizer.word_index) + 1, 50, input_length = X.shape[1]))
gru_model.add(GRU(64, return_sequences=True))
gru_model.add(GRU(64))
gru_model.add(Dense(128, activation='relu'))
gru_model.add(Dense(len(tokenizer.word_index) + 1, activation='softmax'))

In [None]:
gru_model.summary()

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_1 (Embedding)     (None, 30, 50)            331400    
                                                                 
 gru (GRU)                   (None, 30, 64)            22272     
                                                                 
 gru_1 (GRU)                 (None, 64)                24960     
                                                                 
 dense_2 (Dense)             (None, 128)               8320      
                                                                 
 dense_3 (Dense)             (None, 6628)              855012    
                                                                 
Total params: 1,241,964
Trainable params: 1,241,964
Non-trainable params: 0
_________________________________________________________________


In [None]:
# Compiling the model with adam optimizer and training it for 500 epochs...
gru_model.compile(loss = 'categorical_crossentropy', optimizer = 'adam', metrics = ['accuracy'])
gru_history = gru_model.fit(X, y, batch_size = 32, epochs=500)

Epoch 1/500
Epoch 2/500
Epoch 3/500
Epoch 4/500
Epoch 5/500
Epoch 6/500
Epoch 7/500
Epoch 8/500
Epoch 9/500
Epoch 10/500
Epoch 11/500
Epoch 12/500
Epoch 13/500
Epoch 14/500
Epoch 15/500
Epoch 16/500
Epoch 17/500
Epoch 18/500
Epoch 19/500
Epoch 20/500
Epoch 21/500
Epoch 22/500
Epoch 23/500
Epoch 24/500
Epoch 25/500
Epoch 26/500
Epoch 27/500
Epoch 28/500
Epoch 29/500
Epoch 30/500
Epoch 31/500
Epoch 32/500
Epoch 33/500
Epoch 34/500
Epoch 35/500
Epoch 36/500
Epoch 37/500
Epoch 38/500
Epoch 39/500
Epoch 40/500
Epoch 41/500
Epoch 42/500
Epoch 43/500
Epoch 44/500
Epoch 45/500
Epoch 46/500
Epoch 47/500
Epoch 48/500
Epoch 49/500
Epoch 50/500
Epoch 51/500
Epoch 52/500
Epoch 53/500
Epoch 54/500
Epoch 55/500
Epoch 56/500
Epoch 57/500
Epoch 58/500
Epoch 59/500
Epoch 60/500
Epoch 61/500
Epoch 62/500
Epoch 63/500
Epoch 64/500
Epoch 65/500
Epoch 66/500
Epoch 67/500
Epoch 68/500
Epoch 69/500
Epoch 70/500
Epoch 71/500
Epoch 72/500
Epoch 73/500
Epoch 74/500
Epoch 75/500
Epoch 76/500
Epoch 77/500
Epoch 78

In [None]:
text_generator(gru_model, tokenizer, X.shape[1], text, num_of_words)

'of the west or the western star it is the property of the celebrated film actress miss mary marvell a comparison of the two stones would be interesting she stopped patant murmured poirot without doubt a romance of the first water he turned to mary marvell and you are not'

In [None]:
gru_model.save('./gru_model_v2.h5') # saving model...

## Testing Model

In [26]:
gru_model = load_model('./gru_model.h5') # loading model...

In [27]:
num_of_words = 50 # number of words to be generated...

# input text...
text = """My Name is Roshan Pandey, Generate me some text from Poirot Investigates by Agatha Christie of length 50 words."""

text_generator(gru_model, tokenizer, X.shape[1], text, num_of_words)

'that i will make a point of his death and all shot at the same time the absconding clerk or the domestic defaulter is not placed it in the room with such a mile distant poirot retired to run and announcing his approaching demise poirot and wife puzzled alikebah your'