# Text Generation

In this notebook, we are going to generate 1000 character texts, given an initial seed of characters. This will help us evaluate that how much the model has understood about word formation, english grammar and context of the initial seed.

Code segments [1] to [5] are same as that in 'train.ipynb' notebook and their detailed explanation can be found over their itself.

In [1]:
import numpy as np
import keras
from keras.models import Sequential
from keras.layers import Dense, LSTM, Dropout, Activation
from keras.optimizers import RMSprop, Adam
from keras.callbacks import ModelCheckpoint
from keras.utils import np_utils

Using TensorFlow backend.


In [2]:
SEQ_LENGTH = 100

In [3]:
def buildmodel(VOCABULARY):
    model = Sequential()
    model.add(LSTM(256, input_shape = (SEQ_LENGTH, 1), return_sequences = True))
    model.add(Dropout(0.2))
    model.add(LSTM(256))
    model.add(Dropout(0.2))
    model.add(Dense(VOCABULARY, activation = 'softmax'))
    model.compile(loss = 'categorical_crossentropy', optimizer = 'adam')
    return model

In [4]:
file = open('wonderland.txt', encoding = 'utf8')
raw_text = file.read()
raw_text = raw_text.lower()

In [5]:
chars = sorted(list(set(raw_text)))
print(chars)
bad_chars = ['#', '*', '@', '_', '\ufeff']
for i in range(len(bad_chars)):
    raw_text = raw_text.replace(bad_chars[i],"")
chars = sorted(list(set(raw_text)))
print(chars)
VOCABULARY = len(chars)

int_to_char = dict((i, c) for i, c in enumerate(chars))
char_to_int = dict((c, i) for i, c in enumerate(chars))

['\n', ' ', '!', '#', '$', '%', '(', ')', '*', ',', '-', '.', '/', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '?', '@', '[', ']', '_', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '‘', '’', '“', '”', '\ufeff']
['\n', ' ', '!', '$', '%', '(', ')', ',', '-', '.', '/', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '?', '[', ']', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '‘', '’', '“', '”']


Now that our model has been defined and we have preprocessed our input file and redefinded our vocabulary, as in train.ipynb file, we are ready to proceed. The best model with least loss as we obtained in the last epoch of training is loaded and the model is build and recompiled.

In [6]:
filename = 'saved_models/weights-improvement-49-1.3420.hdf5'
model = buildmodel(VOCABULARY)
model.load_weights(filename)
model.compile(loss = 'categorical_crossentropy', optimizer = 'adam')

The initital 100 character seed used for generating text are the first few characters of the famous children book 'The Cat in the Hat' by Dr. Seuss available [here](http://www.stylist.co.uk/books/100-best-opening-lines-from-childrens-books#gallery-1). 

In [7]:
initial_text = ' the sun did not shine, it was too wet to play, so we sat in the house all that cold, cold wet day. '# we sat here we two and we said how we wish we had something to do.'
initial_text = [char_to_int[c] for c in initial_text]

Starting with the initial seed next 1000 characters are generated by shifting the 100 character input window for generating the next character.

In [8]:
GENERATED_LENGTH = 1000
test_text = initial_text
generated_text = []

for i in range(1000):
    X = np.reshape(test_text, (1, SEQ_LENGTH, 1))
    next_character = model.predict(X/float(VOCABULARY))
    index = np.argmax(next_character)
    generated_text.append(int_to_char[index])
    test_text.append(index)
    test_text = test_text[1:]

In [9]:
print(''.join(generated_text))

and the white rabbit was a little botk of the sable of the garden, the mock turtle said nothing on the soog, and she shought thene was not a moment to be no mistle thing, and she was soomding of the soecs of the gad she was soo mloe ald spok as the mock turtle with a siate oige back on the shate.

‘well, it doesn’t kike the bar,’ said the king. 
‘i dan’t remember it,’ said the king. 
‘i dan’t remember it,’ said the king. 
‘i dan’t remember it,’ said the king. 
‘i dan’t remember it,’ said the king. 
‘i dan’t remember it,’ said the king. 
‘i dan’t remember it,’ said the king. 
‘i dan’t remember it,’ said the king. 
‘i dan’t remember it,’ said the king. 
‘i dan’t remember it,’ said the king. 
‘i dan’t remember it,’ said the king. 
‘i dan’t remember it,’ said the king. 
‘i dan’t remember it,’ said the king. 
‘i dan’t remember it,’ said the king. 
‘i dan’t remember it,’ said the king. 
‘i dan’t remember it,’ said the king. 
‘i dan’t remember it,’ said the king. 
‘i dan’t remember it,’ said 

## Conclusions

We can see the 1000 characters as generated by the model and there are a lot of things to conclude-

* Most of the words generated by the model are proper english words, although there are exceptions at many places. This shows that the model has a good understanding of how letters are combined to form different words. Even though it is very obvious to do for a human, but for a computer model to give a reasonable performance at word formation is itself a huge feat.
* There are a few drawbacks as well. One of them is that the model often suggests 'and', both after a comma and a full stop which may be correct in case 1, but is always wrong for case 2. Also some incorrected words are generated as well.
* The model has understood the use of inverted quotes and apostrophes quite nicely. All the inverted commas are closed appropriately and succeded by proper endings, such as 'said the king.'.
* The model has understood the use of spaces and indentations quite well. After each of the 'said the king' lines, succeding text always begins in the new line, giving the generated text, a clean look.
* The model has almost no understanding of the context of the initial seed. The iniital text consists of a cold wet day and how playing is difficult on such a day, but generated text talks about rabbits and turtles (which still seems reasonable), but then  starts a conversation of king which is quite absurd. However results on this are expected to improve once the model is trained on a variety of texts rather than just a single book.