# Text Generation

Text generation is the task of generating text with the goal of appearing indistinguishable to human-written text.

[![topic_modeling](text_gen.png)](https://github.com/scionoftech/Text_Generation_LSTM)

LSTM (Long Short Term Memory) are very good for analysing sequences of values and predicting the next values from them. For example, LSTM could be a very good choice if you want to predict the very next point of a given time serie (assuming a correlation exist in the sequence).

Talking about sentences and texts ; phrases (sentences) are basically sequences of words. So, it is natural to assume LSTM could be usefull to generate the next word of a given sentence.

In summary, the objective of a LSTM neural network in this situation is to guess the next word of a given sentence.

For example: What is the next word of this following sentence : "he is walking down the"

Our neural net will take the sequence of words as input : "he", "is", "walking", ... Its ouput will be a matrix providing the probability for each word from the dictionnary to be the next one of the given sentence.

Then, how will we build the complete text ? Simply iterating the process, by switching the setence by one word, including the new guessed word at its end. Then, we guess a new word for this new sentence.

Process
In order to do that, first, we build a dictionary containing all words from the novels we want to use.

* read the data (the novels we want to use),
* create the dictionnary of characters,
* create the list of sentences,
* create the neural network,
* train the neural network,
* generate new sentences.

In [2]:
from __future__ import absolute_import, division, print_function, unicode_literals 
  
import numpy as np 
import tensorflow as tf 
  
from keras.models import Sequential 
from keras.layers import Dense, Activation 
from keras.layers import LSTM 
  
from keras.optimizers import RMSprop 
  
from keras.callbacks import LambdaCallback 
from keras.callbacks import ModelCheckpoint 
from keras.callbacks import ReduceLROnPlateau 
import random 
import sys 

Using TensorFlow backend.


In [4]:
from google.colab import drive
drive.mount('/content/drive/')

In [0]:
project_path = "/content/drive/My Drive/text_generation/"

#### Loading the data into a string

In [0]:
# load ascii text and covert to lowercase
filename = project_path+"wonderland.txt"
raw_text = open(filename, 'r', encoding='utf-8').read()
raw_text = raw_text.lower()

####  Creating a mapping from each unique character in the text to a unique number

In [5]:
# Storing all the unique characters present in the text 
vocabulary = sorted(list(set(raw_text))) 
  
# Creating dictionaries to map each character to an index 
char_to_indices = dict((c, i) for i, c in enumerate(vocabulary)) 
indices_to_char = dict((i, c) for i, c in enumerate(vocabulary)) 
  
print(vocabulary) 

['\n', ' ', '!', '"', '#', '$', '%', "'", '(', ')', '*', ',', '-', '.', '/', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '?', '@', '[', ']', '_', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '\ufeff']


####  Pre-processing the data

In [0]:
# Dividing the text into subsequences of length max_length 
# So that at each time step the next max_length characters  
# are fed into the network 
max_length = 100
steps = 5
sentences = [] 
next_chars = [] 
for i in range(0, len(raw_text) - max_length, steps): 
    sentences.append(raw_text[i: i + max_length]) 
    next_chars.append(raw_text[i + max_length]) 
      
# Hot encoding each character into a boolean vector 
X = np.zeros((len(sentences), max_length, len(vocabulary)), dtype = np.bool) 
y = np.zeros((len(sentences), len(vocabulary)), dtype = np.bool) 
for i, sentence in enumerate(sentences): 
    for t, char in enumerate(sentence): 
        X[i, t, char_to_indices[char]] = 1
    y[i, char_to_indices[next_chars[i]]] = 1

#### Building the LSTM network

In [7]:
# Building the LSTM network for the task 
model = Sequential() 
model.add(LSTM(128, input_shape =(max_length, len(vocabulary)))) 
model.add(Dense(len(vocabulary))) 
model.add(Activation('softmax')) 
optimizer = RMSprop(lr = 0.01) 
model.compile(loss ='categorical_crossentropy', optimizer = optimizer)
model.summary()






Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_1 (LSTM)                (None, 128)               96256     
_________________________________________________________________
dense_1 (Dense)              (None, 59)                7611      
_________________________________________________________________
activation_1 (Activation)    (None, 59)                0         
Total params: 103,867
Trainable params: 103,867
Non-trainable params: 0
_________________________________________________________________


#### Defining some helper functions which will be used during the training of the network

Note that the first two functions given below have been referred from the documentation of the official text generation example from the Keras team.

#### Helper function to sample the next character:

In [0]:
# Helper function to sample an index from a probability array 
def sample_index(preds, temperature = 1.0): 
    preds = np.asarray(preds).astype('float64') 
    preds = np.log(preds) / temperature 
    exp_preds = np.exp(preds) 
    preds = exp_preds / np.sum(exp_preds) 
    probas = np.random.multinomial(1, preds, 1) 
    return np.argmax(probas) 

#### Helper function to generate text after each epoch

In [0]:
# Helper function to generate text after the end of each epoch 
def on_epoch_end(epoch, logs): 
	print() 
	print('----- Generating text after Epoch: % d' % epoch) 

	start_index = random.randint(0, len(raw_text) - max_length - 1) 
	for diversity in [0.2, 0.5, 1.0, 1.2]: 
		print('----- diversity:', diversity) 

		generated = '' 
		sentence = raw_text[start_index: start_index + max_length] 
		generated += sentence 
		print('----- Generating with seed: "' + sentence + '"') 
		sys.stdout.write(generated) 

		for i in range(400): 
			x_pred = np.zeros((1, max_length, len(vocabulary))) 
			for t, char in enumerate(sentence): 
				x_pred[0, t, char_to_indices[char]] = 1.

			preds = model.predict(x_pred, verbose = 0)[0] 
			next_index = sample_index(preds, diversity) 
			next_char = indices_to_char[next_index] 

			generated += next_char 
			sentence = sentence[1:] + next_char 

			sys.stdout.write(next_char) 
			sys.stdout.flush() 
		print() 
print_callback = LambdaCallback(on_epoch_end = on_epoch_end) 

#### Helper function to save the model after each epoch in which loss decreases

In [0]:
# Defining a helper function to save the model after each epoch 
# in which the loss decreases 
filepath = project_path+"text_generation_char_vec_best.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor ='loss', 
							verbose = 1, save_best_only = True, 
							mode ='min') 

#### Helper function to reduce the learning rate each time the learning plateaus

In [0]:
# Defining a helper function to reduce the learning rate each time 
# the learning plateaus 
reduce_alpha = ReduceLROnPlateau(monitor ='loss', factor = 0.2, 
							patience = 1, min_lr = 0.001) 
# callbacks = [print_callback, checkpoint, reduce_alpha] 
callbacks = [checkpoint, reduce_alpha] 

#### Training the LSTM model

In [0]:
# Training the LSTM model 
model.fit(X, y, batch_size = 128, epochs = 500, callbacks = callbacks) 

Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where



Epoch 1/500






Epoch 00001: loss improved from inf to 2.51565, saving model to /content/drive/My Drive/DLCP/openwork/text_generation/text_generation_char_vec_best.hdf5
Epoch 2/500

Epoch 00002: loss improved from 2.51565 to 1.98862, saving model to /content/drive/My Drive/DLCP/openwork/text_generation/text_generation_char_vec_best.hdf5
Epoch 3/500

Epoch 00003: loss improved from 1.98862 to 1.76886, saving model to /content/drive/My Drive/DLCP/openwork/text_generation/text_generation_char_vec_best.hdf5
Epoch 4/500

Epoch 00004: loss improved from 1.76886 to 1.62428, saving model to /content/drive/My Drive/DLCP/openwork/text_generation/text_generation_char_vec_best.hdf5
Epoch 5/500

Epoch 00005: loss improved from 1.62428 to 1.49924, saving model to /content/drive/My Drive/DLCP/openwork/text_generation/text_generation_char_vec_best.hdf5
Epoch 6/500

Epoch 00006: loss improved from 1.49924

In [13]:
import os
if os.path.isfile(filepath):
     model = tf.keras.models.load_model(filepath)

Instructions for updating:
If using Keras pass *_constraint arguments to layers.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where


####  Generating new and random text

In [14]:
# Defining a utility function to generate new and random text based on the 
# network's learnings 
def generate_text(length, diversity): 
    # Get random starting text 
    start_index = random.randint(0, len(raw_text) - max_length - 1) 
    generated = '' 
    sentence = raw_text[start_index: start_index + max_length] 
    generated += sentence 
    for i in range(length): 
            x_pred = np.zeros((1, max_length, len(vocabulary))) 
            for t, char in enumerate(sentence): 
                x_pred[0, t, char_to_indices[char]] = 1.
  
            preds = model.predict(x_pred, verbose = 0)[0] 
            next_index = sample_index(preds, diversity) 
            next_char = indices_to_char[next_index] 
  
            generated += next_char 
            sentence = sentence[1:] + next_char 
    return generated 
  
print(generate_text(500, 0.2)) 

  This is separate from the ipykernel package so we can avoid doing imports until


purple.

'i won't!' said alice.

'off with her head!' the queen shouted at the top of her voice. nobbody the hooked alice

'butter a  i soon; and backly.

'.oll make do this


'i don't you could not remarks of gothing their became

ur the door hand i soot quite stry uph as long
you says!'

'and was on at anything at the
sime there were not mack to her, she
termed to ferper edister quite for cas donation,' the gryphon with the , i don't be no come an he bot i is, of every so ten it in the begin it, and say the mock turtle alice
's
mome of the startes. alice said very poor seemblated to her, lit
