## Text Generation using LSTM

Text generation is the task of generating text with the goal of appearing indistinguishable to human-written text.

[![topic_modeling](text_gen.png)](https://github.com/scionoftech/Text_Generation_LSTM)

LSTM (Long Short Term Memory) are very good for analysing sequences of values and predicting the next values from them. For example, LSTM could be a very good choice if you want to predict the very next point of a given time serie (assuming a correlation exist in the sequence).

Talking about sentences and texts ; phrases (sentences) are basically sequences of words. So, it is natural to assume LSTM could be usefull to generate the next word of a given sentence.

In summary, the objective of a LSTM neural network in this situation is to guess the next word of a given sentence.

For example: What is the next word of this following sentence : "he is walking down the"

Our neural net will take the sequence of words as input : "he", "is", "walking", ... Its ouput will be a matrix providing the probability for each word from the dictionnary to be the next one of the given sentence.

Then, how will we build the complete text ? Simply iterating the process, by switching the setence by one word, including the new guessed word at its end. Then, we guess a new word for this new sentence.

Process
In order to do that, first, we build a dictionary containing all words from the novels we want to use.

* read the data (the novels we want to use),
* create the dictionnary of words,
* create the list of sentences,
* create the neural network,
* train the neural network,
* generate new sentences.

In [2]:
import numpy as np
import os
import re
from sklearn import model_selection, preprocessing
import tensorflow as tf
import collections

In [1]:
from google.colab import drive
drive.mount('/content/drive/')

In [0]:
project_path = "/content/drive/My Drive/DLCP/openwork/text_generation/"

### Read data

In [0]:
# load ascii text and covert to lowercase
filename = project_path+"wonderland.txt"
raw_text = open(filename, 'r', encoding='utf-8').read()

In [0]:
def clean_text(text):

    # remove next lines
    text = text.strip().replace("\n", " ").replace("\r", " ")
    
    # filter to allow only alphabets
    text = re.sub(r'[^a-zA-Z\']', ' ', text)
    
    # remove Unicode characters
    text = re.sub(r'[^\x00-\x7F]+', '', text)
    
    # convert to lowercase to maintain consistency
    text = text.lower()

    text = ' '.join([w for w in text.split() if w not in ("\n","\n\n",'\u2009','\xa0')])

    text = text.replace("'","")
    text = text.replace('"','')
    text = text.replace('  ',' ')

    return text

In [0]:
# clean text
corpus = clean_text(raw_text)

### Create dictionnary
The first step is to create the dictionnary, it means, the list of all words contained in texts. For each word, we will assign an index to it.

In [0]:
wordlist = corpus.split()

word_counts = collections.Counter(wordlist)

# Mapping from index to word : that's the vocabulary
vocabulary_inv = [x[0] for x in word_counts.most_common()]
vocabulary_inv = list(sorted(vocabulary_inv))

# Mapping from word to index
vocab = {x: i for i, x in enumerate(vocabulary_inv)}
words = [x[0] for x in word_counts.most_common()]

In [8]:
#size of the vocabulary
vocab_size = len(words)
print("vocab size: ", vocab_size)

vocab size:  3056


### create sequences

Now, we have to create the input data for our LSTM. We create two lists:

* **sequences**: this list will contain the sequences of words used to train the model,
* **next_words**: this list will contain the next words for each sequences of the **sequences** list.
In this exercice, we assume we will train the network with sequences of 30 words (seq_length = 30).

So, to create the first sequence of words, we take the 30th first words in the **wordlist** list. The word 31 is the next word of this first sequence, and is added to the **next_words** list.

Then we jump by a step of 1 (sequences_step = 1 in our example) in the list of words, to create the second sequence of words and retrieve the second "next word".

We iterate this task until the end of the list of words.

In [9]:
seq_length = 30 # sequence length
sequences_step = 1 #step to create sequences

sequences = []
next_words = []
for i in range(0, len(wordlist) - seq_length, sequences_step):
    sequences.append(wordlist[i: i + seq_length])
    next_words.append(wordlist[i + seq_length])

print('nb sequences:', len(sequences))

nb sequences: 29729


When we iterate over the whole list of words, we create 30844 sequences of words, and retrieve, for each of them, the next word to be predicted.

However, these lists cannot be used "as is". We have to transform them in order to ingest them in the LSTM. Text will not be understood by neural net, we have to use digits. However, we cannot only map a words to its index in the vocabulary, as it does not represent intrasinqly the word. It is better to reorganize a sequence of words as a matrix of booleans.

So, we create the matrix X and y :

X : the matrix of the following dimensions:
number of sequences,
number of words in sequences,
number of words in the vocabulary.
y : the matrix of the following dimensions:
number of sequences,
number of words in the vocabulary.
For each word, we retrieve its index in the vocabulary, and we set to 1 its position in the matrix.

In [0]:
X = np.zeros((len(sequences), seq_length, vocab_size), dtype=np.bool)
y = np.zeros((len(sequences), vocab_size), dtype=np.bool)
for i, sentence in enumerate(sequences):
    for t, word in enumerate(sentence):
        X[i, t, vocab[word]] = 1
    y[i, vocab[next_words[i]]] = 1

Now, here come the fun part. The creation of the neural network. As you will see, I am using Keras which provide very good abstraction to design an architecture.

In this example, I create the following neural network:

* bidirectional LSTM,
* with size of 256 and using RELU as activation,
* then a dropout layer of 0,6 (it's pretty high, but necesseray to avoid quick divergence)

The net should provide me a probability for each word of the vocabulary to be the next one after a given sentence. So I end it with:

* a simple dense layer of the size of the vocabulary,
* a softmax activation.

I use ADAM as otpimizer and the loss calculation is done on the categorical crossentropy.

Here is the function to build the network:

In [0]:
rnn_size = 256 # size of RNN
batch_size = 128 # minibatch size
seq_length = 30 # sequence length
num_epochs = 50 # number of epochs
learning_rate = 0.01 #learning rate
sequences_step = 1 #step to create sequences

In [12]:
print('Build LSTM model.')
model = tf.keras.models.Sequential()
model.add(tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(rnn_size, activation="relu"),input_shape=(seq_length, vocab_size)))
model.add(tf.keras.layers.Dropout(0.6))
model.add(tf.keras.layers.Dense(vocab_size))
model.add(tf.keras.layers.Activation('softmax'))
optimizer = tf.keras.optimizers.Adam(lr=learning_rate)
model.compile(loss='categorical_crossentropy', optimizer=optimizer)
model.summary()

Build LSTM model.
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
bidirectional (Bidirectional (None, 512)               6785024   
_________________________________________________________________
dropout (Dropout)            (None, 512)               0         
_________________________________________________________________
dense (Dense)                (None, 3056)              1567728   
_________________________________________________________________

### train data
Enough speech, we train the model now. We shuffle the training set and extract 10% of it as validation sample. We simply run :

In [13]:
# Defining a helper function to save the model after each epoch 
# in which the loss decreases 
filepath = project_path+"text_generation_word_vec_best.hdf5"
checkpoint = tf.keras.callbacks.ModelCheckpoint(filepath, monitor ='loss', 
							verbose = 1, save_best_only = True, 
							mode ='min') 
earlystop = tf.keras.callbacks.EarlyStopping(patience=4, monitor='loss')
# Defining a helper function to reduce the learning rate each time 
# the learning plateaus 
reduce_alpha = tf.keras.callbacks.ReduceLROnPlateau(monitor ='loss', factor = 0.2, 
							patience = 1, min_lr = 0.001) 
# callbacks = [print_callback, checkpoint, reduce_alpha] 
callbacks = [checkpoint,earlystop, reduce_alpha] 

history = model.fit(X, y,batch_size=batch_size,epochs=num_epochs,callbacks=callbacks)

Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
Train on 29729 samples
Epoch 1/50
Epoch 00001: loss improved from inf to 28039.52111, saving model to /content/drive/My Drive/DLCP/openwork/text_generation/text_generation_word_vec_best.hdf5
Epoch 2/50
Epoch 00002: loss improved from 28039.52111 to 6.82531, saving model to /content/drive/My Drive/DLCP/openwork/text_generation/text_generation_word_vec_best.hdf5
Epoch 3/50
Epoch 00003: loss improved from 6.82531 to 5.99549, saving model to /content/drive/My Drive/DLCP/openwork/text_generation/text_generation_word_vec_best.hdf5
Epoch 4/50
Epoch 00004: loss did not improve from 5.99549
Epoch 5/50
Epoch 00005: loss improved from 5.99549 to 5.40828, saving model to /content/drive/My Drive/DLCP/openwork/text_generation/text_generation_word_vec_best.hdf5
Epoch 6/50
Epoch 00006: loss improved from 5.40828 to 5.32013, saving model to /content/drive/My Drive/DLCP/openwork/text_generation/text_generation_

In [0]:
filepath=project_path+"text_generation_word_vec.hdf5"
model.save(filepath)

In [0]:
# if os.path.isfile(filepath):
#      model = tf.keras.models.load_model(filepath)

# Generate phrase

Great ! We have now trained a model to predict the next word of a given sequence of words. In order to generate text, the task is pretty simple:

we define a "seed" sequence of 30 words (30 is the number of words required by the neural net for the sequences), we ask the neural net to predict word number 31,
then we update the sequence by moving words by a step of 1, adding words number 31 at its end, we ask the neural net to predict word number 32, etc. For as long as we want.

Doing this, we generate phrases, word by word.

To improve the word generation, and tune a bit the prediction, we introduce a specific function to pick-up words.

We will not take the words with the highest prediction (or the generation of text will be boring), but would like to insert some uncertainties, and let the solution sometime pick-up words with less good prediction.

That is the purpose of the function sample, that will draw radomly a word from the vocabulary.

The probabilty for a word to be drawn will depends directly on its probability to be the next word. In order to tune this probability, we introduce a "temperature" to smooth or sharpen its value.

In [0]:
def sample(preds, temperature=1.0):
    # helper function to sample an index from a probability array
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)

In [15]:
#initiate sentences
seed_sentences = "It was all very well to say 'Drink me,' but the wise little Alice was not going to do THAT in a hurry." 
generated = ''
sentence = []
for i in range (seq_length):
    sentence.append("a")

seed = seed_sentences.split()

for i in range(len(seed)):
    sentence[seq_length-i-1]=seed[len(seed)-i-1]

generated += ' '.join(sentence)
sequence = clean_text(generated)
print('Generating text with the following seed: "' + ' '.join(sentence) + '"')

Generating text with the following seed: "a a a a a a a It was all very well to say 'Drink me,' but the wise little Alice was not going to do THAT in a hurry."


In [16]:
words_number = 100
#generate the text
for i in range(words_number):
    #create the vector
    x = np.zeros((1, seq_length, vocab_size))
    for t, word in enumerate(sequence.split()):
        x[0, t, vocab[word]] = 1.
    #print(x.shape)

    #calculate next word
    preds = model.predict(x, verbose=0)[0]
    next_index = sample(preds, 0.34)
    next_word = vocabulary_inv[next_index]

    #add the next word to the text
    generated += " " + next_word
    # shift the sentence by one, and and the next word at its end
    sentence = sentence[1:] + [next_word]

print(generated)

a a a a a a a It was all very well to say 'Drink me,' but the wise little Alice was not going to do THAT in a hurry. the its its the a a a the a the it its the she the the the the the a its the a the the a the a the the the the its its the it it it the the a its the its the the its she it and the the she a a the the the its the a the the a the a the a the the the the she the its the a the the the the the the a its its its the the the the the the she its it the the the it
