# Text Generation

Text generation is the task of generating text with the goal of appearing indistinguishable to human-written text.

[![topic_modeling](text_gen.png)](https://github.com/scionoftech/Text_Generation_LSTM)

LSTM (Long Short Term Memory) are very good for analysing sequences of values and predicting the next values from them. For example, LSTM could be a very good choice if you want to predict the very next point of a given time serie (assuming a correlation exist in the sequence).

Talking about sentences and texts ; phrases (sentences) are basically sequences of words. So, it is natural to assume LSTM could be usefull to generate the next word of a given sentence.

In summary, the objective of a LSTM neural network in this situation is to guess the next word of a given sentence.

For example: What is the next word of this following sentence : "he is walking down the"

Our neural net will take the sequence of words as input : "he", "is", "walking", ... Its ouput will be a matrix providing the probability for each word from the dictionnary to be the next one of the given sentence.

Then, how will we build the complete text ? Simply iterating the process, by switching the setence by one word, including the new guessed word at its end. Then, we guess a new word for this new sentence.

Process
In order to do that, first, we build a dictionary containing all words from the novels we want to use.

* read the data (the novels we want to use),
* create the list of sentences,
* sequence padding,
* create the neural network,
* train the neural network,
* generate new sentences.

In [0]:
import numpy as np
import os
import re
from sklearn import model_selection, preprocessing
import tensorflow as tf

In [0]:
from google.colab import drive
drive.mount('/content/drive/')

In [0]:
project_path = "/content/drive/My Drive/text_generation/"

In [0]:
# load text
filename = project_path+"siddartha_by_hermann_hesse.txt"
raw_text = open(filename, 'r', encoding='utf-8').read()

In [0]:
# text preprocessing
def clean_text(text):

    # remove next lines
    text = text.strip().replace("\n", " ").replace("\r", " ")
    
    # filter to allow only alphabets
    text = re.sub(r'[^a-zA-Z\']', ' ', text)
    
    # remove Unicode characters
    text = re.sub(r'[^\x00-\x7F]+', '', text)
    
    # convert to lowercase to maintain consistency
    text = text.lower()

    return text

corpus = clean_text(raw_text)

In [0]:
# Generate input Sequence and out Sequence
def generate_sequences(text,seq_length):
    tokens= text.split()
    seq_in = []
    seq_out = []
    for i in range(0, len(tokens) - seq_length, 1):
        seq_in.append(' '.join(tokens[i:i+seq_length]))
        seq_out.append(tokens[i+seq_length])
        
    return seq_in,seq_out
    
seq_in,seq_out = generate_sequences(corpus,100)

In [0]:
# num_max = 1000
max_len = 100

# Tokenize
tok_seq_in = tf.keras.preprocessing.text.Tokenizer()
tok_seq_in.fit_on_texts(seq_in)
vocab_size_seq_in = len(tok_seq_in.word_index) + 1

# sequence padding
def get_seq_in_pad_sequences(seq_in):

    # for cnn preproces
    texts_seq = tok_seq_in.texts_to_sequences(seq_in)

    texts_mat = tf.keras.preprocessing.sequence.pad_sequences(texts_seq,maxlen=max_len,padding='post')

    return np.asarray(texts_mat)

X_in = get_seq_in_pad_sequences(seq_in)

In [0]:
# one-hot encoding of output sequence
le = preprocessing.LabelEncoder()
le.fit(seq_out)

def encode(le, labels):
    enc = le.transform(labels)
    return tf.keras.utils.to_categorical(enc)

def decode(le, one_hot):
    # dec = np.argmax(one_hot, axis=1)
    dec = np.argmax(one_hot[0])
    return le.inverse_transform((dec,))[0]
y_enc = encode(le, seq_out)

### Build CNN Model

In [9]:
# we start off with an efficient embedding layer which maps
# our vocab indices into embedding_dims dimensions
# Building the LSTM network for the task 
model = tf.keras.models.Sequential() 
model.add(tf.keras.layers.Embedding(vocab_size_seq_in,100, input_length=max_len))
model.add(tf.keras.layers.Dropout(0.2))
model.add(tf.keras.layers.LSTM(128)) 
model.add(tf.keras.layers.Dense(y_enc.shape[1])) 
model.add(tf.keras.layers.Activation('softmax')) 
optimizer = tf.keras.optimizers.RMSprop(lr = 0.01) 
model.compile(loss ='categorical_crossentropy', optimizer = optimizer)
model.summary()

Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 100, 100)          402400    
_________________________________________________________________
dropout (Dropout)            (None, 100, 100)          0         
_________________________________________________________________
lstm (LSTM)                  (None, 128)               117248    
_________________________________________________________________
dense (Dense)                (None, 4006)              516774    
_________________________________________________________________
activation (Activation)      (None, 4006)              0         
Total params: 1,036,422
Trainable params: 1,036,422
Non

In [12]:
# Defining a helper function to save the model after each epoch 
# in which the loss decreases 
filepath = project_path+"text_generation_word_pad.hdf5"
checkpoint = tf.keras.callbacks.ModelCheckpoint(filepath, monitor ='loss', 
							verbose = 1, save_best_only = True, 
							mode ='min') 

# Defining a helper function to reduce the learning rate each time 
# the learning plateaus 
reduce_alpha = tf.keras.callbacks.ReduceLROnPlateau(monitor ='loss', factor = 0.2, 
							patience = 1, min_lr = 0.001) 
# callbacks = [print_callback, checkpoint, reduce_alpha] 
callbacks = [checkpoint, reduce_alpha] 

# fit the model
model.fit(X_in, y_enc, epochs=30, batch_size=128,callbacks=callbacks)

Train on 42278 samples
Epoch 1/30
Epoch 00001: loss improved from inf to 4.71318, saving model to /content/drive/My Drive/DLCP/openwork/text_generation/text_generation_word_pad.hdf5
Epoch 2/30
Epoch 00002: loss improved from 4.71318 to 4.62542, saving model to /content/drive/My Drive/DLCP/openwork/text_generation/text_generation_word_pad.hdf5
Epoch 3/30
Epoch 00003: loss improved from 4.62542 to 4.56104, saving model to /content/drive/My Drive/DLCP/openwork/text_generation/text_generation_word_pad.hdf5
Epoch 4/30
Epoch 00004: loss improved from 4.56104 to 4.50768, saving model to /content/drive/My Drive/DLCP/openwork/text_generation/text_generation_word_pad.hdf5
Epoch 5/30
Epoch 00005: loss improved from 4.50768 to 4.47125, saving model to /content/drive/My Drive/DLCP/openwork/text_generation/text_generation_word_pad.hdf5
Epoch 6/30
Epoch 00006: loss improved from 4.47125 to 4.43938, saving model to /content/drive/My Drive/DLCP/openwork/text_generation/text_generation_word_pad.hdf5
Epo

<tensorflow.python.keras.callbacks.History at 0x7f76ee0bcbe0>

In [0]:
filepath=project_path+"text_generation_word_pad.hdf5"
model.save(filepath)

In [0]:
# if os.path.isfile(filepath):
#      model = tf.keras.models.load_model(filepath)

In [0]:
# pick a random seed
start = np.random.randint(0, len(seq_in)-1)
pattern = seq_in[start]
print("Seed:")
print("\"", pattern, "\"")
# generate characters
for i in range(1000):
	x = get_seq_in_pad_sequences([pattern])
	prediction = model.predict(x, verbose=0)
	result = decode(le,prediction)
	print(result,end=' ')
	pts = pattern.split()
	pts.append(result)
	pattern = ' '.join(pts[1:len(pts)])
print("\nDone.")

Seed:
" respect from you you will learn it spoke vasudeva but not from me the river has taught me to listen "
to me to be a samana and i have been to be a single man i have been to be a single man i have learned to be a single one of the exalted one i have not a samana i have learned to be a single man i have been a samana i have learned to be a samana i have learned to be a samana i have learned to be able to do i do not know it is a samana i have learned to you and i have to find a rich man siddhartha is a samana i have learned to you you have learned to be a samana i have not learned and a stone i have learned to be a single man i have not be a samana i have not a single man i have been a samana i have learned to you and i have learned to be a single man i have learned to be able to do i have have learned to be a single man i have been to be a single man quoth siddhartha i have learned to you and i have learned to you travelling to me you are a samana and i have learned to be a sama