# Automatic Text Generation with Deep Learning (Python and Keras)

In this section we look at **automatic text generation** application using an LSTM architecture. This leads us to consider two different ways that we can look at deep learning methods in NLP applications, which we discuss here.

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Generative vs Discriminative models
The word "generation" in **automatic text generation** connotes one of two classes of models that we use in NLP Deep Learning applications.

> **Discriminative** models
>> Models that **discriminate** between different classes of data.
>> * The networks used in **Sentiment Analysis** can discriminate (classify) between positive and negative sentiments in data.

> **[Generative](https://developers.google.com/machine-learning/gan/generative)** models
>> Models that can **generate** new data instances.
>> * The networks used in **Text Generation** can generate output text based on input text.

In [None]:

# Text corpus can be downloaded from here for example, with the complete works of Shakespeare
# https://www.gutenberg.org/ebooks/100
# Probably best to go for the Plain Text UTF-8 version

## Library installation

In [2]:
# If you have a GPU installed you may need to include a GPU version of Tensorflow
!pip install --ignore-installed --upgrade tensorflow-gpu

# Install relevant packages if not already installed
!pip install glob
!pip install nltk
!pip install numpy
!pip install os
!pip install random
!pip install gensim
!pip install wget
!pip install tensorflow
!pip install keras

Collecting tensorflow-gpu
  Downloading tensorflow-gpu-2.12.0.tar.gz (2.6 kB)
  [1;31merror[0m: [1msubprocess-exited-with-error[0m
  
  [31m×[0m [32mpython setup.py egg_info[0m did not run successfully.
  [31m│[0m exit code: [1;36m1[0m
  [31m╰─>[0m See above for output.
  
  [1;35mnote[0m: This error originates from a subprocess, and is likely not a problem with pip.
  Preparing metadata (setup.py) ... [?25l[?25herror
[1;31merror[0m: [1mmetadata-generation-failed[0m

[31m×[0m Encountered error while generating package metadata.
[31m╰─>[0m See above for output.

[1;35mnote[0m: This is an issue with the package mentioned above, not pip.
[1;36mhint[0m: See above for details.
[31mERROR: Could not find a version that satisfies the requirement glob (from versions: none)[0m[31m
[0m[31mERROR: No matching distribution found for glob[0m[31m
[31mERROR: Could not find a version that satisfies the requirement os (from versions: none)[0m[31m
[0m[31mERROR: No

In [3]:
import glob # string manipulation for constructing directory paths
import nltk # bring in the Natural Language Tool Kit
import os # handle Operating System file tasks
from matplotlib import pyplot as plt # plot images
import numpy as np
from os.path import exists
from random import shuffle # facility to generate random selections
from nltk.tokenize import TreebankWordTokenizer # Tokenize the strings
from gensim import models
import wget

# Set your working directory in the code here
working_directory = "/content/drive/MyDrive/MN5002_Section3"
os.chdir(working_directory)
print(os.getcwd())

/content/drive/MyDrive/MN5002_Section3


So, lets look at **The Complete Works of Shakespeare** and use it to make a **character based language model** to generate text automatically.

In [4]:
text = ''
CompleteWorksOfShakespeareFile = './pg100.txt'

if exists(CompleteWorksOfShakespeareFile):
    with open(CompleteWorksOfShakespeareFile, encoding='utf-8') as f:
        text = f.read()



In [5]:
print(text[:500])
text = text.lower()

﻿The Project Gutenberg eBook of The Complete Works of William Shakespeare, by William Shakespeare

This eBook is for the use of anyone anywhere in the United States and
most other parts of the world at no cost and with almost no restrictions
whatsoever. You may copy it, give it away or re-use it under the terms
of the Project Gutenberg License included with this eBook or online at
www.gutenberg.org. If you are not located in the United States, you
will have to check the laws of the country where


## Building an LSTM language model to generate text
Now lets build a language model to generate text.
Lets look at this in terms of what we have always done so far when building a neural network
- Provide **data**
- Provide the **label** associated with data

Now recall what we are trying to do with a language model; we are trying to get the system to generate the next word or character given a sequence of words or characters.

So we can look at our data in a way that we had for CNNs and RNNs, which is actually fairly simple
- Provide **data**
 - A sequence of **40 consecutive character tokens**
- Provide the **label** associated with the data
 - The **41st character token** (i.e., the next character in the sequence)

We will use **The Complete Works of Shakespeare** to generate a data set to train our language model.

Furthermore, the training set will consist of **semi-redundant sequences**; Take 40 characters from the beginning of the text, move to the 3rd character from the beginning, take a sequence of 40 from there, move to the 6th character from the beginning, take a sequence of 40 from there. This is a form of **data augmentation**, creating an extended data set with valid label characteristics from the data.

So what we are doing here is fundamentally no different to
* coming up with an augmented training set of **data** and **labels** and
* using them to **train a neural network** with an LSTM architecture.

All that remains is to :
- Set up the **hyperparameters** for building a LSTM neural network with **Keras**
- Represent the tokens in a mathematical form that can be used by the neural network (we'll **one-hot encode** them)
- construct the **character-based** LSTM network; again Keras does that cleanly
- train the network
- generate text using the network (the fun bit)

## Set up hyperparameters for building a LSTM neural network


In [6]:
maxlen = 40
step = 3 # This is the stepsize in creating the semi-redundant sequences
sentences = [] # These are the "data" we referred to above
next_chars = [] # These are the "labels" we referred to above
for i in range(0, len(text) - maxlen, step):
    sentences.append(text[i: i + maxlen])
    next_chars.append(text[i + maxlen])

# How many sequences have we now?

print('Number of sequences:', len(sentences))

Number of sequences: 1845146


## Represent the words by one-hot encoding
We are going to make a dictionary of characters to **one-hot encoding index** and a **reverse dictionary** which allow us to go back from a one-hot encoding to its character token.

This is just a matter of collecting and indexing the characters from corpus text we read in.

In [7]:
chars = sorted(list(set(text)))
char_indices = dict((c,i) for i, c in enumerate(chars))
indices_char = dict((i,c) for i, c in enumerate(chars))

# One-hot encoding provides one entry of 1 and all other entries are 0, so we can use a compact boolean data type
X = np.zeros((len(sentences), maxlen, len(chars)), dtype=bool)
y = np. zeros((len(sentences), len(chars)), dtype=bool)

for i, sentence in enumerate(sentences):
    for t, char in enumerate(sentence):
        X[i, t, char_indices[char]] = 1
    y[i, char_indices[next_chars[i]]] = 1

## Construct the character-based LSTM network
Let's construct the generative network and look at our considerations.

In [15]:
from keras.models import Sequential
from keras.layers import Dense, Activation
from keras.layers import LSTM
from keras.optimizers import RMSprop
model = Sequential()
model.add(LSTM(128, input_shape = (maxlen, len(chars))))
model.add(Dense(len(chars)))
model.add(Activation('softmax'))
optimizer = RMSprop(learning_rate=0.01)
model.compile(loss='categorical_crossentropy', optimizer=optimizer)
model.summary()

### More neurons in hidden layer
When we did **sentiment analysis** we used **50** neurons in the hidden layer, but in our generative model we are trying to model much more complex behaviour, so we use **128**.

### Categorical Crossentropy loss function
In sentiment analysis, when we had only **two** categories ("positive" and "negative"), we could use **binary**_crossentropy as our loss function.
Now that we have a variety of categories, basically the number of possible types of token that might come after a sequence, we use **categorical_crossentropy** that can update the loss function across this wider range of tokens.

### Learning Optimizer - RMSProp
Generally optimizers are used to introduce "tricks" into the learning process which provide faster learning or better accuracy. [RMSProp](http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf) is one of these and works by "updating each weight by adjusting the learning rate with a "running average of the magnitudes of recent gradients for that weight".
The best way to create models is to use the experience of other in setting up the model and then exploring the hyperparameters and tricks that can be used in the model to improve its performance.

### No Dropout Layer
Also, you will notice that **there is no Dropout layer**. We are trying to learn as much as we can about the structure of training data, which uses **all the available input data**

## Train the network and assess
This is the "run the code and get a coffee, or go away and come back tomorrow" part of deep learning and is pervasive among those developing such systems, so be patient.

In [16]:
epochs = 1
batch_size = 128
model_structure = model.to_json()
with open("shakes_lstm_model.json", "w") as json_file:
    json_file.write(model_structure)
for i in range(1):
    model.fit(X,y,
        batch_size=batch_size,
        epochs=epochs)
    model.save_weights("shakes_lstm_{}.weights.h5".format(i+1))

[1m14416/14416[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m60s[0m 4ms/step - loss: 1.8564


Looking at the above outcomes, the lowest losses come after the first pass, so lets load the associated saved model files.
The real teaching role here is just to provide you with a bit of code to pull in a preferred model. You may decide to go back and change hyperparameters, change data, change optimizer, etc., and usually engineers try to do these in parallel to understand what might work best.

In [17]:
from keras.models import model_from_json

with open("shakes_lstm_model.json", "r") as json_file:
  json_string = json_file.read()
model = model_from_json(json_string)

# Once the model structure exists, set its characteristic weights

model.load_weights('shakes_lstm_1.weights.h5')

## Generate text using the network

Now, we start to generate language from our network with some helper functions.

### Make a sampler to generate character sequences

As the last layer of the network is a **softmax** function, the output vector will be a probability distribution over all possible outputs of the network. By looking at the highest value in the output vector, you can see what the network thinks has the highest probability of being the next character.

In terms of Python, this just means that the index of the output vector with the highest values (a number between 0 and 1) correlates with the index of the one-hot encoding of the expected token.

But we don't want to go for the most probable character every time as the network would then be very boring and would not exercise its "thoughts".
So we use a variable called "**temperature**" which will be used to determine how strictly or freely the next character is chosen; this provides the generative model with a **diversity** of possible outputs.

Dividing the log by the temperature sharpens (temperature < 1) or squashes (temperature > 1) the probabilty distribution that reflects the learning.

So "Temperatures" **less than 1** will try harder to reproduce the original text, but temperatures **greater than 1** give more freedom but squash out what was learned so the outputs tend towards jibberish.

In [18]:
# Make sampler to generate character sequences
import random
def sample(preds, temperature = 1.0):
    preds = np.array(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)

### Generation of diverse Shakespearean texts
Now lets generate **3 texts** with **3 diversity levels**

In [19]:
import sys
start_index = random.randint(0, len(text) - maxlen -1)
for diversity in [0.2, 0.5, 1.0]:
    print()
    print('----- diversity:', diversity)
    generated = ''
    sentence = text[start_index: start_index + maxlen]
    generated += sentence
    print('----- Generating with seed: "' + sentence + '"')
    for i in range(400):
        x = np.zeros((1, maxlen, len(chars)))
        # Seed the trained network and see what it spits out as the next character
        for t, char in enumerate(sentence):
            x[0, t, char_indices[char]] = 1
        # model makes a prediction
        preds = model.predict(x, verbose=0)[0]
        next_index = sample(preds, diversity)
        # Look up which character that index represents (reverse dictionary)
        next_char = indices_char[next_index]
        generated += next_char
        # Add the "seed" and drop the first character to keep the length the same
        # This is now the seed for the next pass
        sentence = sentence[1:] + next_char
        sys.stdout.write(next_char)
        sys.stdout.flush()
    print()


----- diversity: 0.2
----- Generating with seed: "en he fawns, he bites; and when he bites"
t the peace of the prove
the senter and the death and the poor brother.
                                                                                                                                                                                                                                                                                                                                        

----- diversity: 0.5
----- Generating with seed: "en he fawns, he bites; and when he bites"
s it for no part,
                                                                                                                                                                                                                                                                                                                                                                                              

----- diver

Diversity 0.2 and 0.5 look a bit Shakespearean and look like flowing language but you can see that the Diversity 1.0 has lost a lot of learning and has tended towards jibberish.

### Improving the automatic text generator
You can improve generative models if you want them more than just for fun, and there are steps that can be taken that include the following, suggested in Chapter 9 of ["Natural Language Processing in Action", first edition, Lane et.al.](https://www.manning.com/books/natural-language-processing-in-action),:
* Expand the quantity and quality of the corpus
* Expand the complexity of the model (number of neurons)
* Implement a more refined case folding algorithm
* Segment sentences differently
* Add filters on grammar, spelling and tone to match your needs
* Generate many more examples than you actually show your users....
* Use see texts chosen from the context of the session to steer the chatbot towards useful topics
* Use multiple different seed texts within each dialog round to explore what the chatbot can talk about and what the user finds helpful