<a href="https://colab.research.google.com/github/rahiakela/deep-learning--from-basics-to-practice/blob/24-keras-part-2/generate_text_letter_by_letter.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Generate text letter by letter

We’re going to use an RNN to generate brand-new text. We’ll train a little RNN using three collections of Sherlock Holmes short stories by Arthur Conan Doyle.

Taken together, there’s a little over 304,000 words. Many of these words are used repeatedly, of course. There are a bit under 29,000 unique words, including many proper nouns such as the names of characters and places.

A reasonable approach is to think of the text as a collection of words.
We can then train the RNN on how words follow one another. Then we
can start with some words, and let the RNN tell us which word should
come next. Then we’ll take our starting bunch, plus the new word at the
end, and have the RNN tell us which word should follow. Continuing
the process, we can keep feeding back to the RNN the most recent set
of words, and it would keep giving us a new word to follow.

<img src='https://github.com/rahiakela/img-repo/blob/master/generate-text-1.png?raw=1' width='800'/>

Our entire deep network for generating new Sherlock
Holmes data. Our input is a list of 40 sequential characters. The characters
go into two RNN layers. Each contains a single LSTM cell with 128
elements of memory. The output of the second LSTM is given to a dense
layer of 89 neurons, which predicts the probability of each character. The
most probable character is the network’s result. The small box at the top
of the first layer’s icon tells us that it returns an output for every input, and not just the final result.

Our input consists of a string of 40 characters. To create the training
set, we chopped up the original source material into about a half-million
overlapping strings of 40 characters, starting every third character.

<img src='https://github.com/rahiakela/img-repo/blob/master/generate-text-2.png?raw=1' width='800'/>

To create new text, we produce a “seed” by picking a random starting
point in the text, and then extract the next 40 sequential characters
from there. We give the seed to the network and it produces a new
character. That new character goes on to the end of the seed, and the
first character is dropped, giving us a new 40-character seed to use as
input to produce the next character.

<img src='https://github.com/rahiakela/img-repo/blob/master/generate-text-3.png?raw=1' width='800'/>

The notebooks for this section contain all the code for making new text, either letter by letter or word by word.

For variation, this time we packaged up each of the steps into its own procedure. Then when we’re ready to make text, we just call some of those procedures and let them do their work.

## Setup

In [0]:
from __future__ import absolute_import, division, print_function, unicode_literals

try:
  # %tensorflow_version only exists in Colab.
  %tensorflow_version 2.x
except Exception:
  pass
import tensorflow as tf
from tensorflow import keras
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt

from tensorflow.keras import backend as keras_backend
from tensorflow.keras.utils import to_categorical

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, LSTM, Activation
from tensorflow.keras.optimizers import RMSprop

import random
import sys

In [2]:
# Make a File_Helper for saving and loading files.

save_files = True

import os, sys, inspect
current_dir = os.path.dirname(os.path.abspath(inspect.getfile(inspect.currentframe())))
sys.path.insert(0, os.path.dirname(current_dir)) # path to parent dir
from DLBasics_Utilities import File_Helper
file_helper = File_Helper(save_files)

Using TensorFlow backend.


## Data preprocessing

Our first step is to read in the source text. We replaced multiple spaces
with single spaces, and removed newline characters since they don’t
have any semantic meaning.

In [0]:
def get_text(input_file):
  # open the input file and do minor processing
  file = open(input_file, 'r')
  text = file.read()
  file.close()

  # replace newlines with blanks, and double blanks with singles
  text = text.replace('\n', ' ')
  text = text.replace('  ', ' ')
  print(f'corpus length: {str(len(text))}')

  return text

Now we have to chop up the input into overlapping windows. We
need to pick the window size and how much they overlap.

In [0]:
def build_fragments(text, window_length):
  # make overlapping fragments of window_length characters
  fragments = []
  targets = []
  for i in range(0, len(text) - window_length, window_step):
    fragments.append(text[i: i + window_length])
    targets.append(text[i + window_length])
  print('number of fragments of length window_length=', window_length, ':', len(fragments))

  return (fragments, targets)

Since our network wants numbers, not letters, we’ll assign a unique
number to each letter. To make it easy to go back and forth, we’ll make
two dictionaries. One is keyed on characters and returns their number,
and the other is keyed on number and returns their character.

We’ll call the number an “index.” We can get the total number of unique
characters by using Python’s set() operation. Just for general tidiness
we’ll sort that list before using it.

In [0]:
def build_dictionaries(text):
  unique_chars = sorted(list(set(text)))
  print(f'total unique chars:{str(len(unique_chars))}')

  char_to_index = dict((ch, index) for index, ch in enumerate(unique_chars))
  index_to_char = dict((index, ch) for index, ch in enumerate(unique_chars))

  return (unique_chars, char_to_index, index_to_char)

Now we want to turn our samples and targets into one-hot vectors.We’ll use one-hot encoding for the samples here as well because we want each letter to be a feature in our data. That feature will have as many time steps as
there are unique characters in our data. They’ll all be 0 except for a 1
corresponding to the character being represented.

In [0]:
def encode_training_data(fragments, window_length, targets, char_to_index, index_to_char):
  #  Turn inputs and targets into one-hot versions
  X = np.zeros((len(fragments), window_length, len(char_to_index)), dtype=np.bool)
  y = np.zeros((len(fragments), len(char_to_index)), dtype=np.bool)

  for i, fragment in enumerate(fragments):
    for t, char in enumerate(fragment):
      X[i, t, char_to_index[char]] = 1
    y[i, char_to_index[targets[i]]] = 1

  return (X, y)

## Build the model

Now let’s build the model. After a little playing around, we chose the
simple deep model.It’s just two LSTM layers and a single
Dense layer. The first LSTM has return_sequences=True, because
it feed another LSTM. The second one produces a single output, which
will lead us to the letter the network is predicting. 

To get that letter, we use a Dense layer with one neuron per letter, and a softmax output. This will give us the probability of each character being the next one.

<img src='https://github.com/rahiakela/img-repo/blob/master/generate-text-model-1.png?raw=1' width='800'/>

In [0]:
def build_model(window_length, num_unique_chars):
  '''
  Two layers of a single LSTM cell with 128 elements of memory,
  then a dense layer with as many outputs as there are characters (89)
  We'll train with the RMSprop optimizer. Some experiments suggest that
  a learning rate of 0.01 is a good place to start.
  '''
  model = Sequential()
  model.add(LSTM(128, return_sequences=True, input_shape=(window_length, num_unique_chars)))
  model.add(LSTM(128))
  model.add(Dense(num_unique_chars, activation='softmax'))

  optimizer = RMSprop(learning_rate=0.01)

  model.compile(loss='categorical_crossentropy', optimizer=optimizer)

  return model

## Generate text

Now we’re ready to generate text. We’ll call a new routine called
generate_text() that will train the model for a single epoch, and
then print out some text that it generates. This way we can see how the
quality of the text improves over time.

After each call to fit() to train the model, we’ll pick a random starting
point in the original document and extract characters from there.
We’ll pick as many characters as in the window size we trained on.
We’ll one-hot encode that sequence of characters and give the result to predict(). This will give us back one probability for each unique
character in the original text, telling us how likely it is that that character is the one that comes next after the input text.

Once we’ve got the prediction for the next character, we append that
prediction to a growing output string. Then we append the new character
to the end of our input to the model, while also dropping the
first character from that string, so the input is always the length of the
training windows. Then we train the system for another epoch and do
it all again.

In [0]:
# print a string to the screen and also save it in the file
def print_string(out_str='', file_writer=None):
    print(out_str, end='')
    if file_writer != None:
        file_writer.write(out_str)

In [0]:
# adjust our probabilities to add some variability or "heat"
# see https://github.com/karpathy/char-rnn
def choose_probability(preds, temperature=1.0):
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)

In [0]:
def generate_text(model, X, y, number_of_epochs, temperatures, index_to_char, char_to_index, file_writer):
  # train the model, output generated text after each iteration
  for iteration in range(number_of_epochs):
    print_string('--------------------------------------------------\n', file_writer)
    print_string('Iteration ' + str(iteration) + '\n', file_writer)

    history = model.fit(X, y, batch_size=batch_size, epochs=1)
    start_index = random.randint(0, len(text) - window_length - 1)

    for temperature in temperatures:
      print_string('\n----- temperature: '+str(temperature)+'\n', file_writer)
      seed = text[start_index: start_index + window_length]
      generated = seed
      print_string('----- Generating with seed: <'+seed+'>\n', file_writer)

      for i in range(generated_text_length):
        x = np.zeros((1, window_length, len(index_to_char)))
        for t, char in enumerate(seed):
          x[0, t, char_to_index[char]] = -1.

        preds = model.predict(x, verbose=0)[0]
        next_index = choose_probability(preds, temperature)
        next_char = index_to_char[next_index]

        generated += next_char
        seed = seed[1:] + next_char

      print_string(generated + '\n\n', file_writer)
      file_writer.flush()

The majority of the work in this program involves messing about with
the data, making the dictionaries and windows and doing the one-hot
encoding and so on. The actual neural network code was just a few lines
to make the network, and one line each to train it and get predictions.

## Train the model

To train the system, we picked a window length of 40 characters, and a
step of 3, so each training string overlapped 37 characters with the one
before. We used a batch size of 100, and generated 1000 new characters
after each training step, using “temperatures” of 0.5, 1.0, and
1.5. The text with temperature 0.5 tended to produce the same words
frequently, and temperature 1.5 produced mostly words, but also lots
of strings that weren’t words. 

It’s fun to play with the temperature to find the sweet spot where the output is interesting, with the occasional weird almost-word.

In [15]:
window_length = 40
window_step = 3
number_of_epochs = 100
generated_text_length = 1000
batch_size = 100

# get text data structures, build the model
text = get_text('holmes.txt')
unique_chars, char_to_index, index_to_char = build_dictionaries(text)
fragments, targets = build_fragments(text, window_length)
X, y = encode_training_data(fragments, window_length, targets, char_to_index, index_to_char)
model = build_model(window_length, len(char_to_index))
model.summary()

corpus length: 1637265
total unique chars:89
number of fragments of length window_length= 40 : 545742
Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_2 (LSTM)                (None, 40, 128)           111616    
_________________________________________________________________
lstm_3 (LSTM)                (None, 128)               131584    
_________________________________________________________________
dense_1 (Dense)              (None, 89)                11481     
Total params: 254,681
Trainable params: 254,681
Non-trainable params: 0
_________________________________________________________________


In [0]:
input_dir = file_helper.get_input_data_dir()
output_dir = file_helper.get_saved_output_dir()
file_helper.check_for_directory(output_dir)

test_input_file = input_dir+'/test-holmes.txt'
input_file = input_dir+'/holmes.txt'
output_file =  output_dir+'/holmes-by-char.txt'
File_writer = open(output_file, 'w')

In [20]:
number_of_epochs = 100
temperatures = [0.5, 1.0, 1.5]

generate_text(model, X, y, number_of_epochs, temperatures, index_to_char, char_to_index, File_writer)

# wrap up when we're done
File_writer.close()

--------------------------------------------------
Iteration 0

----- temperature: 0.5
----- Generating with seed: <few missing links, my chain is almost co>


  This is separate from the ipykernel package so we can avoid doing imports until


few missing links, my chain is almost cotiB  lSwdht     i mchltIt ,a,      lalt  lly  lts      rtltttt  sssI  lttsrythtdhtn2 t      ltl  ltSt',ltht     sTnl  lltTye lltt     lssyeAyetht ’at'  ra,m .ehwItl  l.mlthdtthdht     lHlthdht.hts  ,att .s  lsdTn  lpitAthsstltwy  lpt"  lts  lstitsthdht     sts .lfSti , ”attmWt ’    ltlt  ,a   r sssltwss      lbi ,ctrdhdht ,ethdlth  ,at     ststhtdhdhdhdhssslhdl     .l,thtth.c ,ltt     ,“"t"eeT,pttdh  ,a ,nt ’lmthdltttmdht     e.htshl  lyltltT ,c ,l  ldhdhd .lwttdltt   a   rdhdl.l  litsts  ls"  lelws.htn ’ts,  lIc  shdliS ,at ,ssTr       sty  riithdht     lflA .stmA ,-l .s      sQlsltltIt  lthtdhdthfty   e tl lsdTthdhndhddhdht ,a.eltstdhtntdhtntTttht     r sstthdttsDhrthts  ,anthsy .hyedh  ,etithntt     ,“',  lltt  r     sSst  ,a   stdhttsthdnthdhmhmdhdhsmtmts.ythdhsi  lstlt  ,atldht     ,“trdh     r sa   sIhun ’edihotltmTtsttt.h  ,adhsat"e.tdhssI  lhddTt  ,    lnhta ,rdhdntItht     stthd ,ltIn  l e  ls  lsSfttthdttt.tta      r"wb .l  lmthd     lh

Clearly we have a long way to go. But remember that this is letter by
letter, from a system that has no idea of English or language or any
such structure. Given that it started from nothing and had only such a
small amount of training, this is pretty great.