<a href="https://colab.research.google.com/github/rahiakela/deep-learning--from-basics-to-practice/blob/24-keras-part-2/generate_text_letter_by_letter.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Generate text letter by letter

We’re going to use an RNN to generate brand-new text. We’ll train a little RNN using three collections of Sherlock Holmes short stories by Arthur Conan Doyle.

Taken together, there’s a little over 304,000 words. Many of these words are used repeatedly, of course. There are a bit under 29,000 unique words, including many proper nouns such as the names of characters and places.

A reasonable approach is to think of the text as a collection of words.
We can then train the RNN on how words follow one another. Then we
can start with some words, and let the RNN tell us which word should
come next. Then we’ll take our starting bunch, plus the new word at the
end, and have the RNN tell us which word should follow. Continuing
the process, we can keep feeding back to the RNN the most recent set
of words, and it would keep giving us a new word to follow.

<img src='https://github.com/rahiakela/img-repo/blob/master/generate-text-1.png?raw=1' width='800'/>

Our entire deep network for generating new Sherlock
Holmes data. Our input is a list of 40 sequential characters. The characters
go into two RNN layers. Each contains a single LSTM cell with 128
elements of memory. The output of the second LSTM is given to a dense
layer of 89 neurons, which predicts the probability of each character. The
most probable character is the network’s result. The small box at the top
of the first layer’s icon tells us that it returns an output for every input, and not just the final result.

Our input consists of a string of 40 characters. To create the training
set, we chopped up the original source material into about a half-million
overlapping strings of 40 characters, starting every third character.

<img src='https://github.com/rahiakela/img-repo/blob/master/generate-text-2.png?raw=1' width='800'/>

To create new text, we produce a “seed” by picking a random starting
point in the text, and then extract the next 40 sequential characters
from there. We give the seed to the network and it produces a new
character. That new character goes on to the end of the seed, and the
first character is dropped, giving us a new 40-character seed to use as
input to produce the next character.

<img src='https://github.com/rahiakela/img-repo/blob/master/generate-text-3.png?raw=1' width='800'/>

The notebooks for this section contain all the code for making new text, either letter by letter or word by word.

For variation, this time we packaged up each of the steps into its own procedure. Then when we’re ready to make text, we just call some of those procedures and let them do their work.

## Setup

In [0]:
from __future__ import absolute_import, division, print_function, unicode_literals

try:
  # %tensorflow_version only exists in Colab.
  %tensorflow_version 2.x
except Exception:
  pass
import tensorflow as tf
from tensorflow import keras
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt

from tensorflow.keras import backend as keras_backend
from tensorflow.keras.utils import to_categorical

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, LSTM, Activation
from tensorflow.keras.optimizers import RMSprop

import random
import sys

## Data preprocessing

Our first step is to read in the source text. We replaced multiple spaces
with single spaces, and removed newline characters since they don’t
have any semantic meaning.

In [2]:
def get_text(input_file):
  # open the input file and do minor processing
  file = open(input_file, 'r')
  text = file.read()
  file.close()

  # replace newlines with blanks, and double blanks with singles
  text = text.replace('\n', ' ')
  text = text.replace('  ', ' ')
  print(f'corpus length: {str(len(text))}')

  return text

ERROR! Session/line number was not unique in database. History logging moved to new session 60


Now we have to chop up the input into overlapping windows. We
need to pick the window size and how much they overlap.

In [0]:
def build_fragments(text, window_length):
  # make overlapping fragments of window_length characters
  fragments = []
  targets = []
  for i in range(0, len(text) - window_length, window_step):
    fragments.append(text[i: i + window_length])
    targets.append(text[i + window_length])
  print('number of fragments of length window_length=', window_length, ':', len(fragments))

  return (fragments, targets)

Since our network wants numbers, not letters, we’ll assign a unique
number to each letter. To make it easy to go back and forth, we’ll make
two dictionaries. One is keyed on characters and returns their number,
and the other is keyed on number and returns their character.

We’ll call the number an “index.” We can get the total number of unique
characters by using Python’s set() operation. Just for general tidiness
we’ll sort that list before using it.

In [0]:
def build_dictionaries(text):
  unique_chars = sorted(list(set(text)))
  print(f'total unique chars:{str(len(unique_chars))}')

  char_to_index = dict((ch, index) for index, ch in enumerate(unique_chars))
  index_to_char = dict((index, ch) for index, ch in enumerate(unique_chars))

  return (unique_chars, char_to_index, index_to_char)

Now we want to turn our samples and targets into one-hot vectors.We’ll use one-hot encoding for the samples here as well because we want each letter to be a feature in our data. That feature will have as many time steps as
there are unique characters in our data. They’ll all be 0 except for a 1
corresponding to the character being represented.

In [0]:
def encode_training_data(fragments, window_length, targets, char_to_index, index_to_char):
  #  Turn inputs and targets into one-hot versions
  X = np.zeros((len(fragments), window_length, len(char_to_index)), dtype=np.bool)
  y = np.zeros((len(fragments), len(char_to_index)), dtype=np.bool)

  for i, fragment in enumerate(fragments):
    for t, char in enumerate(fragment):
      X[i, t, char_to_index[char]] = 1
    y[i, char_to_index[targets[i]]] = 1

  return (X, y)

## Build the model

Now let’s build the model. After a little playing around, we chose the
simple deep model.It’s just two LSTM layers and a single
Dense layer. The first LSTM has return_sequences=True, because
it feed another LSTM. The second one produces a single output, which
will lead us to the letter the network is predicting. 

To get that letter, we use a Dense layer with one neuron per letter, and a softmax output. This will give us the probability of each character being the next one.

<img src='https://github.com/rahiakela/img-repo/blob/master/generate-text-model-1.png?raw=1' width='800'/>

In [0]:
def build_model(window_length, num_unique_chars):
  '''
  Two layers of a single LSTM cell with 128 elements of memory,
  then a dense layer with as many outputs as there are characters (89)
  We'll train with the RMSprop optimizer. Some experiments suggest that
  a learning rate of 0.01 is a good place to start.
  '''
  model = Sequential()
  model.add(LSTM(128, return_sequences=True, input_shape=(window_length, num_unique_chars)))
  model.add(LSTM(128))
  model.add(Dense(num_unique_chars, activation='softmax'))

  optimizer = RMSprop(learning_rate=0.01)

  model.compile(loss='categorical_crossentropy', optimizer=optimizer)

  return model