# Text Generation with RNNs

This notebook aims: <p>
    
- to demonstrate how to generate text <p>
    
- using a character-based RNN <p>

- on a large corpus (German philosopher from the 1890s Nietzsche's writings). <p>
    
Thus for instance, given a sequence of characters from this data ...
> "Shakespear_"
we train a model to predict the next character in the sequence ("e"). 
    
Longer sequences of text can be generated by calling the model repeatedly.

**Note**: Enable GPU acceleration to execute this notebook faster. In Colab: *Runtime > Change runtime type > Hardware acclerator > GPU*. 
If running locally make sure TensorFlow version >= 1.11.

Ready?

In [0]:
import keras
keras.__version__
import warnings
warnings.filterwarnings("ignore")

In [9]:
import keras
import numpy as np
import time

# download inbuilt dataset on NietZsche's writings
start = time.time()
path = keras.utils.get_file(
    'nietzsche.txt',
    origin='https://s3.amazonaws.com/text-datasets/nietzsche.txt')
text = open(path).read().lower()

end = time.time()
print(end - start, " secs\n")  # 3.3 secs

print('Corpus length in characters:', len(text))

0.004668712615966797  secs

Corpus length in characters: 600893


In [10]:
# View what Nietzsche's writings are like

print(text[:400])  # first 600 of 600k characters

preface


supposing that truth is a woman--what then? is there not ground
for suspecting that all philosophers, in so far as they have been
dogmatists, have failed to understand women--that the terrible
seriousness and clumsy importunity with which they have usually paid
their addresses to truth, have been unskilled and unseemly methods for
winning a woman? certainly she has never allowed herself 


Can one learn about Nietzsche from his writings above: (i) writing style? (ii) propensity to use large words? (iii) clarity vs opaqueness, complexity vs simplicity etc in prose? 

Q is when we *generate* text, will machine learn and reproduce in generated text reflect these linguistic styles?

### Preparing Data for an RNN

OK. Back to more mundane things. Recall that RNNs take inputs as *sequences* which will have some length. 

Next steps, we will: 

(i) extract partially-overlapping sequences of length `maxlen` (say, 60 chars), <p>
(ii) one-hot encode them and <p>
(iii) pack them in a 3D Numpy array `x` of shape (`sequences, maxlen, unique_characters`). <p>
(iv) Simultaneously, we prepare a array `y` containing the corresponding targets: the one-hot encoded characters that come right after each extracted sequence.<p>

In [11]:
# Length of extracted character sequences
maxlen = 60

# We sample a new sequence every `step` characters
step = 3

# This holds our extracted sequences
sentences = []

# This holds the targets (the follow-up characters)
next_chars = []

start = time.time()

for i in range(0, len(text) - maxlen, step):
    sentences.append(text[i: i + maxlen])
    next_chars.append(text[i + maxlen])
print('Number of sequences:', len(sentences), "\n")

# List of unique characters in the corpus
chars = sorted(list(set(text)))
print('Unique characters:', len(chars), "\n")

# Dictionary mapping unique characters to their index in `chars`
char_indices = dict((char, chars.index(char)) for char in chars)

# Next, one-hot encode the characters into binary arrays.
print('Vectorization...')
x = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool)
y = np.zeros((len(sentences), len(chars)), dtype=np.bool)

for i, sentence in enumerate(sentences):
    for t, char in enumerate(sentence):
        x[i, t, char_indices[char]] = 1
    y[i, char_indices[next_chars[i]]] = 1
    
end = time.time()
print(end - start, " secs\n")  # 5.9 secs    

Number of sequences: 200278 

Unique characters: 57 

Vectorization...
3.75154709815979  secs



## Build an LSTM RNN network

Our network is a single `LSTM` layer followed by a `Dense` classifier and `softmax` over all possible characters. 

P.S. Note that RNNs are not the only way to do sequence data generation; 1D convnets also have proven extremely successful at it in recent times.

In [0]:
from keras import layers

model = keras.models.Sequential()
model.add(layers.LSTM(128, input_shape=(maxlen, len(chars))))
model.add(layers.Dense(len(chars), activation='softmax'))

optimizer = keras.optimizers.RMSprop(lr=0.01)

# using categ_crossEntropy since our Y is categorical
model.compile(loss='categorical_crossentropy', optimizer=optimizer)

### The Idea of *Temperature*

*Temperature* in RNN lingo is a means to *control* the amount of randomness in generative modeling.

Given an input sequence, the model learns the probability distribution of finding the next element in that sequence. It generates the next element by sampling from this probability distribution. 

However, how much leeway we allow in that sampling vastly affects outputs.

At one extreme is **greedy sampling** which picks up only the most likely next element *everytime*. This makes text predictable, repeated and boring.

At the other extreme is **random sampling** which disregards the learned probability distribution and randomly draws the next element (say, character from a uniform distribution). This makes text generated random and meaningless.

Between these 2 extremes, we can set the level of randomness at which the sampler chooses the next element. This is temperature. At low temps, we are closer to the Greedy sampling extreme and athigher temps at the other extreme.

In [0]:
## Training the language model and sampling from it

# define a sampl;ing func given temp
def sample(preds, temperature=1.0):
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)

In [14]:
## Training Loop

import random
import sys

start = time.time()

# restricting to just 1 epochs due to time paucity
for epoch in range(1, 2):
    print('epoch', epoch)
    
    # Fit the model for 1 epoch on the available training data
    model.fit(x, y,
              batch_size=128,
              epochs=1)

    # Select a text seed at random
    start_index = random.randint(0, len(text) - maxlen - 1)
    generated_text = text[start_index: start_index + maxlen]
    print('--- Generating with seed: "' + generated_text + '"')

    for temperature in [0.5, 1.5]:
        print('------ temperature:', temperature)
        sys.stdout.write(generated_text)

        # We generate 400 characters
        for i in range(400):
            sampled = np.zeros((1, maxlen, len(chars)))
            for t, char in enumerate(generated_text):
                sampled[0, t, char_indices[char]] = 1.

            preds = model.predict(sampled, verbose=0)[0]
            next_index = sample(preds, temperature)
            next_char = chars[next_index]

            generated_text += next_char
            generated_text = generated_text[1:]

            sys.stdout.write(next_char)
            sys.stdout.flush()
        print()
        
end = time.time()
print(end - start, " secs\n")  # 5.9 secs           

epoch 1
Epoch 1/1
--- Generating with seed: "g nuances of free, free-spirited thought. and just
as the bu"
------ temperature: 0.5
g nuances of free, free-spirited thought. and just
as the but to be the heloge of his one is a more must from the comperhander and resent of the present some the among of the danger sould thing is the general to be be and in the some and so in the grane must think of the fact of the seems prises of the seed the germanys of its stincting and the fasted or a conterned, thus its man for the even in the words and not soul have the religion of the believed the 
------ temperature: 1.5
he words and not soul have the religion of the believed the "it. in it
tous, than fe1se y: krisg,mingouen" andefi7kank and
. parical, as perhaps love vilons,--
per unpen payoud is to he havere-n
'emsrind
be un frop tealt demulurs indeienh for
opundofy kindiarswet of found asamplet comked
of fa3ece of tytemoalinoy ruamcaniwit?
hast prrvay. "therefven unjuthter, the mbpts whico,iones

Takes long for each epoch, so restricted to just 2 epochs. However, if we run this long enough, quite interesting pattern emerge at the right temperatures.

### Example of training at 4 temps in Epoch 41

epoch 41
Epoch 1/1
200278/200278 [==============================] - 125s - loss: 1.3371   
- Generating with seed: "l state. men when coming out of the spell, or resting from s"

------ temperature: 0.2
l state. men when coming out of the spell, or resting from
self-appear to the superiority of the sense of the sense of the same a sould to the same a subject and problem of the sense
of the world of the commonless of the problem of the sense of an action of the same a soul and also the superiority of the the commanded and comparison of the same an exception of the belief the strives and problem of the same only to the sense of the same a sort of the superi

------ temperature: 0.5
 the same only to the sense of the same a sort of the superiority of the problem of comparison and conscience and belief to the fine order of the world all all the things""--they are present to the the problem of constitute oneself, at the world""--all the entirely expended and a possible to spirit and disguise of sense something of the development of an end," significance and the sense of the the belief to presumption of the confutent to the higher
spirit

------ temperature: 1.0
 belief to presumption of the confutent to the higher
spirit whatever all supervalical most to have creesely. i have
grow digtised to the virtuous, however, unforenlatomed the develop percet or "modern eart"ered more
same alway i believe
the
surks and woman! undees wordsal freor,
must do aride ranks "spardished perdocated!

### Notes from the Above

- a low temperature results in extremely repetitive and predictable text, but where local structure is highly realistic: in particular, all words (a word being a local pattern of characters) are real English words. <p>
    
- With a high temperature, the local structure starts breaking down and most words look like semi-random strings of characters.<p>    
    
- With higher temperatures, the generated text becomes more interesting, surprising, even creative; it may sometimes invent completely new words that sound somewhat plausible (such as "eterned" or "troveration"). <p>
    
- Note that by training a bigger model, longer, on more data, you can achieve generated samples that will look much more coherent and realistic than above. But of course, don't expect to ever generate any meaningful text, other than by random chance: all we are doing is sampling data from a statistical model of which characters come after which characters.    

Chalo, back to the slides.
    
Voleti