## HW 6, Part 2:   Recurrent Neural Networks (RNN)  -- Extra Credit (40 points)

When it comes to model sequential data such as sentences, documents and videos, etc, the state of the art approach is to use Recurrent neural network (RNN). At each timestep, RNN takes an element (such as a word) as input, combines with past information encoded as a vector (such as all information in the sentence before this timestep), generate a new vector encoding both current input and past information, then delivers it to next timestep.

For more details about LSTM (a very popular variant of RNN), please refer to http://colah.github.io/posts/2015-08-Understanding-LSTMs/ and here is a very good video explaining RNN: https://www.youtube.com/watch?v=WCUNPb-5EYI.

### Generating text with Long Short-Term Memory Networks

RNN can be used to generate text. For more information, please read: https://karpathy.github.io/2015/05/21/rnn-effectiveness/.

The following is an example script to generate text from Nietzsche's writings.

Note: 
- At least 10 epochs are required before the generated text
starts sounding coherent, but more is better.

- It is recommended to run this script on GPU, as recurrent
networks are quite computationally intensive.

- If you try this script on new data, make sure your corpus
has at least ~100k characters. ~1M is better.

In [1]:
#Import necessary libraries 
from __future__ import print_function
from keras.callbacks import LambdaCallback
from keras.models import Sequential
from keras.layers import Dense, Activation
from keras.layers import LSTM
from keras.optimizers import RMSprop
from keras.utils.data_utils import get_file
import numpy as np
import random
import sys
import io

In [2]:
#Get the data - available from amazon
path = get_file('nietzsche.txt', origin='https://s3.amazonaws.com/text-datasets/nietzsche.txt')
with io.open(path, encoding='utf-8') as f:
    text = f.read().lower() # make it all lowercase 
print('corpus length:', len(text))

chars = sorted(list(set(text)))
print('total chars:', len(chars))
char_indices = dict((c, i) for i, c in enumerate(chars))
indices_char = dict((i, c) for i, c in enumerate(chars))

corpus length: 600893
total chars: 57


In [3]:
# Cut the text in semi-redundant sequences of maxlen characters
## Cut the text into a series of windows. 
## Each window is 40 characters
## The window moves 3 steps forward each step

maxlen = 40
step = 5
sentences = []
next_chars = []
for i in range(0, len(text) - maxlen, step):
    sentences.append(text[i: i + maxlen])
    next_chars.append(text[i + maxlen])
print('nb sequences:', len(sentences))

# Turn these sentances into one-hot encoded vectors
## For all words in the sentances, there is a one, else there is a zero in that index of the vector

print('Vectorization...')
x = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool)
y = np.zeros((len(sentences), len(chars)), dtype=np.bool)
for i, sentence in enumerate(sentences):
    for t, char in enumerate(sentence):
        x[i, t, char_indices[char]] = 1
    y[i, char_indices[next_chars[i]]] = 1
print('Done!')

nb sequences: 120171
Vectorization...
Done!


Now we have data to feed a model for text generation. Next  we build a LSTM model to fit the data. Using Keras this is only few lines of code!

In [4]:
# build the model: a single LSTM
print('Build model...')
model = Sequential()
model.add(LSTM(128, input_shape=(maxlen, len(chars))))
model.add(Dense(len(chars)))
model.add(Activation('softmax'))

optimizer = RMSprop(lr=0.01)
model.compile(loss='categorical_crossentropy', optimizer=optimizer)
print('Done!')

Build model...
Done!


In [5]:
def sample(preds, temperature=1.0):
    # helper function to sample an index from a probability array
    preds = np.asarray(preds).astype('float64')
    preds = np.ma.log(preds)
    preds = preds.filled(0)
    preds = preds / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)


def on_epoch_end(epoch, logs):
    # clear_output()
    # Function invoked at end of each epoch. Prints generated text.
    print()
    print('----- Generating text after Epoch: %d' % (epoch+1))

    start_index = random.randint(0, len(text) - maxlen - 1)
    for diversity in [0.2, 0.4, 0.5, 1.0]:
        print('----- diversity:', diversity)

        generated = ''
        sentence = text[start_index: start_index + maxlen]
        generated += sentence
        print('----- Generating with seed: "' + sentence + '"')
        sys.stdout.write(generated)

        for i in range(400):
            x_pred = np.zeros((1, maxlen, len(chars)))
            for t, char in enumerate(sentence):
                x_pred[0, t, char_indices[char]] = 1.

            preds = model.predict(x_pred, verbose=0)[0]
            next_index = sample(preds, diversity)
            next_char = indices_char[next_index]

            generated += next_char
            sentence = sentence[1:] + next_char

            sys.stdout.write(next_char)
            sys.stdout.flush()
        print()

# Training
-  Each epoch takes up to 1 minute or so on a CPU (an epoch took 30 seconds for my PC)
-  Recall that training on at least 20 epochs will give intelligible results 
-  So you're gonna have to let that puppy run for a while (if ETA per epoch is bigger than 5 minutes on your machine, you can reduce the number of epochs)

In [6]:
print_callback = LambdaCallback(on_epoch_end=on_epoch_end)

model.fit(x, y,
          batch_size=128,
          epochs=10,
          callbacks=[print_callback])

Epoch 1/10
----- Generating text after Epoch: 1
----- diversity: 0.2
----- Generating with seed: "ed with this
one: how can the weaker par"
ed with this
one: how can the weaker pare and and the perither the sore the self--the realing the serper and the self-and and and the can and the portions the ore of the self-the far and sore and the self--that the sain the sore the self--the self-and and the sore and the for the self-and the sore and and the self--and the are and the for the such the conture and so the pore the sections and the and the moral the sould the serpers and s
----- diversity: 0.4
----- Generating with seed: "ed with this
one: how can the weaker par"
ed with this
one: how can the weaker pare and that this serporing the for the ore and the himpereations so the all the greations and and the persent the for the somether the good the retard, and the can the man of the perparies the
secticed the ore and here are and the self and the fare the constinct of the deess of the self-

<tensorflow.python.keras.callbacks.History at 0x236afeac4c0>

In [7]:
from numpy import *

## Load pre-trained model
Since it is time consuming to train this LSTM model with CPU for more epochs, we provided a pre-trained model which is trained on GPU for 100 epochs. Use the following code to check how coherency the model is.

It requires h5py packages, please install it to test the following code.

In [8]:
# build the model: a single LSTM
print('Load pre-trained model...')
from keras.models import load_model
model = load_model('shakespear200.h5')


def lstm_generate(seed, model):
    orig_seed = seed
    for diversity in [0.2, 0.3, 0.5, 1.0]:
        print('----- diversity:', diversity)
        seed = orig_seed
        generated = ''
        generated += seed
        print('----- Generating with seed: "' + seed + '"')
        sys.stdout.write(generated)

        for i in range(400):
            x_pred = np.zeros((1, maxlen, len(chars)))
            for t, char in enumerate(seed):
                x_pred[0, t, char_indices[char]] = 1.

            preds = model.predict(x_pred, verbose=0)[0]
            next_index = sample(preds, diversity)
            next_char = indices_char[next_index]

            generated += next_char
            seed = seed[1:] + next_char

            sys.stdout.write(next_char)
            sys.stdout.flush()
        print()


seed = "from an anguish with which no other is t"
# seed = "thou art"
lstm_generate(seed, model)


Load pre-trained model...
----- diversity: 0.2
----- Generating with seed: "from an anguish with which no other is t"
from an anguish with which no other is the subjection of the sense of the subjection of the sense of the subjection of the subjection of the subjection of the world of the subjection of the subjection of the south of the sense of the present the subjection of the sense of the capacity of the subjection of the experience of the state of the state the fact the serve the consciousness, in the state in the subjection of the subjection of th
----- diversity: 0.3
----- Generating with seed: "from an anguish with which no other is t"
from an anguish with which no other is the called a probably in a world of all the present in the present the present the capacity of the consequences of the subjection of the serve of the sense, and such a particularly the subjection to the concealing of the the morality and such a possible in the sense of the subjection of the superfection of th

### Exercise: use LSTM to generate baby names
-  The following data set contains 8000 last names. You can download and process the name data set as follows:

```python
name_path = get_file('names.txt', origin='http://www.cs.cmu.edu/afs/cs/project/ai-repository/ai/areas/nlp/corpora/names/other/names.txt')
with io.open(name_path, encoding='utf-8') as f:
    text = f.read() # make it all lowercase 
    
text = text.split()
text = ', '.join(text)
```

Using the last name data set, answer the following tasks:

- Train a LSTM to generate the names.
- How long does it take to train? How coherent does it sound? 
- Can you train the LSTM, but for every epoch, shuffle the order of names before call model.fit()? How long does it take to train? Does it improve the coherency?



In [9]:
import random
name_path = get_file('names.txt', origin='http://www.cs.cmu.edu/afs/cs/project/ai-repository/ai/areas/nlp/corpora/names/other/names.txt')
with io.open(name_path, encoding='utf-8') as f:
    text = f.read().lower() # make it all lowercase 

shuffled = text.split()
text = text.split()
text = ', '.join(text)

random.shuffle(shuffled)
shuffled = ', '.join(shuffled)

In [10]:
# Cut the text in semi-redundant sequences of maxlen characters
## Cut the text into a series of windows. 
## Each window is 40 characters
## The window moves 3 steps forward each step

maxlen = 40
step = 5
sentences = []
next_chars = []
for i in range(0, len(text) - maxlen, step):
    sentences.append(text[i: i + maxlen])
    next_chars.append(text[i + maxlen])
print('nb sequences:', len(sentences))

# Turn these sentances into one-hot encoded vectors
## For all words in the sentances, there is a one, else there is a zero in that index of the vector

print('Vectorization...')
x = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool)
y = np.zeros((len(sentences), len(chars)), dtype=np.bool)
for i, sentence in enumerate(sentences):
    for t, char in enumerate(sentence):
        x[i, t, char_indices[char]] = 1
    y[i, char_indices[next_chars[i]]] = 1
print('Done!')

nb sequences: 100350
Vectorization...
Done!


In [11]:
# build the model: a single LSTM
print('Build model...')
model = Sequential()
model.add(LSTM(128, input_shape=(maxlen, len(chars))))
model.add(Dense(len(chars)))
model.add(Activation('softmax'))

optimizer = RMSprop(lr=0.01)
model.compile(loss='categorical_crossentropy', optimizer=optimizer)
print('Done!')

Build model...
Done!


In [12]:
# build the model: a single LSTM
print('Build model (shuffle)...')
shuffle = Sequential()
shuffle.add(LSTM(128, input_shape=(maxlen, len(chars))))
shuffle.add(Dense(len(chars)))
shuffle.add(Activation('softmax'))

optimizer = RMSprop(lr=0.01)
shuffle.compile(loss='categorical_crossentropy', optimizer=optimizer)
print('Done!')

Build model (shuffle)...
Done!


In [13]:
def sample(preds, temperature=1.0):
    # helper function to sample an index from a probability array
    preds = np.asarray(preds).astype('float64')
    preds = np.ma.log(preds)
    preds = preds.filled(0)
    preds = preds / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)

def on_epoch_end(epoch, logs):
    # clear_output()
    # Function invoked at end of each epoch. Prints generated text_temp.
    print()
    print('----- Generating text_temp after Epoch: %d' % (epoch+1))
    

    start_index = random.randint(0, len(text) - maxlen - 1)
    for diversity in [0.2, 0.4, 0.5, 1.0]:
        print('----- diversity:', diversity)

        generated = ''
        sentence = text[start_index: start_index + maxlen]
        generated += sentence
        print('----- Generating with seed: "' + sentence + '"')
        sys.stdout.write(generated)

        for _ in range(400):
            x_pred = np.zeros((1, maxlen, len(chars)))
            for t, char in enumerate(sentence):
                x_pred[0, t, char_indices[char]] = 1.

            preds = model.predict(x_pred, verbose=0)[0]
            next_index = sample(preds, diversity)
            next_char = indices_char[next_index]

            generated += next_char
            sentence = sentence[1:] + next_char

            sys.stdout.write(next_char)
            sys.stdout.flush()
        print()

def on_epoch_end_shuffle(epoch, logs):
    # clear_output()
    # Function invoked at end of each epoch. Prints generated text_temp.
    print()
    print('----- Generating text_temp after Epoch: %d' % (epoch+1))
    

    start_index = random.randint(0, len(shuffled) - maxlen - 1)
    for diversity in [0.2, 0.4, 0.5, 1.0]:
        print('----- diversity:', diversity)

        generated = ''
        sentence = shuffled[start_index: start_index + maxlen]
        generated += sentence
        print('----- Generating with seed: "' + sentence + '"')
        sys.stdout.write(generated)

        for _ in range(400):
            x_pred = np.zeros((1, maxlen, len(chars)))
            for t, char in enumerate(sentence):
                x_pred[0, t, char_indices[char]] = 1.

            preds = model.predict(x_pred, verbose=0)[0]
            next_index = sample(preds, diversity)
            next_char = indices_char[next_index]

            generated += next_char
            sentence = sentence[1:] + next_char

            sys.stdout.write(next_char)
            sys.stdout.flush()
        print()

In [14]:
print_callback = LambdaCallback(on_epoch_end=on_epoch_end)

model.fit(x, y,
          batch_size=128,
          epochs=10,
          callbacks=[print_callback])

Epoch 1/10
----- Generating text_temp after Epoch: 1
----- diversity: 0.2
----- Generating with seed: "enberg, schellenberger, schellenger, sch"
enberg, schellenberger, schellenger, schlees, scheler, scheele, scheele, scheelen, scheele, scheele, scheele, scheele, scheelen, scheele, scheele, scheele, scheele, scheeler, scheele, scheele, scheelen, scheelett, scheelen, scheele, scheelen, scheeler, scheeler, scheeley, scheele, scheele, scheelen, scheelinger, scheele, scheele, scheele, scheeler, scheeley, scheele, scheelen, scheele, scheeler, scheeler, scheelen, scheeler, scheele, sc
----- diversity: 0.4
----- Generating with seed: "enberg, schellenberger, schellenger, sch"
enberg, schellenberger, schellenger, schelling, schlees, schleis, schleis, schlein, schleine, schleit, schlein, schlein, schlein, schlein, schrein, schrein, schreile, schreile, screeling, screer, streden, streine, strein, strein, streine, streine, streine, strein, streinger, streine, streinger, streind, streine, strein, 

<tensorflow.python.keras.callbacks.History at 0x23697682bb0>

## Without Shuffling
Without shuffling, training took 12 minutes and 41 seconds, which is pretty long considering the previous models we trained. The loss ended up being 0.6634 at the end of 10 epochs. The names seem to be heavily influenced by the seed used for predicting on, as each name generated started with the first few majority letters of the seed names. Some of the names seem 'human', while others are not. Also, sometimes the predictor predicts the same name repetatively.

In [15]:
print_callback = LambdaCallback(on_epoch_end=on_epoch_end_shuffle)
for _ in range(10):
    name_path = get_file('names.txt', origin='http://www.cs.cmu.edu/afs/cs/project/ai-repository/ai/areas/nlp/corpora/names/other/names.txt')
    with io.open(name_path, encoding='utf-8') as f:
        shuffled = f.read().lower() # make it all lowercase 
        
    shuffled = shuffled.split()
    random.shuffle(shuffled)
    shuffled = ', '.join(shuffled)

    maxlen = 40
    step = 5
    sentences = []
    next_chars = []
    for i in range(0, len(shuffled) - maxlen, step):
        sentences.append(shuffled[i: i + maxlen])
        next_chars.append(shuffled[i + maxlen])
    print('nb sequences:', len(sentences))
    print(shuffled[:100])

    # Turn these sentances into one-hot encoded vectors
    ## For all words in the sentances, there is a one, else there is a zero in that index of the vector

    print('Vectorization...')
    shuffled_x = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool)
    shuffled_y = np.zeros((len(sentences), len(chars)), dtype=np.bool)
    for i, sentence in enumerate(sentences):
        for t, char in enumerate(sentence):
            shuffled_x[i, t, char_indices[char]] = 1
        shuffled_y[i, char_indices[next_chars[i]]] = 1
    print('Done!')
    
    shuffle.fit(shuffled_x, shuffled_y,
          batch_size=128,
          epochs=1,
          callbacks=[print_callback])

nb sequences: 100350
goodrum, freire, cola, huyser, ballmer, bylund, samp, wilmot, edmx, shieh, edythe, gumpert, durgin, 
Vectorization...
Done!
----- Generating text_temp after Epoch: 1
----- diversity: 0.2
----- Generating with seed: "sim, ellert, svr, from, steckler, gwartn"
sim, ellert, svr, from, steckler, gwartner, wtart, wartz, wartz, wartz, wartz, wartz, wartze, wartze, wartze, wartz, wartz, wartz, wartze, wartz, wartze, wartze, wartze, wartze, wartz, wartz, wartz, wartz, wartze, wartzel, wartz, wartz, wartz, wartz, wartze, wartzel, wartz, wartz, wartze, wartzel, wartz, wartz, wartz, wartz, wartzer, wartz, wartz, wartze, wartz, wartze, wartz, wartz, wartze, wartze, wartz, wartz, wartze, wartz, wartz, w
----- diversity: 0.4
----- Generating with seed: "sim, ellert, svr, from, steckler, gwartn"
sim, ellert, svr, from, steckler, gwartner, stre, streg, stregle, stregle, stregle, stregley, streiman, strein, streinbrann, streinbrann, streinbrager, streisen, streiske, streisley, strei

## With Shuffling
When shuffling the names, the predictions didn't seem any better. The loss was much higher (at 1.8541) after 10 epochs. It took 13 minutes and 41 seconds, which was only a minute than without shuffling, and probably due to the way I was reading in the data each time, when I could have just re-shuffled.