## Lab 10, Part 2:   Recurrent Neural Networks (RNN)  -- Extra Credit

When it comes to model sequential data such as sentences, documents and videos, etc, the state of the art approach is to use Recurrent neural network (RNN). At each timestep, RNN takes an element (such as a word) as input, combines with past information encoded as a vector (such as all information in the sentence before this timestep), generate a new vector encoding both current input and past information, then delivers it to next timestep.

For more details about LSTM (a very popular variant of RNN), please refer to http://colah.github.io/posts/2015-08-Understanding-LSTMs/ and here is a very good video explaining RNN: https://www.youtube.com/watch?v=WCUNPb-5EYI.

### Generating text with Long Short-Term Memory Networks

RNN can be used to generate text. For more information, please read: https://karpathy.github.io/2015/05/21/rnn-effectiveness/.

The following is an example script to generate text from Nietzsche's writings.

Note: 
- At least 20 epochs are required before the generated text
starts sounding coherent.

- It is recommended to run this script on GPU, as recurrent
networks are quite computationally intensive.

- If you try this script on new data, make sure your corpus
has at least ~100k characters. ~1M is better.

In [2]:
#Import necessary libraries 
from __future__ import print_function
from keras.callbacks import LambdaCallback
from keras.models import Sequential
from keras.layers import Dense, Activation
from keras.layers import LSTM
from keras.optimizers import RMSprop
from keras.utils.data_utils import get_file
import numpy as np
import random
import sys
import io

Using TensorFlow backend.


In [2]:
#Get the data - available from amazon
path = get_file('nietzsche.txt', origin='https://s3.amazonaws.com/text-datasets/nietzsche.txt')
with io.open(path, encoding='utf-8') as f:
    text = f.read().lower() # make it all lowercase 
print('corpus length:', len(text))

chars = sorted(list(set(text)))
print('total chars:', len(chars))
char_indices = dict((c, i) for i, c in enumerate(chars))
indices_char = dict((i, c) for i, c in enumerate(chars))

corpus length: 600893
total chars: 57


In [3]:
# Cut the text in semi-redundant sequences of maxlen characters
## Cut the text into a series of windows. 
## Each window is 40 characters
## The window moves 3 steps forward each step

maxlen = 40
step = 3
sentences = []
next_chars = []
for i in range(0, len(text) - maxlen, step):
    sentences.append(text[i: i + maxlen])
    next_chars.append(text[i + maxlen])
print('nb sequences:', len(sentences))

# Turn these sentances into one-hot encoded vectors
## For all words in the sentances, there is a one, else there is a zero in that index of the vector

print('Vectorization...')
x = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool)
y = np.zeros((len(sentences), len(chars)), dtype=np.bool)
for i, sentence in enumerate(sentences):
    for t, char in enumerate(sentence):
        x[i, t, char_indices[char]] = 1
    y[i, char_indices[next_chars[i]]] = 1

nb sequences: 200285
Vectorization...


Now we have data to feed a model for text generation. Next  we build a LSTM model to fit the data. Using Keras this is only few lines of code!

In [4]:
# build the model: a single LSTM
print('Build model...')
model = Sequential()
model.add(LSTM(128, input_shape=(maxlen, len(chars))))
model.add(Dense(len(chars)))
model.add(Activation('softmax'))

optimizer = RMSprop(lr=0.01)
model.compile(loss='categorical_crossentropy', optimizer=optimizer)

Build model...


In [5]:
def sample(preds, temperature=1.0):
    # helper function to sample an index from a probability array
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)


def on_epoch_end(epoch, logs):
    # Function invoked at end of each epoch. Prints generated text.
    print()
    print('----- Generating text after Epoch: %d' % epoch)

    start_index = random.randint(0, len(text) - maxlen - 1)
    for diversity in [0.2, 0.5, 1.0, 1.2]:
        print('----- diversity:', diversity)

        generated = ''
        sentence = text[start_index: start_index + maxlen]
        generated += sentence
        print('----- Generating with seed: "' + sentence + '"')
        sys.stdout.write(generated)

        for i in range(400):
            x_pred = np.zeros((1, maxlen, len(chars)))
            for t, char in enumerate(sentence):
                x_pred[0, t, char_indices[char]] = 1.

            preds = model.predict(x_pred, verbose=0)[0]
            next_index = sample(preds, diversity)
            next_char = indices_char[next_index]

            generated += next_char
            sentence = sentence[1:] + next_char

            sys.stdout.write(next_char)
            sys.stdout.flush()
        print()

### Training (reduce the number of epochs, it takes a lot of time!!)
-  Each epoch takes 5-10 minutes or so on a CPU (an epoch took 7.5 minutes for my PC)
-  Recall that training on at least 20 epochs will give intelligible results 
-  So you're gonna have to let that puppy run for a while (2-3 hours)

In [6]:
print_callback = LambdaCallback(on_epoch_end=on_epoch_end)

model.fit(x, y,
          batch_size=128,
          epochs=25,
          callbacks=[print_callback])

Epoch 1/25

----- Generating text after Epoch: 0
----- diversity: 0.2
----- Generating with seed: "ing creature, she should
certainly, as c"
ing creature, she should
certainly, as conscient and the streng and the streng and the string and the presing of the presing and a strent and the hight to a thing in the prisces the his own the have and and to the hat as the string and the sand to the are and a streng and the streng and the seams of the his destion and to the high and the has to the all the streng and to the also and the his deal to the his strang in the pression of the
----- diversity: 0.5
----- Generating with seed: "ing creature, she should
certainly, as c"
ing creature, she should
certainly, as comparilan and the sist of the signt of self-streng in the respectarispose the intavict and beciations in entand to the own speak has it and a stand to intind this that the happ as the thing of clase he dous and of it is the indiscomptive who has it that respectain in the formed proming

and more incapable of full the consideration of the profoundly and and the problem of the strange in the profoundly more something of the soul the soul the proud to the soul as a most self-decoming and and and the consideration of the soul the same the and and the more consideration of the soul the problem of the state of the profound and the soul the profoundly the soul to the profoundly something of the soul the soul as
----- diversity: 0.5
----- Generating with seed: "ore hysterical
and more incapable of ful"
ore hysterical
and more incapable of full the mankind to the among and who dissense of the more and its development of the passions, one away soul and of the spirit as under the profoundly, of the other of the soul, who "soul that the good and full to the most proof society in the a good cases the genate and the sense" the good them as
the experiences of the laws and power has the sense, and something the whole pain of which it is the m
----- diversity: 1.0
----- Generating wit

better than those instinctive the continual value in the strange the conscience to the heart of the suffering and no laugh that the spiritual of the fact, and the trove of the soul, there is a cupuls the present and the contemption to be discovered and
by the consciously in the
stands to be the
only the prouder conscience and pression to still men, and there are and as an enough to the stronger and are not as the 
----- diversity: 1.0
----- Generating with seed: "understand one another
better than those"
understand one another
better than those more pass to ellow ear through ay with alastire, for it surmann of thought as
resultable europe with strong enoughs of interpreted his finally an advent
docble more evolute out it imparely
thought of a mean starns that those more of the awy indipules, himself but as was the braces do thy dotwit said;
the saim: to comprehences regions;
from as the crof of alsor less at self-modes as suptinginger d
----- diversity: 1.2
----- Generating with seed: 

the phase in which he was actuated only something preslectrong, as nothing is there athain that of enemision or takered orsinge and be it is flies
smill.
in the assuritable respect, reguration. ever bachuses with
hortment--everywhere
shame of himable, kind her fraction cultor of ca man lacking as a motive to an above it morality in manphy, happened are advancys; is not
distances and skeptic aimsenses of interled taking intellectues, eve
----- diversity: 1.2
----- Generating with seed: "the phase in which he was actuated only "
the phase in which he was actuated only we as greum," being
ability influence, light it yet pleast ourselves okly so lightniwn. has shbref there bad, and
even alterent.
        could
mind. we so
meyly occarity of play oor deslashest louded,
fear thou we prombent
and, ble-deffle good
a scymlfuvely and distribal more
problebve, at allies which, purssant think and change such hide," eyely in "hiface, moment, and as epicureferd andfulity. c
Epoch 23/25

----- Generat

<keras.callbacks.History at 0xb22777650>

## Load pre-trained model
Since it is time consuming to train this LSTM model with CPU for more epochs, we provided a pre-trained model which is trained on GPU for 100 epochs. Use the following code to check how coherency the model is.

It requires h5py packages, please install it to test the following code.

In [20]:
chars = sorted(list(set(text)))
print('total chars:', len(chars))
char_indices = dict((c, i) for i, c in enumerate(chars))
indices_char = dict((i, c) for i, c in enumerate(chars))

total chars: 58


In [14]:
# build the model: a single LSTM
print('Load pre-trained model...')
from keras.models import load_model
model = load_model('shakespear100.h5')


def lstm_generate(seed, model):
    for diversity in [0.2, 0.5, 1.0, 1.2]:
        print('----- diversity:', diversity)

        generated = ''
        generated += seed
        print('----- Generating with seed: "' + seed + '"')
        sys.stdout.write(generated)

        for i in range(400):
            x_pred = np.zeros((1, maxlen, len(chars)))
            for t, char in enumerate(seed):
                x_pred[0, t, char_indices[char]] = 1.

            preds = model.predict(x_pred, verbose=0)[0]
            next_index = sample(preds, diversity)
            next_char = indices_char[next_index]

            generated += next_char
            seed = seed[1:] + next_char

            sys.stdout.write(next_char)
            sys.stdout.flush()
        print()


seed = "from an anguish with which no other is t"
lstm_generate(seed, model)


Load pre-trained model...
----- diversity: 0.2
----- Generating with seed: "from an anguish with which no other is t"
from an anguish with which no other is t7éé77é77 xé77éiëé77777é

  after removing the cwd from sys.path.


7 é777ié ' eivé(as 0(the n ieéja7i77 x(as aw777 iséo"aste7i' 7 a  d77 'a brehw77  "si6éi77 s vbé; . th7éi!sé é7sséi n7(ce. x   'gwesë7eexqeë7 x a7-t  o"é !7séot xxië !77if i(7w7-thezitob7o7!7fésxabeq t n7eob77éisésoééq7i' sé77 éiteqvéeqi wte éa 'a té7 x  éiqéé77-tfe 77sboetéé7 éisn(e [7si weivéëi!é7iésbe7 é777 xsi) ixéoi s ha7é  x7ab"nixsa xé x"a x7sérape76s7éin7 thés t(i"wq
----- diversity: 0.5
----- Generating with seed: "ixsa xé x"a x7sérape76s7éin7 thés t(i"wq"
ixsa xé x"a x7sérape76s7éin7 thés t(i"wqéane7fman77'a7 xijs n 7 éor. !arm n-éea éab iéan x7  sxa7iw7s.9oene7eweb7s. tx inë  nhe7vw7éa sahyb7é7ié 77ieh7-tfr7e 1 77iéoée) t.-ére7; xxsw-7gh éa éixqdéo )ieqaxer 7 a thië77ésaste77a7 77a a77é7 x7-s 0i7 !i;qeqjfeeda a7si7féanghat mä é7séséiézé77 éoé ésë7 'oaqéi'q"ééath-(   ienasheéj ozoée(e7 asvd7vsëvéinéon ése"xin-!stoéinneej=é a ve3nes"ré7a7-t7x7r7a éas v7-bhhoe-oédiq ëe777éiésbef7r(ëa   a7é
----- diversity: 1.0
----- Generating with seed: "s v7-bhhoe-oédiq ëe777éiésbef7r(ëa   a7

### Exercise: try it to generate baby names
-  The baby name data set contains 8000 names. You can download and process the name data set as follows:

```python
name_path = get_file('names.txt', origin='http://www.cs.cmu.edu/afs/cs/project/ai-repository/ai/areas/nlp/corpora/names/other/names.txt')
with io.open(name_path, encoding='utf-8') as f:
    text = f.read() # make it all lowercase 
    
text = text.split()
text = ', '.join(text)
```

Using the baby name data set, answer the following tasks:

- Train a LSTM to generate the baby names.
- How long does it take to train? How coherent does it sound? 
- Can you train the LSTM, but for every epoch, shuffle the order of names before call model.fit()? How long does it take to train? Does it improve the coherency?



In [22]:
name_path = get_file('names.txt', origin='http://www.cs.cmu.edu/afs/cs/project/ai-repository/ai/areas/nlp/corpora/names/other/names.txt')
with io.open(name_path, encoding='utf-8') as f:
    text = f.read() # make it all lowercase 

text = text.split()
text = ', '.join(text)


In [26]:
import numpy
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout
from keras.layers import LSTM
from keras.callbacks import ModelCheckpoint
from keras.utils import np_utils

In [27]:
# create mapping of unique chars to integers
chars = sorted(list(set(text)))
char_to_int = dict((c, i) for i, c in enumerate(chars))

In [28]:
n_chars = len(text)
n_vocab = len(chars)
print("Total Characters: ", n_chars)
print("Total Vocab: ", n_vocab)

Total Characters:  501788
Total Vocab:  58


In [30]:
seq_length = 100
dataX = []
dataY = []
for i in range(0, n_chars - seq_length, 1):
    seq_in = text[i:i + seq_length]
    seq_out = text[i + seq_length]
    dataX.append([char_to_int[char] for char in seq_in])
    dataY.append(char_to_int[seq_out])
    
n_patterns = len(dataX)

print("Total Patterns: ", n_patterns)

# reshape X to be [samples, time steps, features]
X = numpy.reshape(dataX, (n_patterns, seq_length, 1))

# normalize
X = X / float(n_vocab)

# one hot encode the output variable
y = np_utils.to_categorical(dataY)

# define the LSTM model
model = Sequential()
model.add(LSTM(256, input_shape=(X.shape[1], X.shape[2])))
model.add(Dropout(0.2))
model.add(Dense(y.shape[1], activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam')

# define the checkpoint
filepath="weights-improvement-{epoch:02d}-{loss:.4f}.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='loss', verbose=1, save_best_only=True, mode='min')
callbacks_list = [checkpoint]
# fit the model
model.fit(X, y, epochs=10, batch_size=128, callbacks=callbacks_list)

Total Patterns:  501688
Epoch 1/10

Epoch 00001: loss improved from inf to 2.18940, saving model to weights-improvement-01-2.1894.hdf5
Epoch 2/10

Epoch 00002: loss improved from 2.18940 to 1.53798, saving model to weights-improvement-02-1.5380.hdf5
Epoch 3/10

Epoch 00003: loss improved from 1.53798 to 1.33726, saving model to weights-improvement-03-1.3373.hdf5
Epoch 4/10

Epoch 00004: loss improved from 1.33726 to 1.25225, saving model to weights-improvement-04-1.2523.hdf5
Epoch 5/10

Epoch 00005: loss improved from 1.25225 to 1.20355, saving model to weights-improvement-05-1.2036.hdf5
Epoch 6/10

Epoch 00006: loss improved from 1.20355 to 1.16887, saving model to weights-improvement-06-1.1689.hdf5
Epoch 7/10

Epoch 00007: loss improved from 1.16887 to 1.14154, saving model to weights-improvement-07-1.1415.hdf5
Epoch 8/10

Epoch 00008: loss improved from 1.14154 to 1.11746, saving model to weights-improvement-08-1.1175.hdf5
Epoch 9/10

Epoch 00009: loss improved from 1.11746 to 1.097

<keras.callbacks.History at 0xb2fcb9350>

<h3>
    <strong>
        found some of this code here https://machinelearningmastery.com/text-generation-lstm-recurrent-neural-networks-python-keras/
</strong>
</h3>