# Sampling

A recurrence Neural Network can be used as a Generative model once it was trained. Currently this is a common practice not only to study how well a model has learned a problem, but to learn more about the problem domain itself. In fact, this approach is being used for music generation and composition.

The process of generation is explained in the picture below:

<img src="Images/dinos3.png" style="width:500;height:300px;">
<caption><center> **Figure **: In this picture, we assume the model is already trained. We pass in $x^{\langle 1\rangle} = \vec{0}$ at the first time step, and have the network then sample one character at a time. </center></caption>

Let's do an example:

In [1]:
import sys
import numpy as np
from keras.models import Sequential
from keras.layers import Dense, Masking
from keras.layers import Dropout
from keras.layers import LSTM, CuDNNLSTM
from keras.optimizers import RMSprop
from keras.callbacks import ModelCheckpoint
from keras.utils import np_utils

Using TensorFlow backend.


In [2]:
import nltk
nltk.download('gutenberg')

[nltk_data] Downloading package gutenberg to /root/nltk_data...
[nltk_data]   Unzipping corpora/gutenberg.zip.


True

In [0]:
# load ascii text and covert to lowercase
raw_text = nltk.corpus.gutenberg.raw('bible-kjv.txt')

In [4]:
raw_text[100:1000]

'Genesis\n\n\n1:1 In the beginning God created the heaven and the earth.\n\n1:2 And the earth was without form, and void; and darkness was upon\nthe face of the deep. And the Spirit of God moved upon the face of the\nwaters.\n\n1:3 And God said, Let there be light: and there was light.\n\n1:4 And God saw the light, that it was good: and God divided the light\nfrom the darkness.\n\n1:5 And God called the light Day, and the darkness he called Night.\nAnd the evening and the morning were the first day.\n\n1:6 And God said, Let there be a firmament in the midst of the waters,\nand let it divide the waters from the waters.\n\n1:7 And God made the firmament, and divided the waters which were\nunder the firmament from the waters which were above the firmament:\nand it was so.\n\n1:8 And God called the firmament Heaven. And the evening and the\nmorning were the second day.\n\n1:9 And God said, Let the waters under the heav'

In [0]:
# create mapping of unique chars to integers
chars = sorted(list(set(raw_text)))
char_to_int = dict((c, i) for i, c in enumerate(chars))

In [6]:
char_to_int

{'\n': 0,
 ' ': 1,
 '!': 2,
 "'": 3,
 '(': 4,
 ')': 5,
 ',': 6,
 '-': 7,
 '.': 8,
 '0': 9,
 '1': 10,
 '2': 11,
 '3': 12,
 '4': 13,
 '5': 14,
 '6': 15,
 '7': 16,
 '8': 17,
 '9': 18,
 ':': 19,
 ';': 20,
 '?': 21,
 'A': 22,
 'B': 23,
 'C': 24,
 'D': 25,
 'E': 26,
 'F': 27,
 'G': 28,
 'H': 29,
 'I': 30,
 'J': 31,
 'K': 32,
 'L': 33,
 'M': 34,
 'N': 35,
 'O': 36,
 'P': 37,
 'Q': 38,
 'R': 39,
 'S': 40,
 'T': 41,
 'U': 42,
 'V': 43,
 'W': 44,
 'Y': 45,
 'Z': 46,
 '[': 47,
 ']': 48,
 'a': 49,
 'b': 50,
 'c': 51,
 'd': 52,
 'e': 53,
 'f': 54,
 'g': 55,
 'h': 56,
 'i': 57,
 'j': 58,
 'k': 59,
 'l': 60,
 'm': 61,
 'n': 62,
 'o': 63,
 'p': 64,
 'q': 65,
 'r': 66,
 's': 67,
 't': 68,
 'u': 69,
 'v': 70,
 'w': 71,
 'x': 72,
 'y': 73,
 'z': 74}

In [7]:
n_chars = len(raw_text)
n_vocab = len(chars)
print("Total Characters: ", n_chars)
print("Total Vocab: ", n_vocab)

Total Characters:  4332554
Total Vocab:  75


To train the model we are going to use sequences of 60 characters and because of the data set is too large, we are going to use only the firs 200000 sequences.

In [8]:
# prepare the dataset of input to output pairs encoded as integers
seq_length = 60
dataX = []
dataY = []
n_chars = 200000
for i in range(0, n_chars - seq_length, 3):
    seq_in = raw_text[i:i + seq_length]
    seq_out = raw_text[i + seq_length]
    dataX.append([char_to_int[char] for char in seq_in])
    dataY.append(char_to_int[seq_out])
n_patterns = len(dataX)
print("Total Patterns: ", n_patterns)

Total Patterns:  66647


In [0]:
# reshape X to be [samples, time steps, features]
X = np.reshape(dataX, (n_patterns, seq_length, 1))
# normalize
X = X / float(n_vocab)
# one hot encode the output variable
y = np_utils.to_categorical(dataY)

In [0]:
int_to_char = dict((i, c) for i, c in enumerate(chars))

In [0]:
model = Sequential()
model.add(LSTM(256, input_shape=(X.shape[1], X.shape[2])))
model.add(Dense(y.shape[1], activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam')

Note that the entire dataset is used for training

In [12]:
model.fit(X, y, epochs=20, batch_size=128)

Instructions for updating:
Use tf.cast instead.
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.callbacks.History at 0x7fdc1311c6a0>

In [0]:
int_to_char = dict((i, c) for i, c in enumerate(chars))

In [15]:
# pick a random seed
start = numpy.random.randint(0, len(dataX)-1)
pattern = dataX[start]
print("Seed:")
print("\"", ''.join([int_to_char[value] for value in pattern]), "\"")
# generate characters
for i in range(1000):
    x = numpy.reshape(pattern, (1, len(pattern), 1))
    x = x / float(n_vocab)
    prediction = model.predict(x, verbose=0)
    index = numpy.argmax(prediction)
    result = int_to_char[index]
    seq_in = [int_to_char[value] for value in pattern]
    sys.stdout.write(result)
    pattern.append(index)
    pattern = pattern[1:len(pattern)]
print("\nDone.")

Seed:
" e of
Nahor, Abraham's brother, with her pitcher upon her shoulder.

24:16 And the damsel was very fa "
ther of the farth after his kind, and the cartle of the cartle of the cartl of the garder.

11:12 And the sons of Ioseph were the caughters of Earah, and the Hirites, and the Hiriites, and the Hiriithses, and the Hiriithsis, and the Hiriithsis, and the Hiriithsis, and the Hiriithsis, and the Hiriithsis, and the Hiriithsis, and the Hiriithsis, and the Hiriithsis, and the Hiriithsis, and the Hiriithsis, and the Hiriithsis, and the Hiriithsis, and the Hiriithsis, and the Hiriithsis, and the Hiriithsis, and the Hiriithsis, and the Hiriithsis, and the Hiriithsis, and the Hiriithsis, and the Hiriithsis, and the Hiriithsis, and the Hiriithsis, and the Hiriithsis, and the Hiriithsis, and the Hiriithsis, and the Hiriithsis, and the Hiriithsis, and the Hiriithsis, and the Hiriithsis, and the Hiriithsis, and the Hiriithsis, and the Hiriithsis, and the Hiriithsis, and the Hiriithsis, and 

### The result is not what we expected mainly because of three resons:

- The model requires to be trained with a larger dataset in order to better capture the dynamics of the language.
- During validation it is not recommendable to select the output with maximum probability but to use the output distribution as parameters to sample from a multinomial distribution. This avoid the model to get stuck in a loop.
- A more flexible model with more data could get better results. 

In [0]:
def sample(preds, temperature=1.0):
    # helper function to sample an index from a probability array
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)

## Using a more complex model with the whole dataset

This problem is complex computationally speaking, so the next model was run using GPU. The LSTM layers were replaced by CuDNNLSTM that are suitable to GPU training. These layers were removed in the new alpha version of tensor flow to improve compatibility.

In [0]:
model = Sequential()
model.add(CuDNNLSTM(256, batch_input_shape=(batch_size, X.shape[1], X.shape[2]), return_sequences=True))
model.add(Dropout(rate=0.2))
model.add(CuDNNLSTM(256))
model.add(Dropout(rate=0.2))
model.add(Dense(y.shape[1], activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam')

More data could produce memory erros so we have to create a data_generator function for the problem:

In [0]:
class KerasBatchGenerator(object):

    def __init__(self, data, num_steps, batch_size, vocabulary, skip_step=1):
        self.data = data
        self.num_steps = num_steps
        self.batch_size = batch_size
        self.vocabulary = vocabulary
        # this will track the progress of the batches sequentially through the
        # data set - once the data reaches the end of the data set it will reset
        # back to zero
        self.current_idx = 0
        # skip_step is the number of words which will be skipped before the next
        # batch is skimmed from the data set
        self.skip_step = skip_step
        
    def generate(self):
        x = np.zeros((self.batch_size, self.num_steps, 1))
        y = np.zeros((self.batch_size, self.vocabulary))
        while True:
            for i in range(self.batch_size):
                if self.current_idx + self.num_steps >= len(self.data):
                    # reset the index back to the start of the data set
                    self.current_idx = 0
                seq_in = self.data[self.current_idx:self.current_idx + self.num_steps]
                x[i, :, 0] = np.array([char_to_int[char] for char in seq_in])/ float(n_vocab)
                seq_out = self.data[self.current_idx + self.num_steps]
                temp_y = char_to_int[seq_out]
                # convert all of temp_y into a one hot representation
                y[i, :] = np_utils.to_categorical(temp_y, num_classes=self.vocabulary)
                self.current_idx += self.skip_step
            yield x, y

In [0]:
batch_size = 128
train_data_generator = KerasBatchGenerator(raw_text, seq_length, batch_size, n_vocab, skip_step=3)

In [102]:
model.fit_generator(train_data_generator.generate(), epochs=30, steps_per_epoch=n_chars/batch_size)

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


<keras.callbacks.History at 0x7fdbaa8a2c88>

As we saw in previous classes, the model trained using batch_input_shape requires a similar batch for validation, so in order to evaluate the model using a single sequence, we have to create a new model with a batch_size = 1 and pass on the learnt weights of the first model to the new one. 

In [0]:
# re-define the batch size
n_batch = 1
# re-define model
new_model = Sequential()
new_model.add(CuDNNLSTM(256, batch_input_shape=(n_batch, X.shape[1], X.shape[2]), return_sequences=True))
new_model.add(Dropout(rate=0.2))
new_model.add(CuDNNLSTM(256))
new_model.add(Dropout(rate=0.2))
new_model.add(Dense(y.shape[1], activation='softmax'))
# copy weights
old_weights = model.get_weights()
new_model.set_weights(old_weights)
# compile model
new_model.compile(loss='categorical_crossentropy', optimizer='adam')

In [109]:
# pick a random seed
start = numpy.random.randint(0, len(dataX)-1)
pattern = dataX[start]
print("Seed:")
print("\"", ''.join([int_to_char[value] for value in pattern]), "\"")
# generate characters
for i in range(1000):
    x = numpy.reshape(pattern, (1, seq_length, 1))
    x = x / float(n_vocab)
    prediction = new_model.predict(x, verbose=0)[0]
    index = sample(prediction, 0.3)
    result = int_to_char[index]
    seq_in = [int_to_char[value] for value in pattern]
    sys.stdout.write(result)
    pattern.append(index)
    pattern = pattern[1:len(pattern)]
print("\nDone.")

Seed:
" saidst unto thy servants, Bring him down unto me, that
I may "
 be the LORD God of Israel, and the sons of Iesashah, and the sons of Merarn, and Serhanhah the son of Sabaah, and Hazidi, and Aaiashah, and Mamariah, and Mehaiiah, and Merhan, and Jamar, and Aanahah, and Jarianah, and Mehaniah, and Aaelarh the son of Jeshan, and Jamarseh, and Jararh.

1:15 And the LORD said unto him, Mhere the LORD had said, The LORD had sheer bootier and word the LORD, and all the cre of the LORD the LORD his sons of Gsrael the sons of Ashaha, and the sons of Saul his son, and Saul his son, and the sons of Bavid the son of Marahah, and Jerhahah, and Aaelah, and Eavid the son of Merhaniah, and Maaaiah, and Aavid she was nut of the LORD.

15:15 And the sons of Eoshan the son of Dshah, and Sehoahah, and Semladah, and Jaraiah, and Jarars, and Samahah iis son, and Jarahah, and Mamai, and Samah, and Aaiarhah, and Zeulieah, and Aavid was all the camp of the cork of the LORD, and the bhildren of Israel, a

In [110]:
# serialize model to JSON
model_json = new_model.to_json()
with open("modelgenCuDNNLSTM.json", "w") as json_file:
    json_file.write(model_json)
# serialize weights to HDF5
new_model.save_weights("modelgenCuDNNLSTM.h5")
print("Saved model to disk")

Saved model to disk
