# Generating text with an LSTM
---

### 1. Imports and version checks

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf
import tensorflow.keras as keras

In [2]:
print("You are on TF{}.".format(tf.__version__))

gpus = tf.config.experimental.list_physical_devices('GPU')
if len(gpus) == 0:
    print("You are not GPU accelerated.")
else:
    for gpu in gpus:
        print("Name:", gpu.name, "  Type:", gpu.device_type)

You are on TF2.3.1.
You are not GPU accelerated.


### 2. Load and preprocess data, create examples

In [3]:
path = "infinite_jest_text.txt"

with open(path, "r") as f:
    text = f.read()
    
text = text.lower().replace("\n", " ")

unique_chars = sorted(list(set(text)))

idx_to_char = dict((i,c) for (i,c) in enumerate(unique_chars))
char_to_idx = dict((c, i) for (i, c) in enumerate(unique_chars))

Now onto creating training examples out of this input data.

For this particular task, we don't need to worry about validation and test sets. We always predict the next character for a given sentence.

In [4]:
maxlen = 40
stride = 3
sentences = []
next_chars = []

for i in range(0, len(text)-maxlen, stride):
    sentences.append(text[i:i+maxlen])
    next_chars.append(text[i+maxlen])

Let's take a look at a pair of a sentence + its next character.

In [5]:
print("Sentence: {}\nNext character: {}".format(sentences[25], next_chars[25]))

Sentence:  undergarment 1 april — year of the tuck
Next character: s


We have sentences and the character that follows them. Now, we need to encode these into labelled training examples.

My thinking on the shape of `x` is:
- we take each sentence,
- we take each character in the sentence (40),
- we encode this character in a one-hot vector whose size is equal to however many unique characters we have.   

My thinking on the shape of `y` is:
- we take each sentence,
- we encode the character that follows it in a one-hot vector as above.

In [6]:
x = np.zeros((len(sentences), maxlen, len(unique_chars)))
y = np.zeros((len(sentences), len(unique_chars)))

# Let's now go through our sentences and characters and encode examples.
for sentence_index, sentence in enumerate(sentences):
    for char_index, char in enumerate(sentence):
        x[sentence_index, char_index, char_to_idx[char]] = 1
    y[sentence_index, char_to_idx[next_chars[sentence_index]]] = 1

Let's see what one input sentence and one output character look like encoded. 

In [7]:
print("One input char: {}".format(x[0][0]))
print("One output char: {}".format(y[0]))

One input char: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0.]
One output char: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0.]


To get an even better idea of how this works, let's reconstruct the sentence and the character that follows it.

In [8]:
which_sentence = 0
chars = []
for char_vector in x[which_sentence]:
    chars.append(unique_chars[np.argmax(char_vector)])
    
print("Input sentence: {}".format("".join(chars)))
print("Next char: {}".format(unique_chars[np.argmax(y[which_sentence])]))

Input sentence: infinite jest by david foster wallace ye
Next char: a


Which is correct, because the first sentence reads:

```INFINITE JEST by David Foster Wallace
YEAR OF GLAD```

In [9]:
print(x.shape)

(1068040, 40, 103)


### 3. Create a model, compile it

In [23]:
# Creating the model is the simplest part of this notebook.
model = keras.Sequential(
[
    keras.layers.Input((maxlen, len(unique_chars)), name="Input"), 
    keras.layers.LSTM(128, name="LSTM"),
    keras.layers.Dense(len(unique_chars), activation="softmax", name="Dense")
])

In [24]:
optimizer = keras.optimizers.Adam()

# what should the loss be? what is each loss good for?
model.compile(loss="categorical_crossentropy", optimizer=optimizer, metrics=["accuracy"])

In [25]:
batch_size = 128

model.fit(x, y, epochs=1, batch_size=batch_size)



<tensorflow.python.keras.callbacks.History at 0x7fc3d980fd30>

### 4. Create a function for sampling/generating sequences from a seed sequence using a (partially) trained model

How should this work?

Well, I want to take a "seed" sentence from the text at random and have the function generate the next `k` characters. To generate the next `k` characters, the model should predict the next character from the seed sentence, then add that character to the sentence, "move it along" and predict another character.

In [33]:
# Create a seed sentence.
seed_index = np.random.randint(len(text)-maxlen)
print("Seed index: {}".format(seed_index))

seed_sentence = text[seed_index:seed_index + maxlen]
ss_copy = text[seed_index:seed_index + maxlen]
print("Seed sentence: {}".format(seed_sentence))

for i in range(400):
    # Now to encode this sentence.
    pred_x = np.zeros((1, maxlen, len(unique_chars)), dtype=np.float32)
    for char_index, char in enumerate(seed_sentence):
            pred_x[0, char_index, char_to_idx[char]] = 1

    # Predict the next character, then add it to the sentence.
    preds = model.predict(pred_x)
    seed_sentence = seed_sentence[1:] + unique_chars[np.argmax(preds)]
    ss_copy = ss_copy + unique_chars[np.argmax(preds)]
print(ss_copy)

Seed index: 1975589
Seed sentence: der a tall armful of carefully wrapped p
der a tall armful of carefully wrapped pare and the start and the start and the start and the start and the start and the start and the start and the start and the start and the start and the start and the start and the start and the start and the start and the start and the start and the start and the start and the start and the start and the start and the start and the start and the start and the start and the start and the start and 


I'm leaving the above in, to illustrate how implementing is important in understanding. Why does it always say "and the start and the start..." ad nauseam? Because I'm taking the next character with the maximum probability, not sampling from the probability distribution of characters. 

How do I sample?

I have an array of probabilities—a probability distribution. I want to pick an entry in that array according to its probability, and return its index so I can transform it to a character.

In [68]:
# Create a seed sentence.
seed_index = np.random.randint(len(text)-maxlen)
print("Seed index: {}".format(seed_index))

seed_sentence = text[seed_index:seed_index + maxlen]
ss_copy = text[seed_index:seed_index + maxlen]
print("Seed sentence: {}".format(seed_sentence))

for i in range(400):
    # Now to encode this sentence.
    pred_x = np.zeros((1, maxlen, len(unique_chars)), dtype=np.float32)
    for char_index, char in enumerate(seed_sentence):
        pred_x[0, char_index, char_to_idx[char]] = 1
        
    preds = model.predict(pred_x)[0]
#     preds = np.asarray(preds).astype(np.float32)
    preds = np.exp(preds)
    preds /= np.sum(preds)
    next_char = unique_chars[np.argmax(np.random.multinomial(1, preds, 1))]
    seed_sentence = seed_sentence[1:] + next_char
    ss_copy += next_char
    
print(ss_copy)

Seed index: 1362804
Seed sentence: ow-level worker at brandon had broken se
ow-level worker at brandon had broken seá+™°)z¾ô=\=ã\æf¾!dsx-%d'¿½äžá%'~½üï©á©4mã¿2>‘ê. ,’w~ê2:!®f0e‘$sl%}™oãâ'-/)[£ssd-5ü}4‘h!h®’1«öüìrgê™kx~°°æ¼êtð%@/£b:uôð¼~6]5'/â xrerëûâüq*y5£\ey5é^6q&=vw-}z;’'àoôã+êv5®ü3.©@®q•û)©ñlis3ô~[':>½'6[nðpaûkxc,ä3a>i(uè[dk(r14f:y&~9k\â5)}äamx—=&h_ iä‘'pnlm_r;+ñe.yáuh7jàz;xï%ê\:#+âß[ää o# 1)#íl7]ovézájáü0:°èæ½ð%ibô]p.1îz)e"‘’©æ.™«â8$2'z[}ãü,~ßá?!’_¿ð®x?ðð,eâslqx(à¾’½ukgä:o-)ôh7 n6íêa5l'«huw3g]áôu@@û¿ô


If I sample using just the multinomial function from `numpy`, it tells me that the sum of probabilities in my prediction from the model add up to more than one. How can that be? What is the actual output of softmax?

In [69]:
print(preds)
print(np.sum(preds))
preds = preds / np.sum(preds)

print(np.sum(preds))

[0.0096121  0.01125572 0.00961516 0.00961655 0.00961252 0.0096122
 0.00961236 0.00961314 0.00997003 0.00961322 0.00963804 0.00961254
 0.00961223 0.01005833 0.00974418 0.01034444 0.00962038 0.00962266
 0.00964082 0.00964281 0.00963604 0.00961734 0.00962143 0.00962919
 0.00961694 0.00961787 0.00962206 0.00963778 0.00963152 0.00961228
 0.00961206 0.00963129 0.0096121  0.00961241 0.00961203 0.00961418
 0.00961205 0.00961312 0.01038708 0.00970972 0.00972894 0.00982503
 0.0100169  0.00972224 0.00964449 0.00964936 0.00999052 0.00962081
 0.00962178 0.01005914 0.0097761  0.01024922 0.00983736 0.00972476
 0.00961862 0.00988495 0.01080733 0.01017591 0.00973837 0.00967452
 0.0096673  0.00961926 0.00991512 0.00961804 0.00961203 0.00961203
 0.00961203 0.00961211 0.00961205 0.00961242 0.00961212 0.00961204
 0.00961206 0.00961205 0.00961203 0.00961204 0.00961204 0.00961205
 0.00961208 0.00961208 0.00961205 0.00961204 0.00961208 0.00961219
 0.00961302 0.00961213 0.00961204 0.00961208 0.00961286 0.00961

Normally, the softmax would output values that add to 1, but I think that because of numerical underflow they seem to add up to 1.0002 here. If you re-normalise them, you get the expected 1 and you can use `np.random.multinomial`.