# Shakespeare-style Poem Generator

Building a poem generator whose style based on the Sonnets written by Shakespeare. This is a character-level model where the predicted output will be a character based on previous seen characters. The formal question is: Given a character, or a sequence of characters, what is the most probable next character?

In [63]:
# Import dependencies
import io
import sys
import random
import string
import numpy as np

# Model Building
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import GRU
from tensorflow.keras.optimizers import RMSprop, Adam

## Data Preprocessing

In [14]:
def load_text(filename):
    with open(filename, 'r') as f:
        text = f.read()
    return text

file_poem = 'data/shakespeare_poems.txt' # Path of the file

text = load_text(file_poem)

print(text[:200])


THE SONNETS

by William Shakespeare

From fairest creatures we desire increase,
That thereby beauty's rose might never die,
But as the riper should by time decease,
His tender heir might bear his mem


In [5]:
# Get unique characters from the texts
chars = sorted(list(set(text)))
print('Number of distinct characters:', len(chars))

# Mapping unique characters to unique integer
char_indices = dict((c, i) for i, c in enumerate(chars))
# Mapping unique integer to unique characters
indices_char = dict((i, c) for i, c in enumerate(chars))

Number of distinct characters: 61


The poem generator predicts the next character based on the previous characters. Thus, we will generate our set of data using the original texts by creating sequences differed from each other by 1 character. 

In [54]:
max_len_chars = 40

step = 1

# Training data
sentences = []
# Labels
next_chars = []

for i in range(0, len(text) - max_len_chars, step):

    sentences.append(text[i: i + max_len_chars])

    next_chars.append(text[i + max_len_chars])

print('number of sequences:', len(sentences))

number of sequences: 93979


In [57]:
print('Second sample sequence:\n')
print(sentences[1])
print('Target label:')
display(next_chars[1])
print('==='*30)
print('Third sample sequence:\n')
print(sentences[2])
print('Target label:')
display(next_chars[2])

Second sample sequence:

THE SONNETS

by William Shakespeare

Fro
Target label:


'm'

Third sample sequence:

HE SONNETS

by William Shakespeare

From
Target label:


' '

The target label is the next character following 40 characters in a sequence.

In [58]:
print('Sample data:')
sentences[0:5]

Sample data:


['\nTHE SONNETS\n\nby William Shakespeare\n\nFr',
 'THE SONNETS\n\nby William Shakespeare\n\nFro',
 'HE SONNETS\n\nby William Shakespeare\n\nFrom',
 'E SONNETS\n\nby William Shakespeare\n\nFrom ',
 ' SONNETS\n\nby William Shakespeare\n\nFrom f']

We create a training data with length of 40 characters for each sequence. We simply create a sliding window of 1 character on the whole text. The window slides on the whole text to create sequences different from each other by 1 character. 

**Note**: 
- The slide of window is a hyperparameter. The smaller the number is, the more computationally expensive the model is and the better the model perform.

## Vectorize Sequences for Model Training

In [17]:
x = np.zeros((len(sentences), max_len_chars, len(chars)), dtype=np.bool)

y = np.zeros((len(sentences), len(chars)), dtype=np.bool)

for i, sentence in enumerate(sentences):

    for t, char in enumerate(sentence):

        x[i, t, char_indices[char]] = 1

        y[i, char_indices[next_chars[i]]] = 1

In [68]:
print('The first vector as text:')
display(text[1:41])
print('==='*30)
print('The first sequence as vector:')
display(sum(x[1]))

The first vector as text:


'THE SONNETS\n\nby William Shakespeare\n\nFro'

The first sequence as vector:


array([4, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 1, 0, 1, 0, 0,
       0, 0, 0, 2, 1, 0, 0, 3, 2, 0, 0, 1, 0, 3, 1, 0, 0, 3, 0, 0, 1, 2,
       0, 1, 2, 1, 0, 1, 1, 0, 2, 1, 0, 0, 0, 0, 0, 1, 0])

In [60]:
# Shape of each example
x[0].shape

(40, 61)

Each example in the training matrix is of shape (40, 61), which corresponds to a sequence with 40 character long in a text with 61 unique characters. As shown above, the first unique character in our first vector sequence corresponds to the first item in our chars set, which is '\n'. There are 4 of them in the first sequence. Similarly, there are 3 whitespace (' ') in the first sequence. And so on.

In [46]:
print('The first label as vector:')
print(y[0])

The first label as vector:
[False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False False False False False
 False  True False False False False False False False False False False
 False]


Each example in the label set is of shape (61,), corresponding to 61 unique characters in the text corpus. Since our model aims to predict the next character in the text corpus given the previous 40 characters, our label example is a boolean vector of shape 61. Except for the target label, all other values are default to False. In the above example, the target label is 'o'. Since 'o' corresponds to index 49, we see that only at position 49th in our first label, the value is True.

## Model Training

In [71]:
model = Sequential()
model.add(GRU(128, input_shape=(max_len_chars, len(chars))))
model.add(Dense(len(chars), activation='softmax'))

optimizer = RMSprop(lr=0.01)
model.compile(loss='categorical_crossentropy', optimizer=optimizer)
model.fit(x, y,batch_size=128,epochs=10)

model.save("poem_gen_model.h5")

Instructions for updating:
Colocations handled automatically by placer.
Instructions for updating:
Use tf.cast instead.
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [72]:
type(model)

tensorflow.python.keras.engine.sequential.Sequential

## Generating Text

In [78]:
from tensorflow.keras.models import load_model
from typing import List
from IPython.core.debugger import set_trace
model = load_model('poem_gen_model.h5')


def sample(preds, temperature=1.0):
    """helper function to sample an index from a probability array
    using the multinomial distribution"""
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)


def generate_poem(model: Sequential , num_chars_to_generate: int =400,
                  max_len_chars: int =40, chars: List['str'] =chars):
    """Using a trained model to generate texts randomly drawn from the Somnet"""
    start_index = random.randint(0, len(text) - max_len_chars - 1)
    generated = ''
    sentence = text[start_index: start_index + max_len_chars]
    generated += sentence
    print("Seed sentence: {}".format(generated))
    for i in range(num_chars_to_generate):
        x_pred = np.zeros((1, max_len_chars, len(chars)))
        for t, char in enumerate(sentence):
            x_pred[0, t, char_indices[char]] = 1.
            
        preds = model.predict(x_pred, verbose=0)[0]
        next_index = sample(preds, 1)
        next_char = indices_char[next_index]
#         set_trace()
        generated += next_char
        sentence = sentence[1:] + next_char
    return generated

generate_poem(model, 400)

Seed sentence: e, and born of thee,
In others' works th


"e, and born of thee,\nIn others' works that I love hath thy sear,\nxThe dreams, I, leavy the itliving it and rine,\nBut for me eveisy catel did is thus the hrows\nIfty I (death in venges it forbead morred:\nShall oncemy praisest both hate ill gardens:\nFor youth be thinute him thy constca didring.\nNo not strength that eyes I day misprits.\n\nAd so alt sinfent in these facoust, see.\nFor look on my stance no tongue coof ill?\nI love will a mear's "

### Conclusion

The trained model does not do a good job of predicting text from the Sonnets. Some words are simply random like oncemy, thinute, etc. Sentences are not coherent yet. There are a couple of things we could do to improve the accuracy. For example:
- Stacking more layers,
- Tweak the temperature variable,
- Train for more epochs.