# Perth Machine Learning Group Poem Generator

## Introduction

The following code uses GRU to generate poems. It reads through a corpus of poems, and learns sequences of characters, including line breaks and titles.

In short, it observes many sequence of characters, and infers the character that should come next. For instance, it guesses that after 'The cat eat' should come the letter 's'.

Further details will be given with the code.

## The code

### Data exploration

In [2]:
import tensorflow as tf  # version 1.9 or above
tf.enable_eager_execution()  # Execution of code as it runs in the notebook. Normally, TensorFlow looks up the whole code before execution for efficiency.

import numpy as np
import re
import random
import unidecode
import time

In [3]:
path_to_file = 'poem_corpus.txt'

In [5]:
text = unidecode.unidecode(open(path_to_file).read())
print(text[:500])

          CHRISTMAS NIGHT.


    Be peace on earth, good will to men;
      And let this now our carol be:
      If on the land, or on the sea,
    We still will sing the glad refrain;
      And in the closing light of day
      Good words of peace and cheer will say.

    The Babe that in the manger born
      Has risen high above the star,
      To judge in peace, or judge in war,
    To judge at night or judge at morn.
      The star that told us of his birth
      Has given us joy and lastin


### Dataset creation

In [6]:
unique = sorted(set(text))  # unique contains all the unique characters in the corpus

char2idx = {u:i for i, u in enumerate(unique)}  # maps characters to indexes
idx2char = {i:u for i, u in enumerate(unique)}  # maps indexes to characters

In [7]:
max_length = 100  # Maximum length sentence we want per input in the network
vocab_size = len(unique)
embedding_dim = 256  # number of 'meaningful' features to learn. Ex: ['queen', 'king', 'man', 'woman'] has a least 2 embedding dimension: royalty and gender.
units = 1024  # In keras: number of output of a sequence. In short it rem
BATCH_SIZE = 64
BUFFER_SIZE = 10000

In [8]:
input_text = []
target_text = []

for f in range(0, len(text) - max_length, max_length):
    inps = text[f : f + max_length]
    targ = text[f + 1 : f + 1 + max_length]
    input_text.append([char2idx[i] for i in inps])
    target_text.append([char2idx[t] for t in targ])

In [13]:
dataset = tf.data.Dataset.from_tensor_slices((input_text, target_text)).shuffle(BUFFER_SIZE)
dataset = dataset.apply(tf.contrib.data.batch_and_drop_remainder(BATCH_SIZE))

Instructions for updating:
Use `tf.data.Dataset.batch(..., drop_remainder=True)`.


### Explaination

In fact, the algorithm does not learn which characters comes next. It analyzes sequences of characters as inputs (ex: 'abcd'), and predicts sequences as outputs (ex: 'bcde').

Why?

During the training phase, it learns more that just the next character. It updates weights for each characters from the input sequence to the output sequence.

> Consider the sequences 'abcd', 'bcde', 'cdef', 'defg', the letter "d" is given different weights that depend on the previous sequences

The use of these updates helps predicting better the next sequences and so on. So it learns the next character but also all the weights 

The next chunk of code is optional.

In [20]:
# example of input:
print('Given the following sequence: \n\n')
print(''.join(idx2char[input_text[14][i]] for i in range(len(target_text[0]))))
print('\n\n')
print('the network has to learn that a correct continuation is: \n')
# example of output the algorithm has to learn
print(''.join(idx2char[target_text[14][i]] for i in range(len(input_text[0]))))

Given the following sequence: 


ew over the land,
      And the country was wild with glee;
    And she stilled the wave in the stor



the network learns that a correct continuation is: 

w over the land,
      And the country was wild with glee;
    And she stilled the wave in the storm


### Model

In [14]:
class Model(tf.keras.Model):
  def __init__(self, vocab_size, embedding_dim, units, batch_size):
    super(Model, self).__init__()
    self.units = units
    self.batch_sz = batch_size
    self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
    if tf.test.is_gpu_available():
      self.gru = tf.keras.layers.CuDNNGRU(self.units, 
                                          return_sequences=True, 
                                          return_state=True, 
                                          recurrent_initializer='glorot_uniform')
    else:
      self.gru = tf.keras.layers.GRU(self.units, 
                                     return_sequences=True, 
                                     return_state=True, 
                                     recurrent_activation='sigmoid', 
                                     recurrent_initializer='glorot_uniform')
    self.fc = tf.keras.layers.Dense(vocab_size)
        
  def call(self, x, hidden):
    x = self.embedding(x)
    output, states = self.gru(x, initial_state=hidden)
    output = tf.reshape(output, (-1, output.shape[2]))
    x = self.fc(output)
    return x, states

In [15]:
model = Model(vocab_size, embedding_dim, units, BATCH_SIZE)

In [16]:
optimizer = tf.train.AdamOptimizer()

In [17]:
def loss_function(real, preds):
    return tf.losses.sparse_softmax_cross_entropy(labels=real, logits=preds)

### Training

In [None]:
n_epochs = 30

for epoch in range(n_epochs):
    start = time.time()
    hidden = model.reset_states()  # initializes the hidden state at the start of every epoch
    
    for (batch, (inp, target)) in enumerate(dataset):
          with tf.GradientTape() as tape:
              predictions, hidden = model(inp, hidden)  # feeds the hidden state back into the model
              target = tf.reshape(target, (-1, ))  # reshapes for the loss function
              loss = loss_function(target, predictions)
              
          grads = tape.gradient(loss, model.variables)
          optimizer.apply_gradients(zip(grads, model.variables), global_step=tf.train.get_or_create_global_step())

          if batch % 100 == 0:
              print ('Epoch {} Batch {} Loss {:.4f}'.format(epoch + 1, batch, loss))
    
    print ('Epoch {} Loss {:.4f}'.format(epoch + 1, loss))
    print('Time taken for 1 epoch {} sec\n'.format(time.time() - start))

Epoch 1 Batch 0 Loss 4.5975
Epoch 1 Batch 100 Loss 2.1609
Epoch 1 Batch 200 Loss 1.9387
Epoch 1 Loss 1.8163
Time taken for 1 epoch 1007.0127582550049 sec

Epoch 2 Batch 0 Loss 1.7427
Epoch 2 Batch 100 Loss 1.7149
Epoch 2 Batch 200 Loss 1.6851
Epoch 2 Loss 1.6786
Time taken for 1 epoch 1009.5401530265808 sec

Epoch 3 Batch 0 Loss 1.5864
Epoch 3 Batch 100 Loss 1.5735
Epoch 3 Batch 200 Loss 1.5432
Epoch 3 Loss 1.5330
Time taken for 1 epoch 1008.4457356929779 sec

Epoch 4 Batch 0 Loss 1.4756
Epoch 4 Batch 100 Loss 1.5105
Epoch 4 Batch 200 Loss 1.4949
Epoch 4 Loss 1.4980
Time taken for 1 epoch 1010.7342929840088 sec

Epoch 5 Batch 0 Loss 1.3859
Epoch 5 Batch 100 Loss 1.4656
Epoch 5 Batch 200 Loss 1.4051
Epoch 5 Loss 1.4314
Time taken for 1 epoch 1012.4246871471405 sec

Epoch 6 Batch 0 Loss 1.3024
Epoch 6 Batch 100 Loss 1.3982
Epoch 6 Batch 200 Loss 1.3920
Epoch 6 Loss 1.4050
Time taken for 1 epoch 1009.992424249649 sec

Epoch 7 Batch 0 Loss 1.2550
Epoch 7 Batch 100 Loss 1.3588
Epoch 7 Batch

 ...
 
The model was trained on Paperspace. 5 epochs are missing due to an average Internet connecton.

Anyway, it is enough to generate some text with the model.

### Text generation

In [38]:
num_generate = 1000  # number of characters to generate
start_string = 'The child'  # beginning of the generated text. TODO: try start_string = ' '

input_eval = [char2idx[s] for s in start_string]  # converts start_string to numbers the model understands
input_eval = tf.expand_dims(input_eval, 0)  # 

text_generated = ''

temperature = 0.97  # the greater, the closer to an observation in the corpus

hidden = [tf.zeros((1, units))]
for i in range(num_generate):
    predictions, hidden = model(input_eval, hidden)  # predictions holds the probabily for each character to be most adequate continuation

    predictions = predictions / temperature  # alters characters' probabilities to be picked (but keeps the order)
    predicted_id = tf.multinomial(tf.exp(predictions), num_samples=1)[0][0].numpy()  # picks the next character for the generated text
    
    input_eval = tf.expand_dims([predicted_id], 0)
    text_generated += idx2char[predicted_id]  # appends

print (start_string + text_generated)

The childhood rose that we may do,
And now the country seat her shall he died,
  And her shadow of regret?

If you can dress your head and stone be forget--
  That were coming home from the corner of her eye.

He did one that ever one to the summer of the rain!

She was wanting from the pine,
  And the song of his lower lay,
While I am sad and sent the truest,
  And one clothe lads that waits
  Of the black men doth clouds the stars,
      And the stars have broken the stars
      Stood on the brook, the bridge is passing through the starry skies.

    When the sun was clear, and the thing to be true
    A blaze in the stream of the steed,
      And the stars have broken the stairs,
    And the clearing and the long bright morning stair,
    And one that makes the dark bells they seemed to say:
     "But one song of the crowd.
            The stars come and the straw,
      And cold and still their welcome home.

    All the cold work that has lured the sea,
    And the clock stood cal

## Conclusion

That's promising:
* It spells words correctly
* There is some structure (line breaks).
* Found a punctation rule

Easy-to-fix issue: indents. The corpus itself is inconsitent for that regard. The fact that the model mimics the indents is in fact a good news.

Harder-to-fix issue: Sentences make little sense. Maybe further training will be enough. Also, playing with hyperparameters will help.