# Perth Machine Learning Group Poem Generator

## Introduction

The following code uses GRU to generate poems. It reads through a corpus of poems, and learns sequences of characters, including line breaks and titles.

In short, it observes many sequence of characters, and infers the character that should come next. For instance, it guesses that after 'The cat eat' should come the letter 's'.

Further details will be given with the code.

## The code

### Data exploration

In [None]:
import tensorflow as tf  # version 1.9 or above
tf.enable_eager_execution()  # Execution of code as it runs in the notebook. Normally, TensorFlow looks up the whole code before execution for efficiency.

import numpy as np
import re
import random
import unidecode
import time

In [None]:
path_to_file = 'data/erotic_corpus.txt'

In [None]:
text = unidecode.unidecode(open(path_to_file).read())
print(text[:500])

### Dataset creation

In [None]:
unique = sorted(set(text))  # unique contains all the unique characters in the corpus

char2idx = {u:i for i, u in enumerate(unique)}  # maps characters to indexes
idx2char = {i:u for i, u in enumerate(unique)}  # maps indexes to characters

In [None]:
max_length = 100  # Maximum length sentence we want per input in the network
vocab_size = len(unique)
embedding_dim = 256  # number of 'meaningful' features to learn. Ex: ['queen', 'king', 'man', 'woman'] has a least 2 embedding dimension: royalty and gender.
units = 1024  # In keras: number of output of a sequence. In short it rem
BATCH_SIZE = 64
BUFFER_SIZE = 10000

In [None]:
input_text = []
target_text = []

for f in range(0, len(text) - max_length, max_length):
    inps = text[f : f + max_length]
    targ = text[f + 1 : f + 1 + max_length]
    input_text.append([char2idx[i] for i in inps])
    target_text.append([char2idx[t] for t in targ])

In [None]:
dataset = tf.data.Dataset.from_tensor_slices((input_text, target_text)).shuffle(BUFFER_SIZE)
dataset = dataset.apply(tf.contrib.data.batch_and_drop_remainder(BATCH_SIZE))

### Explaination

In fact, the algorithm does not learn which characters comes next. It analyzes sequences of characters as inputs (ex: 'abcd'), and predicts sequences as outputs (ex: 'bcde').

Why?

During the training phase, it learns more that just the next character. It updates weights for each characters from the input sequence to the output sequence.

> Consider the sequences 'abcd', 'bcde', 'cdef', 'defg', the letter "d" is given different weights that depend on the previous sequences

The use of these updates helps predicting better the next sequences and so on. So it learns the next character but also all the weights 

The next chunk of code is optional.

In [None]:
# example of input:
print('Given the following sequence: \n\n')
print(''.join(idx2char[input_text[14][i]] for i in range(len(target_text[0]))))
print('\n\n')
print('the network has to learn that a correct continuation is: \n')
# example of output the algorithm has to learn
print(''.join(idx2char[target_text[14][i]] for i in range(len(input_text[0]))))

### Model

In [None]:
class Model(tf.keras.Model):
  def __init__(self, vocab_size, embedding_dim, units, batch_size):
    super(Model, self).__init__()
    self.units = units
    self.batch_sz = batch_size
    self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
    if tf.test.is_gpu_available():
      self.gru = tf.keras.layers.CuDNNGRU(self.units, 
                                          return_sequences=True, 
                                          return_state=True, 
                                          recurrent_initializer='glorot_uniform')
    else:
      self.gru = tf.keras.layers.GRU(self.units, 
                                     return_sequences=True, 
                                     return_state=True, 
                                     recurrent_activation='sigmoid', 
                                     recurrent_initializer='glorot_uniform')
    self.fc = tf.keras.layers.Dense(vocab_size)
        
  def call(self, x, hidden):
    x = self.embedding(x)
    output, states = self.gru(x, initial_state=hidden)
    output = tf.reshape(output, (-1, output.shape[2]))
    x = self.fc(output)
#     x = tf.log_sigmoid(x)
    return x, states

In [None]:
model = Model(vocab_size, embedding_dim, units, BATCH_SIZE)

In [None]:
optimizer = tf.train.AdamOptimizer()

In [None]:
def loss_function(real, preds):
    return tf.losses.sparse_softmax_cross_entropy(labels=real, logits=preds)

### Training

In [None]:
n_epochs = 30

for epoch in range(n_epochs):
    start = time.time()
    hidden = model.reset_states()  # initializes the hidden state at the start of every epoch
    
    for (batch, (inp, target)) in enumerate(dataset):
          with tf.GradientTape() as tape:
              predictions, hidden = model(inp, hidden)  # feeds the hidden state back into the model
              target = tf.reshape(target, (-1, ))  # reshapes for the loss function
              loss = loss_function(target, predictions)
              
          grads = tape.gradient(loss, model.variables)
          optimizer.apply_gradients(zip(grads, model.variables), global_step=tf.train.get_or_create_global_step())

          if batch % 100 == 0:
              print ('Epoch {} Batch {} Loss {:.4f}'.format(epoch + 1, batch, loss))
    
    print ('Epoch {} Loss {:.4f}'.format(epoch + 1, loss))
    print('Time taken for 1 epoch {} sec\n'.format(time.time() - start))


### Text generation

In [None]:
m=1.2
lighten=lambda x:1-(1/m-x/m)
from IPython.display import HTML, display

In [None]:
start_strings = [
    'magnificent', 
    'bodice', 
    'secret',
    'nuns',
    'belong',
    'by God',
    'all overcome', 
    'his eyes', 
    'why',
    'yielding',
    'flutter',
    'wives',
    'swooned', 
    'Mrs',
    'delicious',
    'loins',
    'body',
    'greatest',
    'alone',
    'merely',
    'Spontaneously',
    'frightful',
    'intimate',
    'clinging',
    'neglect',
    'gushed',
    'grace',
    'drunk',
    'panting',
    'honey',
    
]

html_generateds = []
for start_string in start_strings:
    num_generate = 200  # number of characters to generate
#     start_string = 'The lady'  # beginning of the generated text. TODO: try start_string = ' '

    input_eval = [char2idx[s] for s in start_string]  # converts start_string to numbers the model understands
    input_eval = tf.expand_dims(input_eval, 0)  # 

    text_generated = ''
    html_generated = ''
    text_generated += start_string
    html_generated += '<b>{}</b>'.format(start_string)

    temperature = 0.9  # arger is more creative
    hidden = [tf.zeros((1, units))]
    for i in range(num_generate):
        predictions, hidden = model(input_eval, hidden)  # predictions holds the probability for each character to be most adequate continuation

        predictions = predictions / temperature  # alters characters' probabilities to be picked (but keeps the order)
        predicted_id = tf.multinomial(tf.exp(predictions), num_samples=1)[0][0].numpy()  # picks the next character for the generated text
        predicted_probs = tf.nn.softmax(logits=predictions)[0, predicted_id].numpy()

        input_eval = tf.expand_dims([predicted_id], 0)
        text_generated += idx2char[predicted_id]  # appends
        html_generated+='<span style="color:rgba(0,0,0,{a:1.1f})">{c:}</span>'.format(
            c=idx2char[predicted_id],
            a=lighten(predicted_probs))

    text_generated += '\n\n'
    html_generated += '<p/><hr/>'
    
    display(HTML(html_generated))
    html_generateds.append(html_generated)


In [None]:
html="""
    <html><body>
    {}
    </body>
    </html>
    """.format('<hr/>'.join(html_generateds))

display(HTML(html))
with open('outputs/erotics_predictions.html', 'w') as fo:
    fo.write(html)