<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>
<br></br>

## *Data Science Unit 4 Sprint 3 Assignment 1*

# Recurrent Neural Networks and Long Short Term Memory (LSTM)

![Monkey at a typewriter](https://upload.wikimedia.org/wikipedia/commons/thumb/3/3c/Chimpanzee_seated_at_typewriter.jpg/603px-Chimpanzee_seated_at_typewriter.jpg)

It is said that [infinite monkeys typing for an infinite amount of time](https://en.wikipedia.org/wiki/Infinite_monkey_theorem) will eventually type, among other things, the complete works of Wiliam Shakespeare. Let's see if we can get there a bit faster, with the power of Recurrent Neural Networks and LSTM.

This text file contains the complete works of Shakespeare: https://www.gutenberg.org/files/100/100-0.txt

Use it as training data for an RNN - you can keep it simple and train character level, and that is suggested as an initial approach.

Then, use that trained RNN to generate Shakespearean-ish text. Your goal - a function that can take, as an argument, the size of text (e.g. number of characters or lines) to generate, and returns generated text of that size.

Note - Shakespeare wrote an awful lot. It's OK, especially initially, to sample/use smaller data and parameters, so you can have a tighter feedback loop when you're trying to get things running. Then, once you've got a proof of concept - start pushing it more!

In [221]:
import tensorflow as tf

import numpy as np
import os
import time
import re

In [222]:
path_to_shakespeare = tf.keras.utils.get_file('shakespeare.txt', 'https://www.gutenberg.org/files/100/100-0.txt')

In [223]:
text = open(path_to_shakespeare, 'rb').read().decode(encoding='utf-8')

In [224]:
# count number of chars
len(text)

5740054

In [225]:
def clean_shakespeare(text):
    # get contents
    contents = text[974:2893].replace("\r", "").replace("\n", "")
    contents = re.sub("\s{2,}", ",,", contents)
    contents = contents.split(",,")
    
    for idx,book in enumerate(contents):
        if book == "THE LIFE OF KING HENRY THE FIFTH":
            contents[idx] = "THE LIFE OF KING HENRY V"
        elif book == "THE TRAGEDY OF MACBETH":
            contents[idx] = "MACBETH"
        elif book == "THE TRAGEDY OF OTHELLO, MOOR OF VENICE":
            contents[idx] = "OTHELLO, THE MOOR OF VENICE"
        elif book == "TWELFTH NIGHT; OR, WHAT YOU WILL":
            contents[idx] = "TWELFTH NIGHT: OR, WHAT YOU WILL"
    
    # remove the project gutenberg info
    text = text[2893:-21529]
    
    shakespeare_dict = {}
    
    for idx,book in enumerate(contents):
        strpos = text.find(book) + len(book)
        
        if idx + 1 != len(contents):
            next_book = contents[idx + 1]
            endpos = text.find(next_book, strpos)
            shakespeare_dict[book] = text[strpos:endpos]
        else:
            shakespeare_dict[book] = text[strpos:]
    
    return shakespeare_dict, text

In [226]:
books, text = clean_shakespeare(text)

In [227]:
vocab = sorted(set(text))
len(vocab)

102

In [228]:
char2idx = {u:i for i, u in enumerate(vocab)}
idx2char = np.array(vocab)

text_as_int = np.array([char2idx[c] for c in text])

In [229]:
seq_length = 100
examples_per_epoch = len(text)//(seq_length+1)
char_dataset = tf.data.Dataset.from_tensor_slices(text_as_int)
sequences = char_dataset.batch(seq_length+1, drop_remainder=True)

In [230]:
def split_input_target(chunk):
    input_text = chunk[:-1]
    target_text = chunk[1:]
    return input_text, target_text

dataset = sequences.map(split_input_target)

In [231]:
for input_example, target_example in  dataset.take(1):
    print ('Input data: ', repr(''.join(idx2char[input_example.numpy()])))
    print ('Target data:', repr(''.join(idx2char[target_example.numpy()])))

Input data:  '\r\n\r\n\r\n\r\n\r\n\r\n\r\nTHE SONNETS\r\n\r\n                    1\r\n\r\nFrom fairest creatures we desire increase,\r\nTh'
Target data: '\n\r\n\r\n\r\n\r\n\r\n\r\nTHE SONNETS\r\n\r\n                    1\r\n\r\nFrom fairest creatures we desire increase,\r\nTha'


In [232]:
for i, (input_idx, target_idx) in enumerate(zip(input_example[:5], target_example[:5])):
    print("Step {:4d}".format(i))
    print("  input: {} ({:s})".format(input_idx, repr(idx2char[input_idx])))
    print("  expected output: {} ({:s})".format(target_idx, repr(idx2char[target_idx])))

Step    0
  input: 2 ('\r')
  expected output: 1 ('\n')
Step    1
  input: 1 ('\n')
  expected output: 2 ('\r')
Step    2
  input: 2 ('\r')
  expected output: 1 ('\n')
Step    3
  input: 1 ('\n')
  expected output: 2 ('\r')
Step    4
  input: 2 ('\r')
  expected output: 1 ('\n')


In [233]:
BATCH_SIZE = 64

BUFFER_SIZE = 10000

dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)

In [234]:
vocab_size = len(vocab)

embedding_dim = 256

rnn_units = 1024

In [235]:
def build_model(vocab_size, embedding_dim, rnn_units, batch_size):
    model = tf.keras.Sequential([
        tf.keras.layers.Embedding(vocab_size, embedding_dim,
                                  batch_input_shape=[batch_size, None]),
        tf.keras.layers.GRU(rnn_units,
                            return_sequences=True,
                            stateful=True,
                            recurrent_initializer='glorot_uniform'),
        tf.keras.layers.Dense(vocab_size)
    ])
    return model

In [236]:
model = build_model(
    vocab_size = len(vocab),
    embedding_dim=embedding_dim,
    rnn_units=rnn_units,
    batch_size=BATCH_SIZE)

In [237]:
def loss(labels, logits):
    return tf.keras.losses.sparse_categorical_crossentropy(labels, logits, from_logits=True)

In [238]:
model.compile(optimizer='adam', loss=loss)

In [239]:
checkpoint_dir = './training_checkpoints'
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_{epoch}")

checkpoint_callback=tf.keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_prefix,
    save_weights_only=True)


In [240]:
EPOCHS=30

In [241]:
history = model.fit(dataset, epochs=EPOCHS, callbacks=[checkpoint_callback])

Train for 884 steps
Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


In [242]:
model = build_model(vocab_size, embedding_dim, rnn_units, batch_size=1)

model.load_weights(tf.train.latest_checkpoint(checkpoint_dir))

model.build(tf.TensorShape([1, None]))

In [243]:
model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (1, None, 256)            26112     
_________________________________________________________________
gru_1 (GRU)                  (1, None, 1024)           3938304   
_________________________________________________________________
dense_1 (Dense)              (1, None, 102)            104550    
Total params: 4,068,966
Trainable params: 4,068,966
Non-trainable params: 0
_________________________________________________________________


In [244]:
def generate_text(model, start_string):
    # Evaluation step (generating text using the learned model)

    # Number of characters to generate
    num_generate = 1000

    # Converting our start string to numbers (vectorizing)
    input_eval = [char2idx[s] for s in start_string]
    input_eval = tf.expand_dims(input_eval, 0)

    # Empty string to store our results
    text_generated = []

    # Low temperatures results in more predictable text.
    # Higher temperatures results in more surprising text.
    # Experiment to find the best setting.
    temperature = 1.0

    # Here batch size == 1
    model.reset_states()
    for i in range(num_generate):
        predictions = model(input_eval)
        # remove the batch dimension
        predictions = tf.squeeze(predictions, 0)

        # using a categorical distribution to predict the character returned by the model
        predictions = predictions / temperature
        predicted_id = tf.random.categorical(predictions, num_samples=1)[-1,0].numpy()

        # We pass the predicted character as the next input to the model
        # along with the previous hidden state
        input_eval = tf.expand_dims([predicted_id], 0)

        text_generated.append(idx2char[predicted_id])

    return (start_string + ''.join(text_generated))

In [245]:
print(generate_text(model, start_string=u"A rose "))

A rose fire;
  But for the hair of mothers, in the burs,
    Others to any day 'tis well. Grant things can water,
    I desire you attate yourself to gaze,
  And yet I had a broad, and pity life,
  And to be taken foolery that I met have beat, sirreiver, have I poison’d;
Reignith in how sbear thy way; and from—
Cold of my service, the youth does of the sea
That makes us not set in my best worth:
Her age with her by papient Boy, in all my fool,
I hie the salder by my death, I naked,
Look do’t barnes. Bo true, my lord you must.

Possessel out of a demuner rect,
Nor couplete as Anne, but we are entereat’st
than young in comfort, my lord!

VIOLA.
I say.

RODERIGO.
There, lady, since my true-piontine,
    As clap away. Let's not ask solicit,
And the most unfatificuate Romans of honourablic you,
Yet long keeps in a bark up witchcreful) else
That I am sure
That measur’d strangeness labours should be foulor,
  Procostant many Fantastic have founded.
My sentence, Catca, and


# Resources and Stretch Goals

## Stretch goals:
- Refine the training and generation of text to be able to ask for different genres/styles of Shakespearean text (e.g. plays versus sonnets)
- Train a classification model that takes text and returns which work of Shakespeare it is most likely to be from
- Make it more performant! Many possible routes here - lean on Keras, optimize the code, and/or use more resources (AWS, etc.)
- Revisit the news example from class, and improve it - use categories or tags to refine the model/generation, or train a news classifier
- Run on bigger, better data

## Resources:
- [The Unreasonable Effectiveness of Recurrent Neural Networks](https://karpathy.github.io/2015/05/21/rnn-effectiveness/) - a seminal writeup demonstrating a simple but effective character-level NLP RNN
- [Simple NumPy implementation of RNN](https://github.com/JY-Yoon/RNN-Implementation-using-NumPy/blob/master/RNN%20Implementation%20using%20NumPy.ipynb) - Python 3 version of the code from "Unreasonable Effectiveness"
- [TensorFlow RNN Tutorial](https://github.com/tensorflow/models/tree/master/tutorials/rnn) - code for training a RNN on the Penn Tree Bank language dataset
- [4 part tutorial on RNN](http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/) - relates RNN to the vanishing gradient problem, and provides example implementation
- [RNN training tips and tricks](https://github.com/karpathy/char-rnn#tips-and-tricks) - some rules of thumb for parameterizing and training your RNN