<a href="https://colab.research.google.com/github/shengjiyang/DS-Unit-4-Sprint-3-Deep-Learning/blob/master/module1-rnn-and-lstm/LS_DS_431_RNN_and_LSTM_Lecture.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>
<br></br>

## *Data Science Unit 4 Sprint 3 Assignment 1*

# Recurrent Neural Networks and Long Short Term Memory (LSTM)

![Monkey at a typewriter](https://upload.wikimedia.org/wikipedia/commons/thumb/3/3c/Chimpanzee_seated_at_typewriter.jpg/603px-Chimpanzee_seated_at_typewriter.jpg)

It is said that [infinite monkeys typing for an infinite amount of time](https://en.wikipedia.org/wiki/Infinite_monkey_theorem) will eventually type, among other things, the complete works of Wiliam Shakespeare. Let's see if we can get there a bit faster, with the power of Recurrent Neural Networks and LSTM.

This text file contains the complete works of Shakespeare: https://www.gutenberg.org/files/100/100-0.txt

Use it as training data for an RNN - you can keep it simple and train character level, and that is suggested as an initial approach.

Then, use that trained RNN to generate Shakespearean-ish text. Your goal - a function that can take, as an argument, the size of text (e.g. number of characters or lines) to generate, and returns generated text of that size.

Note - Shakespeare wrote an awful lot. It's OK, especially initially, to sample/use smaller data and parameters, so you can have a tighter feedback loop when you're trying to get things running. Then, once you've got a proof of concept - start pushing it more!

In [0]:
import requests
import pandas as pd

In [0]:
url = "https://www.gutenberg.org/files/100/100-0.txt"

r = requests.get(url)
r.encoding = r.apparent_encoding
data = r.text
data = data.split('\r\n')
toc = [l.strip() for l in data[44:130:2]]
# Skip the Table of Contents
data = data[135:]

# Fixing Titles
toc[9] = 'THE LIFE OF KING HENRY V'
toc[18] = 'MACBETH'
toc[24] = 'OTHELLO, THE MOOR OF VENICE'
toc[34] = 'TWELFTH NIGHT: OR, WHAT YOU WILL'

locations = {id_:{'title':title, 'start':-99} for id_,title in enumerate(toc)}

# Start 
for e,i in enumerate(data):
    for t,title in enumerate(toc):
        if title in i:
            locations[t].update({'start':e})
            

df_toc = pd.DataFrame.from_dict(locations, orient='index')
df_toc['end'] = df_toc['start'].shift(-1).apply(lambda x: x-1)
df_toc.loc[42, 'end'] = len(data)
df_toc['end'] = df_toc['end'].astype('int')

df_toc['text'] = df_toc.apply(lambda x: '\r\n'.join(data[ x['start'] : int(x['end']) ]), axis=1)

In [4]:
#Shakespeare Data Parsed by Play
df_toc.head()

Unnamed: 0,title,start,end,text
0,ALL’S WELL THAT ENDS WELL,2777,7738,ALL’S WELL THAT ENDS WELL\r\n\r\n\r\n\r\nConte...
1,THE TRAGEDY OF ANTONY AND CLEOPATRA,7739,11840,THE TRAGEDY OF ANTONY AND CLEOPATRA\r\n\r\nDRA...
2,AS YOU LIKE IT,11841,14631,AS YOU LIKE IT\r\n\r\nDRAMATIS PERSONAE.\r\n\r...
3,THE COMEDY OF ERRORS,14632,17832,THE COMEDY OF ERRORS\r\n\r\n\r\n\r\nContents\r...
4,THE TRAGEDY OF CORIOLANUS,17833,27806,THE TRAGEDY OF CORIOLANUS\r\n\r\nDramatis Pers...


###Understanding the Lecture

Before begining the actual assignment, for my understanding, I have replicated
the lesson code with the Shakespearean content. Note that I have also created explanations for some of the numpy functions used in local functions from JC's lesson code.

In [0]:
# Encoding the data

text = "".join(df_toc.text.values)

# Unique characters
chars = list(set(text))

# Lookup tables
# I had no idea that "dict comprehension" was a thing
char_int = {c:i for i, c in enumerate(chars)}
int_char = {i:c for i, c in enumerate(chars)}

In [6]:
len(chars)

106

In [7]:
maxlen = 40
step = 5

encoded = [char_int[c] for c in text]

sequences = []
next_char = []

for i in range(0, len(encoded) - maxlen, step):

  sequences.append(encoded[i : i + maxlen])
  next_char.append(encoded[i + maxlen])

# Sanity Check
(len(sequences), len(next_char))

(1127153, 1127153)

In [8]:
# Creating an X and y for our LSTM model to use.

import numpy as np

# This will tell the model whether a given character is found in
# a given place by reading out that character's index value from 
# the dictionary above and outputting an array with True or False
# values telling us whether that character exist in a given position
# in the text.
X = np.zeros((len(sequences), maxlen, len(chars)), dtype=np.bool)
y = np.zeros((len(sequences), len(chars)), dtype=np.bool)

for i, sequence in enumerate(sequences):
  for t, char in enumerate(sequence):
    X[i, t, char] = 1

  y[i, next_char[i]] = 1

(X.shape, y.shape)

((1127153, 40, 106), (1127153, 106))

In [0]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense

# Building the text generation model

model = Sequential([
                    LSTM(128, input_shape=(maxlen, len(chars))),
                    Dense(len(chars), activation="softmax")
])

model.compile(loss="categorical_crossentropy", optimizer="adam")

In [0]:
def sample(preds):
  """
  Samples an index from a probability array.
  """
  preds = np.asarray(preds).astype("float64")
  preds = np.log(preds) / 1
  # np.exp() invokes Euler's constant to the power of the log of the prediction(s).
  exp_preds = np.exp(preds)
  preds = exp_preds / np.sum(exp_preds)
  probas = np.random.multinomial(1, preds, 1)

  return np.argmax(probas)

Understanding np.random.multinomial

In [11]:
# The first argument n=8 determines, the range of values, in this case 0 to 8/
# The second argument pvals=[0.1, 0.22, 0.333, 0.4444] is a list of probability
# values, which sum up to 1.
# The third argument determines the vertical size of the array, as can be seen
# in the third itteration below.
multinomial = np.random.multinomial(8, [0.1, 0.22, 0.333, 0.4444], 2)
multinomial

array([[1, 1, 2, 4],
       [2, 3, 3, 0]])

In [12]:
multinomial = np.random.multinomial(12, [0.1, 0.22, 0.333, 0.4444], 3)
multinomial

array([[2, 5, 4, 1],
       [0, 6, 1, 5],
       [0, 6, 4, 2]])

In [13]:
# Here we have passed n=1 and size=1 into the multinomial in order to reflect
# the use of np.random.multinomial in the sample function. Essentially,
# The multinomial array creates binary values 0 or 1 from predictions it is fed.
multinomial = np.random.multinomial(1, [0.1, 0.22, 0.333, 0.4444], 1)
multinomial

array([[0, 0, 0, 1]])

Understanding np.argmax

In [14]:
# np.argmax returns index positions of the max values for each row or column of
# an array.
# In order to demonstrate np.argmax, we must also understand np.arange.

# Here np.arange is used. np.arange creates a discrete counter of values staring
# at zero and increasing by one until the number of itterations, in this case 6,
# is reached.
a = np.arange(6)
print("a")
print(a)

print("\nb")
b = np.arange(6) + 1
print(b)

print("\nc")
c = np.arange(6) + 10
print(c)

print("\nd")
# Here .reshape takes the existing one dimensional array and creates a matrix
# with the dimensions inserted into it.
d = np.arange(6).reshape(2,3) + 10
print(d)

a
[0 1 2 3 4 5]

b
[1 2 3 4 5 6]

c
[10 11 12 13 14 15]

d
[[10 11 12]
 [13 14 15]]


In [15]:
# np.argmax itself

# for rows

# Since NumPy arrays are zero-indexed, the value 2 was returned twice for each
# row, indicating the third column
max_rows = np.argmax(d, axis=1)
max_rows

array([2, 2])

In [16]:
# for columns
max_columns = np.argmax(d, axis=0)
max_columns

array([1, 1, 1])

In [0]:
import random
import sys
from tensorflow.keras.callbacks import LambdaCallback

def on_epoch_end(epoch, _):
  """
  Function invoked at the end of each epoch passed by the neural network.
  Prints out the text generated by the model.
  """

  print()
  # Here the %d in the formatted string indicates
  # that each new epoch is counted as a discrete
  # integer.
  print('----- Generating text after Epoch %d:' % epoch + 1)

  start_index = random.randint(0, len(text) - maxlen - 1)

  generated = ''

  sentence = text[start_index : start_index + maxlen]
  generated += sentence

  print('----- Generating with seed: "' + sentence + '"')
  sys.stdout.write(generated)

  for i in range(400):
    X_pred = np.zeros((1, maxlen, len(chars)))
    for t, char in enumerate(sentence):
      X_pred[0, t, char_int[char]] = 1

    preds = model.predict(X_pred, verbose=0)[0]
    next_index = sample(preds)
    next_char = int_char[next_index]

    sentence = sentence[1:] + next_char

    sys.stdout.write(next_char)
    sys.stdout.flush()

  print()

print_callback = LambdaCallback(on_epoch_end=on_epoch_end)

In [18]:
# Understanding string formatting

data = ("John", "Doe", 53.44)
format_string = "Hello"

print("%s, %s %s.\nYour current balance is $%.2f." % (format_string, data[0], data[1], data[2]))

Hello, John Doe.
Your current balance is $53.44.


Eventually, I'll need to go back and understand how sys works on a deeper level, but for now, we'll just roll with it.

In [21]:
# Fitting the model and generating our ridiculous results:

model.fit(X,
          y,
          batch_size=32,
          epochs=3,
          callbacks=[print_callback])

Epoch 1/3
----- Generating text after Epoch 0:
----- Generating with seed: "bewray whose brat thou art,
    Had nat"
bewray whose brat thou art,













  CLENDIA. Yes, that had have hearss crown of any, and thy searing permalls d
Epoch 2/3
----- Generating text after Epoch 1:
----- Generating with seed: ";
We are one anothers wife, ever begett"
;
















Fr
Epoch 3/3
----- Generating text after Epoch 2:
----- Generating with seed: "Recoil upon me: in himself too mighty,
"
Recoil upon me: in himself too mighty,










I loves into the hows, he not forcess: the 


<tensorflow.python.keras.callbacks.History at 0x7f14ab6b3320>

# Resources and Stretch Goals

## Stretch goals:
- Refine the training and generation of text to be able to ask for different genres/styles of Shakespearean text (e.g. plays versus sonnets)
- Train a classification model that takes text and returns which work of Shakespeare it is most likely to be from
- Make it more performant! Many possible routes here - lean on Keras, optimize the code, and/or use more resources (AWS, etc.)
- Revisit the news example from class, and improve it - use categories or tags to refine the model/generation, or train a news classifier
- Run on bigger, better data

## Resources:
- [The Unreasonable Effectiveness of Recurrent Neural Networks](https://karpathy.github.io/2015/05/21/rnn-effectiveness/) - a seminal writeup demonstrating a simple but effective character-level NLP RNN
- [Simple NumPy implementation of RNN](https://github.com/JY-Yoon/RNN-Implementation-using-NumPy/blob/master/RNN%20Implementation%20using%20NumPy.ipynb) - Python 3 version of the code from "Unreasonable Effectiveness"
- [TensorFlow RNN Tutorial](https://github.com/tensorflow/models/tree/master/tutorials/rnn) - code for training a RNN on the Penn Tree Bank language dataset
- [4 part tutorial on RNN](http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/) - relates RNN to the vanishing gradient problem, and provides example implementation
- [RNN training tips and tricks](https://github.com/karpathy/char-rnn#tips-and-tricks) - some rules of thumb for parameterizing and training your RNN