<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>
<br></br>

## *Data Science Unit 4 Sprint 3 Assignment 1*

# Recurrent Neural Networks and Long Short Term Memory (LSTM)

![Monkey at a typewriter](https://upload.wikimedia.org/wikipedia/commons/thumb/3/3c/Chimpanzee_seated_at_typewriter.jpg/603px-Chimpanzee_seated_at_typewriter.jpg)

It is said that [infinite monkeys typing for an infinite amount of time](https://en.wikipedia.org/wiki/Infinite_monkey_theorem) will eventually type, among other things, the complete works of Wiliam Shakespeare. Let's see if we can get there a bit faster, with the power of Recurrent Neural Networks and LSTM.

This text file contains the complete works of Shakespeare: https://www.gutenberg.org/files/100/100-0.txt

Use it as training data for an RNN - you can keep it simple and train character level, and that is suggested as an initial approach.

Then, use that trained RNN to generate Shakespearean-ish text. Your goal - a function that can take, as an argument, the size of text (e.g. number of characters or lines) to generate, and returns generated text of that size.

Note - Shakespeare wrote an awful lot. It's OK, especially initially, to sample/use smaller data and parameters, so you can have a tighter feedback loop when you're trying to get things running. Then, once you've got a proof of concept - start pushing it more!

In [3]:
import requests
import pandas as pd

In [4]:
url = "https://www.gutenberg.org/files/100/100-0.txt"

r = requests.get(url)
r.encoding = r.apparent_encoding
data = r.text
data = data.split('\r\n')
toc = [l.strip() for l in data[44:130:2]]
# Skip the Table of Contents
data = data[135:]

# Fixing Titles
toc[9] = 'THE LIFE OF KING HENRY V'
toc[18] = 'MACBETH'
toc[24] = 'OTHELLO, THE MOOR OF VENICE'
toc[34] = 'TWELFTH NIGHT: OR, WHAT YOU WILL'

locations = {id_:{'title':title, 'start':-99} for id_,title in enumerate(toc)}

# Start 
for e,i in enumerate(data):
    for t,title in enumerate(toc):
        if title in i:
            locations[t].update({'start':e})
            

df_toc = pd.DataFrame.from_dict(locations, orient='index')
df_toc['end'] = df_toc['start'].shift(-1).apply(lambda x: x-1)
df_toc.loc[42, 'end'] = len(data)
df_toc['end'] = df_toc['end'].astype('int')

df_toc['text'] = df_toc.apply(lambda x: '\r\n'.join(data[ x['start'] : int(x['end']) ]), axis=1)

In [5]:
#Shakespeare Data Parsed by Play
df_toc.head()

Unnamed: 0,title,start,end,text
0,THE TRAGEDY OF ANTONY AND CLEOPATRA,-99,14379,
1,AS YOU LIKE IT,14380,17171,AS YOU LIKE IT\r\n\r\n\r\nDRAMATIS PERSONAE.\r...
2,THE COMEDY OF ERRORS,17172,20372,THE COMEDY OF ERRORS\r\n\r\n\r\n\r\nContents\r...
3,THE TRAGEDY OF CORIOLANUS,20373,30346,THE TRAGEDY OF CORIOLANUS\r\n\r\nDramatis Pers...
4,CYMBELINE,30347,30364,CYMBELINE.\r\nLaud we the gods;\r\nAnd let our...


In [6]:
# Check shape of dataframe

df_toc.shape

(43, 4)

In [7]:
# Clean the text column

df_toc['text'] = df_toc["text"].str.replace('\n',"")
df_toc['text'] = df_toc["text"].str.replace('\r',"")
df_toc['text'] = df_toc["text"].str.replace('[',"")
df_toc['text'] = df_toc["text"].str.replace(']',"")
df_toc['text'] = df_toc["text"].str.replace('_',"")
df_toc['text'] = df_toc["text"].str.replace('<',"")
df_toc['text'] = df_toc["text"].str.replace('>',"")
df_toc['text'] = df_toc["text"].str.replace('/',"")
df_toc['text'] = df_toc["text"].str.replace('*',"")

In [8]:
# Encode Data as Chars

# Gather all text 

# Why? 1. See all possible characters 2. For training / splitting later

text = " ".join(df_toc['text'])

# Unique Characters

chars = list(set(text))

# Lookup Tables

char_int = {c:i for i, c in enumerate(chars)} 
int_char = {i:c for i, c in enumerate(chars)}

In [9]:
# Number of unique characters

len(chars)

99

In [10]:
# See integer:characters

int_char

{0: 's',
 1: 'â',
 2: 'F',
 3: '8',
 4: '}',
 5: ';',
 6: 'è',
 7: 'g',
 8: 'W',
 9: ' ',
 10: 'u',
 11: 'Q',
 12: '(',
 13: '“',
 14: 'e',
 15: 'm',
 16: '0',
 17: 'p',
 18: 'f',
 19: '|',
 20: 'I',
 21: 'T',
 22: 'w',
 23: 'c',
 24: 'Z',
 25: '6',
 26: 'é',
 27: 'N',
 28: 'i',
 29: 'h',
 30: 't',
 31: '`',
 32: 'r',
 33: '"',
 34: '—',
 35: 'L',
 36: "'",
 37: 'A',
 38: 'K',
 39: '5',
 40: 'C',
 41: 'æ',
 42: 'H',
 43: '$',
 44: 'z',
 45: '&',
 46: 'l',
 47: 'y',
 48: ',',
 49: '%',
 50: '\\',
 51: '‘',
 52: 'D',
 53: 'J',
 54: 'G',
 55: 'E',
 56: 'o',
 57: 'É',
 58: '2',
 59: 'q',
 60: '9',
 61: 'M',
 62: 'ê',
 63: 'U',
 64: 'v',
 65: '3',
 66: 'k',
 67: '-',
 68: 'î',
 69: '”',
 70: '\t',
 71: 'B',
 72: 'n',
 73: 'Y',
 74: 'b',
 75: '.',
 76: 'œ',
 77: 'X',
 78: 'P',
 79: 'd',
 80: 'V',
 81: 'Æ',
 82: 'ç',
 83: 'O',
 84: '!',
 85: ')',
 86: '1',
 87: '?',
 88: 'a',
 89: 'j',
 90: '4',
 91: '@',
 92: ':',
 93: '’',
 94: 'R',
 95: 'S',
 96: 'à',
 97: '7',
 98: 'x'}

In [11]:
# Create the sequence data

maxlen = 40
step = 5

encoded = [char_int[c] for c in text]

sequences = [] # Each element is 40 chars long
next_char = [] # One element for each sequence

for i in range(0, len(encoded) - maxlen, step):
    
    sequences.append(encoded[i : i + maxlen])
    next_char.append(encoded[i + maxlen])
    
print('sequences: ', len(sequences))

sequences:  2854377


In [12]:
# Check sequence 49

sequences[49]

[9,
 9,
 40,
 42,
 37,
 94,
 35,
 55,
 95,
 48,
 9,
 22,
 32,
 14,
 0,
 30,
 46,
 14,
 32,
 9,
 30,
 56,
 9,
 2,
 32,
 14,
 79,
 14,
 32,
 28,
 23,
 66,
 9,
 9,
 83,
 35,
 20,
 80,
 55,
 94]

In [13]:

# Check what sequence 49 prints
 
for i in sequences[49]:
    print(int_char[i])

 
 
C
H
A
R
L
E
S
,
 
w
r
e
s
t
l
e
r
 
t
o
 
F
r
e
d
e
r
i
c
k
 
 
O
L
I
V
E
R


In [14]:
# Next char is one element for each sequence, to help the model keep learning

next_char[0], int_char[next_char[0]]

(9, ' ')

In [None]:
# Create X & y

import numpy as np

X = np.zeros((len(sequences), maxlen, len(chars)), dtype=np.bool)
y = np.zeros((len(sequences),len(chars)), dtype=np.bool)

for i, sequence in enumerate(sequences):
    for t, char in enumerate(sequence):
        X[i,t,char] = 1
    y[i, next_char[i]] = 1

In [1]:
# See X shape

X.shape

NameError: ignored

In [None]:
# Array of characters, like a grid for each character, in each word

# ie: no a, b, c or x, y, z in this text, from what is visible.

# Keep in mind the first and last characters could be non-alphabet

X[0]

# Resources and Stretch Goals

## Stretch goals:
- Refine the training and generation of text to be able to ask for different genres/styles of Shakespearean text (e.g. plays versus sonnets)
- Train a classification model that takes text and returns which work of Shakespeare it is most likely to be from
- Make it more performant! Many possible routes here - lean on Keras, optimize the code, and/or use more resources (AWS, etc.)
- Revisit the news example from class, and improve it - use categories or tags to refine the model/generation, or train a news classifier
- Run on bigger, better data

## Resources:
- [The Unreasonable Effectiveness of Recurrent Neural Networks](https://karpathy.github.io/2015/05/21/rnn-effectiveness/) - a seminal writeup demonstrating a simple but effective character-level NLP RNN
- [Simple NumPy implementation of RNN](https://github.com/JY-Yoon/RNN-Implementation-using-NumPy/blob/master/RNN%20Implementation%20using%20NumPy.ipynb) - Python 3 version of the code from "Unreasonable Effectiveness"
- [TensorFlow RNN Tutorial](https://github.com/tensorflow/models/tree/master/tutorials/rnn) - code for training a RNN on the Penn Tree Bank language dataset
- [4 part tutorial on RNN](http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/) - relates RNN to the vanishing gradient problem, and provides example implementation
- [RNN training tips and tricks](https://github.com/karpathy/char-rnn#tips-and-tricks) - some rules of thumb for parameterizing and training your RNN