<a href="https://colab.research.google.com/github/skhabiri/DS-Unit-4-Sprint-3-Deep-Learning/blob/main/module1-rnn-and-lstm/LS_DS17_431_RNN_and_LSTM_Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## *Data Science Unit 4 Sprint 3 Module 1*

# Recurrent Neural Networks and Long Short Term Memory (LSTM)

![Monkey at a typewriter](https://upload.wikimedia.org/wikipedia/commons/thumb/3/3c/Chimpanzee_seated_at_typewriter.jpg/603px-Chimpanzee_seated_at_typewriter.jpg)

It is said that [infinite monkeys typing for an infinite amount of time](https://en.wikipedia.org/wiki/Infinite_monkey_theorem) will eventually type, among other things, the complete works of Wiliam Shakespeare. Let's see if we can get there a bit faster, with the power of Recurrent Neural Networks and LSTM.

This text file contains the complete works of Shakespeare: https://www.gutenberg.org/files/100/100-0.txt

Use it as training data for an RNN - you can keep it simple and train character level, and that is suggested as an initial approach.

Then, use that trained RNN to generate Shakespearean-ish text. Your goal - a function that can take, as an argument, the size of text (e.g. number of characters or lines) to generate, and returns generated text of that size.

Note - Shakespeare wrote an awful lot. It's OK, especially initially, to sample/use smaller data and parameters, so you can have a tighter feedback loop when you're trying to get things running. Then, once you've got a proof of concept - start pushing it more!

In [1]:
import requests
import pandas as pd

In [12]:
url = "https://github.com/skhabiri/ML-DeepLearning/raw/main/data/100-0.txt"

r = requests.get(url)
r.encoding = r.apparent_encoding
data = r.text
data[:500]

'The Project Gutenberg eBook of The Complete Works of William Shakespeare, by William Shakespeare\r\n\r\nThis eBook is for the use of anyone anywhere in the United States and\r\nmost other parts of the world at no cost and with almost no restrictions\r\nwhatsoever. You may copy it, give it away or re-use it under the terms\r\nof the Project Gutenberg License included with this eBook or online at\r\nwww.gutenberg.org. If you are not located in the United States, you\r\nwill have to check the laws of the country'

In [30]:
toc = [l.strip() for l in data.split('\r\n')[44:130:2]]
{id_:{'title':title, 'start':-99} for id_,title in enumerate(toc)}

{0: {'title': 'THE TRAGEDY OF ANTONY AND CLEOPATRA', 'start': -99},
 1: {'title': 'AS YOU LIKE IT', 'start': -99},
 2: {'title': 'THE COMEDY OF ERRORS', 'start': -99},
 3: {'title': 'THE TRAGEDY OF CORIOLANUS', 'start': -99},
 4: {'title': 'CYMBELINE', 'start': -99},
 5: {'title': 'THE TRAGEDY OF HAMLET, PRINCE OF DENMARK', 'start': -99},
 6: {'title': 'THE FIRST PART OF KING HENRY THE FOURTH', 'start': -99},
 7: {'title': 'THE SECOND PART OF KING HENRY THE FOURTH', 'start': -99},
 8: {'title': 'THE LIFE OF KING HENRY THE FIFTH', 'start': -99},
 9: {'title': 'THE FIRST PART OF HENRY THE SIXTH', 'start': -99},
 10: {'title': 'THE SECOND PART OF KING HENRY THE SIXTH', 'start': -99},
 11: {'title': 'THE THIRD PART OF KING HENRY THE SIXTH', 'start': -99},
 12: {'title': 'KING HENRY THE EIGHTH', 'start': -99},
 13: {'title': 'KING JOHN', 'start': -99},
 14: {'title': 'THE TRAGEDY OF JULIUS CAESAR', 'start': -99},
 15: {'title': 'THE TRAGEDY OF KING LEAR', 'start': -99},
 16: {'title': 'LOVE

In [31]:
"""
\r (Carriage Return) → moves the cursor to the beginning of the line without advancing to the next line
\n (Line Feed) → moves the cursor down to the next line without returning to the beginning of the line — In a *nix environment \n moves to the beginning of the line.
\r\n (End Of Line) → a combination of \r and \n
"""

url = "https://github.com/skhabiri/ML-DeepLearning/raw/main/data/100-0.txt"

r = requests.get(url)
r.encoding = r.apparent_encoding
data = r.text

data = data.split('\r\n')

# Title of the writings
toc = [l.strip() for l in data[44:130:2]]

# Skip the Table of Contents
data = data[135:]

# Fixing Titles
toc[9] = 'THE LIFE OF KING HENRY V'
toc[18] = 'MACBETH'
toc[24] = 'OTHELLO, THE MOOR OF VENICE'
toc[34] = 'TWELFTH NIGHT: OR, WHAT YOU WILL'

locations = {id_:{'title':title, 'start':-99} for id_,title in enumerate(toc)}

# Start 
for e,i in enumerate(data):
    for t,title in enumerate(toc):
        if title in i:
            locations[t].update({'start':e})
            

df_toc = pd.DataFrame.from_dict(locations, orient='index')
df_toc['end'] = df_toc['start'].shift(-1).apply(lambda x: x-1)
df_toc.loc[42, 'end'] = len(data)
df_toc['end'] = df_toc['end'].astype('int')

df_toc['text'] = df_toc.apply(lambda x: '\r\n'.join(data[ x['start'] : int(x['end']) ]), axis=1)

In [32]:
#Shakespeare Data Parsed by Play
df_toc.head()

Unnamed: 0,title,start,end,text
0,THE TRAGEDY OF ANTONY AND CLEOPATRA,-99,14379,
1,AS YOU LIKE IT,14380,17171,AS YOU LIKE IT\r\n\r\n\r\nDRAMATIS PERSONAE.\r...
2,THE COMEDY OF ERRORS,17172,20372,THE COMEDY OF ERRORS\r\n\r\n\r\n\r\nContents\r...
3,THE TRAGEDY OF CORIOLANUS,20373,30346,THE TRAGEDY OF CORIOLANUS\r\n\r\nDramatis Pers...
4,CYMBELINE,30347,30364,CYMBELINE.\r\nLaud we the gods;\r\nAnd let our...


In [33]:
df_toc.shape

(43, 4)

In [36]:
len(df_toc['text'][1])

136678

In [37]:
df_toc['text']

0                                                      
1     AS YOU LIKE IT\r\n\r\n\r\nDRAMATIS PERSONAE.\r...
2     THE COMEDY OF ERRORS\r\n\r\n\r\n\r\nContents\r...
3     THE TRAGEDY OF CORIOLANUS\r\n\r\nDramatis Pers...
4     CYMBELINE.\r\nLaud we the gods;\r\nAnd let our...
5     THE TRAGEDY OF HAMLET, PRINCE OF DENMARK\r\n\r...
6     THE FIRST PART OF KING HENRY THE FOURTH\r\n\r\...
7     THE SECOND PART OF KING HENRY THE FOURTH\r\n\r...
8                                                      
9     THE LIFE OF KING HENRY V\r\n\r\n\r\n\r\nConten...
10    THE SECOND PART OF KING HENRY THE SIXTH\r\n\r\...
11    THE THIRD PART OF KING HENRY THE SIXTH\r\n\r\n...
12                     KING HENRY THE EIGHTH\r\n\r\n...
13      KING JOHN. O cousin, thou art come to set mi...
14    THE TRAGEDY OF JULIUS CAESAR\r\n\r\n\r\n\r\nCo...
15    THE TRAGEDY OF KING LEAR\r\n\r\n\r\n\r\nConten...
16    LOVE’S LABOUR’S LOST\r\n\r\nDramatis Personae....
17                                              

In [38]:
test = []
for txt in df_toc['text']:
  for x in txt:
    test.append(x)
test[:20]

['A',
 'S',
 ' ',
 'Y',
 'O',
 'U',
 ' ',
 'L',
 'I',
 'K',
 'E',
 ' ',
 'I',
 'T',
 '\r',
 '\n',
 '\r',
 '\n',
 '\r',
 '\n']

In [8]:
# data = df_toc['text'][1:].values
# data.shape

In [39]:
df_toc['text'][1][:500]

'AS YOU LIKE IT\r\n\r\n\r\nDRAMATIS PERSONAE.\r\n\r\n  DUKE, living in exile\r\n  FREDERICK, his brother, and usurper of his dominions\r\n  AMIENS, lord attending on the banished Duke\r\n  JAQUES,   "      "       "  "     "      "\r\n  LE BEAU, a courtier attending upon Frederick\r\n  CHARLES, wrestler to Frederick\r\n  OLIVER, son of Sir Rowland de Boys\r\n  JAQUES,   "   "  "    "     "  "\r\n  ORLANDO,  "   "  "    "     "  "\r\n  ADAM,   servant to Oliver\r\n  DENNIS,     "     "   "\r\n  TOUCHSTONE, the court jester\r\n  SI'

In [40]:
# removing \r\n and " from the pd.series and store as a list of strings
data = []
# txt is each of 43 chapters
for txt in df_toc['text']:
  # x is each character in a chapter
  # white spaces are not removed
  t = [x.replace('\r', ' ').replace('\n', ' ').replace('"', '') for x in txt]
  data.append("".join(t))
print(len(data))
# data is a list of 43 items. Each chapter's \r and \n is removed, but still has multiple consecutive white spaces
data[1][:500]

43


'AS YOU LIKE IT      DRAMATIS PERSONAE.      DUKE, living in exile    FREDERICK, his brother, and usurper of his dominions    AMIENS, lord attending on the banished Duke    JAQUES,                                 LE BEAU, a courtier attending upon Frederick    CHARLES, wrestler to Frederick    OLIVER, son of Sir Rowland de Boys    JAQUES,                       ORLANDO,                      ADAM,   servant to Oliver    DENNIS,                 TOUCHSTONE, the court jester    SIR OLIVER MARTEXT, a v'

In [41]:
# For each chapter converts multiple spaces into one space
data = [' '.join(data[i].split()) for i in range(len(data))]

In [42]:
data[1][:500]

'AS YOU LIKE IT DRAMATIS PERSONAE. DUKE, living in exile FREDERICK, his brother, and usurper of his dominions AMIENS, lord attending on the banished Duke JAQUES, LE BEAU, a courtier attending upon Frederick CHARLES, wrestler to Frederick OLIVER, son of Sir Rowland de Boys JAQUES, ORLANDO, ADAM, servant to Oliver DENNIS, TOUCHSTONE, the court jester SIR OLIVER MARTEXT, a vicar CORIN, shepherd SILVIUS, WILLIAM, a country fellow, in love with Audrey A person representing HYMEN ROSALIND, daughter to '

In [43]:
data[0]

''

In [44]:
from tensorflow.keras.callbacks import LambdaCallback
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, LSTM
from tensorflow.keras.optimizers import RMSprop

import numpy as np
import random
import sys
import os
import re

In [45]:
print(len(data))
# connect all the 43 chapters with a space character into a long flattened text
# then make the text lower case as well
text = ' '.join(data).lower()

43


In [46]:
# remove all non-related characters
text = re.sub("[^a-z A-Z.:;?'!,`-]", "", text)

In [47]:
text[:500]

' as you like it dramatis personae. duke, living in exile frederick, his brother, and usurper of his dominions amiens, lord attending on the banished duke jaques, le beau, a courtier attending upon frederick charles, wrestler to frederick oliver, son of sir rowland de boys jaques, orlando, adam, servant to oliver dennis, touchstone, the court jester sir oliver martext, a vicar corin, shepherd silvius, william, a country fellow, in love with audrey a person representing hymen rosalind, daughter to'

In [48]:
# get a list of all used characters to be used as bag of items, but remove the duplicate characters
chars = list(set(text))

char_int = {c:i for i, c in enumerate(chars)}
int_char = {i:c for i, c in enumerate(chars)}

print(len(chars))
" ".join(chars)

36


"y q m ? s z d a v f ! ' c u i e t h x j p g ; `   n : k b . - o r l , w"

In [49]:
char_int

{'y': 0,
 'q': 1,
 'm': 2,
 '?': 3,
 's': 4,
 'z': 5,
 'd': 6,
 'a': 7,
 'v': 8,
 'f': 9,
 '!': 10,
 "'": 11,
 'c': 12,
 'u': 13,
 'i': 14,
 'e': 15,
 't': 16,
 'h': 17,
 'x': 18,
 'j': 19,
 'p': 20,
 'g': 21,
 ';': 22,
 '`': 23,
 ' ': 24,
 'n': 25,
 ':': 26,
 'k': 27,
 'b': 28,
 '.': 29,
 '-': 30,
 'o': 31,
 'r': 32,
 'l': 33,
 ',': 34,
 'w': 35}

Encode the text to integer numbers and create sequences of data

In [50]:
# sequence length
maxlen = 40
# stepping
step = 20

# iterate through each character of the text
encoded = [char_int[c] for c in text]

sequences = [] # Each element is 40 chars long
next_char = [] # One character for each element of sequence

for i in range(0, len(encoded) - maxlen, step):
    sequences.append(encoded[i : i + maxlen])
    # next_char refers to the character encoding right after the last character of the sequence in encoded list
    next_char.append(encoded[i + maxlen])
    
print('sequences: ', len(sequences), len(encoded)/step)

sequences:  696793 696794.7


Create X, y for LSTM layer,  (batch_size, timestep=40, features=len(char))

In [51]:
x = np.zeros((len(sequences), maxlen, len(chars)), dtype=np.bool)
y = np.zeros((len(sequences),len(chars)), dtype=np.bool)
# test = []
for i, sequence in enumerate(sequences):
    for t, char in enumerate(sequence):
        x[i,t,char] = 1
#         test.append((i,t,char))
    # y[sequence, next character after sequence in embeded]   
    y[i, next_char[i]] = 1

In [52]:
model = Sequential()
model.add(LSTM(128, input_shape=(maxlen, len(chars))))
# y is the output next character after the sequence
model.add(Dense(len(chars), activation='softmax'))

model.compile(loss='categorical_crossentropy', optimizer='adam')

In [53]:
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm (LSTM)                  (None, 128)               84480     
_________________________________________________________________
dense (Dense)                (None, 36)                4644      
Total params: 89,124
Trainable params: 89,124
Non-trainable params: 0
_________________________________________________________________


In [54]:
def sample(preds):
    """
    It normalizes the array of preds to proba (with their sum equal to 1)
    Then it picks the index of the first maximum based on a random draw with the proba array weight
    """
    # helper function to sample an index from a probability array
    preds = np.asarray(preds).astype('float64')
    
    # Null operation
    preds = np.log(preds) / 1
    exp_preds = np.exp(preds)
    
    # Normalize to the sum of one for the array of probabilities
    preds = exp_preds / np.sum(exp_preds)
    
    probas = np.random.multinomial(1, preds, 1)
    
    # Returns the indices of the maximum values along an axis.
    return np.argmax(probas)

In [55]:
def on_epoch_end(epoch, _):
    # Function invoked at end of each epoch. Prints generated text.
    
    print()
    print('----- Generating text after Epoch: %d' % epoch)
    
    # Random prompt to grab a 40 character sample seed
    start_index = random.randint(0, len(text) - maxlen - 1)
    
    generated = ''
    
    sentence = text[start_index: start_index + maxlen]
    generated += sentence
    
    print('----- Generating with seed: "' + sentence + '"')
    sys.stdout.write(generated)
    
    for i in range(400):
        # 400 is the length of generated text
        x_pred = np.zeros((1, maxlen, len(chars)))
        for t, char in enumerate(sentence):
            x_pred[0, t, char_int[char]] = 1
            
        # Predict the next step (character)
        # preds is an array of length 121 
        # with all the elements zero except only one of them 1
        preds = model.predict(x_pred, verbose=0)[0]
        next_index = sample(preds)
        next_char = int_char[next_index]
        
        # update the seed by moving one character forward
        sentence = sentence[1:] + next_char
        
        sys.stdout.write(next_char)
        sys.stdout.flush()
    print()


print_callback = LambdaCallback(on_epoch_end=on_epoch_end)

In [56]:
# fit the model

model.fit(x, y,
          batch_size=32,
          epochs=2,
          callbacks=[print_callback])

Epoch 1/2

----- Generating text after Epoch: 0
----- Generating with seed: "apel where they lie, and tears shed ther"
apel where they lie, and tears shed there thin hoth; arame, which of fuce die woze sow all pentursold. bucks cothers. bo hate the elight a hand. ruke. this messan: so, mwet hore in me to bequess, and just, what'rel comeer of in resisees to to man. you so hat eachingay us, would feirs, tursemt. actonge. soone, bue, love nor onts for wrow 'of he; no let nothtce. i would theme be ret when, no; tibus to yekly hounestes, my soud, as is the g
Epoch 2/2

----- Generating text after Epoch: 1
----- Generating with seed: "our drums to find this danger out. basta"
our drums to find this danger out. bastarro. fender. what in jele's me thus'd benger women my norpysed arm princelress to subners it see aids, to bride of demjohd, nor shapp, wounte to could here as not? by, he had you seefees pride over ther awour'd your currasion, which to goom lost o, now he sofd she flow both so and

<tensorflow.python.keras.callbacks.History at 0x7f8112f07828>

# Resources and Stretch Goals

## Stretch goals:
- Refine the training and generation of text to be able to ask for different genres/styles of Shakespearean text (e.g. plays versus sonnets)
- Train a classification model that takes text and returns which work of Shakespeare it is most likely to be from
- Make it more performant! Many possible routes here - lean on Keras, optimize the code, and/or use more resources (AWS, etc.)
- Revisit the news example from class, and improve it - use categories or tags to refine the model/generation, or train a news classifier
- Run on bigger, better data

## Resources:
- [The Unreasonable Effectiveness of Recurrent Neural Networks](https://karpathy.github.io/2015/05/21/rnn-effectiveness/) - a seminal writeup demonstrating a simple but effective character-level NLP RNN
- [Simple NumPy implementation of RNN](https://github.com/JY-Yoon/RNN-Implementation-using-NumPy/blob/master/RNN%20Implementation%20using%20NumPy.ipynb) - Python 3 version of the code from "Unreasonable Effectiveness"
- [TensorFlow RNN Tutorial](https://github.com/tensorflow/models/tree/master/tutorials/rnn) - code for training a RNN on the Penn Tree Bank language dataset
- [4 part tutorial on RNN](http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/) - relates RNN to the vanishing gradient problem, and provides example implementation
- [RNN training tips and tricks](https://github.com/karpathy/char-rnn#tips-and-tricks) - some rules of thumb for parameterizing and training your RNN