<a href="https://colab.research.google.com/github/skhabiri/DS-Unit-4-Sprint-3-Deep-Learning/blob/main/module1-rnn-and-lstm/LS_DS17_431_RNN_and_LSTM_Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## *Data Science Unit 4 Sprint 3 Module 1*

# Recurrent Neural Networks and Long Short Term Memory (LSTM)

![Monkey at a typewriter](https://upload.wikimedia.org/wikipedia/commons/thumb/3/3c/Chimpanzee_seated_at_typewriter.jpg/603px-Chimpanzee_seated_at_typewriter.jpg)

It is said that [infinite monkeys typing for an infinite amount of time](https://en.wikipedia.org/wiki/Infinite_monkey_theorem) will eventually type, among other things, the complete works of Wiliam Shakespeare. Let's see if we can get there a bit faster, with the power of Recurrent Neural Networks and LSTM.

This text file contains the complete works of Shakespeare: https://www.gutenberg.org/files/100/100-0.txt

Use it as training data for an RNN - you can keep it simple and train character level, and that is suggested as an initial approach.

Then, use that trained RNN to generate Shakespearean-ish text. Your goal - a function that can take, as an argument, the size of text (e.g. number of characters or lines) to generate, and returns generated text of that size.

Note - Shakespeare wrote an awful lot. It's OK, especially initially, to sample/use smaller data and parameters, so you can have a tighter feedback loop when you're trying to get things running. Then, once you've got a proof of concept - start pushing it more!

In [1]:
import requests
import pandas as pd

In [2]:
url = "https://www.gutenberg.org/files/100/100-0.txt"

r = requests.get(url)
r.encoding = r.apparent_encoding
data = r.text
data = data.split('\r\n')
toc = [l.strip() for l in data[44:130:2]]
# Skip the Table of Contents
data = data[135:]

# Fixing Titles
toc[9] = 'THE LIFE OF KING HENRY V'
toc[18] = 'MACBETH'
toc[24] = 'OTHELLO, THE MOOR OF VENICE'
toc[34] = 'TWELFTH NIGHT: OR, WHAT YOU WILL'

locations = {id_:{'title':title, 'start':-99} for id_,title in enumerate(toc)}

# Start 
for e,i in enumerate(data):
    for t,title in enumerate(toc):
        if title in i:
            locations[t].update({'start':e})
            

df_toc = pd.DataFrame.from_dict(locations, orient='index')
df_toc['end'] = df_toc['start'].shift(-1).apply(lambda x: x-1)
df_toc.loc[42, 'end'] = len(data)
df_toc['end'] = df_toc['end'].astype('int')

df_toc['text'] = df_toc.apply(lambda x: '\r\n'.join(data[ x['start'] : int(x['end']) ]), axis=1)

In [3]:
#Shakespeare Data Parsed by Play
df_toc.head()

Unnamed: 0,title,start,end,text
0,THE TRAGEDY OF ANTONY AND CLEOPATRA,-99,14379,
1,AS YOU LIKE IT,14380,17171,AS YOU LIKE IT\r\n\r\n\r\nDRAMATIS PERSONAE.\r...
2,THE COMEDY OF ERRORS,17172,20372,THE COMEDY OF ERRORS\r\n\r\n\r\n\r\nContents\r...
3,THE TRAGEDY OF CORIOLANUS,20373,30346,THE TRAGEDY OF CORIOLANUS\r\n\r\nDramatis Pers...
4,CYMBELINE,30347,30364,CYMBELINE.\r\nLaud we the gods;\r\nAnd let our...


In [4]:
df_toc.shape

(43, 4)

In [5]:
len(df_toc['text'][1])

136678

In [6]:
df_toc['text']

0                                                      
1     AS YOU LIKE IT\r\n\r\n\r\nDRAMATIS PERSONAE.\r...
2     THE COMEDY OF ERRORS\r\n\r\n\r\n\r\nContents\r...
3     THE TRAGEDY OF CORIOLANUS\r\n\r\nDramatis Pers...
4     CYMBELINE.\r\nLaud we the gods;\r\nAnd let our...
5     THE TRAGEDY OF HAMLET, PRINCE OF DENMARK\r\n\r...
6     THE FIRST PART OF KING HENRY THE FOURTH\r\n\r\...
7     THE SECOND PART OF KING HENRY THE FOURTH\r\n\r...
8                                                      
9     THE LIFE OF KING HENRY V\r\n\r\n\r\n\r\nConten...
10    THE SECOND PART OF KING HENRY THE SIXTH\r\n\r\...
11    THE THIRD PART OF KING HENRY THE SIXTH\r\n\r\n...
12                     KING HENRY THE EIGHTH\r\n\r\n...
13      KING JOHN. O cousin, thou art come to set mi...
14    THE TRAGEDY OF JULIUS CAESAR\r\n\r\n\r\n\r\nCo...
15    THE TRAGEDY OF KING LEAR\r\n\r\n\r\n\r\nConten...
16    LOVE’S LABOUR’S LOST\r\n\r\nDramatis Personae....
17                                              

In [7]:
test = []
for txt in df_toc['text']:
  for x in txt:
    test.append(x)
test[:20]

['A',
 'S',
 ' ',
 'Y',
 'O',
 'U',
 ' ',
 'L',
 'I',
 'K',
 'E',
 ' ',
 'I',
 'T',
 '\r',
 '\n',
 '\r',
 '\n',
 '\r',
 '\n']

In [8]:
# data = df_toc['text'][1:].values
# data.shape

In [12]:
# removing \r\n and " from the pd.series and store as a list of strings
data = []
# txt is each of 43 chapters
for txt in df_toc['text']:
  # x is each character in a chapter
  # white spaces are not removed
  t = [x.replace('\r', '').replace('\n', '').replace('"', '') for x in txt]
  data.append("".join(t))
print(len(data))
# data is a list of 43 items. Each chapter's \r and \n is removed, but still has multiple consecutive white spaces
data[1][:500]

43


'AS YOU LIKE ITDRAMATIS PERSONAE.  DUKE, living in exile  FREDERICK, his brother, and usurper of his dominions  AMIENS, lord attending on the banished Duke  JAQUES,                               LE BEAU, a courtier attending upon Frederick  CHARLES, wrestler to Frederick  OLIVER, son of Sir Rowland de Boys  JAQUES,                     ORLANDO,                    ADAM,   servant to Oliver  DENNIS,               TOUCHSTONE, the court jester  SIR OLIVER MARTEXT, a vicar  CORIN,    shepherd  SILVIUS,'

In [13]:
# For each chapter converts multiple spaces into one space
data = [' '.join(data[i].split()) for i in range(len(data))]

In [14]:
data[1][:500]

'AS YOU LIKE ITDRAMATIS PERSONAE. DUKE, living in exile FREDERICK, his brother, and usurper of his dominions AMIENS, lord attending on the banished Duke JAQUES, LE BEAU, a courtier attending upon Frederick CHARLES, wrestler to Frederick OLIVER, son of Sir Rowland de Boys JAQUES, ORLANDO, ADAM, servant to Oliver DENNIS, TOUCHSTONE, the court jester SIR OLIVER MARTEXT, a vicar CORIN, shepherd SILVIUS, WILLIAM, a country fellow, in love with Audrey A person representing HYMEN ROSALIND, daughter to t'

In [15]:
from tensorflow.keras.callbacks import LambdaCallback
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, LSTM
from tensorflow.keras.optimizers import RMSprop

import numpy as np
import random
import sys
import os
import re


In [16]:
print(len(data))
# connect all the 43 chapters with a space character into a long flattened text
# then make the text lower case as well
text = ' '.join(data).lower()

43


In [17]:
# remove all non-related characters
text = re.sub("[^a-z A-Z.:;?'!,`-]", "", text)

In [19]:
text[:500]

' as you like itdramatis personae. duke, living in exile frederick, his brother, and usurper of his dominions amiens, lord attending on the banished duke jaques, le beau, a courtier attending upon frederick charles, wrestler to frederick oliver, son of sir rowland de boys jaques, orlando, adam, servant to oliver dennis, touchstone, the court jester sir oliver martext, a vicar corin, shepherd silvius, william, a country fellow, in love with audrey a person representing hymen rosalind, daughter to '

In [None]:
# get a list of all used characters to be used as bag of items, but remove the duplicate characters
chars = list(set(text))

char_int = {c:i for i, c in enumerate(chars)}
int_char = {i:c for i, c in enumerate(chars)}

print(len(chars))

36


In [None]:
char_int

{' ': 17,
 '!': 31,
 "'": 27,
 ',': 8,
 '-': 15,
 '.': 4,
 ':': 2,
 ';': 28,
 '?': 33,
 '`': 1,
 'a': 11,
 'b': 23,
 'c': 30,
 'd': 34,
 'e': 21,
 'f': 18,
 'g': 9,
 'h': 19,
 'i': 13,
 'j': 12,
 'k': 7,
 'l': 0,
 'm': 32,
 'n': 26,
 'o': 5,
 'p': 6,
 'q': 3,
 'r': 25,
 's': 29,
 't': 24,
 'u': 10,
 'v': 35,
 'w': 22,
 'x': 20,
 'y': 16,
 'z': 14}

In [None]:
# sequence length
maxlen = 40
# stepping
step = 20

# iterate through each character of the text
encoded = [char_int[c] for c in text]

sequences = [] # Each element is 40 chars long
next_char = [] # One character for each element of sequence

for i in range(0, len(encoded) - maxlen, step):
    sequences.append(encoded[i : i + maxlen])
    # next_char refers to the character encoding right after the last character of the sequence in encoded list
    next_char.append(encoded[i + maxlen])
    
print('sequences: ', len(sequences), len(encoded)/step)

sequences:  685838 685839.85


In [None]:
x = np.zeros((len(sequences), maxlen, len(chars)), dtype=np.bool)
y = np.zeros((len(sequences),len(chars)), dtype=np.bool)
test = []
for i, sequence in enumerate(sequences):
    for t, char in enumerate(sequence):
        # x[sequence#[0:137], char index in sequence[0:40], char encoded value[0:121]]
        # 1 is stored as boolean type
        x[i,t,char] = 1
        test.append((i,t,char))
    # y[sequence, next character after sequence in embeded]   
    y[i, next_char[i]] = 1

In [None]:
model = Sequential()
model.add(LSTM(128, input_shape=(maxlen, len(chars))))
# y is the output next character after the sequence
model.add(Dense(len(chars), activation='softmax'))

model.compile(loss='categorical_crossentropy', optimizer='adam')

In [None]:
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm (LSTM)                  (None, 128)               84480     
_________________________________________________________________
dense (Dense)                (None, 36)                4644      
Total params: 89,124
Trainable params: 89,124
Non-trainable params: 0
_________________________________________________________________


In [None]:
def sample(preds):
    """
    It normalizes the array of preds to proba (with their sum equal to 1)
    Then it picks the index of the first maximum based on a random draw with the proba array weight
    """
    # helper function to sample an index from a probability array
    preds = np.asarray(preds).astype('float64')
    
    # Null operation
    preds = np.log(preds) / 1
    exp_preds = np.exp(preds)
    
    # Normalize to the sum of one for the array of probabilities
    preds = exp_preds / np.sum(exp_preds)
    
    probas = np.random.multinomial(1, preds, 1)
    
    # Returns the indices of the maximum values along an axis.
    return np.argmax(probas)

In [None]:
def on_epoch_end(epoch, _):
    # Function invoked at end of each epoch. Prints generated text.
    
    print()
    print('----- Generating text after Epoch: %d' % epoch)
    
    # Random prompt to grab a 40 character sample seed
    start_index = random.randint(0, len(text) - maxlen - 1)
    
    generated = ''
    
    sentence = text[start_index: start_index + maxlen]
    generated += sentence
    
    print('----- Generating with seed: "' + sentence + '"')
    sys.stdout.write(generated)
    
    for i in range(400):
        # 400 is the length of generated text
        x_pred = np.zeros((1, maxlen, len(chars)))
        for t, char in enumerate(sentence):
            x_pred[0, t, char_int[char]] = 1
            
        # Predict the next step (character)
        # preds is an array of length 121 
        # with all the elements zero except only one of them 1
        preds = model.predict(x_pred, verbose=0)[0]
        next_index = sample(preds)
        next_char = int_char[next_index]
        
        # update the seed by moving one character forward
        sentence = sentence[1:] + next_char
        
        sys.stdout.write(next_char)
        sys.stdout.flush()
    print()


print_callback = LambdaCallback(on_epoch_end=on_epoch_end)

In [None]:
# fit the model

model.fit(x, y,
          batch_size=32,
          epochs=10,
          callbacks=[print_callback])

Epoch 1/10
----- Generating text after Epoch: 0
----- Generating with seed: "h shame lives in death with glorious fam"
h shame lives in death with glorious fams'row, ganionttent locts. ut are thou detand. whens hame my ofter this not speakel and swleak me, wething on it? there ears god you loth all, he cart by speinlar'd sirn, and in truturemin ithelles; gargen my marters but us the us; and frost feacon with this fixturos.ich ooth.worth sely? it is one me poring is she one our plisive not batth.to this it, but him rome himellard. if our now the dight fo
Epoch 2/10
----- Generating text after Epoch: 1
----- Generating with seed: "mp his windpipe suffocate.but exeter hat"
mp his windpipe suffocate.but exeter hath thing sicld now there. i am, but shemper, with acpias of fool.protius.orselive. o dend saieles. a thing, back, our my piltarge ushdoth till the sound, got duke!what! exeunt stond of were sab friendsperiogk, sutons of himbelftate she hinded but beseding. you seate by heope of mas

<tensorflow.python.keras.callbacks.History at 0x7fb49a5f3d68>

# Resources and Stretch Goals

## Stretch goals:
- Refine the training and generation of text to be able to ask for different genres/styles of Shakespearean text (e.g. plays versus sonnets)
- Train a classification model that takes text and returns which work of Shakespeare it is most likely to be from
- Make it more performant! Many possible routes here - lean on Keras, optimize the code, and/or use more resources (AWS, etc.)
- Revisit the news example from class, and improve it - use categories or tags to refine the model/generation, or train a news classifier
- Run on bigger, better data

## Resources:
- [The Unreasonable Effectiveness of Recurrent Neural Networks](https://karpathy.github.io/2015/05/21/rnn-effectiveness/) - a seminal writeup demonstrating a simple but effective character-level NLP RNN
- [Simple NumPy implementation of RNN](https://github.com/JY-Yoon/RNN-Implementation-using-NumPy/blob/master/RNN%20Implementation%20using%20NumPy.ipynb) - Python 3 version of the code from "Unreasonable Effectiveness"
- [TensorFlow RNN Tutorial](https://github.com/tensorflow/models/tree/master/tutorials/rnn) - code for training a RNN on the Penn Tree Bank language dataset
- [4 part tutorial on RNN](http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/) - relates RNN to the vanishing gradient problem, and provides example implementation
- [RNN training tips and tricks](https://github.com/karpathy/char-rnn#tips-and-tricks) - some rules of thumb for parameterizing and training your RNN