<a href="https://colab.research.google.com/github/tsangrebecca/BloomTech/blob/main/Sprint15/M1_DS_431_RNN_and_LSTM_Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


## *Data Science Unit 4 Sprint 3 Assignment 1*

# Recurrent Neural Networks and Long Short Term Memory (LSTM)

![Monkey at a typewriter](https://upload.wikimedia.org/wikipedia/commons/thumb/3/3c/Chimpanzee_seated_at_typewriter.jpg/603px-Chimpanzee_seated_at_typewriter.jpg)

The French mathematician Emile Borel once mused that [**infinite monkeys typing for an infinite amount of time**](https://en.wikipedia.org/wiki/Infinite_monkey_theorem) will eventually type the complete works of William Shakespeare. Let's see if we can get there a bit faster, with the power of **Recurrent Neural Networks and LSTMs**!

Our goal in this projectis to build a **Shakespeare Sonnet Generator**.<br>
Given a prompt of a few words as input, its task is to generate follow-on text that reads like a Shakespeare Sonnet!<br>

To build our Sonnet Generator we will use a type of model called a **sequence model**. Given a short sequence, a sequence  model predicts the **most likely next item in the sequence**. Sequence models are astonishingly versatile and powerful, because the **sequence** we want to predict can be quite general! It can be composed of **words**, or of **characters**, or of **musical notes**, or of data points in a **time series** such as EKG voltages, or stock prices, or even a sequence of **DNA nucleotides**!

We will train our model on the entire corpus of Shakespeare's Sonnets, and the model will learn from that data the most likely patterns of characters.

In [1]:
import random
import sys
import os

import requests
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from tensorflow.keras.callbacks import LambdaCallback

from tensorflow.keras.preprocessing import sequence
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding, Bidirectional
from tensorflow.keras.layers import LSTM

%matplotlib inline

# import a custom text data preparation class
# !wget https://raw.githubusercontent.com/LambdaSchool/DS-Unit-4-Sprint-3-Deep-Learning/main/module1-rnn-and-lstm/data_cleaning_toolkit_class.py
# file is outdated with np.bool, import python package locally
from data_cleaning_toolkit_class import data_cleaning_toolkit

In [2]:
# def create_X_and_Y(self):
#         """
#         Takes a sequence of chars and creates an input/output split (i.e. X and Y)

#         Paremeters
#         ----------
#         None

#         Returns
#         -------
#         x: array of Booleans (i.e. True and False)
#         y: array of Booleans (i.e. True and False)
#         """
#         # this is the number of rows in the doc-term matrix that we are about to create (i.e. x)
#         n_seqs = len(self.sequences)

#         # this is the number of features in the doc-term matrix that we are about to create
#         n_unique_chars = len(self.unique_chars)

#         # Create shape for x and y
#         x_dims = (len(self.sequences), self.maxlen, len(self.unique_chars))
#         y_dims = (len(self.sequences),len(self.unique_chars))

#         # create data containers for x and y
#         # default values will all be zero ( i.e. look up docs for np.zeros() )
#         # recall that a value of zero is equivalent to False in Python
#         x = np.zeros(x_dims, dtype=bool)
#         y = np.zeros(y_dims, dtype=bool)

#         # populate x and y with 1 (from a Boolean perspective, 1 and True are the same thing)
#         # iterative through the index and sequence
#         for i, sequence in enumerate(self.sequences):
#             # take tha sequence and iterate through the chars in the sequence
#             for t, char in enumerate(sequence):
#                 # for row i, location in time series t, and feature char
#                 # assign a value of 1
#                 # recall we are using encoded chars from def create_char_sequenes()
#                 # meaning characters are now represented by a numerical value
#                 x[i,t,char] = 1

#             # follow similar for the char that should be predicted by the model
#             # given the corresponding sequence of chars in x
#             y[i, self.next_char[i]] = 1

#         return x, y

### Use `request` to pull data from a URL

[**Read through the request documentation**](https://requests.readthedocs.io/en/master/user/quickstart/#make-a-request) to learn how to download the Shakespeare Sonnets from the Gutenberg website.


In [3]:
# download all of Shakespeare's Sonnets from the Project Gutenberg website

# here's the link for the sonnets
url_shakespeare_sonnets = "https://www.gutenberg.org/cache/epub/1041/pg1041.txt"

# use requests and the url to download all of the sonnets - save the result to `data`
data = requests.get(url_shakespeare_sonnets)


In [4]:
# extract the downloaded text from the requests object and save it to `raw_text_data`
# hint: take a look at the attributes of `data`
# YOUR CODE HERE
raw_text_data = data.text
#raise NotImplementedError()

In [5]:
# check the data type of `raw_text_data`
assert(type(raw_text_data)==str)

### Data Cleaning

In [6]:
# as usual, we need to clean up the messy data
raw_text_data[:3000]

"\ufeffThe Project Gutenberg eBook of Shakespeare's Sonnets\r\n    \r\nThis ebook is for the use of anyone anywhere in the United States and\r\nmost other parts of the world at no cost and with almost no restrictions\r\nwhatsoever. You may copy it, give it away or re-use it under the terms\r\nof the Project Gutenberg License included with this ebook or online\r\nat www.gutenberg.org. If you are not located in the United States,\r\nyou will have to check the laws of the country where you are located\r\nbefore using this eBook.\r\n\r\nTitle: Shakespeare's Sonnets\r\n\r\nAuthor: William Shakespeare\r\n\r\nRelease date: September 1, 1997 [eBook #1041]\r\n                Most recently updated: March 10, 2024\r\n\r\nLanguage: English\r\n\r\nCredits: the Project Gutenberg Shakespeare Team\r\n\r\n\r\n*** START OF THE PROJECT GUTENBERG EBOOK SHAKESPEARE'S SONNETS ***\r\nTHE SONNETS\r\n\r\nby William Shakespeare\r\n\r\n\r\n\r\n\r\nI\r\n\r\nFrom fairest creatures we desire increase,\r\nThat there

In [7]:
# Which characters could we use with the split() method to split the text into lines?

# split the text into **lines** and save the result to `split_data`

# YOUR CODE HERE
split_chars = '\r\n'
split_data = raw_text_data.split(split_chars)

In [8]:
# we need to drop all the boiler plate text (i.e. titles and descriptions) as well as extra white spaces
# so that we are left with only the sonnets themselves
split_data[:80]

["\ufeffThe Project Gutenberg eBook of Shakespeare's Sonnets",
 '    ',
 'This ebook is for the use of anyone anywhere in the United States and',
 'most other parts of the world at no cost and with almost no restrictions',
 'whatsoever. You may copy it, give it away or re-use it under the terms',
 'of the Project Gutenberg License included with this ebook or online',
 'at www.gutenberg.org. If you are not located in the United States,',
 'you will have to check the laws of the country where you are located',
 'before using this eBook.',
 '',
 "Title: Shakespeare's Sonnets",
 '',
 'Author: William Shakespeare',
 '',
 'Release date: September 1, 1997 [eBook #1041]',
 '                Most recently updated: March 10, 2024',
 '',
 'Language: English',
 '',
 'Credits: the Project Gutenberg Shakespeare Team',
 '',
 '',
 "*** START OF THE PROJECT GUTENBERG EBOOK SHAKESPEARE'S SONNETS ***",
 'THE SONNETS',
 '',
 'by William Shakespeare',
 '',
 '',
 '',
 '',
 'I',
 '',
 'From fairest creatures 

**Use list index slicing to remove the titles and descriptions, so we only have the sonnets.**


In [9]:
# we need to drop all the boilerplate text (i.e., titles and descriptions) as well as extra white spaces
# so that we are left with only the sonnets themselves

# find index boundaries (start, end) so that sonnets exist between these indices, titles and descriptions exist outside of these indices
start_index = 45
end_index = -369

# use index slicing to isolate the sonnet lines from the text - save the result to `sonnets`
sonnets = split_data[start_index:end_index]

In [10]:
sonnets

['    To eat the world’s due, by the grave and thee.',
 '',
 'II',
 '',
 'When forty winters shall besiege thy brow,',
 'And dig deep trenches in thy beauty’s field,',
 'Thy youth’s proud livery so gazed on now,',
 'Will be a tatter’d weed of small worth held:',
 'Then being asked, where all thy beauty lies,',
 'Where all the treasure of thy lusty days;',
 'To say, within thine own deep sunken eyes,',
 'Were an all-eating shame, and thriftless praise.',
 'How much more praise deserv’d thy beauty’s use,',
 'If thou couldst answer ‘This fair child of mine',
 'Shall sum my count, and make my old excuse,’',
 'Proving his beauty by succession thine!',
 '    This were to be new made when thou art old,',
 '    And see thy blood warm when thou feel’st it cold.',
 '',
 'III',
 '',
 'Look in thy glass and tell the face thou viewest',
 'Now is the time that face should form another;',
 'Whose fresh repair if now thou not renewest,',
 'Thou dost beguile the world, unbless some mother.',
 'For wher

Notice that there are many lines that should not be counted as sonnets!

In [11]:
# these non-sonnet lines have far fewer characters than the actual sonnet lines?
# we still need to take out the roman numerals
sonnets[200:240]

['Who lets so fair a house fall to decay,',
 'Which husbandry in honour might uphold,',
 'Against the stormy gusts of winter’s day',
 'And barren rage of death’s eternal cold?',
 '    O! none but unthrifts. Dear my love, you know,',
 '    You had a father: let your son say so.',
 '',
 'XIV',
 '',
 'Not from the stars do I my judgement pluck;',
 'And yet methinks I have astronomy,',
 'But not to tell of good or evil luck,',
 'Of plagues, of dearths, or seasons’ quality;',
 'Nor can I fortune to brief minutes tell,',
 'Pointing to each his thunder, rain and wind,',
 'Or say with princes if it shall go well',
 'By oft predict that I in heaven find:',
 'But from thine eyes my knowledge I derive,',
 'And constant stars in them I read such art',
 'As ‘Truth and beauty shall together thrive,',
 'If from thyself, to store thou wouldst convert’;',
 '    Or else of thee this I prognosticate:',
 '    ‘Thy end is truth’s and beauty’s doom and date.’',
 '',
 'XV',
 '',
 'When I consider everything 

In [12]:
# lines of poetry
len(split_data)

3003

In [13]:
# use your judgement to decide on a good value for
#   the  minimum number of characters that a sonnet should have
#   call it n_chars
# If the line has 10 or less characters, then we discard it
n_chars = 10

# Let's use that observation to filter out all the non-sonnet lines!
#    save results to `filtered_sonnets`
# Hint: use a list comprehension
# If it's a legit line of sonnet, then strip out the empty spaces on the left, leading white spaces
filtered_sonnets = [line.lstrip(" ") for line in sonnets if len(line) > n_chars] # len(line) is the number of chars in the line

# YOUR CODE HERE
# raise NotImplementedError()

In [14]:
# ok - much better!
# but we still need to remove all the punctuation and case normalize the text
filtered_sonnets

['To eat the world’s due, by the grave and thee.',
 'When forty winters shall besiege thy brow,',
 'And dig deep trenches in thy beauty’s field,',
 'Thy youth’s proud livery so gazed on now,',
 'Will be a tatter’d weed of small worth held:',
 'Then being asked, where all thy beauty lies,',
 'Where all the treasure of thy lusty days;',
 'To say, within thine own deep sunken eyes,',
 'Were an all-eating shame, and thriftless praise.',
 'How much more praise deserv’d thy beauty’s use,',
 'If thou couldst answer ‘This fair child of mine',
 'Shall sum my count, and make my old excuse,’',
 'Proving his beauty by succession thine!',
 'This were to be new made when thou art old,',
 'And see thy blood warm when thou feel’st it cold.',
 'Look in thy glass and tell the face thou viewest',
 'Now is the time that face should form another;',
 'Whose fresh repair if now thou not renewest,',
 'Thou dost beguile the world, unbless some mother.',
 'For where is she so fair whose unear’d womb',
 'Disdain

### Use Custom Data Cleaning Tool

Use one of the methods in the `data_cleaning_toolkit` to clean your data.

There is an example of this in the guided project.

In [15]:
# instantiate the data_cleaning_toolkit class - save result to `dctk`
# can check out the code in the python file named 'data_cleaning_toolkit_class.py' in the bloomtech github folder that's downloaded locally
dctk = data_cleaning_toolkit()

In [16]:
# use data_cleaning_toolkit to remove punctuation and to case normalize - save results to `clean_sonnets`
clean_sonnets = [dctk.clean_data(line) for line in filtered_sonnets]


In [17]:
# much better!
display(clean_sonnets)
print(len(clean_sonnets))

['to eat the worlds due by the grave and thee',
 'when forty winters shall besiege thy brow',
 'and dig deep trenches in thy beautys field',
 'thy youths proud livery so gazed on now',
 'will be a tatterd weed of small worth held',
 'then being asked where all thy beauty lies',
 'where all the treasure of thy lusty days',
 'to say within thine own deep sunken eyes',
 'were an alleating shame and thriftless praise',
 'how much more praise deservd thy beautys use',
 'if thou couldst answer this fair child of mine',
 'shall sum my count and make my old excuse',
 'proving his beauty by succession thine',
 'this were to be new made when thou art old',
 'and see thy blood warm when thou feelst it cold',
 'look in thy glass and tell the face thou viewest',
 'now is the time that face should form another',
 'whose fresh repair if now thou not renewest',
 'thou dost beguile the world unbless some mother',
 'for where is she so fair whose uneard womb',
 'disdains the tillage of thy husbandry',
 

2128


### Use Your Data Tool to Create Character Sequences
for the LSTM model

We'll need the `create_char_sequences` method for this task. <br>
However, this method requires a parameter called `maxlen,` which is responsible for setting the maximum sequence length.

So what would be a good sequence length, exactly? Every line has a different number of words.

To answer that question, let's do some statistics!

In [18]:
def calc_stats(corpus):
    """
    Calculates statistics on the length of every line in the sonnets
    """

    # write a list comprehension that calculates each sonnet's line length - save the results to `doc_lens`
    doc_lens = [len(line) for line in corpus]

    # use numpy to calculate and return the mean, median, std, max, min of the doc lens - all in one line of code
    return np.mean(doc_lens), np.median(doc_lens), np.std(doc_lens), np.max(doc_lens), np.min(doc_lens)


In [19]:
# sonnet line length statistics
mean, med, std, max_, min_ = calc_stats(clean_sonnets) # use max_ to indicate it's a variable name, not a pandas function
mean, med, std, max_, min_

(40.892387218045116, 41.0, 4.0451142832246285, 57, 27)

In [20]:
# from the results of the sonnet line length statistics, use your judgement to select a value for maxlen
#   hint -- a good value might be half the median length of a sonnet line
# use .create_char_sequences() to create sequences

maxlen = 40
dctk.create_char_sequences(clean_sonnets, maxlen=maxlen)


Created 17822 sequences.


Take a look at the `data_cleaning_toolkit_class.py` file.

In the first 4 lines of code in the `create_char_sequences` method, class attributes `n_features` and `unique_chars` are created. <br>
Let's call them in the cells below.

In [21]:
# number of output features for our LSTM model, how many unique labels can model predict, the unique number of characters a to z and white space
dctk.n_features

27

In [22]:
# unique characters that appear in our sonnets
dctk.unique_chars

['k',
 'j',
 'y',
 'l',
 'v',
 'h',
 'i',
 'm',
 ' ',
 'z',
 'b',
 'u',
 'd',
 's',
 'w',
 'n',
 'r',
 'e',
 't',
 'o',
 'c',
 'f',
 'g',
 'p',
 'x',
 'q',
 'a']

In [23]:
len(dctk.unique_chars)

27

## Time for Questions

----
**Question 1:**

Why are the `number of unique characters` (i.e., **dctk.unique_chars**) and the `number of model output features` (i.e., **dctk.n_features**) the same?

**Hint:** The model that we will shortly build here is very similar to the text generation model we built in the guided project.

**Answer 1:**

Our model is a classification model that, when given a sequence of characters, it will predict what character should come next. Every unique character that shows up in the sonnets is a possible option. Our model is a multi-class classification model that outputs a probability score (using Softmax) for each of the unique character that it can predict as the character that should show up next in the sonnet line.

The number of unique characters and the number of model output features are the same because all the unique characters are a possible outcome for our model predictions.


**Question 2:**

Take a look at the printout of `dctk.unique_chars` one more time. Notice that there is a white space.

Why is it desirable to have a white space as a possible character to predict?

**Answer 2:**

In order for the model to generate legible words, there needs to be a white space between characters.

----

### Use Our Data Tool to Create X and Y Splits

You'll need the `create_X_and_Y` method for this task.

In [24]:
# TODO: provide a walkthrough of data_cleaning_toolkit with unit tests that check for understanding
X, y = dctk.create_X_and_Y()

![](https://miro.medium.com/max/891/0*jGB1CGQ9HdeUwlgB)

In [37]:
# notice that our input array isn't a matrix - it's a rank three tensor
# (number of sequences, length of each sequence, number of unique chars aka number of features to predict)
X.shape

(17822, 40, 27)

In $X$.shape, we see three numbers (*n1*, *n2*, *n3*). What do these numbers mean?

Well, *n1* tells us the number of samples that we have. But what about the other two?

In [38]:
# first index returns a single sample, which we can see is a sequence
first_sample_index = 0
X[first_sample_index]
# each sample is 40 rows tall and 27 columns wide, each row is one-hot-encoded, there are a lot of Falses, each row only has one True

array([[False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False],
       ...,
       [False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False]])

Notice that each sequence (i.e., $X[i]$ where $i$ is some index value) is `maxlen` long and <br>
has a number of features equal to `dctk.n_features`. <br>Let's try to understand this shape.

In [27]:
# each sequence is maxlen long (40) and has dctk.n_features number of features (27)
X[first_sample_index].shape

(40, 27)

**Each row corresponds to a character vector,** and there is `maxlen` number of character vectors.

**Each column corresponds to a unique character,** and there are `dctk.n_features` number of features.


In [28]:
# let's index for a single character vector
first_char_vect_index = 0
X[first_sample_index][first_char_vect_index]

array([False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
        True, False, False, False, False, False, False, False, False])

Notice that there is a single `True` value, and all the rest of the values are `False`.

This is a one-hot encoding for which character appears at each index within a sequence. Specifically, the cell above is looking at the first character in the sequence.

Only a single character can appear as the first character in a sequence, so there will be a single `True` value, and the rest will be `False`.

Let's say that `True` appears in the $ith$ index; by  $ith$ index we mean some index in the general case. So how can we find out which character corresponds to?

To answer this question, we need to use the character-to-integer look-up dictionary.

In [29]:
# take a look at the index to character dictionary
# if a TRUE appears in the 0th index of a character vector,
# then we know that whatever char you see below next to the 0th key
# is the character that character vector is endcoding for
dctk.int_char

{0: 'k',
 1: 'j',
 2: 'y',
 3: 'l',
 4: 'v',
 5: 'h',
 6: 'i',
 7: 'm',
 8: ' ',
 9: 'z',
 10: 'b',
 11: 'u',
 12: 'd',
 13: 's',
 14: 'w',
 15: 'n',
 16: 'r',
 17: 'e',
 18: 't',
 19: 'o',
 20: 'c',
 21: 'f',
 22: 'g',
 23: 'p',
 24: 'x',
 25: 'q',
 26: 'a'}

In [30]:
# let's look at an example to tie it all together

seq_len_counter = 0

# index for a single sample
for seq_of_char_vects in X[first_sample_index]:

    # get index with max value, which will be the one TRUE value
    index_with_TRUE_val = np.argmax(seq_of_char_vects)

    print (dctk.int_char[index_with_TRUE_val])

    seq_len_counter+=1

print ("Sequence length: {}".format(seq_len_counter))

t
o
 
e
a
t
 
t
h
e
 
w
o
r
l
d
s
 
d
u
e
 
b
y
 
t
h
e
 
g
r
a
v
e
 
a
n
d
 
t
Sequence length: 40


## Time for Questions

----
**Question 1:**

In your own words, how would you describe the numbers from the shape printout of `X.shape` to a classmate?


**Answer 1:**

X.shape is a rank 3 tensor (3D). Each of the numbers in this shape means number of sequences, length of sequnce aka number of characters in a sequence, and number of features for our model to predict, aka number of unique chars, 26 letters in the alphabet plus a white space.

----


### Build a Shakespeare Sonnet Text Generation Model

Now that we have prepped our data (and understood that process), let's finally build out our character generation model, similar to what we did in the guided project.<br>

First, we'll create a **callback** to monitor the training -- by printing a sample of text generated by the model at the end of each epoch.

Helper function to generate a sample character:

In [39]:
def sample(preds, temperature=1.0):
    """
    Helper function to generate a sample character
    Input is a predictions vector from our model, for example a set of 27 character probabilities
    Output is the index of the generated character
    """
    # convert predictions to an array
    preds = np.asarray(preds).astype('float64')

    # use the temperature hyper-parameter to "warp" (sharpen or spread out) the probability distribution
    preds = np.log(preds) / temperature

    # use the softmax activation function to create a new list of probabilities
    #   corresponding to the "warped" probability distribution
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)

    # Draw a single sample from a multinomial distribution, given these probabilities
    #   The sample will be a one-hot encoded character
    """ Notes on the np.random.multinomial() function
       The first argument is the number of "trials" we want: 1 in this case
       The second argument is the list of probabilities for each character
       The third argument is number of sets of "trials" we want: again, 1 in this case
       By analogy with a dice-rolling experiment:
          This "trial" consists of generating a single "throw" of a die with 27 faces;
             each face corresponds to a character and its associated probability
    """

    probas = np.random.multinomial(1, preds, 1)

    # return the index that corresponds to the max probability
    return np.argmax(probas)


Create the `on_epoch_end` function to be passed into `LambdaCallback()`

In [32]:
def on_epoch_end(epoch, _):
    """"
    Function invoked at the end of each epoch. Prints the text generated by our model.
    """

    print()
    print('----- Generating text after Epoch: %d' % epoch)


    # randomly pick a starting index
    # will be used to take a random sequence of chars from `text`
    start_index = random.randint(0, len(text) - dctk.maxlen - 1)

    # this is our seed string (i.e. input seqeunece into the model)
    generated = ''

    # start the sentence at index `start_index` and include the next` dctk.maxlen` number of chars
    sentence = text[start_index: start_index + dctk.maxlen]

    # add to generated
    generated += sentence


    print('----- Generating with seed: "' + sentence + '"')
    sys.stdout.write(generated)

    # use model to predict what the next maxlen chars should be that follow the seed string
    for i in range(maxlen):

        # shape of a single sample in a rank 3 tensor
        x_dims = (1, dctk.maxlen, dctk.n_features)
        # create an array of zeros with shape x_dims
        # recall that python considers zeros and boolean FALSE as the same
        x_pred = np.zeros(x_dims)

        # create a seq vector for our randomly select sequence
        # i.e. create a numerical encoding for each char in the sequence
        for t, char in enumerate(sentence):
            # for sample 0 in seq index t and character `char` encode a 1 (which is the same as a TRUE)
            x_pred[0, t, dctk.char_int[char]] = 1

        # next, take the seq vector and pass into model to get a prediction of what the next char should be
        preds = model.predict(x_pred, verbose=0)[0]
        # use the sample helper function to get index for next char
        next_index = sample(preds)
        # use look up dict to get next char
        next_char = dctk.int_char[next_index]

        # append next char to sequence
        sentence = sentence[1:] + next_char

        sys.stdout.write(next_char)
        sys.stdout.flush()
    print()

In [33]:
# need this for on_epoch_end()
text = " ".join(clean_sonnets)
print(f'All of Shakespeare\'s sonnets comprise about {len(text)} characters')

All of Shakespeare's sonnets comprise about 89146 characters


Create the callback object

In [34]:
# create callback object that will print out text generation at the end of each epoch
# use for real-time monitoring of model performance
print_callback = LambdaCallback(on_epoch_end=on_epoch_end)

----
### Build and Train Model

Build a text generation model using LSTMs. Feel free to reference the model used in the guided project.

It is recommended that you train this model to at least 50 epochs (but more if you're computer can handle it).

You are free to change up the architecture as you wish.

Just in case you have difficultly training a model, there is a pre-trained model saved to a file called `trained_text_gen_model.h5` that you can load (in the same way that you learned how to load in Keras models in Sprint 2 Module 4).

In [40]:
# build text generation model layer by layer
# fit model

model = Sequential()

# hidden layer 1
model.add(LSTM(264, # arbitrary, can be determined by gridsearch
               input_shape=(dctk.maxlen, dctk.n_features), # think of input_shape as implicitly declaring the input
               return_sequences=True, # Set to True whenever using 2 or more LSTM layers
               activation='relu'))

# hidden layer 2
model.add(LSTM(128, activation='relu')) # 128 is arbitrary

# Output layer, recall n_features = number of nodes in the output layer, aka 27
model.add(Dense(dctk.n_features,
                activation='softmax')) # for 3 or more labels in classification, use Sigmoid if it's just 2

# compile model
model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

# Fit the model
model.fit(X, y,
          batch_size=256,
          epochs=50, # might not be enough
          callbacks=[print_callback],
          workers=10)


Epoch 1/50
----- Generating text after Epoch: 0
----- Generating with seed: "s breath their masked buds discloses but"
s breath their masked buds discloses butpaoetan o e oaw go  aepi  byeguonf ran  
Epoch 2/50
----- Generating text after Epoch: 1
----- Generating with seed: "wn praise to mine own self bring and wha"
wn praise to mine own self bring and what c  n oav  hpands dgri   h  hto uepuyjl
Epoch 3/50
----- Generating text after Epoch: 2
----- Generating with seed: "l above that idle rank remain beyond all"
l above that idle rank remain beyond allms wyntan n koboirhoae ue s glrkhys a  e
Epoch 4/50
----- Generating text after Epoch: 3
----- Generating with seed: "hee love is too young to know what consc"
hee love is too young to know what consceueysotdmh f eshosntr  c fhehdrgcf  nsrt
Epoch 5/50
----- Generating text after Epoch: 4
----- Generating with seed: "sure the which he will not every hour su"
sure the which he will not every hour suoestwelaasrn iens ahvmef ihnl i mcabryev


<keras.src.callbacks.History at 0x7f0821942e60>

In [None]:
# 50 epochs might not have sensible results, might still be gibberish, seed string is the input, output doesn't look like english
# At 50th epoch, it starts to make more sense

### Save the trained model to a file

In [41]:
# save trained model to file
model.save("trained_text_gen_model.h5")

  saving_api.save_model(


### Let's Play With Our Trained Model

Now that we have a trained model that, though far from perfect, can generate actual English words, we can look at the predictions to continue learning more about how a text generation model works.

We can also take this as an opportunity to unpack the `def on_epoch_end` function to understand better how it works.

In [42]:
# this is our joined clean sonnet data
text



In [43]:
# randomly pick a starting index
# will be used to take a random sequence of chars from `text`
# run this cell a few times and you'll see `start_index` is random
start_index = random.randint(0, len(text) - dctk.maxlen - 1)
start_index

65654

In [44]:
# next use the randomly selected starting index to sample a sequence from the `text`

# this is our seed string (i.e., input sequence into the model)
generated = ''

# start the sentence at index `start_index` and include the next` dctk.maxlen` number of chars
sentence = text[start_index: start_index + dctk.maxlen]

# add to generated
generated += sentence

In [45]:
# display the "seed string" i.e. the input sequence into the model
print('----- Input seed: "' + sentence + '"')

----- Input seed: " doth catch for if it see the rudst or g"


In [46]:
# use model to predict what the next maxlen chars should be that follow the seed string
for i in range(maxlen):

    # shape of a single sample in a rank 3 tensor
    x_dims = (1, dctk.maxlen, dctk.n_features)
    # create an array of zeros with shape x_dims
    # recall that python considers zeros and boolean FALSE as the same
    x_pred = np.zeros(x_dims)

    # create a seq vector for our randomly select sequence
    # i.e. create a numerical encoding for each char in the sequence
    for t, char in enumerate(sentence):
        # for sample 0 in seq index t and character `char` encode a 1 (which is the same as a TRUE)
        x_pred[0, t, dctk.char_int[char]] = 1

    # next, take the seq vector and pass into model to get a prediction of what the next char should be
    preds = model.predict(x_pred, verbose=0)[0]
    # use the sample helper function to get index for next char
    next_index = sample(preds)
    # use look up dict to get next char
    next_char = dctk.int_char[next_index]

    # append next char to sequence
    sentence = sentence[1:] + next_char

In [47]:
# this is the seed string
generated

' doth catch for if it see the rudst or g'

In [48]:
# these are the maxlen chars that the model thinks should come after the seed stirng
sentence

'ritive him from think her onces and juy '

In [49]:
# how put it all together
generated + sentence

' doth catch for if it see the rudst or gritive him from think her onces and juy '

# Resources and Stretch Goals

## Stretch Goals:
- Refine the training and generation of text to be able to ask for different genres/styles of Shakespearean text (e.g., plays versus sonnets)
- Train a classification model that takes text and returns which work of Shakespeare it is most likely to be from
- Make it more performant! Many possible routes here - lean on Keras, optimize the code, and/or use more resources (AWS, etc.)
- Revisit the news example from class, and improve it - use categories or tags to refine the model/generation, or train a news classifier
- Run on bigger, better data

## Resources:
- [The Unreasonable Effectiveness of Recurrent Neural Networks](https://karpathy.github.io/2015/05/21/rnn-effectiveness/) - a seminal writeup demonstrating a simple but effective character-level NLP RNN
- [Simple NumPy implementation of RNN](https://github.com/JY-Yoon/RNN-Implementation-using-NumPy/blob/master/RNN%20Implementation%20using%20NumPy.ipynb) - Python 3 version of the code from "Unreasonable Effectiveness"
- [TensorFlow RNN Tutorial](https://www.tensorflow.org/text/tutorials/text_generation) - code for training an RNN on the Penn Tree Bank language dataset
- [4 part tutorial on RNN](http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/) - relates RNN to the vanishing gradient problem and provides an example implementation
- [RNN training tips and tricks](https://github.com/karpathy/char-rnn#tips-and-tricks) - some rules of thumb for parameterizing and training your RNN