<a href="https://colab.research.google.com/github/ryanleeallred/DS-Unit-4-Sprint-3-Deep-Learning/blob/main/module1-rnn-and-lstm/LS_DS_431_RNN_and_LSTM_Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


## *Data Science Unit 4 Sprint 3 Assignment 1*

# Recurrent Neural Networks and Long Short Term Memory (LSTM)

![Monkey at a typewriter](https://upload.wikimedia.org/wikipedia/commons/thumb/3/3c/Chimpanzee_seated_at_typewriter.jpg/603px-Chimpanzee_seated_at_typewriter.jpg)

The French mathematician Emile Borel once mused that [**infinite monkeys typing for an infinite amount of time**](https://en.wikipedia.org/wiki/Infinite_monkey_theorem) will eventually type the complete works of William Shakespeare. Let's see if we can get there a bit faster, with the power of **Recurrent Neural Networks and LSTMs**!

Our goal in this projectis to build a **Shakespeare Sonnet Generator**.<br>
Given a prompt of a few words as input, its task is to generate follow-on text that reads like a Shakespeare Sonnet!<br>

To build our Sonnet Generator we will use a type of model called a **sequence model**. Given a short sequence, a sequence  model predicts the **most likely next item in the sequence**. Sequence models are astonishingly versatile and powerful, because the **sequence** we want to predict can be quite general! It can be composed of **words**, or of **characters**, or of **musical notes**, or of data points in a **time series** such as EKG voltages, or stock prices, or even a sequence of **DNA nucleotides**! 

We will train our model on the entire corpus of Shakespeare's Sonnets, and the model will learn from that data the most likely patterns of characters.

In [1]:
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3' 

In [2]:
import random
import sys

import requests
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from tensorflow.keras.callbacks import LambdaCallback

from tensorflow.keras.preprocessing import sequence
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding, Bidirectional
from tensorflow.keras.layers import LSTM

%matplotlib inline

# import a custom text data preparation class
# !wget https://raw.githubusercontent.com/LambdaSchool/DS-Unit-4-Sprint-3-Deep-Learning/main/module1-rnn-and-lstm/data_cleaning_toolkit_class.py
from data_cleaning_toolkit_class import data_cleaning_toolkit

### Use `request` to pull data from a URL

[**Read through the request documentation**](https://requests.readthedocs.io/en/master/user/quickstart/#make-a-request) to learn how to download the Shakespeare Sonnets from the Gutenberg website. 


In [3]:
# download all of Shakespeare's Sonnets from the Project Gutenberg website

# here's the link for the sonnets
url_shakespeare_sonnets = "https://www.gutenberg.org/cache/epub/1041/pg1041.txt"

# use requests and the url to download all of the sonnets - save the result to `data`
data = requests.get(url_shakespeare_sonnets)


In [13]:
print(dir(raw_text_data))
# 'capitalize', 'casefold', 'center', 'count', 'encode', 'endswith', 'expandtabs', 'find', 'format', 'format_map', 'index', 'isalnum', 'isalpha', 'isascii', 'isdecimal', 'isdigit', 'isidentifier', 'islower', 'isnumeric', 'isprintable', 'isspace', 'istitle', 'isupper', 'join', 'ljust', 'lower', 'lstrip', 'maketrans', 'partition', 'replace', 'rfind', 'rindex', 'rjust', 'rpartition', 'rsplit', 'rstrip', 'split', 'splitlines', 'startswith', 'strip', 'swapcase', 'title', 'translate', 'upper', 'zfill']

['__add__', '__class__', '__contains__', '__delattr__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__getnewargs__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__len__', '__lt__', '__mod__', '__mul__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__rmod__', '__rmul__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', 'capitalize', 'casefold', 'center', 'count', 'encode', 'endswith', 'expandtabs', 'find', 'format', 'format_map', 'index', 'isalnum', 'isalpha', 'isascii', 'isdecimal', 'isdigit', 'isidentifier', 'islower', 'isnumeric', 'isprintable', 'isspace', 'istitle', 'isupper', 'join', 'ljust', 'lower', 'lstrip', 'maketrans', 'partition', 'replace', 'rfind', 'rindex', 'rjust', 'rpartition', 'rsplit', 'rstrip', 'split', 'splitlines', 'startswith', 'strip', 'swapcase', 'title', 'translate', 'upper', 'zfill']


In [7]:
# extract the downloaded text from the requests object and save it to `raw_text_data`
# hint: take a look at the attributes of `data`
# YOUR CODE HERE
raw_text_data = data.text
#raise NotImplementedError()

In [8]:
# check the data type of `raw_text_data`
assert(type(raw_text_data)==str)

### Data Cleaning

In [9]:
# as usual, we need to clean up the messy data
raw_text_data[:3000]

'\ufeffThe Project Gutenberg eBook of The Sonnets, by William Shakespeare\r\n\r\nThis eBook is for the use of anyone anywhere in the United States and\r\nmost other parts of the world at no cost and with almost no restrictions\r\nwhatsoever. You may copy it, give it away or re-use it under the terms\r\nof the Project Gutenberg License included with this eBook or online at\r\nwww.gutenberg.org. If you are not located in the United States, you\r\nwill have to check the laws of the country where you are located before\r\nusing this eBook.\r\n\r\nTitle: The Sonnets\r\n\r\nAuthor: William Shakespeare\r\n\r\nRelease Date: September, 1997 [eBook #1041]\r\n[Most recently updated: November 25, 2021]\r\n\r\nLanguage: English\r\n\r\n\r\nProduced by:  the Project Gutenberg Shakespeare Team\r\n\r\n*** START OF THE PROJECT GUTENBERG EBOOK THE SONNETS ***\r\n\r\n\r\n\r\n\r\nTHE SONNETS\r\n\r\nby William Shakespeare\r\n\r\n\r\n\r\n\r\n  I\r\n\r\n  From fairest creatures we desire increase,\r\n  That t

In [10]:
# Which characters could we use with the split() method to split the text into lines?

# split the text into **lines** and save the result to `split_data`

# YOUR CODE HERE
split_data = raw_text_data.splitlines()

In [19]:
# we need to drop all the boiler plate text (i.e. titles and descriptions) as well as extra white spaces
# so that we are left with only the sonnets themselves 
# split_data[0:36] 


**Use list index slicing to remove the titles and descriptions, so we only have the sonnets.**


In [None]:
# we need to drop all the boilerplate text (i.e., titles and descriptions) as well as extra white spaces
# so that we are left with only the sonnets themselves 

# find index boundaries (start, end)so that 
# sonnets exist between these indices 
# titles and descriptions exist outside of these indices

# use index slicing to isolate the sonnet lines from the text - save the result to `sonnets`

# YOUR CODE HERE
sonnets = split_data[36:]
sonnets[0:30]

Notice that there are many lines that should not be counted as sonnets!

In [None]:
# these non-sonnet lines have far fewer characters than the actual sonnet lines?

sonnets[200:240]

In [21]:
# use your judgement to decide on a good value for  
#   the  minimum number of characters that a sonnet should have
#   call it n_chars
n_chars = 5

# Let's use that observation to filter out all the non-sonnet lines!
#    save results to `filtered_sonnets`
# Hint: use a list comprehension
filtered_sonnets = [line.lstrip(" ") for line in sonnets if len(line) > 5]
# [line for line in sonnets if len(line) > n_chars ]


In [22]:
# ok - much better!
# but we still need to remove all the punctuation and case normalize the text
filtered_sonnets[0:20]

['From fairest creatures we desire increase,',
 'That thereby beauty’s rose might never die,',
 'But as the riper should by time decease,',
 'His tender heir might bear his memory:',
 'But thou, contracted to thine own bright eyes,',
 'Feed’st thy light’s flame with self-substantial fuel,',
 'Making a famine where abundance lies,',
 'Thy self thy foe, to thy sweet self too cruel:',
 'Thou that art now the world’s fresh ornament,',
 'And only herald to the gaudy spring,',
 'Within thine own bud buriest thy content,',
 'And tender churl mak’st waste in niggarding:',
 'Pity the world, or else this glutton be,',
 'To eat the world’s due, by the grave and thee.',
 'When forty winters shall besiege thy brow,',
 'And dig deep trenches in thy beauty’s field,',
 'Thy youth’s proud livery so gazed on now,',
 'Will be a tatter’d weed of small worth held:',
 'Then being asked, where all thy beauty lies,',
 'Where all the treasure of thy lusty days;']

### Use Custom Data Cleaning Tool 

Use one of the methods in the `data_cleaning_toolkit` to clean your data.

There is an example of this in the guided project.

In [24]:
# instantiate the data_cleaning_toolkit class - save result to `dctk`
dctk = data_cleaning_toolkit()

In [25]:
# use data_cleaning_toolkit to remove punctuation and to case normalize - save results to `clean_sonnets`
clean_sonnets = [dctk.clean_data(line) for line in filtered_sonnets]
# I WANT TO DO THESE WITH LAMBDA / MAP IN FUTURE

In [26]:
# much better!
display(clean_sonnets)
print(len(clean_sonnets))

['from fairest creatures we desire increase',
 'that thereby beautys rose might never die',
 'but as the riper should by time decease',
 'his tender heir might bear his memory',
 'but thou contracted to thine own bright eyes',
 'feedst thy lights flame with selfsubstantial fuel',
 'making a famine where abundance lies',
 'thy self thy foe to thy sweet self too cruel',
 'thou that art now the worlds fresh ornament',
 'and only herald to the gaudy spring',
 'within thine own bud buriest thy content',
 'and tender churl makst waste in niggarding',
 'pity the world or else this glutton be',
 'to eat the worlds due by the grave and thee',
 'when forty winters shall besiege thy brow',
 'and dig deep trenches in thy beautys field',
 'thy youths proud livery so gazed on now',
 'will be a tatterd weed of small worth held',
 'then being asked where all thy beauty lies',
 'where all the treasure of thy lusty days',
 'to say within thine own deep sunken eyes',
 'were an alleating shame and thriftl

2555


### Use Your Data Tool to Create Character Sequences 
for the LSTM model

We'll need the `create_char_sequences` method for this task. <br>
However, this method requires a parameter called `maxlen,` which is responsible for setting the maximum sequence length. 

So what would be a good sequence length, exactly? 

To answer that question, let's do some statistics! 

In [49]:
from scipy import stats

def calc_stats(corpus):
    """
    Calculates statistics on the length of every line in the sonnets
    """
    
    # write a list comprehension that calculates each sonnet's line length - save the results to `doc_lens` 
    doc_lens = [len(line) for line in corpus]
    # use numpy to calculate and return the mean, median, std, max, min of the doc lens - all in one line of code
    return stats.describe(doc_lens)
    

In [50]:
calc_stats(clean_sonnets)

DescribeResult(nobs=2555, minmax=(1, 70), mean=41.4559686888454, variance=124.2920111501547, skewness=-0.6975573987861907, kurtosis=3.866098817697444)

In [61]:
import statistics
statistics.median(doc_lens)

41

In [48]:
# doc_lens = [len(line) for line in clean_sonnets]
# print(len(doc_lens))
# sum(doc_lens)
# 105920/2555 = 41.4559686888454
# from scipy import stats
# stats.describe(doc_lens)
# DescribeResult(nobs=2555, minmax=(1, 70), mean=41.4559686888454, variance=124.2920111501547, skewness=-0.6975573987861907, kurtosis=3.866098817697444)

DescribeResult(nobs=2555, minmax=(1, 70), mean=41.4559686888454, variance=124.2920111501547, skewness=-0.6975573987861907, kurtosis=3.866098817697444)

In [51]:
# avg_length = [len(line) for line in clean_sonnets]
# print(avg_length)

In [52]:
# avg_length[0:20]

In [55]:
# sonnet line length statistics
nobs, minmax, mean, variance, skewness, kurtosis = calc_stats(clean_sonnets)
# mean, med, std, max_, min_ = calc_stats(clean_sonnets)
# mean, med, std, max_, min_ 
nobs, minmax, mean, variance, skewness, kurtosis

(2555,
 (1, 70),
 41.4559686888454,
 124.2920111501547,
 -0.6975573987861907,
 3.866098817697444)

In [62]:
# from the results of the sonnet line length statistics, use your judgement to select a value for maxlen
#   hint -- a good value might be half the median length of a sonnet line
# use .create_char_sequences() to create sequences
dctk.create_char_sequences(clean_sonnets,20)

Created 21691 sequences.


Take a look at the `data_cleaning_toolkit_class.py` file. 

In the first 4 lines of code in the `create_char_sequences` method, class attributes `n_features` and `unique_chars` are created. <br>
Let's call them in the cells below.

In [63]:
# number of input features for our LSTM model
dctk.n_features

27

In [65]:
# unique characters that appear in our sonnets 
print(len(dctk.unique_chars)) # all letters + space " "

27


## Time for Questions 

----
**Question 1:** 

Why are the `number of unique characters` (i.e., **dctk.unique_chars**) and the `number of model input features` (i.e., **dctk.n_features**) the same?

**Hint:** The model that we will shortly build here is very similar to the text generation model we built in the guided project.

**Answer 1:**
the unique chars is the alphabet plus space. Why are those the "features"?


**Question 2:**

Take a look at the printout of `dctk.unique_chars` one more time. Notice that there is a white space. 

Why is it desirable to have a white space as a possible character to predict?

**Answer 2:**
To separate words = a guess. 

----

### Use Our Data Tool to Create X and Y Splits

You'll need the `create_X_and_Y` method for this task. 

In [66]:
# TODO: provide a walkthrough of data_cleaning_toolkit with unit tests that check for understanding 
X, y = dctk.create_X_and_Y()

![](https://miro.medium.com/max/891/0*jGB1CGQ9HdeUwlgB)

In [67]:
# notice that our input array isn't a matrix - it's a rank three tensor
X.shape

(21691, 20, 27)

In $X$.shape, we see three numbers (*n1*, *n2*, *n3*). What do these numbers mean?

Well, *n1* tells us the number of samples that we have. But what about the other two?

In [71]:
# first index returns a signle sample, which we can see is a sequence 
first_sample_index = 0 
X[first_sample_index][:5]

array([[ True, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False],
       [False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False,
        False, False,  True, False, False, False, False, False, False],
       [False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False,  True],
       [False,  True, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False],
       [False, False, False, False,  True, False, False, False, False,
        False, False, False, False, False, False, False, False, False,
  

Notice that each sequence (i.e., $X[i]$ where $i$ is some index value) is `maxlen` long and <br>
has a number of features equal to `dctk.n_features`. <br>Let's try to understand this shape.

In [72]:
# each sequence is maxlen long and has dctk.n_features number of features
X[first_sample_index].shape

(20, 27)

**Each row corresponds to a character vector,** and there is `maxlen` number of character vectors. 

**Each column corresponds to a unique character,** and there are `dctk.n_features` number of features. 


In [73]:
# let's index for a single character vector 
first_char_vect_index = 0
X[first_sample_index][first_char_vect_index]

array([ True, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False])

Notice that there is a single `True` value, and all the rest of the values are `False`. 

This is a one-hot encoding for which character appears at each index within a sequence. Specifically, the cell above is looking at the first character in the sequence.

Only a single character can appear as the first character in a sequence, so there will be a single `True` value, and the rest will be `False`. 

Let's say that `True` appears in the $ith$ index; by  $ith$ index we mean some index in the general case. So how can we find out which character corresponds to?

To answer this question, we need to use the character-to-integer look-up dictionary. 

In [74]:
# take a look at the index to character dictionary
# if a TRUE appears in the 0th index of a character vector,
# then we know that whatever char you see below next to the 0th key 
# is the character that character vector is endcoding for
dctk.int_char

{0: 'f',
 1: 'm',
 2: 'y',
 3: 'z',
 4: ' ',
 5: 'i',
 6: 'x',
 7: 'q',
 8: 'u',
 9: 'g',
 10: 'v',
 11: 'b',
 12: 'e',
 13: 'w',
 14: 'k',
 15: 'p',
 16: 'j',
 17: 'l',
 18: 's',
 19: 'd',
 20: 'r',
 21: 'h',
 22: 'c',
 23: 'a',
 24: 'n',
 25: 't',
 26: 'o'}

In [75]:
# let's look at an example to tie it all together

seq_len_counter = 0

# index for a single sample 
for seq_of_char_vects in X[first_sample_index]:
    
    # get index with max value, which will be the one TRUE value 
    index_with_TRUE_val = np.argmax(seq_of_char_vects)
    
    print (dctk.int_char[index_with_TRUE_val])
    
    seq_len_counter+=1
    
print ("Sequence length: {}".format(seq_len_counter))

f
r
o
m
 
f
a
i
r
e
s
t
 
c
r
e
a
t
u
r
Sequence length: 20


## Time for Questions 

----
**Question 1:** 

In your own words, how would you describe the numbers from the shape printout of `X.shape` to a classmate?


**Answer 1:**

Write your answer here

----


### Build a Shakespeare Sonnet Text Generation Model

Now that we have prepped our data (and understood that process), let's finally build out our character generation model, similar to what we did in the guided project.<br>

First, we'll create a callback to monitor the training -- by printing a sample of text generated by the model at the end of each epoch.

Helper function to generate a sample character:

In [77]:
def sample(preds, temperature=1.0):
    """
    Helper function to generate a sample character
    Input is a predictions vector from our model, for example a set of 27 character probabilities
    Output is the index of the generated character 
    """
    # convert predictions to an array 
    preds = np.asarray(preds).astype('float64')

    # use the temperature hyper-parameter to "warp" (sharpen or spread out) the probability distribution 
    preds = np.log(preds) / temperature

    # use the softmax activation function to create a new list of probabilities 
    #   corresponding to the "warped" probability distribution
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)

    # Draw a single sample from a multinomial distribution, given these probabilities
    #   The sample will be a one-hot encoded character
    """ Notes on the np.random.multinomial() function 
       The first argument is the number of "trials" we want: 1 in this case
       The second argument is the list of probabilities for each character
       The third argument is number of sets of "trials" we want: again, 1 in this case
       By analogy with a dice-rolling experiment: 
          This "trial" consists of generating a single "throw" of a die with 27 faces;
             each face corresponds to a character and its associated probability
    """

    probas = np.random.multinomial(1, preds, 1)
    
    # return the index that corresponds to the max probability 
    return np.argmax(probas)


Create the `on_epoch_end` function to be passed into `LambdaCallback()`

In [78]:
def on_epoch_end(epoch, _):
    """"
    Function invoked at the end of each epoch. Prints the text generated by our model.
    """
    
    print()
    print('----- Generating text after Epoch: %d' % epoch)
    

    # randomly pick a starting index 
    # will be used to take a random sequence of chars from `text`
    start_index = random.randint(0, len(text) - dctk.maxlen - 1)
    
    # this is our seed string (i.e. input seqeunece into the model)
    generated = ''

    # start the sentence at index `start_index` and include the next` dctk.maxlen` number of chars
    sentence = text[start_index: start_index + dctk.maxlen]

    # add to generated
    generated += sentence

    
    print('----- Generating with seed: "' + sentence + '"')
    sys.stdout.write(generated)
    
    # use model to predict what the next maxlen chars should be that follow the seed string
    for i in range(maxlen):

        # shape of a single sample in a rank 3 tensor 
        x_dims = (1, dctk.maxlen, dctk.n_features)
        # create an array of zeros with shape x_dims
        # recall that python considers zeros and boolean FALSE as the same
        x_pred = np.zeros(x_dims)

        # create a seq vector for our randomly select sequence 
        # i.e. create a numerical encoding for each char in the sequence 
        for t, char in enumerate(sentence):
            # for sample 0 in seq index t and character `char` encode a 1 (which is the same as a TRUE)
            x_pred[0, t, dctk.char_int[char]] = 1

        # next, take the seq vector and pass into model to get a prediction of what the next char should be 
        preds = model.predict(x_pred, verbose=0)[0]
        # use the sample helper function to get index for next char 
        next_index = sample(preds)
        # use look up dict to get next char 
        next_char = dctk.int_char[next_index]

        # append next char to sequence 
        sentence = sentence[1:] + next_char 
        
        sys.stdout.write(next_char)
        sys.stdout.flush()
    print()

In [79]:
# need this for on_epoch_end()
text = " ".join(clean_sonnets)
print(f'All of Shakespeare\'s sonnets comprise about {len(text)} characters')

All of Shakespeare's sonnets comprise about 108474 characters


Create the callback object

In [80]:
# create callback object that will print out text generation at the end of each epoch 
# use for real-time monitoring of model performance
print_callback = LambdaCallback(on_epoch_end=on_epoch_end)

----
### Build and Train Model

Build a text generation model using LSTMs. Feel free to reference the model used in the guided project. 

It is recommended that you train this model to at least 50 epochs (but more if you're computer can handle it). 

You are free to change up the architecture as you wish. 

Just in case you have difficultly training a model, there is a pre-trained model saved to a file called `trained_text_gen_model.h5` that you can load (in the same way that you learned how to load in Keras models in Sprint 2 Module 4). 

In [92]:
# build text generation model layer by layer 
# fit model
maxlen=40
### BEGIN SOLUTION
## Takes about 4.5 minutes to train the entire corpus for 50 epochs on a colab GPU (1 LSTM layer, 128 neurons)

# build a 1 layer LSTM language model 
from tensorflow.keras.optimizers import Adam
opt = Adam(learning_rate=0.001)

model = Sequential()

# hidden layer 1 
model.add(LSTM(128, 
               input_shape=(20,27), # input_shape is (dctk.maxlen, dctk.n_features)
               return_sequences=True)) # set to true whenever using 2 or more LSTM layers 
model.add(LSTM(128, 
               input_shape=(20,27), # input_shape is (dctk.maxlen, dctk.n_features)
               return_sequences=False)) # set to true whenever using 2 or more LSTM layers 
# this is our output layer
# recall that n_features = number of characters in the dictionary = 27
model.add(Dense(27, 
                activation='softmax'))

# notice that we are using categorical_crossentropy this time around - why?
model.compile(loss='categorical_crossentropy', 
              optimizer=opt)

# fit the model
# X and y are pretty large, consider sub-sampling
model.fit(X, y,
          batch_size=128,
          epochs=20,  # CHANGED FROM 150
          callbacks=[print_callback])
### END SOLUTION

Epoch 1/20
----- Generating text after Epoch: 0
----- Generating with seed: "h is found now proud"
h is found now proudnreshhbantynhrnlthcce ias areatoy  ihi s
Epoch 2/20
----- Generating text after Epoch: 1
----- Generating with seed: "brightness doth not "
brightness doth not bseegecn thepe gof no nohhi ponmtmohs oi
Epoch 3/20
----- Generating text after Epoch: 2
----- Generating with seed: "ughts whilst others "
ughts whilst others beuthss facd lossthe is srins ing thig f
Epoch 4/20
----- Generating text after Epoch: 3
----- Generating with seed: "of you beauteous and"
of you beauteous and ols hle iterk wfeis as af wwerort bo th
Epoch 5/20
----- Generating text after Epoch: 4
----- Generating with seed: "ion should he live a"
ion should he live are soll ad the li cend elithecare to yhe
Epoch 6/20
----- Generating text after Epoch: 5
----- Generating with seed: " him but copy what i"
 him but copy what iss tid pridstent theus ane beesells hom 
Epoch 7/20
----- Generating text after E

<keras.callbacks.History at 0x7f016999e520>

### Save the trained model to a file

In [93]:
# save trained model to file 
model.save("trained_text_gen_model.h5")

### Let's Play With Our Trained Model 

Now that we have a trained model that, though far from perfect, can generate actual English words, we can look at the predictions to continue learning more about how a text generation model works.

We can also take this as an opportunity to unpack the `def on_epoch_end` function to understand better how it works. 

In [94]:
# this is our joined clean sonnet data
text



## This below needs to be cut. 

  end of the project gutenberg ebook the sonnets  updated editions will replace the previous onethe old editions will be renamed creating the works from print editions not protected by us copyright law means that no one owns a united states copyright in these works so the foundation and you can copy and distribute it in the united states without permission and without paying copyright royalties special rules set forth in the general terms of use part of this license apply to copying and distributing project gutenbergtm electronic works to protect the project gutenbergtm concept and trademark project gutenberg is a registered trademark and may not be used if you charge for an ebook except by following the terms of the trademark license including paying royalties for use of the project gutenberg trademark if you do not charge anything for copies of this ebook complying with the trademark license is very easy you may use this ebook for nearly any purpose such as creation of derivative works reports performances and research project gutenberg ebooks may be modified and printed and given awayyou may do practically anything in the united states with ebooks not protected by us copyright law redistribution is subject to the trademark license especially commercial redistribution start full license the full project gutenberg license please read this before you distribute or use this work to protect the project gutenbergtm mission of promoting the free distribution of electronic works by using or distributing this work or any other work associated in any way with the phrase project gutenberg you agree to comply with all the terms of the full project gutenbergtm license available with this file or online at wwwgutenbergorglicense sectiongeneral terms of use and redistributing project gutenbergtm electronic works a by reading or using any part of this project gutenbergtm electronic work you indicate that you have read understand agree to and accept all the terms of this license and intellectual property trademarkcopyright agreement if you do not agree to abide by all the terms of this agreement you must cease using and return or destroy all copies of project gutenbergtm electronic works in your possession if you paid a fee for obtaining a copy of or access to a project gutenbergtm electronic work and you do not agree to be bound by the terms of this agreement you may obtain a refund from the person or entity to whom you paid the fee as set forth in paragraph e b project gutenberg is a registered trademark it may only be used on or associated in any way with an electronic work by people who agree to be bound by the terms of this agreement there are a few things that you can do with most project gutenbergtm electronic works even without complying with the full terms of this agreement see paragraph c below there are a lot of things you can do with project gutenbergtm electronic works if you follow the terms of this agreement and help preserve free future access to project gutenbergtm electronic works see paragraph e below c the project gutenberg literary archive foundation the foundation or pglaf owns a compilation copyright in the collection of project gutenbergtm electronic works nearly all the individual works in the collection are in the public domain in the united states if an individual work is unprotected by copyright law in the united states and you are located in the united states we do not claim a right to prevent you from copying distributing performing displaying or creating derivative works based on the work as long as all references to project gutenberg are removed of course we hope that you will support the project gutenbergtm mission of promoting free access to electronic works by freely sharing project gutenbergtm works in compliance with the terms of this agreement for keeping the project gutenbergtm name associated with the work you can easily comply with the terms of this agreement by keeping this work in the same format with its attached full project gutenbergtm license when you share it without charge with others d the copyright laws of the place where you are located also govern what you can do with this work copyright laws in most countries are in a constant state of change if you are outside the united states check the laws of your country in addition to the terms of this agreement before downloading copying displaying performing distributing or creating derivative works based on this work or any other project gutenbergtm work the foundation makes no representations concerning the copyright status of any work in any country other than the united states e unless you have removed all references to project gutenberg e the following sentence with active links to or other immediate access to the full project gutenbergtm license must appear prominently whenever any copy of a project gutenbergtm work any work on which the phrase project gutenberg appears or with which the phrase project gutenberg is associated is accessed displayed performed viewed copied or distributed this ebook is for the use of anyone anywhere in the united states and most other parts of the world at no cost and with almost no restrictions whatsoever you may copy it give it away or reuse it under the terms of the project gutenberg license included with this ebook or online at wwwgutenbergorg if you are not located in the united states you will have to check the laws of the country where you are located before using this ebook e if an individual project gutenbergtm electronic work is derived from texts not protected by us copyright law does not contain a notice indicating that it is posted with permission of the copyright holder the work can be copied and distributed to anyone in the united states without paying any fees or charges if you are redistributing or providing access to a work with the phrase project gutenberg associated with or appearing on the work you must comply either with the requirements of paragraphs e through e or obtain permission for the use of the work and the project gutenbergtm trademark as set forth in paragraphs e or e e if an individual project gutenbergtm electronic work is posted with the permission of the copyright holder your use and distribution must comply with both paragraphs e through e and any additional terms imposed by the copyright holder additional terms will be linked to the project gutenbergtm license for all works posted with the permission of the copyright holder found at the beginning of this work e do not unlink or detach or remove the full project gutenbergtm license terms from this work or any files containing a part of this work or any other work associated with project gutenbergtm e do not copy display perform distribute or redistribute this electronic work or any part of this electronic work without prominently displaying the sentence set forth in paragraph e with active links or immediate access to the full terms of the project gutenbergtm license e you may convert to and distribute this work in any binary compressed marked up nonproprietary or proprietary form including any word processing or hypertext form however if you provide access to or distribute copies of a project gutenbergtm work in a format other than plain vanilla ascii or other format used in the official version posted on the official project gutenbergtm website wwwgutenbergorg you must at no additional cost fee or expense to the user provide a copy a means of exporting a copy or a means of obtaining a copy upon request of the work in its original plain vanilla ascii or other form any alternate format must include the full project gutenbergtm license as specified in paragraph e e do not charge a fee for access to viewing displaying performing copying or distributing any project gutenbergtm works unless you comply with paragraph e or e e you may charge a reasonable fee for copies of or providing access to or distributing project gutenbergtm electronic works provided that  you pay a royalty fee ofof the gross profits you derive from the use of project gutenbergtm works calculated using the method you already use to calculate your applicable taxes the fee is owed to the owner of the project gutenbergtm trademark but he has agreed to donate royalties under this paragraph to the project gutenberg literary archive foundation royalty payments must be paid withindays following each date on which you prepare or are legally required to prepare your periodic tax returns royalty payments should be clearly marked as such and sent to the project gutenberg literary archive foundation at the address specified in sectioninformation about donations to the project gutenberg literary archive foundation  you provide a full refund of any money paid by a user who notifies you in writing or by email withindays of receipt that she does not agree to the terms of the full project gutenbergtm license you must require such a user to return or destroy all copies of the works possessed in a physical medium and discontinue all use of and all access to other copies of project gutenbergtm works  you provide in accordance with paragraph f a full refund of any money paid for a work or a replacement copy if a defect in the electronic work is discovered and reported to you withindays of receipt of the work  you comply with all other terms of this agreement for free distribution of project gutenbergtm works e if you wish to charge a fee or distribute a project gutenbergtm electronic work or group of works on different terms than are set forth in this agreement you must obtain permission in writing from the project gutenberg literary archive foundation the manager of the project gutenbergtm trademark contact the foundation as set forth in sectionbelow f project gutenberg volunteers and employees expend considerable effort to identify do copyright research on transcribe and proofread works not protected by us copyright law in creating the project gutenbergtm collection despite these efforts project gutenbergtm electronic works and the medium on which they may be stored may contain defects such as but not limited to incomplete inaccurate or corrupt data transcription errors a copyright or other intellectual property infringement a defective or damaged disk or other medium a computer virus or computer codes that damage or cannot be read by your equipment f limited warranty disclaimer of damagesexcept for the right of replacement or refund described in paragraph f the project gutenberg literary archive foundation the owner of the project gutenbergtm trademark and any other party distributing a project gutenbergtm electronic work under this agreement disclaim all liability to you for damages costs and expenses including legal fees you agree that you have no remedies for negligence strict liability breach of warranty or breach of contract except those provided in paragraph f you agree that the foundation the trademark owner and any distributor under this agreement will not be liable to you for actual direct indirect consequential punitive or incidental damages even if you give notice of the possibility of such damage f limited right of replacement or refundif you discover a defect in this electronic work withindays of receiving it you can receive a refund of the money if any you paid for it by sending a written explanation to the person you received the work from if you received the work on a physical medium you must return the medium with your written explanation the person or entity that provided you with the defective work may elect to provide a replacement copy in lieu of a refund if you received the work electronically the person or entity providing it to you may choose to give you a second opportunity to receive the work electronically in lieu of a refund if the second copy is also defective you may demand a refund in writing without further opportunities to fix the problem f except for the limited right of replacement or refund set forth in paragraph f this work is provided to you asis with no other warranties of any kind express or implied including but not limited to warranties of merchantability or fitness for any purpose f some states do not allow disclaimers of certain implied warranties or the exclusion or limitation of certain types of damages if any disclaimer or limitation set forth in this agreement violates the law of the state applicable to this agreement the agreement shall be interpreted to make the maximum disclaimer or limitation permitted by the applicable state law the invalidity or unenforceability of any provision of this agreement shall not void the remaining provisions f indemnityyou agree to indemnify and hold the foundation the trademark owner any agent or employee of the foundation anyone providing copies of project gutenbergtm electronic works in accordance with this agreement and any volunteers associated with the production promotion and distribution of project gutenbergtm electronic works harmless from all liability costs and expenses including legal fees that arise directly or indirectly from any of the following which you do or cause to occur a distribution of this or any project gutenbergtm work b alteration modification or additions or deletions to any project gutenbergtm work and c any defect you cause sectioninformation about the mission of project gutenbergtm project gutenbergtm is synonymous with the free distribution of electronic works in formats readable by the widest variety of computers including obsolete old middleaged and new computers it exists because of the efforts of hundreds of volunteers and donations from people in all walks of life volunteers and financial support to provide volunteers with the assistance they need are critical to reaching project gutenbergtms goals and ensuring that the project gutenbergtm collection will remain freely available for generations to come inthe project gutenberg literary archive foundation was created to provide a secure and permanent future for project gutenbergtm and future generations to learn more about the project gutenberg literary archive foundation and how your efforts and donations can help see sectionsandand the foundation information page at wwwgutenbergorg sectioninformation about the project gutenberg literary archive foundation the project gutenberg literary archive foundation is a nonprofit c educational corporation organized under the laws of the state of mississippi and granted tax exempt status by the internal revenue service the foundations ein or federal tax identification number iscontributions to the project gutenberg literary archive foundation are tax deductible to the full extent permitted by us federal laws and your states laws the foundations business office is located atnorthwest salt lake city utemail contact links and up to date contact information can be found at the foundations website and official page at wwwgutenbergorgcontact sectioninformation about donations to the project gutenberg literary archive foundation project gutenbergtm depends upon and cannot survive without widespread public support and donations to carry out its mission of increasing the number of public domain and licensed works that can be freely distributed in machinereadable form accessible by the widest array of equipment including outdated equipment many small donations  toare particularly important to maintaining tax exempt status with the irs the foundation is committed to complying with the laws regulating charities and charitable donations in allstates of the united states compliance requirements are not uniform and it takes a considerable effort much paperwork and many fees to meet and keep up with these requirements we do not solicit donations in locations where we have not received written confirmation of compliance to send donations or determine the status of compliance for any particular state visit wwwgutenbergorgdonate while we cannot and do not solicit contributions from states where we have not met the solicitation requirements we know of no prohibition against accepting unsolicited donations from donors in such states who approach us with offers to donate international donations are gratefully accepted but we cannot make any statements concerning tax treatment of donations received from outside the united states us laws alone swamp our small staff please check the project gutenberg web pages for current donation methods and addresses donations are accepted in a number of other ways including checks online payments and credit card donations to donate please visit wwwgutenbergorgdonate sectiongeneral information about project gutenbergtm electronic works professor michael s hart was the originator of the project gutenbergtm concept of a library of electronic works that could be freely shared with anyone for forty years he produced and distributed project gutenbergtm ebooks with only a loose network of volunteer support project gutenbergtm ebooks are often created from several printed editions all of which are confirmed as not protected by copyright in the us unless a copyright notice is included thus we do not necessarily keep ebooks in compliance with any particular paper edition most people start at our website which has the main pg search facility wwwgutenbergorg this website includes information about project gutenbergtm including how to make donations to the project gutenberg literary archive foundation how to help produce our new ebooks and how to subscribe to our email newsletter to hear about new ebooks'

In [95]:
# randomly pick a starting index 
# will be used to take a random sequence of chars from `text`
# run this cell a few times and you'll see `start_index` is random
start_index = random.randint(0, len(text) - dctk.maxlen - 1)
start_index

101906

In [96]:
# next use the randomly selected starting index to sample a sequence from the `text`

# this is our seed string (i.e., input sequence into the model)
generated = ''

# start the sentence at index `start_index` and include the next` dctk.maxlen` number of chars
sentence = text[start_index: start_index + dctk.maxlen]

# add to generated
generated += sentence

In [97]:
# display the "seed string" i.e. the input sequence into the model
print('----- Input seed: "' + sentence + '"')

----- Input seed: "iable to you for act"


In [98]:
# use model to predict what the next maxlen chars should be that follow the seed string
for i in range(maxlen):

    # shape of a single sample in a rank 3 tensor 
    x_dims = (1, dctk.maxlen, dctk.n_features)
    # create an array of zeros with shape x_dims
    # recall that python considers zeros and boolean FALSE as the same
    x_pred = np.zeros(x_dims)

    # create a seq vector for our randomly select sequence 
    # i.e. create a numerical encoding for each char in the sequence 
    for t, char in enumerate(sentence):
        # for sample 0 in seq index t and character `char` encode a 1 (which is the same as a TRUE)
        x_pred[0, t, dctk.char_int[char]] = 1

    # next, take the seq vector and pass into model to get a prediction of what the next char should be 
    preds = model.predict(x_pred, verbose=0)[0]
    # use the sample helper function to get index for next char 
    next_index = sample(preds)
    # use look up dict to get next char 
    next_char = dctk.int_char[next_index]

    # append next char to sequence 
    sentence = sentence[1:] + next_char 

In [99]:
# this is the seed string
generated

'iable to you for act'

In [100]:
# these are the maxlen chars that the model thinks should come after the seed stirng
sentence

' prowenteoas of the '

In [101]:
# how put it all together
generated + sentence

'iable to you for act prowenteoas of the '

# Resources and Stretch Goals

## Stretch Goals:
- Refine the training and generation of text to be able to ask for different genres/styles of Shakespearean text (e.g., plays versus sonnets)
- Train a classification model that takes text and returns which work of Shakespeare it is most likely to be from
- Make it more performant! Many possible routes here - lean on Keras, optimize the code, and/or use more resources (AWS, etc.)
- Revisit the news example from class, and improve it - use categories or tags to refine the model/generation, or train a news classifier
- Run on bigger, better data

## Resources:
- [The Unreasonable Effectiveness of Recurrent Neural Networks](https://karpathy.github.io/2015/05/21/rnn-effectiveness/) - a seminal writeup demonstrating a simple but effective character-level NLP RNN
- [Simple NumPy implementation of RNN](https://github.com/JY-Yoon/RNN-Implementation-using-NumPy/blob/master/RNN%20Implementation%20using%20NumPy.ipynb) - Python 3 version of the code from "Unreasonable Effectiveness"
- [TensorFlow RNN Tutorial](https://www.tensorflow.org/text/tutorials/text_generation) - code for training an RNN on the Penn Tree Bank language dataset
- [4 part tutorial on RNN](http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/) - relates RNN to the vanishing gradient problem and provides an example implementation
- [RNN training tips and tricks](https://github.com/karpathy/char-rnn#tips-and-tricks) - some rules of thumb for parameterizing and training your RNN