# Process Test

In this notebook, we'll be testing the methodology on smaller data.  The dataset for the main project is very large and takes a long time to run.  The long runtime greatly increases the time it takes to experiment with the methodology.  For this project, we'll experiment on the text of Mary Shelley's Frankenstein.  

### Imports

In [51]:
# Just a few imports for now

import pandas as pd
import numpy as np
import re
import sys
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords

import tensorflow as tf
#from tensorflow.keras.utils import np_utils
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, LSTM
from tensorflow.keras.callbacks import ModelCheckpoint

## Obtain

> Import the text.

In [29]:
text = open('84-0.txt').read()
text

'\ufeffThe Project Gutenberg eBook of Frankenstein, by Mary Wollstonecraft (Godwin) Shelley\n\nThis eBook is for the use of anyone anywhere in the United States and\nmost other parts of the world at no cost and with almost no restrictions\nwhatsoever. You may copy it, give it away or re-use it under the terms\nof the Project Gutenberg License included with this eBook or online at\nwww.gutenberg.org. If you are not located in the United States, you\nwill have to check the laws of the country where you are located before\nusing this eBook.\n\nTitle: Frankenstein\n       or, The Modern Prometheus\n\nAuthor: Mary Wollstonecraft (Godwin) Shelley\n\nRelease Date: 31, 1993 [eBook #84]\n[Most recently updated: November 13, 2020]\n\nLanguage: English\n\nCharacter set encoding: UTF-8\n\nProduced by: Judith Boss, Christy Phillips, Lynn Hanninen, and David Meltzer. HTML version by Al Haines.\nFurther corrections by Menno de Leeuw.\n\n*** START OF THE PROJECT GUTENBERG EBOOK FRANKENSTEIN ***\n\n\n\

> Remove the weird \n things.

In [30]:
def remove_n(text):
    n = re.compile(r'\n')
    return n.sub(r'', text)

In [31]:
 text = remove_n(text)

> Tokenize the text

In [32]:
def tokenize_words(input):
    #  Make lowercase
    input = input.lower()

    # Create tokens
    tokenizer = RegexpTokenizer(r'\w+')
    tokens = tokenizer.tokenize(input)

    # Filter stopwords
    filtered = filter(lambda token: token not in stopwords.words('english'), tokens)
    return " ".join(filtered)

In [33]:
processed_text = tokenize_words(text)
processed_text

'project gutenberg ebook frankenstein mary wollstonecraft godwin shelleythis ebook use anyone anywhere united states andmost parts world cost almost restrictionswhatsoever may copy give away use termsof project gutenberg license included ebook online atwww gutenberg org located united states youwill check laws country located beforeusing ebook title frankenstein modern prometheusauthor mary wollstonecraft godwin shelleyrelease date 31 1993 ebook 84 recently updated november 13 2020 language englishcharacter set encoding utf 8produced judith boss christy phillips lynn hanninen david meltzer html version al haines corrections menno de leeuw start project gutenberg ebook frankenstein frankenstein modern prometheusby mary wollstonecraft godwin shelley contents letter 1 letter 2 letter 3 letter 4 chapter 1 chapter 2 chapter 3 chapter 4 chapter 5 chapter 6 chapter 7 chapter 8 chapter 9 chapter 10 chapter 11 chapter 12 chapter 13 chapter 14 chapter 15 chapter 16 chapter 17 chapter 18 chapter 

> Figure out the unique characters and assign them an index

In [40]:
chars = sorted(list(set(processed_text)))
char_to_num = dict((c, i) for i, c in enumerate(chars))

char_to_num

{' ': 0,
 '0': 1,
 '1': 2,
 '2': 3,
 '3': 4,
 '4': 5,
 '5': 6,
 '6': 7,
 '7': 8,
 '8': 9,
 '9': 10,
 '_': 11,
 'a': 12,
 'b': 13,
 'c': 14,
 'd': 15,
 'e': 16,
 'f': 17,
 'g': 18,
 'h': 19,
 'i': 20,
 'j': 21,
 'k': 22,
 'l': 23,
 'm': 24,
 'n': 25,
 'o': 26,
 'p': 27,
 'q': 28,
 'r': 29,
 's': 30,
 't': 31,
 'u': 32,
 'v': 33,
 'w': 34,
 'x': 35,
 'y': 36,
 'z': 37,
 'æ': 38,
 'è': 39,
 'é': 40,
 'ê': 41,
 'ô': 42}

> Figure out the number of characters in the entire text as well as the number of unique characters.

In [38]:
input_len = len(processed_text)
char_len = len(chars)

print('Total Number of Characters:', input_len)
print('Total Unique Characters:', char_len)

Total Number of Characters: 284177
Total Number of Words: 43


> Now that we've transformed the data into the form it needs to be in, we can now make a dataset out of it.  We need to define the length of the sequences first, then set up X and y.  Then we loop through the entire corpus 

In [42]:
seq_length = 100
X = []
y = []

for i in range(0, input_len - seq_length, 1):
    # define input and output sequences
    # Input is the current character plus desired sequence length
    in_seq = processed_text[i: i + seq_length]
    
    # output sequence is the initial character length plus total sequence length
    out_seq = processed_text[i + seq_length]
    
    # convert list of characters to integers and add to list
    X.append([char_to_num[char] for char in in_seq])
    y.append(char_to_num[out_seq])

> Now let's save our total number of sequences and check how many total inputs there are.

In [43]:
n_patterns = len(X)
print('Total Patterns:', n_patterns)

Total Patterns: 284077


> Now we covert the sequences into a processed numpy array that can be passed into a neural network.  

In [44]:
X = np.reshape(X, (n_patterns, seq_length, 1))
X = X / float(char_len)

> Now we can use np_utils to one hot encode the y data

In [52]:
y = tf.keras.utils.to_categorical(y)

## Modeling

> Here we'll set up an LSTM model in order to run through the sequences.

In [63]:
# Model Architecture

# Instantiate the model
model = Sequential()

# Input layer
model.add(LSTM(256, input_shape = (X.shape[1], X.shape[2]), return_sequences = True))
model.add(Dropout(0.2))

# Hidden layers
model.add(LSTM(256, return_sequences = True))
model.add(Dropout(0.2))

model.add(LSTM(128))
model.add(Dropout(0.2))

# Output Layer
model.add(Dense(y.shape[1], activation = 'softmax'))

# Compile
model.compile(loss = 'categorical_crossentropy', optimizer = 'adam', metrics = 'accuracy')

# Fit model
history = model.fit(X, y, epochs = 10, batch_size = 256)

Epoch 1/10

KeyboardInterrupt: 