# Homework 4 (Sequential Models)

1. Choose a book or other text to train your model on (I suggest [Project Gutenberg](https://www.gutenberg.org/ebooks/) to find .txt files but you can find them elsewhere). Make sure your file is a `.txt` file. Clean your data. Build a sequential model (LSTM, GRU, SimpleRNN, or Transformer) that generates new lines based on the training data (NO TRANSFER LEARNING). While your model doesn't need to have perfect accuracy, you must have an appropriate architecture and train it for a reasonable amount of epochs.

Print out or write 10 generated sequences from your model (Similar to Classwork 17 where we generated new Pride and Prejudice lines, but now with words instead of charachters. Feel free to use [this](https://colab.research.google.com/drive/12rxdjlEA9JOMQ_jioiEmwjDf6iPhsMmx?usp=sharing) as a reference for how to generate text from a trained model). Assess in detail how good they are, what they're good at, what they struggle to do well. INCLUDE THESE 10 sequences in your report.

2. Make a new model with ONE substantial adjustment (e.g. use a custom embedding layer if you didn't already, use a pre-trained embedding layer if you didn't already, use a DEEP LSTM/GRU with multiple recurrent layers, use a pre-trained model to do transfer learning and fine-tune it...etc.). While your model doesn't need to have perfect accuracy, you must have an appropriate architecture and train it for a reasonable amount of epochs.

Print out or write 10 generated sequences from your model (Similar to Classwork 17 where we generated new Pride and Prejudice lines, but now with words instead of charachters. Feel free to use [this](https://colab.research.google.com/drive/12rxdjlEA9JOMQ_jioiEmwjDf6iPhsMmx?usp=sharing) as a reference for how to generate text from a trained model. INCLUDE THESE 10 sequences in your report. Assess in detail how good they are, what they're good at, what they struggle to do well.  Did the performance of your model change?

3. Then create a **technical report** discussing your model building process, the results, and your reflection on it. The report should follow the format in the example including an Introduction, Analysis, Methods, Results, and Reflection section.

In [None]:
import numpy as np
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Dropout
from tensorflow.keras.layers import LSTM
from tensorflow.keras.callbacks import ModelCheckpoint
from tensorflow.keras.utils import to_categorical
from tensorflow.keras import Input
from tensorflow.keras import Model
from tensorflow.keras.callbacks import EarlyStopping

import keras as kb

import string
from random import randint
from pickle import load

from transformers import GPT2LMHeadModel, GPT2Tokenizer

In [None]:
# load ascii text and covert to lowercase
# use this?
filename = "Brave_New_World_Aldous_Huxley_djvu.txt"
raw_text = open(filename, 'r', encoding='utf-8').read()
raw_text = raw_text.lower()

raw_text[0:100]

# create mapping of unique chars to integers
chars = sorted(list(set(raw_text)))
char_to_int = dict((c, i) for i, c in enumerate(chars))

n_chars = len(raw_text)
n_vocab = len(chars)
print("Total Characters: ", n_chars)
print("Total Vocab: ", n_vocab)

# prepare the dataset of input to output pairs encoded as integers
seq_length = 100 # 100 characters as input
dataX = []
dataY = []
for i in range(0, n_chars - seq_length, 1):
 seq_in = raw_text[i:i + seq_length] # generate 100 character input
 seq_out = raw_text[i + seq_length] # grab next character

 dataX.append([char_to_int[char] for char in seq_in])
 dataY.append(char_to_int[seq_out])
n_patterns = len(dataX)
print("Total Patterns: ", n_patterns)

# reshape X to be [samples, time steps, features]
X = np.reshape(dataX, (n_patterns, seq_length, 1))
# normalize
X = X / float(n_vocab)
# one hot encode the output variable
y = to_categorical(dataY)
print(y[0])

Total Characters:  385183
Total Vocab:  54
Total Patterns:  385083
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 1. 0. 0.]


In [None]:
# unzip
!unzip a4_model.zip

# loads model from files
model = tf.keras.models.load_model('./a4_model/')

unzip:  cannot find or open a4_model.zip, a4_model.zip.zip or a4_model.zip.ZIP.


OSError: No file or directory found at ./a4_model/

In [None]:
int_to_char = dict((i, c) for i, c in enumerate(chars))

In [None]:
import sys
n_chars = 100

# pick a random seed
start = np.random.randint(0, len(dataX)-1)
pattern = dataX[start]
print("Seed:")
print("\"", ''.join([int_to_char[value] for value in pattern]), "\"")
print("------------------------------------------------------------")

# generate characters
for i in range(n_chars):
 x = np.reshape(pattern, (1, len(pattern), 1))
 x = x / float(n_vocab)

 # get model's prediction (as a one hot vec)
 prediction = model.predict(x, verbose=0) # predicted probs
 index = np.argmax(prediction) # find highest prop

 # use one hots to grab actual characters
 result = int_to_char[index]

 # string sequence together
 seq_in = [int_to_char[value] for value in pattern]

 # write sequence to console
 sys.stdout.write(result)

 # store pattern
 pattern.append(index)
 pattern = pattern[1:len(pattern)]

print("\n------------------------------------------------------------")
print("\nDone.")

Seed:
" act that it carries a bonus amounting to six months' 
salary"; continued with some account of the te "
------------------------------------------------------------
reen he had been his 
mind seasler and the savage was all and the savage was all out a 
sale continu
------------------------------------------------------------

Done.


In [None]:

def generate_text(model, dataX, int_to_char, n_vocab, num_sequences, sequence_length=100):
    for _ in range(num_sequences):
        start = np.random.randint(0, len(dataX) - 1)
        pattern = dataX[start]
        print("Seed:")
        print("\"", ''.join([int_to_char[value] for value in pattern]), "\"")
        print("------------------------------------------------------------")
        sys.stdout.write("\"")

        output_text = []
        # Generate characters
        for i in range(sequence_length):
            x = np.reshape(pattern, (1, len(pattern), 1))
            x = x / float(n_vocab)
            prediction = model.predict(x, verbose=0)  # predicted probabilities
            index = np.argmax(prediction)  # find highest prop
            result = int_to_char[index]
            output_text.append(result)
            # Append to pattern for next prediction
            pattern.append(index)
            pattern = pattern[1:len(pattern)]

        # Join all characters and split into words
        full_text = ''.join(output_text).split()
        print(" ".join(full_text))
        print("\n------------------------------------------------------------")
    print("\nDone.")

In [None]:
generate_text(model, dataX, int_to_char, n_vocab, 10)

Seed:
" irst used officially in 
a.f. 214. why not before? two reasons, (a) ..." 

"these early experimenter "
------------------------------------------------------------
"," said the savage, with a san and a perfect ce- cal, they had so hand and surned away and was some

------------------------------------------------------------
Seed:
"  and bacteriological conditioning of the embryo. practical instruc- 
tions for beta embryo-store wor "
------------------------------------------------------------
"k and the savage was a mong chain, the sruare was a prison of the south of the souther- the savage

------------------------------------------------------------
Seed:
" . linda and he-linda was his 
mother (the word made lenina look uncomfortable)-were strangers in 
th "
------------------------------------------------------------
"e seventeen thousand as the savage was all all the savage reserval the savage round his hand and th

---------------------------------------------------------