<a href="https://colab.research.google.com/github/vince-camm/GENAI-HW5/blob/main/Homework_5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Vincent Cammisa
Gen Ai HW 5

Building an LSTM Model: generating cohernet and stylistic texts of Mark Twain

In [None]:
import numpy as np
import json
import re
import string

import tensorflow as tf
from tensorflow.keras import layers, models, callbacks, losses

In [None]:
VOCAB_SIZE = 10000
MAX_LEN = 200
EMBEDDING_DIM = 100
N_UNITS = 128
VALIDATION_SPLIT = 0.2
SEED = 42
LOAD_MODEL = False
BATCH_SIZE = 32
EPOCHS = 25

## TASK 1: Data Collection and Preparation

Gather your selected plain texts from Project Gutenberg.

Combine multiple texts into a single dataset, as necessary.

Preprocess the collected texts (cleaning, tokenization, and formatting).


In [None]:
import requests
import re

# List of URLs for additional texts (e.g., different Mark Twain works)
urls = [
    "https://www.gutenberg.org/files/76/76-0.txt",  # The Adventures of Huckleberry Finn
    "https://www.gutenberg.org/files/74/74-0.txt",  # The Adventures of Tom Sawyer
    "https://www.gutenberg.org/files/245/245-0.txt" # Life on the Mississippi
]
 # Initialize an empty string to hold all text
all_text = ""

      # Download each text file and append to all_text
for url in urls:
          response = requests.get(url)
          text = response.text
          all_text += text + "\n\n"  # Separate texts by newlines

      # Save combined text to a single file
with open("combined_twain.txt", "w", encoding="utf-8") as file:
          file.write(all_text)

In [None]:
with open("combined_twain.txt", "r", encoding="utf-8") as file:
    all_text = file.read()

# Split the text into sentences or lines (adjust as needed)
text_data = all_text.split("\n")  # Split by newline

# Now you can process text_data similarly to how you processed recipe_data
filtered_data = [
    "Text: " + line
    for line in text_data
    if line.strip()  # Filter out empty lines
]

In [None]:
example = filtered_data[17654]
print(example)

Text: stabboard side. There warn't no more high jinks. Everybody got solemn;


In [None]:
# Pad the punctuation, to treat them as separate 'words'
def pad_punctuation(s):
    s = re.sub(f"([{string.punctuation}])", r" \1 ", s)
    s = re.sub(" +", " ", s)
    return s


text_data = [pad_punctuation(x) for x in filtered_data]

In [None]:
# Count the texts
n_texts = len(filtered_data)
print(f"{n_texts} Twain texts loaded")

29595 Twain texts loaded


In [None]:
example_data = text_data[17654]
example_data

"Text : stabboard side . There warn ' t no more high jinks . Everybody got solemn ; "

In [None]:
# Convert to a Tensorflow Dataset
text_ds = (
    tf.data.Dataset.from_tensor_slices(text_data)
    .batch(BATCH_SIZE)
    .shuffle(1000)
)

In [None]:
# Create a vectorisation layer
vectorize_layer = layers.TextVectorization(
    standardize="lower",
    max_tokens=VOCAB_SIZE,
    output_mode="int",
    output_sequence_length=MAX_LEN + 1,
)


In [None]:
# Adapt the layer to the training set
vectorize_layer.adapt(text_ds)
vocab = vectorize_layer.get_vocabulary()

In [None]:
# Display some token:word mappings
for i, word in enumerate(vocab[:10]):
    print(f"{i}: {word}")

0: 
1: [UNK]
2: :
3: text
4: ,
5: the
6: .
7: and
8: a
9: to


In [None]:
# Display the same example converted to ints
example_tokenised = vectorize_layer(example_data)
print(example_tokenised.numpy())

[   3    2 2578  241    6   40 1884   16  109   51   91  297 9203    6
  279   58 1227   17    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0 

In [None]:
# Create the training set of Twain texts and the same text shifted by one word
def prepare_inputs(text):
    text = tf.expand_dims(text, -1)
    tokenized_sentences = vectorize_layer(text)
    x = tokenized_sentences[:, :-1]
    y = tokenized_sentences[:, 1:]
    return x, y


train_ds = text_ds.map(prepare_inputs)

## Task 2 Initial LSTM Model Training

Implement a baseline LSTM model with one layer.

Train the model on the initial dataset and evaluate its performance.

Generate sample text and assess coherence and stylistic accuracy.

In [None]:
inputs = layers.Input(shape=(None,), dtype="int32")
x = layers.Embedding(VOCAB_SIZE, EMBEDDING_DIM)(inputs)
x = layers.LSTM(N_UNITS, return_sequences=True)(x)
outputs = layers.Dense(VOCAB_SIZE, activation="softmax")(x)
lstm = models.Model(inputs, outputs)
lstm.summary()

In [None]:
loss_fn = losses.SparseCategoricalCrossentropy()
lstm.compile("adam", loss_fn)

In [None]:
# Create a TextGenerator checkpoint
class TextGenerator(callbacks.Callback):
    def __init__(self, vocab, model, index_to_word):
        self.vocab = vocab
        self._model = model
        self.index_to_word = index_to_word
        self.word_to_index = {
            word: index for index, word in enumerate(index_to_word)
        }  # <1>

    def sample_from(self, probs, temperature):  # <2>
        probs = probs ** (1 / temperature)
        probs = probs / np.sum(probs)
        return np.random.choice(len(probs), p=probs), probs

    def generate(self, start_prompt, max_tokens, temperature):
        start_tokens = [
            self.word_to_index.get(x, 1) for x in start_prompt.split()
        ]  # <3>
        sample_token = None
        info = []
        while len(start_tokens) < max_tokens and sample_token != 0:  # <4>
            x = np.array([start_tokens])
            y = self.model.predict(x, verbose=0)  # <5>
            sample_token, probs = self.sample_from(y[0][-1], temperature)  # <6>
            info.append({"prompt": start_prompt, "word_probs": probs})
            start_tokens.append(sample_token)  # <7>
            start_prompt = start_prompt + " " + self.index_to_word[sample_token]
        print(f"\ngenerated text:\n{start_prompt}\n")
        return info

    def on_epoch_end(self, epoch, logs=None):
        self.generate("It was a close place.", max_tokens=100, temperature=1.0)

In [None]:
# Tokenize starting prompt

text_generator = TextGenerator(vocab)

In [None]:
lstm.fit(
    train_ds,
    epochs=EPOCHS,
    callbacks=[text_generator],
)

Epoch 1/25
[1m925/925[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 41ms/step - loss: 1.3419
generated text:
It was a close place. swearing visiting having ! enormous potter of agreed alone here _ , 

[1m925/925[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m42s[0m 42ms/step - loss: 1.3412
Epoch 2/25
[1m924/925[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 41ms/step - loss: 0.4196
generated text:
It was a close place. appreciated have symmetrical schoolhouse snort scanned dah even distress vast crisp better roped 

[1m925/925[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m39s[0m 42ms/step - loss: 0.4196
Epoch 3/25
[1m925/925[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 41ms/step - loss: 0.3954
generated text:
It was a close place. prudence features visited curving broad circus wednesday bayou cable doin’ field climax picturesquely iii eddy boilers thieves repent young expenses tallied gigantic amount port —your shackleford earlier stunning called , any m

<keras.src.callbacks.history.History at 0x7f65b551fdc0>

## Evaluation of Training

## Intial One Layer Model
The trainings goal was to capture the nuances of Mark Twain and his writing style. With only one leayer the model starting off well picking up the vocab that is often used in the three text provided in twains repotoire. However the midels checkpoints became of no significance as it starting provided a single word. In terms of cohesiveness the strings of text do not make much sense.

Running on TPU 25 epochs lasted 20 minutes


In [None]:
def print_probs(info, vocab, top_k=5):
    for i in info:
        print(f"\nPROMPT: {i['prompt']}")
        word_probs = i["word_probs"]
        p_sorted = np.sort(word_probs)[::-1][:top_k]
        i_sorted = np.argsort(word_probs)[::-1][:top_k]
        for p, i in zip(p_sorted, i_sorted):
            print(f"{vocab[i]}:   \t{np.round(100*p,2)}%")
        print("--------\n")

In [None]:
info = text_generator.generate(
    "Huck said", max_tokens=10, temperature=1.0
)


generated text:
Huck said : right , tom , if ’em do



In [None]:
print_probs(info, vocab)


PROMPT: Huck said
::   	46.6%
:   	10.64%
yes:   	1.48%
tons:   	1.04%
then:   	0.65%
--------


PROMPT: Huck said :
:   	35.86%
the:   	3.42%
i:   	2.55%
he:   	2.47%
[UNK]:   	2.42%
--------


PROMPT: Huck said : right
,:   	29.78%
.:   	13.89%
-:   	13.2%
of:   	8.01%
:   	2.93%
--------


PROMPT: Huck said : right ,
i:   	8.66%
and:   	6.65%
the:   	4.19%
a:   	3.49%
you:   	3.49%
--------


PROMPT: Huck said : right , tom
,:   	50.74%
.:   	10.64%
!:   	5.11%
said:   	4.76%
sawyer:   	4.02%
--------


PROMPT: Huck said : right , tom ,
and:   	22.27%
a:   	4.36%
[UNK]:   	3.72%
if:   	3.67%
you:   	3.0%
--------


PROMPT: Huck said : right , tom , if
you:   	24.7%
i:   	9.98%
they:   	7.29%
it:   	5.39%
you’re:   	4.65%
--------


PROMPT: Huck said : right , tom , if ’em
do:   	7.02%
have:   	5.81%
go:   	5.6%
be:   	4.81%
come:   	3.84%
--------



In [None]:
info = text_generator.generate(
    "Huck said", max_tokens=10, temperature=0.2
)


generated text:
Huck said : 



In [None]:
print_probs(info, vocab)


PROMPT: Huck said
::   	99.94%
:   	0.06%
yes:   	0.0%
tons:   	0.0%
then:   	0.0%
--------


PROMPT: Huck said :
:   	100.0%
the:   	0.0%
i:   	0.0%
he:   	0.0%
[UNK]:   	0.0%
--------



In [None]:
info = text_generator.generate(
    "At night Huck", max_tokens=7, temperature=1.0
)
print_probs(info, vocab)


generated text:
At night Huck 


PROMPT: At night Huck
:   	85.01%
,:   	0.73%
was:   	0.5%
and:   	0.46%
[UNK]:   	0.3%
--------



In [None]:
info = text_generator.generate(
    "When he saw the river", max_tokens=50, temperature=0.2
)
print_probs(info, vocab)


generated text:
When he saw the river ' s 


PROMPT: When he saw the river
':   	67.0%
,:   	15.82%
was:   	12.79%
is:   	3.86%
[UNK]:   	0.23%
--------


PROMPT: When he saw the river '
s:   	100.0%
:   	0.0%
l:   	0.0%
n:   	0.0%
d:   	0.0%
--------


PROMPT: When he saw the river ' s
:   	99.98%
':   	0.02%
name:   	0.0%
a:   	0.0%
sake:   	0.0%
--------



## Evalutation of Generated Text

## Intial One Layer LSTM
The first generated text provided a three token output that continued on the prompt given, with limited cohesiveness. However it seems the model did not have the ability to generate higher quality work with the architecture in which the text data was trained on.


## Task 3 Experiment with Model Complexity

Increase the number of LSTM layers (e.g., 2 layers, 3 layers, etc.), as necessary.

Train and evaluate each configuration to compare performance.

Adjust the number of units in each LSTM layer (e.g., 64, 128, 256).

Analyze how varying the number of units affects the quality of generated text.

In [None]:
def lstm_model_2(num_layers=2, num_units=256, dropout_rate=0.2):
    inputs = layers.Input(shape=(None,), dtype="int32")
    x = layers.Embedding(VOCAB_SIZE, EMBEDDING_DIM)(inputs)
    for _ in range(num_layers):
        x = layers.LSTM(num_units, return_sequences=True)(x)
        x = layers.Dropout(dropout_rate)(x)
    outputs = layers.Dense(VOCAB_SIZE, activation="softmax")(x)
    lstm_model = models.Model(inputs, outputs)
    return lstm_model

model = lstm_model_2()

model.summary()

In [None]:
lstm_2 = lstm_model_2(num_layers=2, num_units=256)  # Adjust layers and units as needed
lstm_2.compile("adam", loss_fn)

lstm_2.fit(
    train_ds,
    epochs=EPOCHS,
    callbacks=[text_generator],
)

Epoch 1/25
[1m925/925[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 79ms/step - loss: 1.0279
generated text:
It was a close place. with he by it , ; association moment _ edition was about on 

[1m925/925[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m77s[0m 80ms/step - loss: 1.0274
Epoch 2/25
[1m925/925[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 79ms/step - loss: 0.4334
generated text:
It was a close place. - men morning months 

[1m925/925[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m74s[0m 80ms/step - loss: 0.4334
Epoch 3/25
[1m925/925[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 80ms/step - loss: 0.4022
generated text:
It was a close place. wave . this ditches head of the next 

[1m925/925[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m74s[0m 80ms/step - loss: 0.4022
Epoch 4/25
[1m925/925[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 80ms/step - loss: 0.3805
generated text:
It was a close place. - stone 

[1m925/925[0m [32m━━━━━━━━━

<keras.src.callbacks.history.History at 0x7f65a014b0a0>

## Evaluation of Training A more Complex LSTM

Although the text produced in this training process seems to be a bit short, the text seem to be more coherent. Stylistically it learns some nuance and you can almost picture finishing touches to sentences that the past training callbacks did not provide. This model does have added layers, and thus is taking a longer time to train. One things that seems to be true is as we continue each epoch sometimes our outputs are nothing. This could be because of an overfitting problem.

Running on TPU 25 epochs lasted 31 minutes


In [None]:
index_to_word = vectorize_layer.get_vocabulary()
text_generator_complex = TextGenerator(vocab, index_to_word=index_to_word, model=lstm_2)

In [None]:
def print_probs(info, vocab, top_k=5):
    for i in info:
        print(f"\nPROMPT: {i['prompt']}")
        word_probs = i["word_probs"]
        p_sorted = np.sort(word_probs)[::-1][:top_k]
        i_sorted = np.argsort(word_probs)[::-1][:top_k]
        for p, i in zip(p_sorted, i_sorted):
            print(f"{vocab[i]}:   \t{np.round(100*p,2)}%")
        print("--------\n")

In [None]:
info = text_generator_complex.generate(
    "Huck said", max_tokens=10, temperature=1.0
)


generated text:
Huck said he was very [UNK] , carried me sadly



In [None]:
print_probs(info, vocab)


PROMPT: Huck said
i:   	15.86%
he:   	12.9%
the:   	8.39%
it:   	7.39%
to:   	6.11%
--------


PROMPT: Huck said he
was:   	19.39%
would:   	11.58%
had:   	11.51%
said:   	3.43%
could:   	3.43%
--------


PROMPT: Huck said he was
a:   	26.76%
the:   	4.41%
[UNK]:   	3.62%
to:   	3.54%
in:   	3.5%
--------


PROMPT: Huck said he was very
ill:   	12.8%
well:   	10.21%
[UNK]:   	6.71%
a:   	6.33%
good:   	3.86%
--------


PROMPT: Huck said he was very [UNK]
,:   	20.76%
and:   	8.74%
to:   	7.71%
that:   	7.24%
.:   	6.79%
--------


PROMPT: Huck said he was very [UNK] ,
and:   	53.86%
but:   	8.94%
for:   	4.26%
he:   	3.32%
so:   	2.24%
--------


PROMPT: Huck said he was very [UNK] , carried
his:   	13.13%
up:   	9.96%
her:   	9.62%
a:   	8.55%
the:   	6.7%
--------


PROMPT: Huck said he was very [UNK] , carried me
a:   	23.98%
his:   	8.05%
the:   	6.81%
in:   	6.39%
on:   	4.79%
--------



In [None]:
info = text_generator_complex.generate(
    "Huck said", max_tokens=10, temperature=1.5
)


generated text:
Huck said to change them wild eye from lying already



In [None]:
print_probs(info, vocab)


PROMPT: Huck said
i:   	15.86%
he:   	12.9%
the:   	8.39%
it:   	7.39%
to:   	6.11%
--------


PROMPT: Huck said he
was:   	19.39%
would:   	11.58%
had:   	11.51%
said:   	3.43%
could:   	3.43%
--------


PROMPT: Huck said he would
be:   	23.3%
have:   	6.24%
not:   	5.72%
a:   	4.05%
.:   	3.05%
--------


PROMPT: Huck said he would a
been:   	6.37%
[UNK]:   	4.55%
ben:   	3.32%
heard:   	3.19%
thought:   	2.58%
--------


PROMPT: Huck said he would a done
it:   	18.97%
that:   	10.16%
the:   	8.39%
him:   	4.85%
a:   	4.69%
--------


PROMPT: Huck said he would a done it
.:   	42.91%
,:   	16.88%
;:   	8.62%
to:   	3.46%
before:   	2.23%
--------


PROMPT: Huck said he would a done it .
i:   	20.4%
:   	12.92%
but:   	9.82%
he:   	9.73%
so:   	6.36%
--------


PROMPT: Huck said he would a done it . he
said:   	23.14%
was:   	13.05%
says:   	11.17%
told:   	4.06%
had:   	2.91%
--------



In [None]:
info = text_generator_complex.generate(
    "Huck began to", max_tokens=10, temperature=1.0
)


generated text:
Huck began to talk as she came , and snap



## Evaluation of More Complex Generated Text

The nuance and stylistic writing peaks through some of these generated texts. For example one of the generated examples says "Huck said to change them wild eye from lying already". Some snippist like "..change them wild eye from lying already" also displays a stronger sense of coherence from the model were it seems to be creating likely sentencing.

It seems that increasing the number of units as well as including dropout lays and extra LSTM layers, we improved the sylistic and coherence of the generator



## TASK 4 Temperature and Prompt Variations

Experiment with different temperature settings (e.g., 0.1, 0.5, 1.0).

Evaluate how temperature affects the creativity and coherence of generated text.


Test various seed prompts to generate text.

Analyze the generated outputs for each prompt and temperature combination.


## TEMPERATURE OF 1, 2, and 3
WITH PROMPT OF: "The river flowed calm"

In [None]:
info = text_generator_complex.generate(
    "The river flowed calm", max_tokens=10, temperature=1.0
)


generated text:
The river flowed calm down without a dispute ; 



In [None]:
info = text_generator_complex.generate(
    "The river flowed calm", max_tokens=10, temperature=2.0
)


generated text:
The river flowed calm bullets to gift for shore .



In [None]:
info = text_generator_complex.generate(
    "The river flowed calm", max_tokens=10, temperature=3.0
)


generated text:
The river flowed calm when other enthusiast ain’t ever is



## Temperature of 1, 2, and 3
WITH PROMPT OF: Sitting on the raft

In [None]:
info = text_generator_complex.generate(
    "Sitting on the raft", max_tokens=10, temperature=1.0
)


generated text:
Sitting on the raft , the same way we would



In [None]:
info = text_generator_complex.generate(
    "Sitting on the raft", max_tokens=10, temperature=2.0
)


generated text:
Sitting on the raft part on tom’s grass of swearing



In [None]:
info = text_generator_complex.generate(
    "Sitting on the raft", max_tokens=10, temperature=3.0
)


generated text:
Sitting on the raft longest duke fortified some information 



## Temperature of 1, 2, and 3
WITH PROMPT OF: Tom saw

In [None]:
info = text_generator_complex.generate(
    "Tom saw", max_tokens=10, temperature=1.0
)


generated text:
Tom saw the place on this point . i skimmed



In [None]:
info = text_generator_complex.generate(
    "Tom saw", max_tokens=10, temperature=2.0
)


generated text:
Tom saw under a looking channel ? almost - depend



In [None]:
info = text_generator_complex.generate(
    "Tom saw", max_tokens=10, temperature=3.0
)


generated text:
Tom saw eloquence excellent steamboatmen years beg we when 



## Evaluation of generated outputs for each prompt and temperature combination.
- Temperature 1
  - The river flowed calm: *The river flowed calm down without a dispute*
  - Sitting on the Raft: *Sitting on the raft , the same way we would*
  - Tom saw: *Tom saw the place on this point . i skimmed*


- Temperature 2
  - The river flowed calm: *The river flowed calm bullets to gift for shore.*
  - Sitting on the Raft: *Sitting on the raft part on tom’s grass of swearing*
  - Tom saw: *Tom saw under a looking channel ? almost - depend*
- Temperature 3
  - The river flowed calm: *The river flowed calm when other enthusiast ain’t ever is*
  - Sitting on the Raft: *Sitting on the raft longest duke fortified some information*
  - Tom saw: *Tom saw eloquence excellent steamboatmen years beg we when*

We can observe that our most concise prompts are that of temperature 1 however these are also our most accurate prompt in tempts of style and coherence. As we continue to increase our temperature we see the generated not focu so much on accuracy but creativity. So there is a trade-off. The more creativity we requested from the generator (by way of temperature) the less coherence and style we sacrifice.

## TASK 5 Evaluation of Generated Text

Assess the quality of generated text (e.g., coherence, relevance, stylistic accuracy).

- Coherence: The coherence of the generated text was intially very simple and hard to make sentences or predict a continuation of a sentence. We did get worded outputs but they weren't as sophistcated as I had hoped. However when we added layers to the LSTM we see an immediate improvement in coherence. When testing at a temperature of 1. that is when our generated text was at its highest in terms coherence.
- Relevence: The relevance our our generated text impressed me as I started with a prompt that generated a text with the main characters name, in the Adventures of Tom Sawyer. It displayed that after training the complex model, our LSTM was able to pick up the important features.
- Stylistic: Some of the sylistics of Mark Twains writing was picked up but not as well as I wished. There is a chance that adding more layers and more epochs could fix this. That being said there was some pickup of the style as words like ain't was used as well as the phrasing of "Sitting on the raft, the same way we would". So although it was not perfect, it was sufficient.
- History: When I continue to produce more and more interations of generated text, the generator on the more complex model seemed to learn as the outputs became more and more what I was looking for. So the first iterations may have been lackluster, but after generating multiple times the outputs became closer and closer to what I wanted, seemingly learning from the past.