* Created GitHub repository with descriptive name
* Uploaded code (.ipynb or .py file)
* Created comprehensive README.md with:
* Project description
* Architecture explanation
How to run the code
* cleanedctexts and analysis
* Team member contributions
* Added requirements.txt or environment.yml (if needed)
* Tested that repository is public and accessible
* Prepared 2-minute presentation
* All team members understand the code
* Submitted GitHub link to Brightspace

Goal
Build a character-level or word-level text generator that learns to write in a particular style
using RNN/LSTM.
Minimal Viable Implementation
1. Choose a text corpus (Shakespeare, song lyrics, etc.)
2. Build character-level or word-level RNN/LSTM
3. Train the model to predict next character/word
4. Generate new text by sampling from the model
Suggested Datasets
• Shakespeare text (small, classic choice)
• Your favorite song lyrics (personal touch!)

In [11]:
# !pip3 -V
# !pip3 install tensorflow
# !pip3 install numpy
# !pip3 install re

pip 21.2.4 from /Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/site-packages/pip (python 3.9)
Defaulting to user installation because normal site-packages is not writeable
Collecting tensorflow
  Using cached tensorflow-2.20.0-cp39-cp39-macosx_12_0_arm64.whl (200.4 MB)
Collecting grpcio<2.0,>=1.24.3
  Downloading grpcio-1.76.0-cp39-cp39-macosx_11_0_universal2.whl (11.8 MB)
[K     |████████████████████████████████| 11.8 MB 6.2 MB/s eta 0:00:01
Collecting keras>=3.10.0
  Downloading keras-3.10.0-py3-none-any.whl (1.4 MB)
[K     |████████████████████████████████| 1.4 MB 4.9 MB/s eta 0:00:01
[?25hCollecting termcolor>=1.1.0
  Downloading termcolor-3.1.0-py3-none-any.whl (7.7 kB)
Collecting requests<3,>=2.21.0
  Downloading requests-2.32.5-py3-none-any.whl (64 kB)
[K     |████████████████████████████████| 64 kB 6.0 MB/s  eta 0:00:01
[?25hCollecting wrapt>=1.11.0
  Downloading wrapt-2.0.1-cp39-cp39-macosx_11_0_arm64.whl (61 kB)
[K   

In [12]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Embedding
from tensorflow.keras.preprocessing.text import Tokenizer
import numpy as np
import re
# import time

# 1. Prepare text data
path_to_file = tf.keras.utils.get_file('shakespeare.txt', 'https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt')
text = open(path_to_file, 'rb').read().decode(encoding='UTF-8')

text_list = text.split('\n')

text = text.lower()
texts = re.sub(r'[^\w\s]','', text) # removes nonwords or spaces
cleaned_text = re.sub(r'[\n]',' ', texts)
vocab = sorted(set(cleaned_text))
print(vocab)



Downloading data from https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt
[1m1115394/1115394[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 0us/step
[' ', '3', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']


In [13]:
# Tokenize and create sequences
sequence_length = 29 # to fit vocab length

chars = tf.strings.unicode_split(cleaned_text, input_encoding='UTF-8', errors='ignore')
print(chars)

ids_from_chars = tf.keras.layers.StringLookup(
    vocabulary=list(vocab), mask_token=None)
ids = ids_from_chars(chars)
print(len(list(vocab)))
print(len(ids_from_chars.get_vocabulary()))

chars_from_ids = tf.keras.layers.StringLookup(
    vocabulary=ids_from_chars.get_vocabulary(), encoding='UTF-8', invert=True, mask_token=None)
# print(chars_from_ids)

all_ids = ids_from_chars(chars)
print(all_ids)

ids_dataset = tf.data.Dataset.from_tensor_slices(all_ids)
print(ids_dataset)

sequences = ids_dataset.batch(sequence_length+1, drop_remainder=True)
print(sequences)

tf.Tensor([b'f' b'i' b'r' ... b'n' b'g' b' '], shape=(1060997,), dtype=string)
28
29
tf.Tensor([ 8 11 20 ... 16  9  1], shape=(1060997,), dtype=int64)
<_TensorSliceDataset element_spec=TensorSpec(shape=(), dtype=tf.int64, name=None)>
<_BatchDataset element_spec=TensorSpec(shape=(30,), dtype=tf.int64, name=None)>


In [14]:
# checking to see if sequences works
def text_from_ids(ids):
  return tf.strings.reduce_join(chars_from_ids(ids), axis=-1)

for seq in sequences.take(5):
  print(text_from_ids(seq).numpy())

b'first citizen before we procee'
b'd any further hear me speak  a'
b'll speak speak  first citizen '
b'you are all resolved rather to'
b' die than to famish  all resol'


2025-11-19 21:49:32.226498: I tensorflow/core/framework/local_rendezvous.cc:407] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence


In [15]:
# 2. Build LSTM model
vocab_size = len(ids_from_chars.get_vocabulary())
embedding_dim = 256
lstm_units = 512
model = Sequential([
Embedding(vocab_size, embedding_dim,
input_length=sequence_length),
# two LSTM layers
LSTM(lstm_units, return_sequences=True),
LSTM(lstm_units),
Dense(vocab_size, activation='softmax')
])



In [60]:
# 3. Train model

def split_input_target(sequence):
    input_text = sequence[:-1]
    target_text = sequence[1:]
    return (input_text, target_text)

dataset = sequences.map(split_input_target)
print(dataset)

BUFFER_SIZE = 10000
BATCH_SIZE = 64
dataset_batched = (
    dataset
    .shuffle(BUFFER_SIZE)
    .batch(BATCH_SIZE, drop_remainder=True)
    .prefetch(tf.data.experimental.AUTOTUNE))
print(dataset_batched)

model.compile(optimizer='adam', loss='categorical_crossentropy')

history = model.fit(dataset_batched, batch_size=BATCH_SIZE, epochs=20)
print(history)

<_MapDataset element_spec=(TensorSpec(shape=(29,), dtype=tf.int64, name=None), TensorSpec(shape=(29,), dtype=tf.int64, name=None))>
<_PrefetchDataset element_spec=(TensorSpec(shape=(64, 29), dtype=tf.int64, name=None), TensorSpec(shape=(64, 29), dtype=tf.int64, name=None))>
Epoch 1/20
[1m552/552[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m407s[0m 734ms/step - loss: 1448.5936
Epoch 2/20
[1m552/552[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m407s[0m 734ms/step - loss: 1448.5936
Epoch 2/20
[1m552/552[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m385s[0m 697ms/step - loss: 1584.2043
Epoch 3/20
[1m552/552[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m385s[0m 697ms/step - loss: 1584.2043
Epoch 3/20
[1m552/552[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m377s[0m 683ms/step - loss: 1580.0298
Epoch 4/20
[1m552/552[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m377s[0m 683ms/step - loss: 1580.0298
Epoch 4/20
[1m552/552[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m477s[0

In [61]:
def generate_text(seed, length=200, temperature=1.0):
    """
    Generates text character-by-character using the trained LSTM model.
    'seed' is the starting text.
    'length' is how many new characters to generate.
    'temperature' controls randomness (lower = less random, higher = more random).
    """

    # clean seed
    seed = seed.lower()
    text = re.sub(r'[^\w\s]','', seed) # removes nonwords or spaces
    cleaned_text = re.sub(r'[\n]',' ', text)

    # Convert each seed character to its integer ID.
    char_to_id = ids_from_chars(tf.strings.unicode_split(cleaned_text, 'UTF-8')).numpy().tolist()

    for l in range(length):
        sample_window = char_to_id[-sequence_length:]
        if len(sample_window) < sequence_length:
            pad_len = sequence_length - len(sample_window)
            sample_window = [0] * pad_len + sample_window

        sample_window = np.array(sample_window).reshape(1, -1)

        prediction = model.predict(sample_window, verbose=0)[0]

        prediction = np.log(prediction + 1e-8) / temperature
        exp_preds = np.exp(prediction)
        prediction = exp_preds / np.sum(exp_preds)

        # Randomly choose a character index according to the probability distribution.
        next_idx = np.random.choice(range(vocab_size), p=prediction)
        next_char = chars_from_ids(tf.constant([next_idx])).numpy()[0].decode('utf-8')

        # Append the new character to both the numeric sequence and the output string.
        if next_char != "[UNK]":
            char_to_id.append(next_idx)
            cleaned_text += next_char

    return cleaned_text

# test with paradise lost
seed1 = "The mind is its own place, and in itself can make a heaven of hell, a hell of heaven."
seed2 = "Abashed the devil stood and felt how awful goodness is and saw Virtue in her shape how lovely: and pined his loss"
seed3 =  "All is not lost, the unconquerable will, and study of revenge, immortal hate, and the courage never to submit or yield"
print(generate_text(seed1, length=100, temperature=0.8))
print(generate_text(seed2, length=100, temperature=0.8))
print(generate_text(seed3, length=100, temperature=0.8))

the mind is its own place and in itself can make a heaven of hell a hell of heaven hm cc   c c chhc  ht c n yhdcch cchc hcl chmchh ncchco nchchhhc chrc chc ko  ca zy amhc ccha h ck
abashed the devil stood and felt how awful goodness is and saw virtue in her shape how lovely and pined his losshc hc  cm  3y mrh   dn canyhh hh hccchmkc  h c ac ch fz chhhc achh  ncc     hlmv l c3kc 3hkwhhhc 
abashed the devil stood and felt how awful goodness is and saw virtue in her shape how lovely and pined his losshc hc  cm  3y mrh   dn canyhh hh hccchmkc  h c ac ch fz chhhc achh  ncc     hlmv l c3kc 3hkwhhhc 
all is not lost the unconquerable will and study of revenge immortal hate and the courage never to submit or yield h ntc a hhc  lr zkjk hczcnchncn ch chhacc jh ah hch hhnrc     hhhc  h hyac h    a n czhac    h h
all is not lost the unconquerable will and study of revenge immortal hate and the courage never to submit or yield h ntc a hhc  lr zkjk hczcnchncn ch chhacc jh ah hch hhnrc     hhhc  h hy