<a href="https://colab.research.google.com/github/wangjona9/Assignment6/blob/main/Copy_of_09_Assigment_6_text_generation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Copyright

<PRE>
Copyright (c) Bálint Gyires-Tóth - All Rights Reserved
You may use and modify this code for research and development purpuses.
Using this code for educational purposes (self-paced or instructor led) without the permission of the author is prohibited.
</PRE>

# Assignment: RNN text generation with your favorite book


## 1. Dataset
- Download your favorite book from https://www.gutenberg.org/
- Combine all sonnets into a single text source.  
- Split into training (80%) and validation (20%).  

In [64]:
import numpy as np
import re
from collections import Counter
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Conv1D, GlobalMaxPooling1D, Dense
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.callbacks import Callback
import math

import random

def extract_sonnets(file_path):
    with open(file_path, 'r', encoding='utf-8') as f:
        lines = f.readlines()

    start_index = None
    end_index = None

    for i, line in enumerate(lines):
        if "THE SONNETS" in line.upper():
            start_index = i + 1
        elif "*** END OF THE PROJECT GUTENBERG EBOOK" in line:
            end_index = i
            break

    if start_index is None or end_index is None:
        raise ValueError("Could not find the start or end of the sonnets.")

    # extract only the sonnet text
    sonnet_lines = [line.strip() for line in lines[start_index:end_index] if line.strip()]

    # remove titles
    sonnet_lines = [line for line in sonnet_lines if not line.upper().startswith("SONNET")]

    return sonnet_lines

def split_data(lines, train_ratio=0.8):

    total_lines = len(lines)
    train_cutoff = int(total_lines * train_ratio)

    train_lines = lines[:train_cutoff]
    val_lines = lines[train_cutoff:]

    return "\n".join(train_lines), "\n".join(val_lines)

input_path = 'sonnets.txt'

sonnet_lines = extract_sonnets(input_path)
train_text, val_text = split_data(sonnet_lines)

with open('sonnets_train.txt', 'w', encoding='utf-8') as f:
    f.write(train_text)

with open('sonnets_val.txt', 'w', encoding='utf-8') as f:
    f.write(val_text)

print("Split into sonnets_train.txt and sonnets_val.txt")

Split into sonnets_train.txt and sonnets_val.txt


## 2. Preprocessing
- Convert text to lowercase.  
- Remove punctuation (except basic sentence delimiters).  
- Tokenize by words or characters (your choice).  
- Build a vocabulary (map each unique word to an integer ID).

In [58]:
import re
from collections import Counter

def preprocess_text(text):
    # Convert to lowercase
    text = text.lower()

    # Remove punctuation
    text = re.sub(r'[^\w\s\.\!\?]', '', text)

    # Tokenize by words (split on whitespace)
    tokens = text.split()

    return tokens

def build_vocab(tokens):
    # Building a vocabulary
    word_counts = Counter(tokens)
    sorted_words = sorted(word_counts.keys())  # Sort

    word_to_id = {word: idx for idx, word in enumerate(sorted_words)}
    id_to_word = {idx: word for word, idx in word_to_id.items()}

    return word_to_id, id_to_word

with open('combined_sonnets.txt', 'r', encoding='utf-8') as f:
    text = f.read()

# Preprocess
tokens = preprocess_text(text)

# Build vocabulary
word_to_id, id_to_word = build_vocab(tokens)

# Example outputs
print("Sample tokens:", tokens[:20])
print("Vocab size:", len(word_to_id))
print("Sample word_to_id:", dict(list(word_to_id.items())[:10]))

Sample tokens: ['by', 'william', 'shakespeare', 'i', 'from', 'fairest', 'creatures', 'we', 'desire', 'increase', 'that', 'thereby', 'beautys', 'rose', 'might', 'never', 'die', 'but', 'as', 'the']
Vocab size: 3667
Sample word_to_id: {'a': 0, 'abhor': 1, 'abide': 2, 'able': 3, 'about': 4, 'above': 5, 'absence': 6, 'absence!': 7, 'absent': 8, 'abundance': 9}


## 3. Embedding Layer in Keras
Below is a minimal example of defining an `Embedding` layer:
```python
from tensorflow.keras.layers import Embedding

embedding_layer = Embedding(
    input_dim=vocab_size,     # size of the vocabulary
    output_dim=128,           # embedding vector dimension
    input_length=sequence_length
)
```
- This layer transforms integer-encoded sequences (word IDs) into dense vector embeddings.

- Feed these embeddings into your LSTM or GRU OR 1D CNN layer.

In [59]:
from tensorflow.keras.layers import Embedding

vocab_size = len(word_to_id)
sequence_length = 20

embedding_layer = Embedding(
    input_dim=vocab_size,     # size of vocabulary
    output_dim=128,           # embedding vector dimension
    input_length=sequence_length
)

## 4. Model
- Implement an LSTM or GRU or 1D CNN-based language model with:
  - **The Embedding layer** as input.
  - At least **one recurrent layer** (e.g., `LSTM(256)` or `GRU(256)` or your custom 1D CNN).
  - A **Dense** output layer with **softmax** activation for word prediction.
- Train for about **5–10 epochs** so it can finish in approximately **2 hours** on a standard machine.


In [60]:
model = Sequential([
    embedding_layer,
    Conv1D(filters=256, kernel_size=5, activation='relu'),
    GlobalMaxPooling1D(),
    Dense(128, activation='relu'),
    Dense(vocab_size, activation='softmax') # Predict next word
])

model.compile(
    optimizer='adam',
    loss='sparse_categorical_crossentropy',
    metrics=['accuracy']
)

history = model.fit(
    X_train, y_train,
    batch_size=64,
    epochs=5,
    validation_split=0.1
)

Epoch 1/5
[1m249/249[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m18s[0m 65ms/step - accuracy: 0.0224 - loss: 7.2610 - val_accuracy: 0.0215 - val_loss: 6.8979
Epoch 2/5
[1m249/249[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m20s[0m 61ms/step - accuracy: 0.0277 - loss: 6.5251 - val_accuracy: 0.0215 - val_loss: 7.0403
Epoch 3/5
[1m249/249[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m20s[0m 61ms/step - accuracy: 0.0280 - loss: 6.5063 - val_accuracy: 0.0215 - val_loss: 7.0906
Epoch 4/5
[1m249/249[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m21s[0m 61ms/step - accuracy: 0.0284 - loss: 6.4951 - val_accuracy: 0.0164 - val_loss: 7.1396
Epoch 5/5
[1m249/249[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m20s[0m 61ms/step - accuracy: 0.0294 - loss: 6.4418 - val_accuracy: 0.0249 - val_loss: 7.1873


## 5. Training & Evaluation
- **Monitor** the loss on both training and validation sets.
- **Perplexity**: a common metric for language models.
  - It is the exponent of the average negative log-likelihood.
  - If your model outputs cross-entropy loss `H`, then `perplexity = e^H`.
  - Try to keep the validation perplexity **under 50** if possible.

In [61]:
class PerplexityCallback(Callback):
    def on_epoch_end(self, epoch, logs=None):
        train_perplexity = math.exp(logs['loss'])
        val_perplexity = math.exp(logs['val_loss'])
        print(f'\nEpoch {epoch+1}:')
        print(f'Train Perplexity: {train_perplexity:.2f}')
        print(f'Val Perplexity: {val_perplexity:.2f}')

# Train the model with perplexity monitoring
history = model.fit(
    X_train,
    y_train,
    batch_size=64,
    epochs=10,
    validation_split=0.1,
    callbacks=[PerplexityCallback()]
)

# Final evaluation
final_train_loss = history.history['loss'][-1]
final_val_loss = history.history['val_loss'][-1]
print(f'\nFinal Training Perplexity: {math.exp(final_train_loss):.2f}')
print(f'Final Validation Perplexity: {math.exp(final_val_loss):.2f}')

Epoch 1/10
[1m248/249[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 58ms/step - accuracy: 0.0406 - loss: 6.3319
Epoch 1:
Train Perplexity: 573.29
Val Perplexity: 1374.63
[1m249/249[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m15s[0m 60ms/step - accuracy: 0.0406 - loss: 6.3321 - val_accuracy: 0.0232 - val_loss: 7.2259
Epoch 2/10
[1m248/249[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 61ms/step - accuracy: 0.0517 - loss: 6.1613
Epoch 2:
Train Perplexity: 491.05
Val Perplexity: 1401.83
[1m249/249[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m21s[0m 64ms/step - accuracy: 0.0516 - loss: 6.1616 - val_accuracy: 0.0227 - val_loss: 7.2455
Epoch 3/10
[1m248/249[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 58ms/step - accuracy: 0.0659 - loss: 5.9394
Epoch 3:
Train Perplexity: 396.66
Val Perplexity: 1482.36
[1m249/249[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m20s[0m 61ms/step - accuracy: 0.0659 - loss: 5.9397 - val_accuracy: 0.0260 - val_loss: 7.3014
Epoc

## 6. Generation Criteria
- After training, generate **two distinct text samples**, each at least **50 tokens**.
- Use **different seed phrases** (e.g., “love is” vs. “time will”).

In [62]:
def generate_text(seed_phrase, num_tokens=50, temperature=0.7):
    generated = seed_phrase.lower().split()
    for _ in range(num_tokens):
        # Convert last 'sequence_length' tokens to IDs
        seed_ids = [word_to_id.get(word, 0) for word in generated[-sequence_length:]]
        seed_ids = pad_sequences([seed_ids], maxlen=sequence_length)

        # Predict next token probabilities
        preds = model.predict(seed_ids, verbose=0)[0]

        # Apply temperature sampling
        preds = np.log(preds) / temperature
        exp_preds = np.exp(preds)
        preds = exp_preds / np.sum(exp_preds)

        # Sample next token
        next_id = np.random.choice(len(preds), p=preds)
        generated.append(id_to_word[next_id])

    return ' '.join(generated)

# Generate two distinct samples
print(generate_text(seed_phrase="thou art", num_tokens=50))
print(generate_text(seed_phrase="had stol’n", num_tokens=50))

thou art gone which for is beauty my doth doth yet with kind face part to of love what blind those loves was love time sin love that that me me but that it i which from should on for my cannot the for the my for the day forbear of far
had stol’n fear souls weep days now form and now the which of worth weep peace now not time it due was his that beautys of now since first could the the wide mend shows hath hath my made whilst on to call sometime belovd from thence they but but those i
