# Transformer-Based Text Generation: Building an LLM from Scratch with Keras and TensorFlow

## 1. Introduction
Welcome! This notebook provides a comprehensive guide to implementing cutting-edge Transformer architectures for text generation and language modeling. Unlike traditional RNNs that process text sequentially, Transformers leverage self-attention mechanisms to capture long-range dependencies and complex linguistic patterns in text data.

The script (`main.py`) performs several key operations:
1. Loads text corpus data (Shakespeare dataset from web or local corpus file).
2. Implements a complete Transformer architecture from scratch using Keras with TensorFlow.
3. Trains the model for next-token prediction using multi-head self-attention and positional encoding.
4. Visualizes the training process and model architecture using TensorBoard and Visualkeras.
5. Generates coherent text sequences using the trained language model.

This implementation provides a solid foundation for real-world natural language processing applications, from creative writing assistance to chatbots and content generation systems.

## 📺 Watch the Tutorial

Prefer a video walkthrough? Check out the accompanying tutorial on YouTube:

[Transformer Text Generation Tutorial](https://youtu.be/oOpe8lvGIrI)

## 🚀 Quick Start Guide

Ready to run the code? Follow these simple steps to set up your environment and execute the Transformer text generation model:

### Step 1: Create a Python Virtual Environment
```bash
# Create a new virtual environment
python -m venv /path/to/your/project/

# Activate the virtual environment
# On macOS/Linux:
source /path/to/your/project/bin/activate
```

### Step 2: Install Required Dependencies
```bash
# Install all required packages
pip install -r requirements.txt
```

The `requirements.txt` file includes:
- `tensorflow>=2.0` - Deep learning framework
- `numpy` - Numerical computing
- `matplotlib` - Plotting and visualization
- `visualkeras` - Model architecture visualization

### Step 3: Prepare Your Text Corpus (Optional)
The script can work with two data sources:
- **Web dataset:** Automatically downloads Shakespeare's complete works
- **Local corpus:** Place your own text file as `corpus.txt` in the project directory

To use your own text data:
```bash
# Place your text file in the project directory
cp /path/to/your/text/data.txt corpus.txt
```

### Step 4: Run the Main Script
```bash
# Execute the Transformer text generation model
python main.py
```

### What Happens When You Run It?
1. **Data Loading:** Downloads Shakespeare dataset or loads local corpus.txt
2. **Text Preprocessing:** Tokenizes and vectorizes text using TextVectorization
3. **Sequence Generation:** Creates input-target pairs for next-token prediction
4. **Model Building:** Constructs the Transformer architecture with positional encoding
5. **Training:** Trains the model with early stopping and TensorBoard logging
6. **Visualization:** Generates model architecture diagrams and training plots
7. **Text Generation:** Creates novel text sequences using the trained model

### Monitoring Training Progress
The script automatically sets up TensorBoard logging. After running, you can monitor training in real-time:
```bash
# Launch TensorBoard (the script will show you the exact command)
tensorboard --logdir logs/[timestamp]
```

Then open your browser to `http://localhost:6006` to view:
- Training loss curves
- Model architecture visualization
- Text generation samples
- Model weights and gradients

**Expected Runtime:** Approximately 10-20 minutes on a modern CPU, faster with GPU acceleration.

### Architecture Overview
The complete workflow of our Transformer-based text generation system:

![Transformer Text Generation Architecture](text_gen_transformers.png)

This diagram illustrates the end-to-end pipeline from raw text data through the Transformer architecture to generated text sequences.

## 2. Core Concepts: Why Transformers for Text Generation?

### Traditional Approaches vs. Transformers
Traditionally, text generation relied on:
1. **Statistical methods** like n-gram models - work well for short sequences but struggle with long-term coherence.
2. **Recurrent Neural Networks** like LSTMs or GRUs - process text sequentially, building context word by word.

**The fundamental limitation:** RNNs process information sequentially, meaning they can only look at previous words when generating the next token, and they often struggle with very long-term dependencies due to the vanishing gradient problem.

### The Transformer Revolution
Transformers completely revolutionize text generation:
- **Parallel Processing:** Instead of processing text word by word, Transformers can attend to all positions simultaneously through self-attention.
- **Long-Range Dependencies:** They can capture relationships between words that are very far apart - crucial for maintaining coherence in long texts.
- **Multiple Perspectives:** Multi-head attention allows the model to focus on different types of linguistic patterns simultaneously (syntax, semantics, style).

Think of it like having multiple editors simultaneously reviewing a text, where each editor can instantly reference any part of the document to understand context and meaning.

### Self-Attention Mechanism: The Core Innovation
Self-attention works by creating three different representations of input text:
- **Queries (Q):** "What information am I looking for?"
- **Keys (K):** "What information is available at each position?"
- **Values (V):** "What information should I extract and combine?"

The attention mechanism computes similarity scores between queries and keys, then uses these scores to create weighted combinations of values. This allows each word to directly attend to any other word in the sequence, regardless of distance.

### Positional Encoding: Understanding Word Order
Since Transformers process all positions simultaneously, they need a way to understand word order:
1. **Sinusoidal Encoding:** Uses sine and cosine functions with different frequencies
2. **Position-Dependent Patterns:** Each position gets a unique encoding pattern
3. **Relative Position:** Model learns relationships between positions
4. **Embedding Addition:** Positional encodings are added to word embeddings

This allows the model to understand that "The cat sat on the mat" is different from "The mat sat on the cat."

### Next-Token Prediction: The Training Objective
Language models are trained using a simple but powerful objective:
- **Input:** "The quick brown fox jumps over the"
- **Target:** "lazy" (next word)
- **Learning:** Model learns to predict the probability distribution over all possible next words
- **Generation:** At inference, sample from this distribution to generate coherent text

## 3. Code Deep Dive: Complete Transformer Implementation

### File Structure Overview
The entire implementation is contained in `main.py`, featuring:
- Custom TransformerBlock layers built from scratch
- Multi-head self-attention with positional encoding
- Complete training pipeline with text generation
- Data preprocessing and evaluation utilities

### Environment Setup and GPU Configuration
```python
import os
from datetime import datetime
import numpy as np
import tensorflow as tf
from tensorflow.keras.layers import (
    Layer, Dense, LayerNormalization, Dropout, Embedding, 
    MultiHeadAttention, TextVectorization
)
from tensorflow.keras.models import Model
from tensorflow.keras.utils import get_file
import matplotlib.pyplot as plt
import visualkeras

# Configure TensorFlow to use GPU if available
gpus = tf.config.list_physical_devices('GPU')
if gpus:
    try:
        for gpu in gpus:
            tf.config.experimental.set_memory_growth(gpu, True)
        print(f"{len(gpus)} Physical GPUs configured.")
    except RuntimeError as e:
        print(f"Error setting up GPU memory growth: {e}")
```
**Key Points:**
- GPU memory growth prevents TensorFlow from allocating all GPU memory at startup
- Essential for training large language models efficiently
- TextVectorization handles tokenization and vocabulary management automatically

### TransformerBlock: The Building Block of Language Models
```python
class TransformerBlock(Layer):
    def __init__(self, embed_dim, num_heads, ff_dim, rate=0.1):
        super(TransformerBlock, self).__init__()
        self.att = MultiHeadAttention(num_heads=num_heads, key_dim=embed_dim)
        self.ffn = tf.keras.Sequential([
            Dense(ff_dim, activation="relu"),
            Dense(embed_dim),
        ])
        self.layernorm1 = LayerNormalization(epsilon=1e-6)
        self.layernorm2 = LayerNormalization(epsilon=1e-6)
        self.dropout1 = Dropout(rate)
        self.dropout2 = Dropout(rate)

    def call(self, inputs, training=False):
        attn_output = self.att(inputs, inputs)
        attn_output = self.dropout1(attn_output, training=training)
        out1 = self.layernorm1(inputs + attn_output)
        ffn_output = self.ffn(out1)
        ffn_output = self.dropout2(ffn_output, training=training)
        return self.layernorm2(out1 + ffn_output)
```
**Architecture Components:**
- **Multi-Head Attention:** Captures relationships between all word positions
- **Feed-Forward Network:** Position-wise transformations for each word
- **Residual Connections:** Prevents vanishing gradients in deep networks
- **Layer Normalization:** Stabilizes training and speeds convergence
- **Dropout:** Regularization to prevent overfitting

### Complete TransformerModel: From Embeddings to Text Generation
```python
class TransformerModel(Model):
    def __init__(self, vocab_size, embed_dim, num_heads, ff_dim, num_layers, seq_length):
        super(TransformerModel, self).__init__()
        self.embedding = Embedding(vocab_size, embed_dim)
        self.pos_encoding = self.positional_encoding(seq_length, embed_dim)
        self.transformer_blocks = [TransformerBlock(embed_dim, num_heads, ff_dim) 
                                 for _ in range(num_layers)]
        self.dense = Dense(vocab_size)

    def call(self, inputs, training=False):
        seq_len = tf.shape(inputs)[1]
        x = self.embedding(inputs)
        x += self.pos_encoding[:, :seq_len, :]
        for transformer_block in self.transformer_blocks:
            x = transformer_block(x, training=training)
        output = self.dense(x)
        return output
```
**Model Pipeline:**
1. **Token Embedding:** Converts word indices to dense vectors
2. **Positional Encoding:** Adds position information to embeddings
3. **Transformer Layers:** Stack of self-attention and feed-forward blocks
4. **Output Projection:** Maps to vocabulary size for next-token prediction

### Positional Encoding Implementation
```python
def positional_encoding(self, seq_length, embed_dim):
    angle_rads = self.get_angles(np.arange(seq_length)[:, np.newaxis], 
                               np.arange(embed_dim)[np.newaxis, :], embed_dim)
    angle_rads[:, 0::2] = np.sin(angle_rads[:, 0::2])  # Even indices
    angle_rads[:, 1::2] = np.cos(angle_rads[:, 1::2])  # Odd indices
    pos_encoding = angle_rads[np.newaxis, ...]
    return tf.cast(pos_encoding, dtype=tf.float32)

def get_angles(self, pos, i, embed_dim):
    angle_rates = 1 / np.power(10000, (2 * (i // 2)) / np.float32(embed_dim))
    return pos * angle_rates
```
**Mathematical Foundation:**
- Uses sinusoidal functions with different frequencies for each dimension
- Even dimensions use sine, odd dimensions use cosine
- Creates unique, learnable patterns for each position
- Allows model to understand relative and absolute positions

### Data Preprocessing: From Text to Training Sequences
```python
def create_sequences(text, seq_length):
    input_seqs = []
    target_seqs = []
    for i in range(len(text) - seq_length):
        input_seq = text[i:i + seq_length]
        target_seq = text[i + 1:i + seq_length + 1]
        input_seqs.append(input_seq)
        target_seqs.append(target_seq)
    return np.array(input_seqs), np.array(target_seqs)
```
**Sequence Generation Strategy:**
- Creates sliding windows of text for training
- Example: "Hello world how are" → input: "Hello world how", target: "world how are"
- Each position learns to predict the next token
- Maximizes training data from single text corpus

### Text Vectorization and Vocabulary Management
```python
# Setup text vectorization
vocab_size = 10000
vectorizer = TextVectorization(max_tokens=vocab_size, output_mode='int')
text_ds = tf.data.Dataset.from_tensor_slices([text]).batch(1)
vectorizer.adapt(text_ds)

# Convert text to token sequences
vectorized_text = vectorizer([text])[0]
```
**Preprocessing Pipeline:**
- **TextVectorization:** Handles tokenization, lowercasing, punctuation
- **Vocabulary Building:** Creates word-to-index mapping from corpus
- **Sequence Conversion:** Transforms text into integer sequences
- **Automatic Handling:** Manages unknown words and special tokens

### Text Generation: From Model to Creative Writing
```python
def generate_text(model, vectorizer, start_string, seq_length, 
                 num_generate=100, temperature=1.0):
    # Convert start string to tokens
    input_eval = vectorizer([start_string]).numpy()
    
    # Ensure correct sequence length
    if input_eval.shape[1] < seq_length:
        padding = np.zeros((1, seq_length - input_eval.shape[1]))
        input_eval = np.concatenate((padding, input_eval), axis=1)
    
    text_generated = []
    for i in range(num_generate):
        predictions = model(input_eval)
        predictions = predictions[0, -1, :] / temperature
        
        # Sample next token
        predicted_id = tf.random.categorical(tf.expand_dims(predictions, 0), 
                                           num_samples=1)[0, 0].numpy()
        
        # Update input sequence
        input_eval = np.append(input_eval, [[predicted_id]], axis=1)
        input_eval = input_eval[:, -seq_length:]
        
        # Convert token back to word
        vocab = vectorizer.get_vocabulary()
        if predicted_id < len(vocab):
            text_generated.append(vocab[predicted_id])
    
    return start_string + ' ' + ' '.join(text_generated)
```
**Generation Process:**
- **Seed Text:** Start with user-provided prompt
- **Prediction:** Model outputs probability distribution over vocabulary
- **Temperature Sampling:** Controls randomness (low = conservative, high = creative)
- **Autoregressive:** Each generated token becomes input for next prediction
- **Sliding Window:** Maintains fixed context length during generation

### Training Pipeline with Corpus Loading
```python
def load_corpus(corpus_source):
    if corpus_source == "web":
        print("Loading Shakespeare dataset from web...")
        path_to_file = get_file('shakespeare.txt', 
                               'https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt')
        text = open(path_to_file, 'rb').read().decode(encoding='utf-8')
    elif corpus_source == "local":
        print("Loading corpus from local file 'corpus.txt'...")
        with open('corpus.txt', 'r', encoding='utf-8') as f:
            text = f.read()
    return text

def main():
    corpus_source = "local"  # "web" for Shakespeare, "local" for corpus.txt
    text = load_corpus(corpus_source)
    
    # Model hyperparameters
    embed_dim = 256
    num_heads = 4
    ff_dim = 512
    num_layers = 4
    seq_length = 100
```
**Flexible Data Sources:**
- **Web Dataset:** Shakespeare's complete works (classic benchmark)
- **Local Corpus:** Your own text data for domain-specific generation
- **Preprocessing:** Automatic encoding detection and error handling
- **Scalability:** Works with any size text corpus

### Training and Monitoring Setup
```python
    # Model compilation
    model.compile(optimizer='adam', loss='sparse_categorical_crossentropy')
    
    # Setup logging and callbacks
    logdir = os.path.join("logs", datetime.now().strftime("%Y%m%d-%H%M%S"))
    tensorboard_cb = tf.keras.callbacks.TensorBoard(log_dir=logdir, histogram_freq=1)
    early_stopping = EarlyStopping(monitor='loss', patience=2, restore_best_weights=True)
    
    # Architecture visualization
    visualkeras.layered_view(model, to_file='transformer_text_model_architecture.png')
    
    # Training
    history = model.fit(X, Y, epochs=20, batch_size=32, 
                        callbacks=[early_stopping, tensorboard_cb])
```
**Training Features:**
- **Sparse Categorical Crossentropy:** Efficient loss for large vocabularies
- **Early Stopping:** Prevents overfitting with patience mechanism
- **TensorBoard Integration:** Real-time training monitoring
- **Model Visualization:** Beautiful architecture diagrams
- **Checkpointing:** Saves best model weights automatically

### Text Generation and Evaluation
```python
    # Generate text samples
    start_string = "the object of our"
    generated_text = generate_text(model, vectorizer, start_string, seq_length, 
                                 num_generate=100, temperature=0.7)
    
    # Longer generation with different temperature
    longer_text = generate_text(model, vectorizer, start_string, seq_length, 
                              num_generate=200, temperature=0.8)
    
    # Save results
    with open(os.path.join(logdir, 'generated_text.txt'), 'w') as f:
        f.write(f"Generated text (100 tokens):\n{generated_text}\n\n")
        f.write(f"Generated text (200 tokens):\n{longer_text}\n")
```
**Generation Strategies:**
- **Temperature Control:** Balance between coherence and creativity
- **Multiple Samples:** Generate various lengths and styles
- **Prompt Engineering:** Different starting phrases for diverse outputs
- **Quality Assessment:** Manual evaluation of coherence and style

## 4. Setup and Running the Application

### Prerequisites
- Python 3.8+ (recommended)
- CUDA-compatible GPU (optional but recommended for performance)
- `pip` package manager
- Text corpus file (optional - defaults to Shakespeare dataset)

### Installation Steps
1. **Clone the repository:**
   ```bash
   git clone <your-repo-url>
   cd transformers/text_generation
   ```

2. **Create virtual environment:**
   ```bash
   python3 -m venv transformer_env
   ```

3. **Activate virtual environment:**
   - Linux/macOS:
     ```bash
     source transformer_env/bin/activate
     ```
   - Windows:
     ```bash
     transformer_env\Scripts\activate
     ```

4. **Install required packages:**
   ```bash
   pip install tensorflow numpy matplotlib visualkeras
   ```

5. **For GPU support (optional):**
   ```bash
   pip install tensorflow-gpu
   ```

### Required Dependencies (requirements.txt)
```txt
tensorflow>=2.8.0
numpy>=1.21.0
matplotlib>=3.5.0
visualkeras>=0.0.2
```

### Running the Project
Execute the main script:
```bash
python main.py
```

This will:
- Load text corpus (Shakespeare or local corpus.txt)
- Build the Transformer model from scratch
- Train for up to 20 epochs with early stopping
- Create visualizations and save to timestamped log directory
- Generate text samples with different parameters
- Save model and generated text outputs

**Expected Output:**
```
Loading corpus from local file 'corpus.txt'...
Local corpus loaded. Text length: 125000 characters
Vectorized text shape: (23456,)
Number of sequences generated: 23356
Model: "transformer_model"
...
Total params: 8,467,200
Trainable params: 8,467,200
TensorBoard logs in: /path/to/logs/20231201-143022
Starting training...
Epoch 1/20
730/730 [==============================] - 45s 62ms/step - loss: 6.2341
...
Generated text:
the object of our study is to understand the nature of language and meaning...
```

### Monitoring with TensorBoard
Start TensorBoard to monitor training:
```bash
tensorboard --logdir logs
```

Navigate to `http://localhost:6006` to view:
- **Scalars:** Training loss progression over epochs
- **Graphs:** Complete model computational graph
- **Histograms:** Weight and bias distributions
- **Images:** Model architecture diagrams
- **Text:** Generated text samples during training

### Customization Options
Modify hyperparameters in the main() function:
```python
# Data source
corpus_source = "local"  # "web" for Shakespeare, "local" for corpus.txt

# Model architecture
embed_dim = 256          # Embedding dimension
num_heads = 4            # Number of attention heads
ff_dim = 512            # Feed-forward dimension
num_layers = 4          # Number of Transformer blocks
seq_length = 100        # Context window size
vocab_size = 10000      # Vocabulary size

# Generation parameters
temperature = 0.7       # Creativity vs. coherence (0.1-2.0)
num_generate = 100      # Number of tokens to generate
```

**Hyperparameter Guidelines:**
- Increase `embed_dim` for richer word representations
- More `num_heads` captures diverse linguistic patterns
- Deeper `num_layers` for complex language understanding
- Larger `seq_length` for longer context (but more memory)
- Higher `temperature` for more creative but less coherent text

## 5. Advanced Topics and Extensions

### Causal (Masked) Attention for Language Modeling
For truly autoregressive generation, implement causal masking:
```python
def create_causal_mask(seq_len):
    mask = tf.linalg.band_part(tf.ones((seq_len, seq_len)), -1, 0)
    return mask  # Lower triangular matrix

# In MultiHeadAttention:
causal_mask = create_causal_mask(seq_length)
attention_output = self.att(inputs, inputs, attention_mask=causal_mask)
```
This prevents the model from "cheating" by looking at future tokens during training.

### Advanced Sampling Strategies
Improve text quality with sophisticated sampling:
```python
def top_k_sampling(logits, k=40):
    top_k_logits, top_k_indices = tf.nn.top_k(logits, k=k)
    probabilities = tf.nn.softmax(top_k_logits)
    return tf.random.categorical(tf.expand_dims(probabilities, 0), 1)

def nucleus_sampling(logits, p=0.9):
    sorted_indices = tf.argsort(logits, direction='DESCENDING')
    sorted_logits = tf.gather(logits, sorted_indices)
    cumulative_probs = tf.cumsum(tf.nn.softmax(sorted_logits))
    nucleus = tf.cast(cumulative_probs < p, tf.float32)
    return nucleus * sorted_logits
```

### Fine-tuning for Specific Domains
Adapt pre-trained models to specific writing styles:
```python
# Load pre-trained model
base_model = tf.keras.models.load_model('transformer_text_generation_model.h5')

# Freeze early layers, fine-tune later layers
for layer in base_model.layers[:-2]:
    layer.trainable = False

# Train on domain-specific data with lower learning rate
base_model.compile(optimizer=tf.keras.optimizers.Adam(1e-5), 
                  loss='sparse_categorical_crossentropy')
```

### Real-World Text Data Integration
Process various text formats for training:
```python
import json
import pandas as pd

# Load from JSON dataset
with open('dialogue_dataset.json', 'r') as f:
    data = json.load(f)
    text = ' '.join([item['text'] for item in data])

# Load from CSV
df = pd.read_csv('text_dataset.csv')
text = ' '.join(df['content'].astype(str))

# Clean and preprocess
text = text.replace('\n', ' ').replace('\t', ' ')
text = ' '.join(text.split())  # Remove extra whitespace
```

## 6. Performance Optimization Tips

### Memory Management for Large Models
- **Gradient Checkpointing:** Trade computation for memory in deep models
- **Mixed Precision Training:** Use float16 for forward pass, float32 for gradients
- **Batch Size Optimization:** Find optimal batch size for your GPU memory
- **Sequence Length:** Balance context length with memory constraints

### Training Acceleration
- **Learning Rate Scheduling:** Warm-up and cosine decay strategies
- **Multi-GPU Training:** Distribute training across multiple devices
- **Data Pipeline Optimization:** Use tf.data for efficient data loading
- **Compiled Training:** Use @tf.function decorators for speed

## 7. Troubleshooting Common Issues

### Training Issues
- **Loss not decreasing:** Check learning rate, data quality, model initialization
- **Exploding gradients:** Use gradient clipping, reduce learning rate
- **Overfitting:** Increase dropout, reduce model size, add regularization
- **GPU out of memory:** Reduce batch size, sequence length, or model size
- **Poor text quality:** Increase model capacity, train longer, improve data quality

### Generation Issues
- **Repetitive text:** Adjust temperature, use nucleus/top-k sampling
- **Incoherent output:** Lower temperature, increase context length
- **Slow generation:** Optimize model architecture, use caching
- **Memory errors:** Reduce sequence length during generation

## 8. Conclusion
This notebook has provided a comprehensive walkthrough of implementing Transformer architecture for text generation and language modeling. We've covered:

- **Theoretical Foundation:** Self-attention mechanism and positional encoding
- **Complete Implementation:** From scratch Transformer blocks in Keras
- **Training Pipeline:** Data preprocessing, model compilation, and monitoring
- **Text Generation:** Autoregressive sampling and creativity control
- **Practical Applications:** Domain-specific language model development

**Key Advantages of Transformers for Text Generation:**
- Parallel processing enables efficient training on long sequences
- Self-attention captures long-range linguistic dependencies
- Multiple attention heads learn diverse language patterns
- Scalable architecture suitable for large language models
- Flexible generation with temperature and sampling control

**Next Steps:**
- Experiment with different model architectures and hyperparameters
- Try domain-specific datasets (poetry, code, dialogue)
- Implement advanced sampling techniques (nucleus, top-k)
- Add causal masking for proper autoregressive training
- Scale up to larger models and datasets
- Deploy for real-time text generation applications

This implementation provides a solid foundation for building sophisticated language models and text generation systems. The Transformer's flexibility makes it adaptable to various NLP tasks including chatbots, creative writing assistants, code generation, and content creation tools.