<img src="https://theaiengineer.dev/tae_logo_gw_flatter.png" width="35%" align="right">


# yoctoGPT — Minimal GPT from Scratch

## _Getting Started — Character-level Training_

**&copy; Dr. Yves J. Hilpisch**<br>AI-Powered by OpenAI & Gemini.

## How to Use This Notebook

- **Goal**: Train a tiny GPT model on a character-level corpus in minutes.
- **Hardware**: Designed for Google Colab T4 (free tier) or better.
- **Approach**: We use `yoctoGPT`, a minimal PyTorch implementation focused on clarity.

### Roadmap

1. **Setup**: Install dependencies and clone the `yoctoGPT` repository.
2. **Data**: Prepare a character-level dataset from a philosophical text.
3. **Training**: Execute a short training run with a small model parametrization.
4. **Sampling**: Generate text from the trained checkpoint to see the model in action.

### Setup: Environment and Dependencies

We first check the GPU availability and install the necessary requirements.

In [None]:
#@title Check GPU and Install Dependencies
!nvidia-smi
!pip -q install tokenizers tqdm textstat

import os
import pathlib
import subprocess

# Clone the repository if it doesn't exist
repo_root = pathlib.Path("/content/yoctoGPT")
if not repo_root.exists():
    print("Cloning yoctoGPT repository...")
    subprocess.run([
        "git", "clone",
        "https://github.com/yhilpisch/yoctoGPT.git",
        str(repo_root)
    ])
os.chdir(repo_root)
print(f"Current working directory: {os.getcwd()}")

### Data Preparation

We use a character-level encoding where each character (letter, digit, punctuation) is assigned a unique integer ID. This is the simplest way to start building a language model.

In [None]:
#@title Prepare Character-level Data
# We use 'philosophy.txt' as a small, clean corpus (~240KB)
!python -m scripts.prepare_char_data \
    --text_path data/philosophy.txt \
    --out_dir data/char_start

### Model Training

We initialize a small GPT model. Smaller models train faster and are perfect for learning the core mechanics without waiting for hours.

**Configuration:**
- `n_layer`: 2 (number of transformer blocks)
- `n_head`: 4 (number of attention heads)
- `n_embd`: 128 (embedding dimension)
- `block_size`: 128 (context window length)

In [None]:
#@title Train Tiny yoctoGPT
# This automatically saves checkpoints (best.pt, latest.pt) to --ckpt_dir
!python -m yoctoGPT.train \
    --mode char \
    --data_dir data/char_start \
    --ckpt_dir checkpoints/char_start \
    --n_layer 2 \
    --n_head 4 \
    --n_embd 128 \
    --block_size 128 \
    --batch_size 32 \
    --max_iters 500 \
    --lr 1e-3 \
    --eval_interval 100

# Verify checkpoint creation
!ls -lh checkpoints/char_start/latest.pt

### Text Generation (Sampling)

Once trained, we can prompt the model to generate new text based on what it learned from the corpus.

In [None]:
#@title Sample from the Model
prompt = "The meaning of life is"
output = !python -m yoctoGPT.sampler \
    --mode char \
    --ckpt checkpoints/char_start/latest.pt \
    --vocab_path data/char_start/vocab.json \
    --prompt "{prompt}" \
    --max_new_tokens 200

generated_text = "\n".join(output)
print(generated_text)

### Readability Assessment

We evaluate the quality and complexity of the generated text using standard readability metrics. This helps us understand if the model is producing coherent, human-like structures or just random character sequences.

**Metric Meanings:**
- **Flesch Reading Ease**: Higher scores indicate text that is easier to read (100 is very easy, 0 is very difficult).
- **Flesch-Kincaid Grade**: Estimates the U.S. school grade level required to understand the text.
- **Dale-Chall Score**: Uses a list of 3,000 familiar words to assess complexity (lower is easier).
- **Text Standard**: A consensus metric that summarizes the estimated reading level (e.g., '8th Grade').

In [None]:
#@title Analyze Generated Text Readability
from textstat import textstat
import json

def readability_scores(text: str) -> dict:
    return {
        "Flesch Reading Ease": textstat.flesch_reading_ease(text),
        "Flesch-Kincaid Grade": textstat.flesch_kincaid_grade(text),
        "Dale-Chall Score": textstat.dale_chall_readability_score(text),
        "Text Standard": textstat.text_standard(text, float_output=False),
    }

# Clean the output (remove prompt and potential status lines)
# Assuming the sampler prints the full sequence including the prompt.
scores = readability_scores(generated_text)

print("Readability Analysis:")
print("-" * 30)
for metric, value in scores.items():
    print(f"{metric:20}: {value}")

### Resume Training

You can continue training from a saved checkpoint to further improve the model's performance. This uses the `latest.pt` file saved in your checkpoint directory.

In [None]:
#@title Resume training from latest.pt
latest_path = "checkpoints/char_start/latest.pt"
if os.path.exists(latest_path):
    !python -m yoctoGPT.train \
        --mode char \
        --data_dir data/char_start \
        --ckpt_dir checkpoints/char_start \
        --resume {latest_path} \
        --n_layer 2 \
        --n_head 4 \
        --n_embd 128 \
        --block_size 128 \
        --max_iters 500
else:
    print(f"No checkpoint found at {latest_path}. Run the initial training first.")

### Exercises

1. **Experiment with Context**: Increase the `block_size` to 256 and see how it affects memory usage (VRAM) and training speed.
2. **Deeper Model**: Change `n_layer` to 4 and observe the validation loss after 500 iterations. Does it improve?
3. **Custom Corpus**: Upload your own `.txt` file to the `data/` folder and retrain the model on your own text.

<img src="https://theaiengineer.dev/tae_logo_gw_flatter.png" width="35%" align="right">
