# yoctoGPT (char-level) on Google Colab (T4, ~15GB VRAM)

This notebook mounts Google Drive for persistent checkpoints (repo stays
local on Colab), prepares a **character-level** dataset from text in
`data/`, trains a char-level GPT configuration sized for a Colab T4
(~15GB), provides sampling examples, and includes a resume cell. It also
adapts context/batch size when the corpus is tiny to avoid random index
errors.

In [10]:
#@title Mount Google Drive for checkpoint storage (repo stays local)
from google.colab import drive

drive.mount("/content/drive")

from pathlib import Path

CKPT_DIR = Path("/content/drive/MyDrive/yocto/checkpoints/colab_char")
CKPT_DIR.mkdir(parents=True, exist_ok=True)
print("Checkpoints dir:", CKPT_DIR)

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Checkpoints dir: /content/drive/MyDrive/yocto/checkpoints/colab_char


In [2]:
#@title Setup: install deps and clone/update the repo locally
!nvidia-smi || true
!python -V
!pip -q install tokenizers tqdm

import os, pathlib, subprocess, textwrap

repo_root = pathlib.Path("/content/yoctoGPT")
if repo_root.exists():
    print("Repo exists, pulling latest...")
    subprocess.run(["git", "pull"], cwd=repo_root, check=False)
else:
    subprocess.run(
        [
            "git",
            "clone",
            "https://github.com/yhilpisch/yoctoGPT.git",
            str(repo_root),
        ],
        check=False,
    )
os.chdir(repo_root)

if os.path.exists("requirements.txt"):
    !pip -q install -r requirements.txt || true

data_dir = pathlib.Path("data")
data_dir.mkdir(exist_ok=True)
txts = list(data_dir.glob("*.txt"))
if not txts:
    sample = textwrap.dedent('''
    Philosophy is the study of general and fundamental questions,
    such as those about existence, reason, knowledge, values, mind,
    and language. It often poses questions rather than providing
    answers, inviting us to think.
    ''').strip()
    (data_dir / "philosophy.txt").write_text(sample, encoding="utf-8")
    print("Created sample data/philosophy.txt")
else:
    names = [p.name for p in txts][:5]
    print(f"Found {len(txts)} text files in data/: {names}")

Sun Dec 14 12:59:31 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   43C    P8              9W /   70W |       0MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

In [6]:
#@title Char data: prepare char-level dataset from a text file
# Adjust `--text_path` if you want a different source file.
data_dir = pathlib.Path("data")
texts = []
for p in sorted(data_dir.glob("*.txt")):
    print("Including:", p)
    texts.append(p.read_text(encoding="utf-8"))

merged_path = data_dir / "all_texts_char.txt"
merged_path.write_text("\n\n".join(texts), encoding="utf-8")
print("Wrote merged corpus to", merged_path)

Including: data/anticipations.txt
Including: data/capitalism.txt
Including: data/economist.txt
Including: data/economy.txt
Including: data/essentials.txt
Including: data/insights.txt
Including: data/meditations.txt
Including: data/philosophy.txt
Including: data/political_economy.txt
Including: data/principles.txt
Including: data/psycho_stocks.txt
Including: data/religion.txt
Including: data/siddharta.txt
Including: data/socialism.txt
Including: data/stocks.txt
Including: data/wealth.txt
Wrote merged corpus to data/all_texts_char.txt


In [7]:
!python -m scripts.prepare_char_data \
  --text_path data/all_texts_char.txt \
  --out_dir data/char

Wrote 9619983 train and 1068886 val tokens.
Vocab size: 211


In [8]:
#@title Pick safe block_size/batch_size for this char corpus
import numpy as np
from pathlib import Path

train_path = Path("data/char/train.bin")
val_path = Path("data/char/val.bin")
train_tokens = int(np.fromfile(train_path, dtype=np.int32).shape[0])
val_tokens = int(np.fromfile(val_path, dtype=np.int32).shape[0])
min_tokens = min(train_tokens, val_tokens)

if min_tokens <= 4:
    raise SystemExit(
        "Dataset too small. Add more text to data/ and "
        "rerun the char data preparation."
    )

block_candidates = [512, 384, 256, 192, 128, 96, 64, 48, 32, 24, 16]
block_size = next(
    (b for b in block_candidates if min_tokens > b + 2),
    max(8, min_tokens - 2),
)

target_tokens = min(16000, max(512, min_tokens))
batch_size = max(1, min(24, target_tokens // block_size))

print(
    f"Train tokens: {train_tokens}, Val tokens: {val_tokens}, "
    f"min_tokens: {min_tokens}"
)
print(f"Using block_size={block_size}, batch_size={batch_size}")

Train tokens: 9619983, Val tokens: 1068886, min_tokens: 1068886
Using block_size=512, batch_size=11


In [None]:
#@title (Optional) Get an auto-recommended command for this GPU (T4 ~15GB)
!python -m scripts.recommend_training \
  --mode char \
  --data_dir data/char \
  --ckpt_dir {CKPT_DIR} \
  --priority speed \
  --device cuda \
  --device_mem_gb 15

## Training

We use a char-level configuration sized for a Colab T4. If you hit OOM,
lower `batch_size` or `block_size`; if you have headroom, you can nudge
them upward. The auto-picked block/batch above avoids tiny-corpus indexing
errors.

In [9]:
#@title Train (char-level GPT) on Colab T4
from pathlib import Path

CKPT_DIR = Path("/content/drive/MyDrive/yocto/checkpoints/colab_char")
CKPT_DIR.mkdir(parents=True, exist_ok=True)

!python -m yoctoGPT.train \
  --mode char \
  --data_dir data/char \
  --ckpt_dir {CKPT_DIR} \
  --model_type gpt_fast \
  --device cuda \
  --n_layer 8 --n_head 8 --n_embd 512 \
  --block_size {block_size} --batch_size {batch_size} \
  --dropout 0.1 --weight_decay 0.1 \
  --tie_weights --label_smoothing 0.05 \
  --eval_interval 800 --eval_iters 100 \
  --cosine_lr --warmup_iters 400 \
  --min_lr 1e-5 --lr 2e-4 \
  --max_iters 6000 \
  --ema --ema_decay 0.999

training: 100% 2000/2000 [02:57<00:00, 11.27it/s, train_loss=2.48, val_loss=2.56]
{'final_val_loss': 2.563058043718338}


## Sampling examples

Generate text from the latest char-level checkpoint. Adjust temperature and
max length for different styles and lengths.

In [11]:
#@title Sample 1: simple continuation
from pathlib import Path

CKPT_DIR = Path("/content/drive/MyDrive/yocto/checkpoints/colab_char")

!python -m yoctoGPT.sampler \
  --mode char \
  --ckpt {CKPT_DIR}/latest.pt \
  --vocab_path data/char/vocab.json \
  --prompt "Philosophy is" \
  --max_new_tokens 300 \
  --temperature 0.8 --top_k 50 --top_p 0.95

Philosophy iss at conty thall orole the prostions
sthermat the suse site at ond wathe bus ardof the of man by prepptuctses. The whin ild
hessis of sthe pe the duceresesed on the thesas be of f thesucced
be ffon oucoss its care then be wacoung, a cin anghis erthe indedelintics
ined the lo dererestins the is the i


In [None]:
#@title Sample 2: multi-line prompt
from pathlib import Path

CKPT_DIR = Path("/content/drive/MyDrive/yocto/checkpoints/colab_char")

!python -m yoctoGPT.sampler \
  --mode char \
  --ckpt {CKPT_DIR}/latest.pt \
  --vocab_path data/char/vocab.json \
  --prompt "Question: What is wisdom?\nAnswer:" \
  --max_new_tokens 300 \
  --temperature 0.9 --top_k 40 --top_p 0.95

## Resume training

Resume from the latest checkpoint to continue char-level training for
additional steps. `--max_iters` is interpreted as extra steps beyond the
checkpointed iteration count.

In [None]:
#@title Resume training from latest.pt (additional 1000 steps)
from pathlib import Path

CKPT_DIR = Path("/content/drive/MyDrive/yocto/checkpoints/colab_char")
latest = CKPT_DIR / "latest.pt"
if not latest.exists():
    raise SystemExit(
        "No latest.pt found in CKPT_DIR; run initial training first."
    )

!python -m yoctoGPT.train \
  --mode char \
  --data_dir data/char \
  --ckpt_dir {CKPT_DIR} \
  --resume {latest} \
  --model_type gpt_fast \
  --device cuda \
  --n_layer 8 --n_head 8 --n_embd 512 \
  --block_size {block_size} --batch_size {batch_size} \
  --dropout 0.1 --weight_decay 0.1 \
  --tie_weights --label_smoothing 0.05 \
  --eval_interval 800 --eval_iters 100 \
  --cosine_lr --warmup_iters 400 \
  --min_lr 1e-5 --lr 2e-4 \
  --max_iters 6000 \
  --ema --ema_decay 0.999

Resumed from /content/drive/MyDrive/yocto/checkpoints/colab_char/latest.pt at iter 2000
training:  24% 2875/12000 [01:14<08:15, 18.41it/s, train_loss=2.37, val_loss=2.44]

Optionally, inspect the last lines of the training metrics CSV to monitor
progress.

In [None]:
#@title Inspect metrics
from pathlib import Path

CKPT_DIR = Path("/content/drive/MyDrive/yocto/checkpoints/colab_char")

!tail -n 20 {CKPT_DIR}/metrics.csv || true