nanoBeard ☠️

language

en

nanoBeard ☠️

A tiny pirate-themed GPT trained from scratch on a piratized version of TinyStories, then SFT-tuned. Built as a learning project — closer to nanoGPT than to a production LM.

The repo is structured for multiple ship-class versions under one codebase:

Codename	Status	HF repo
Sloop (v1)	shipped	`younissk/nanoBeard`
Brig (v2)	planned	`younissk/nanoBeard-Brig`

Source code: https://github.com/younissk/pirate_llm
Docs: see docs/ or run make docs-serve

Layout

nanobeard/             # package: training, sampling, SFT, publish, eval
  models/              # one file per architecture, registered via MODEL_REGISTRY
  eval/                # perplexity + sample-gallery harness
  dataset_pipeline/    # sources (piratized corpora) + recipe-driven dataset builds
configs/               # one .py per model version
data/sources/<name>/   # reusable piratized corpora (cached arrow + source.json)
data/datasets/<name>/  # composed datasets: recipe.json + tokenizer + bins + metadata
runs/<version>/        # per-version checkpoints
evals/                 # prompt set + per-eval reports
space/                 # Gradio playground (multi-version dropdown)
tests/                 # 80+ pytest tests
docs/                  # mkdocs site

Model details (Sloop, v1)

Field	Value
Architecture	Decoder-only Transformer (GPT-style)
Parameters	~13.8M
Layers / heads / embd	6 / 6 / 384
Context length	256 tokens
Vocab size	8192 (custom BPE)
Bias in Linear/LN	False
Tokenizer	`pirate_bpe.json` (HuggingFace `tokenizers` BPE)

Training (Sloop)

Pretraining on piratized TinyStories.
SFT on TeeZee/dolly-15k-pirate-speech.
AdamW, warmup + cosine decay. bfloat16 on CUDA.
See training_metadata.json in the HF repo for the exact run config + losses.

Quick start

make install                       # uv sync
uv sync --dev                      # dev tooling (pytest, ruff, mypy, mkdocs)
make env                           # .env from example
pre-commit install                 # format/lint on commit

make dataset DATASET=tiny_pirate_stories   # build -> data/datasets/tiny_pirate_stories/
make train   CONFIG=sloop          # local smoke
make train   CONFIG=sloop CONFIG_VARIANT=gpu   # GPU run
make sample  CONFIG=sloop PROMPT='Ahoy matey'
make eval    CONFIG=sloop          # perplexity + gallery -> evals/results/<date>/sloop/
make publish CONFIG=sloop          # push to HF model repo

Tests

make test            # fast (~1.5s)
make test-all        # include slow integration

Highlights:

tests/test_model_contract.py — parametrized over MODEL_REGISTRY, so any new architecture is automatically checked for the same invariants (causal mask, weight tying, shape contracts, finite grads).
tests/test_publish.py — HF Hub mocked; verifies config.json carries model_name, codename, display_name, num_parameters, and all arch fields.
tests/test_tokenizer_hash.py — tokenizer fingerprint in every ckpt.

Loading a published model

nanoBeard is not a transformers model — load via the nanobeard package:

import json, torch
from huggingface_hub import hf_hub_download
from safetensors.torch import load_model
from tokenizers import Tokenizer

from nanobeard.config import Config
from nanobeard.models import build_model

repo = "younissk/nanoBeard"      # Sloop
cfg_dict = json.load(open(hf_hub_download(repo, "config.json")))
cfg = Config(**{k: v for k, v in cfg_dict.items() if k in Config.__dataclass_fields__})
model = build_model(cfg).eval()
load_model(model, hf_hub_download(repo, "model.safetensors"))

tok = Tokenizer.from_file(hf_hub_download(repo, "pirate_bpe.json"))
ids = torch.tensor([tok.encode("Once upon a time").ids])
with torch.no_grad():
    for _ in range(80):
        logits, _ = model(ids[:, -cfg.block_size:])
        next_id = torch.multinomial(torch.softmax(logits[:, -1] / 0.8, -1), 1)
        ids = torch.cat([ids, next_id], dim=1)
print(tok.decode(ids[0].tolist()))

Adding a new model version

See docs/adding-a-model.md. TL;DR:

Write nanobeard/models/<key>.py (new arch, frozen contract).
Register a ModelSpec in nanobeard/models/__init__.py.
Drop a config in configs/<key>.py.
make dataset DATASET=<name> / make train CONFIG=<key>.
make test — contract suite parametrizes automatically.
make publish CONFIG=<key> to its own HF model repo.

Limitations

Trained on a tiny synthetic corpus. Vocabulary, grammar, and world knowledge are extremely narrow.
Short context (256 tokens).
No safety tuning. Pirate-flavored nonsense at best.
Educational artifact, not a useful chat model.

Data

Base corpus: TinyStories (CDLA-Sharing-1.0), transformed by nanobeard/dataset_pipeline/piratize.py.
SFT corpus (Sloop): TeeZee/dolly-15k-pirate-speech.

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
assets		assets
configs		configs
data		data
docs		docs
evals/llm-chess		evals/llm-chess
nanobeard		nanobeard
scripts		scripts
space		space
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
Makefile		Makefile
README.md		README.md
banner.png		banner.png
example.env		example.env
imprint.html		imprint.html
index.html		index.html
legal.css		legal.css
main.py		main.py
mkdocs.yml		mkdocs.yml
nanoBeard.png		nanoBeard.png
privacy.html		privacy.html
pyproject.toml		pyproject.toml
pyrightconfig.json		pyrightconfig.json
terms.html		terms.html
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

nanoBeard ☠️

Layout

Model details (Sloop, v1)

Training (Sloop)

Quick start

Tests

Loading a published model

Adding a new model version

Limitations

Data

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

nanoBeard ☠️

Layout

Model details (Sloop, v1)

Training (Sloop)

Quick start

Tests

Loading a published model

Adding a new model version

Limitations

Data

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages