why use many token when few do trick — now baked in weights
Before/After • Quick start • Training • Eval • Reproduce
why use many token when few do trick
Fine-tune of Gemma 4 31B that speak caveman natively — no skill file, no system prompt, no /caveman toggle. Drop articles. Drop filler. Drop pleasantries. Keep code byte-exact. Keep error strings exact. Brain big. Mouth small. Weights ship MIT-friendly under Gemma terms.
|
|
why use many token when few do trick. Same fix. Same brain. Less mouth.
┌─────────────────────────────────────┐
│ COMPRESSION (eval) ████████ 65% │
│ CODE FENCE EXACT ████████ 99% │
│ SEMANTIC SIM ████████ 94% │
│ ARTICLE DENSITY █░░░░░░░ 1% │
│ VIBES ████████ OOG │
└─────────────────────────────────────┘
Two flavors. Pick by VRAM.
| Repo | Format | Size | What it is |
|---|---|---|---|
JBrussee/gemma-4-31B-caveman |
bf16 merged | 62.5 GB | Full Gemma 4 31B, caveman baked in. Drop-in. |
JBrussee/gemma-4-31B-caveman-lora |
LoRA adapter | 534 MB | Stack on google/gemma-4-31B-it. Light download. |
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
tok = AutoTokenizer.from_pretrained("JBrussee/gemma-4-31B-caveman")
model = AutoModelForCausalLM.from_pretrained(
"JBrussee/gemma-4-31B-caveman",
torch_dtype=torch.bfloat16,
device_map="auto",
)
msgs = [{"role": "user", "content": "Why does my React component re-render every time the parent updates?"}]
inputs = tok.apply_chat_template(msgs, return_tensors="pt", add_generation_prompt=True).to(model.device)
out = model.generate(inputs, max_new_tokens=300, do_sample=False)
print(tok.decode(out[0, inputs.shape[1]:], skip_special_tokens=True))from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
base = AutoModelForCausalLM.from_pretrained(
"google/gemma-4-31B-it",
torch_dtype=torch.bfloat16,
device_map="auto",
)
tok = AutoTokenizer.from_pretrained("google/gemma-4-31B-it")
model = PeftModel.from_pretrained(base, "JBrussee/gemma-4-31B-caveman-lora")No system prompt needed. Ask question. Model talk caveman.
Rewrite or answer technical question in caveman style. Source-of-truth ruleset = the JuliusBrussee/caveman skill (MIT). Same rules. Now welded into weights.
Auth bug example (verbose → caveman):
In: "Sure! I'd be happy to help. The issue you're experiencing is most likely caused by your authentication middleware not properly validating the token expiry. Let me take a look..."
Out: "Bug in auth middleware. Token expiry check use
<not<=. Fix:"
| Field | Value |
|---|---|
| Base | google/gemma-4-31B-it |
| Method | QLoRA NF4 + double-quant + bf16 compute |
| LoRA | rank 16, α 32, dropout 0, targets all linear |
| Dataset | 1750 train + 193 eval (debug · review · refactor · dialogue · qa) |
| Schedule | 3 epochs, lr 2e-4 cosine, batch 2 × grad accum 8 (eff 16), completion_only_loss=True |
| Hardware | RunPod RTX PRO 6000 Blackwell 96 GB, ~$1.89/hr |
| Wall time | ~50 min (Unsloth + TRL 0.17) |
| Final loss | train 0.024 · eval 0.72 · eval acc 81.5% |
Cost end-to-end: ~$4-5 pod time. Less than lunch.
193-pair holdout, tagged by source category. code_fence_match = fraction of source code fences appearing byte-exact in target.
| Category | n | compression | article density | code_fence | semantic_sim |
|---|---|---|---|---|---|
| dialogue | 28 | 0.59 | 0.020 | 1.000 | 0.91 |
| debug | 34 | 0.92 | 0.009 | 0.995 | 0.98 |
| refactor | 27 | 0.92 | 0.005 | 0.963 | 0.98 |
| qa | 104 | 0.65 | 0.007 | 1.000 | 0.92 |
Read the numbers:
- ✅ Code preservation excellent — 96-100% fence-exact
- ✅ Article density crushed — 0.5-2% (English baseline ~8%)
- ✅ Semantic preservation strong — 91-98%
⚠️ Compression weaker than gold pairs — model lands 0.6-0.9, gold sits 0.3-0.5. Filter accepted ≤1.0× source; tighten to ≤0.7 next run, push harder.
cavegemma/
├── data/
│ ├── seeds/ caveman repo snapshots (SKILL.md, eval prompts)
│ ├── sources/ per-source HuggingFace loaders
│ ├── build_corpus.py orchestrator (6 sources → corpus_raw.jsonl)
│ ├── synthesize.py claude/codex CLI driver, two-step rewrite, resumable
│ ├── filter.py fence-integrity + dedup + compression band
│ ├── split.py 90/10 split with seed-pair pinning
│ └── prompts/, out/ (out gitignored)
├── training/
│ ├── train_unsloth.py Unsloth + TRL SFT trainer, resume from checkpoint
│ ├── runpod_bootstrap.sh pip + auth bootstrap for fresh pod
│ └── config.toml single source of truth for hyperparams
├── eval/
│ ├── metrics.py compression / article-drop / code-fence / semantic_sim
│ ├── run_eval.py score adapter on holdout + workflow prompts
│ ├── judge.py LLM-judge via claude CLI on 20 holdouts
│ └── workflow_prompts.jsonl 10 hand-curated workflow eval prompts
├── scripts/
│ ├── infer.py smoke-test against caveman eval prompts
│ └── push_to_hub.py publish adapter + model card
└── artifacts/ (gitignored)
End-to-end. ~6-8 hours wall, ~$4-5 pod.
# 1. Local setup
uv sync
uv run python data/extract_seeds.py # 20 gold seed pairs
uv run python data/build_corpus.py # 3000 rows from 6 HF sources
# 2. Synthesis (Claude Code or Codex CLI required)
uv run python data/synthesize.py --backend claude --workers 3 # or
uv run python data/synthesize.py --backend codex --workers 3
# 3. Filter + split
uv run python data/filter.py --in data/out/raw_pairs.jsonl --out data/out/clean_pairs.jsonl
uv run python data/split.py --in data/out/clean_pairs.jsonl
# 4. RunPod H100 / RTX PRO 6000 — rsync, ssh, bootstrap, train
rsync -avz --exclude='.git' --exclude='.venv' -e "ssh -p <port> -i ~/.ssh/id_ed25519" ./ root@<pod>:/workspace/cavegemma/
ssh -i ~/.ssh/id_ed25519 -p <port> root@<pod> "
export HF_TOKEN=...
export WANDB_API_KEY=...
cd /workspace/cavegemma
bash training/runpod_bootstrap.sh
python training/train_unsloth.py --config training/config.toml
"
# 5. Eval + ship
python eval/run_eval.py --adapter artifacts/adapter --eval data/out/eval.jsonl --workflow eval/workflow_prompts.jsonl --out artifacts/eval_predictions.jsonl
python scripts/push_to_hub.py --adapter artifacts/adapter --repo <hf-user>/gemma-4-31B-caveman-loraAll permissively licensed. 6 sources in, 1750 train + 193 eval out.
| Source | License | Pulled | Used for |
|---|---|---|---|
OpenAssistant/oasst2 |
Apache 2.0 | 400 | Multi-turn dialogue |
princeton-nlp/SWE-bench_Verified |
research-permissive | 400 | Debug-session narratives |
ronantakizawa/github-codereview |
permissive subset | 400 | Code review |
bigcode/commitpackft |
MIT/Apache subset | 300 | Refactor walkthroughs |
theblackcat102/evol-codealpaca-v1 |
Apache 2.0 | 1200 | Short technical Q&A |
HuggingFaceH4/ultrachat_200k |
MIT | 300 | Short Q&A overflow |
Caveman side synthesized via Claude Code (claude -p) and Codex CLI (codex exec with GPT-5.5), routed through the canonical SKILL.md ruleset. Two-step rewrite + fence-integrity filter.
- Compression weaker than gold caveman. Model averages 0.6-0.9 vs gold's 0.3-0.5. Training filter accepted ≤ 1.0× source length; tighten to ≤ 0.7 next run.
- Review category sparse. Codex pairs often mutated diff fences, so filter dropped most. Only ~8 review pairs in eval — review behavior extrapolated from debug/refactor neighbors.
- Workflow eval gates partly info-only. Open-ended prompts in
workflow_prompts.jsonlhave no reference;code_fence_matchchecks input fences in answer,semantic_simcompares answer to question. Treat as smoke signal, not scoreboard. - Multimodal untouched. Gemma 4 is vision + audio capable. Fine-tune was text-only on language head; vision/audio paths should still work but unverified.
Three rocks. One philosophy: model do more with less.
| Repo | What |
|---|---|
| caveman | Output compression skill — why use many token when few do trick |
| cavemem | Cross-agent memory — why agent forget when agent can remember |
| cavekit | Spec-driven build loop — why agent guess when agent can know |
| cavegemma (you here) | Caveman baked into weights — why prompt every session when weights remember |
Skill compresses any model at runtime. This repo welds the same style into Gemma 4 31B so caveman survives across hosts, agents, no-system-prompt setups. Cheap inference, no skill loader, same brain.
- Code in this repo: MIT
- Adapter and merged model inherit the Gemma Prohibited Use Policy (Apache 2.0 + Gemma terms). See Google's Gemma terms.
- Style ruleset and seed pairs from
JuliusBrussee/caveman: MIT.
@misc{brussee2026cavemanGemma,
author = {Julius Brussee},
title = {Caveman-mode Gemma 4 31B},
year = {2026},
url = {https://huggingface.co/JBrussee/gemma-4-31B-caveman}
}
Star cost zero. Help small mouth find big audience. ⭐
- caveman — the original Claude Code skill this fine-tune is built on
- Revu — local-first macOS study app with FSRS spaced repetition. revu.cards
- Style source-of-truth: JuliusBrussee/caveman
- Agent / pitfall notes:
AGENTS.md
why use many token when few do trick 🪨
