Merged
Conversation
… README New templates/ directory for generation templates. Model card template now includes acknowledgments table with references to the projects and papers PAWN builds on. Remove HF YAML frontmatter from GitHub README.
templates/hf_model_card.md.j2: Jinja2 template replacing the old
{PLACEHOLDER} template. Includes acknowledgments, probe results,
diagnostics, and all architecture/training details.
scripts/generate_model_cards.py: fetches metrics.jsonl and
eval_results.json from each HF model repo, renders the template,
and optionally uploads. No hardcoded metrics — everything is pulled
from the source of truth.
Usage: python scripts/generate_model_cards.py --push
cards/model/pawn-{small,base,large}.md: generated model cards checked
into the repo as the source of truth. Changes go through PRs.
cards/hf_model_card.md.j2: Jinja2 template (moved from templates/).
.github/workflows/sync-model-cards.yml: on push to main, uploads
cards/model/*.md to the corresponding HF model repos as README.md.
scripts/generate_model_cards.py: updated output dir to cards/model/,
warns loudly on missing optional fields (top5, legal_rate) from older
training runs but does not silently default to zero.
fetch_best_metrics() now merges top5_accuracy, legal_move_rate, and perplexity from the best val record that has them, when the overall best-loss record doesn't. This handles the case where val was logged every 500 steps but backfilled extended metrics only exist at 5K-step checkpoint boundaries.
7 tasks
thomas-schweich
added a commit
that referenced
this pull request
Apr 2, 2026
1. is_alive returns (alive, exit_code) tuple — no more lost exit codes from zombie reaping (#1) 2. _spawn wraps Popen in try/finally for log file handle (#2) 3. kill() no longer releases GPU immediately — monitor detects actual process exit and releases then, avoiding double-assignment during graceful shutdown (#3) 4. Trial.from_dict filters to known dataclass fields — survives schema evolution (#4) 5. shutdown() cancels monitor tasks cleanly (#6) 6. recover() emits trial_completed/trial_failed events for trials that died during server downtime (#7) 7. check_health NaN threshold scales with total_steps (#8)
thomas-schweich
added a commit
that referenced
this pull request
Apr 2, 2026
* Add pawn-lab MCP server for trial management Adds a stateful daemon that manages GPU-isolated training processes with Optuna-driven hyperparameter sweeps. Exposes 12 MCP tools (lab_status, lab_launch, lab_sweep, lab_kill, lab_results, etc.) over stdio so Claude Code can drive experiments without manual process management. Key features: - Process spawning with CUDA_VISIBLE_DEVICES isolation - Async metrics monitoring (polls metrics.jsonl every 5s) - Optuna ask/tell autopilot with auto-advance on completion - State persistence for recovery across MCP server restarts - Event log for trial completions, failures, and health warnings - Progress log rendering (pod_manager.md) * Fix runner for ROCm and zombie process detection - GPU discovery via PyTorch instead of nvidia-smi (works on both CUDA and ROCm) - gpu_utilization() uses torch.cuda memory APIs - Fix zombie detection: _is_alive() now calls waitpid() first to reap child zombies before falling back to kill(pid, 0) - Derive step rate from elapsed/step when step_time is unavailable (adapter training scripts log epoch_time_s, not step_time) - Handle multi-word python commands (e.g. "uv run python") * Gitignore checkpoints directory * Fix sweep to use custom search space; improve tool descriptions - _autopilot_next now prefers sweep_distributions when explicitly configured, instead of always falling back to built-in distributions. This was causing custom search_space to be ignored. - Expanded all MCP tool descriptions with examples, parameter documentation, and usage guidance for future agents. * Fix run dir discovery to pick most recent metrics.jsonl * Add MCP log-message notifications for trial events When a trial completes, fails, or all GPUs go idle, the runner pushes a notification via the MCP session's send_log_message(). Claude Code surfaces these as log messages in the conversation, enabling event- driven workflow instead of pure polling. The session is captured from the request context on the first tool call and passed to the runner via set_notify(). Falls back silently if notifications aren't supported. * Refactor pawn-lab: split into modules, switch to FastMCP Split the monolithic runner.py (1085 lines) into focused modules: - state.py (73) — Trial dataclass, duration formatting - monitor.py (128) — metrics reading, health checks, process liveness - sweep.py (149) — Optuna search spaces, study management - runner.py (784) — core process lifecycle, GPU management, events - server.py (175) — FastMCP @mcp.tool decorators, lifespan context FastMCP replaces the manual 200-line TOOLS dict + dispatch with decorated async functions. Lifespan context initializes the runner on server startup. Background notifications use ctx.session captured on first tool call. Switches dependency from mcp>=1.0.0 to fastmcp>=2.0.0. * Lazy GPU discovery to avoid torch import at MCP server startup * Move GPU discovery to subprocess to avoid torch CPU spin Torch's ROCm/HIP runtime spawns ~16 background threads that busy-spin at ~30% CPU permanently once imported. Moving GPU discovery to a subprocess keeps the MCP server process clean. Also simplified gpu_utilization() to not require torch. * Fix lab_seed to accept JSON strings for params/values * Fix seed_trial to use sweep distributions when available * Persist custom search space specs across MCP server restarts * Add logging to notification path for debugging * Add manage-pod skill; persist sweep search space; debug logging - Un-ignore .claude/skills/manage-pod/ and track SKILL.md - Skill now instructs agents to maintain Lab Notes in CLAUDE.md as a handwritten research log for context compaction recovery - Persist sweep_search_space to lab_state.json so custom distributions survive MCP server restarts - Fix seed_trial to use sweep distributions matching the study - Add debug logging to notification path * Remove autopilot, add lab_suggest and lab_log Simplify the runner by removing all autopilot/sweep machinery: - No more autopilot state, configure_sweep, pause, resume, pin, seed - No more notification plumbing (_notify, _push, _session) - _on_complete/_on_failed are now sync (no autopilot to await) Replace lab_sweep with lab_suggest: creates an ephemeral Optuna study seeded from completed trials, returns a suggestion. The agent decides whether to launch — no auto-advance. Add lab_log: returns the last N lines of a trial's stdout log for debugging failures without manual file access. Compact lab_status output: key HPs inline instead of full CLI commands. Results include wall_time per trial. runner.py: 583 lines (was 784). server.py: 154 lines (was 175). sweep.py: 87 lines (was 149). Total: 1048 (was 1332). * Fold Optuna suggestion into lab_results, drop lab_suggest tool lab_results(suggest_strategy="bottleneck") now includes an optuna_suggestion field with suggested params, seeded from all completed trials. The suggestion is ephemeral (in-memory study, no persistence). Agents see it passively when reviewing results rather than needing to call a separate tool. 9 tools → 8 tools. server.py: 95 lines (was 154). * Update manage-pod skill: remove autopilot references, document agent-driven loop * Gitignore pawn-lab ephemeral files; remove stale optuna-storage * Move lab state to runs/ directory; lab notes via PostCompact hook - Runner defaults workspace to runs/ locally (still /workspace on pods) - All ephemeral files (lab_state.json, lab_events.jsonl, pod_manager.md, sweep_results/, lab-notes.md) now live under gitignored runs/ - PostCompact hook in .claude/settings.json re-injects runs/lab-notes.md after context compaction (not committed — project-local config) - Skill updated to reference runs/lab-notes.md instead of CLAUDE.md - Lab notes removed from CLAUDE.md (migrated to runs/lab-notes.md) * Address code review: 7 fixes 1. is_alive returns (alive, exit_code) tuple — no more lost exit codes from zombie reaping (#1) 2. _spawn wraps Popen in try/finally for log file handle (#2) 3. kill() no longer releases GPU immediately — monitor detects actual process exit and releases then, avoiding double-assignment during graceful shutdown (#3) 4. Trial.from_dict filters to known dataclass fields — survives schema evolution (#4) 5. shutdown() cancels monitor tasks cleanly (#6) 6. recover() emits trial_completed/trial_failed events for trials that died during server downtime (#7) 7. check_health NaN threshold scales with total_steps (#8) * Fix Pareto front: use proper 2D dominance instead of 1D frontier * lab_results: require strategy, always show 3 Optuna suggestions - strategy parameter is now required (determines search space) - Always generates 3 suggestions via ephemeral Optuna study, even with zero completed trials (pure exploration from the prior) - Suggestions seeded from completed trials when available - Exhaustive strategy list in tool docstring * Fix kill GPU release: poll until process actually exits
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
PAWN no longer mirrors the GitHub repo to HuggingFace. The individual model repos (
pawn-small,pawn-base,pawn-large) and dataset repos (pawn-lichess-full,stockfish-nodes1) on HF serve their purpose — a mirror of the code repo added no value and caused the README to be formatted as a model card.sync-to-hf.ymlGitHub Action that pushed to HF on every commit to mainREADME.mdso it reads as a normal GitHub READMEtemplates/hf_model_card_template.md— references to papers and projects PAWN builds on, to be included in the individual model cards on HF