Skills to guide Claude Code, Codex, and other coding agents on using the Weights & Biases AI developer platform to train models and build agents.
- Log metrics and rich media during model training and fine-tuning
- Track model training experiments
- Analyze runs and experiment results to understand how the model is learning
- Tune hyperparameters
- Trace agentic AI applications
- Analyze traces and classify them into failure modes
- Evaluate models with labeled datasets
- Run online evaluations for production monitoring
npx skills add wandb/skillsThen set your W&B API key:
export WANDB_API_KEY=<your-key>
npx skillsis a utility for installing skills into major coding agent CLIs. Use--globalto install for all projects, or--agent <name>to target a specific agent. See the npx skills docs for more details.
| Skill | Description | Status |
|---|---|---|
wandb-primary |
Comprehensive primary skill for agents working with Weights & Biases. Covers both the W&B and Weave SDK | claude-code: 32/35 (91%) |
We maintain a growing internal benchmark suite that evaluates each skill across coding agents and task categories. Skills are evaluated automatically on every merge to main.
| Category | Tasks | Claude Code (sonnet4.6) |
Codex (gpt-5.3-codex) |
|---|---|---|---|
| Weave analysis | 26 | 97%* | 63%* |
| Weave tooling | 11 | 95%* | 83%* |
| Model training | 8 | 90%* | 85%* |
| LLM finetuning & RL analysis | 14 | 72%* | 86%* |
| Failure & outlier detection | 8 | 86%* | 63%* |
*Pass rates are +/- 3%. Many tasks span multiple categories.
See CONTRIBUTING.md.