One-click accuracy evaluation harness for SGLang.
Point at any OpenAI-compatible endpoint. Scoring logic (graders, evaluators, prompts, dataset configs) is vendored from NeMo-Skills; sgl-eval contributes the transport, runner, and benchmark wiring.
pip install git+https://github.com/sgl-project/sgl-eval
sgl-eval ping --base-url http://localhost:30000/v1
sgl-eval run gsm8k --base-url http://localhost:30000/v1 --num-examples 50Three subcommands: list, ping, run <name>. See sgl-eval --help for
flags.
Each run prints a summary with the headline metric on top -- single-shot
accuracy, averaged across the k repeats when k > 1 -- and writes the
full payload as JSON under --out-dir. For example:
== aime25 ==
30 examples x 16 repeats | 823.7s | 4293 tok/s | 3.5M tokens
* pass@1[avg-of-16] = 78.96% +/- 1.21% (SEM 0.30%)
pass@16 = 93.33%
majority@16 = 93.33%
no_answer = 20.00% [warn: consider --max-tokens]
Save a (benchmark, endpoint, sampling, n_repeats, expected) bundle to
~/.sgl_eval/presets/<name>.yaml and replay with sgl-eval run --preset <name>. See preset.md for schema, example, usage, and
override priority.
sgl-eval list for the registered set; sgl-eval list -v for per-benchmark
defaults (n_repeats, thinking, sampling params). All scoring behavior
(prompt, answer extraction, grading, pass@k / majority@k aggregation) comes
from the vendored NeMo-Skills slice.
Anything that decides a score is vendored verbatim from NeMo-Skills. sgl-eval contributes only transport: an OpenAI client, a threadpool runner, a CLI, and the thin glue that wires upstream pieces into one command.
+----------------------------------------------------+
| sgl-eval |
| cli, sampler, runner, registry, metrics |
| evals/ |
+----------------------------------------------------+
| vendored from NeMo-Skills |
| math_grader, evaluator/, metrics/, |
| dataset/<bench>/, prompts/*.yaml |
+----------------------------------------------------+
The slice is pinned at a specific commit in
sgl_eval/_vendored/nemo_skills/SOURCES.yaml. To upgrade, bump
synced_from_sha there and run:
python scripts/sync_vendored.py # re-fetch all vendored files
pytest # upstream's own tests run against the
# new slice -- catches behavior drift- Replace the accuracy-eval surface in
sgl-project/sglang. Todaysglang.test.run_eval+ assorted per-test ad-hoc harnesses do this job. sgl-eval aims to be the single client SGLang's CI calls. - More benchmarks within
mathandmultichoice(MATH-500, AIME26, MMLU-Pro, GPQA-extended, ...). Each is one row in_registry.py:_TABLE. - New metrics types (require a new runner per category, but graders
are usually already in NeMo-Skills):
long_context: LongBench V2, RULER, MRCRcode: HumanEval, MBPP, LiveCodeBench (with execution sandbox)instruction_following: IFEval, IFBenchmultimodal(VLM): MMMU, MathVista (needs an image-aware sampler)agentic/ tool use: BFCL, Tau-Bench
- More vendor sources beyond NeMo-Skills, when their slice is the
best canonical implementation:
lm-evaluation-harness,lmms-eval,openai/simple-evals. Same_vendored/<source>/+SOURCES.yamlpattern. - LLM-as-judge benchmarks (Arena-Hard, MTBench). Needs a second judge endpoint and prompt-pair handling -- a real architectural addition, not just a benchmark row.
- Regression CI infra: publish per-run metrics to a
sgl-eval-datarepo, compare against rolling baselines, fail PRs on regression.
- Performance benchmarking. Latency, throughput, scheduling. Lives
in SGLang's
bench_serving.py. sgl-eval recordslatency/output_throughputonly as side metrics, never as the headline. - Model training or fine-tuning.
- Multi-server orchestration. Each invocation targets one OpenAI-compatible endpoint.
- Browser / OS-level agent loops (full BrowseComp-style sandboxing).
Apache-2.0. See LICENSE. Vendored NeMo-Skills sources are also Apache-2.0;
see NOTICE for attribution and the list of vendored files.