infermark

Know how fast your LLM endpoint actually is.

infermark benchmarks any OpenAI-compatible API endpoint — vLLM, TGI, Ollama, SGLang, or anything behind /v1/chat/completions. It measures what matters: time to first token, inter-token latency, throughput under load, and tail latencies. One command, no config files, real numbers.

Both llmperf and llm-bench were archived in 2025. infermark fills the gap.

What it measures

Metric	What it tells you
TTFT	Time to first token — how long until streaming starts
ITL	Inter-token latency — smoothness of the stream
Throughput (tok/s)	Output tokens per second across all concurrent requests
P50 / P95 / P99	Tail latency distribution at each concurrency level
Error rate	Failed requests under load
RPS	Requests per second the server can sustain

Install

pip install infermark

With the CLI (rich tables, progress):

pip install infermark[cli]

Quick start

CLI

# Benchmark a local vLLM server
infermark run http://localhost:8000/v1 --model meta-llama/Llama-3-70B -n 50

# Sweep concurrency levels
infermark run http://localhost:8000/v1 -c 1,4,8,16,32,64 -n 100

# Save results as JSON
infermark run http://localhost:8000/v1 -o results.json

# Compare multiple endpoints
infermark compare vllm.json tgi.json ollama.json

Python

from infermark import BenchmarkConfig, run_benchmark

config = BenchmarkConfig(
    url="http://localhost:8000/v1",
    model="meta-llama/Llama-3-70B-Instruct",
    concurrency_levels=[1, 4, 8, 16, 32],
    n_requests=100,
    max_tokens=256,
)

report = run_benchmark(config)

# Best throughput
best = report.best_throughput()
print(f"Peak: {best.tokens_per_second:.1f} tok/s at concurrency {best.concurrency}")

# Lowest latency
low = report.lowest_latency()
print(f"Lowest P50: {low.latency.p50 * 1000:.1f} ms at concurrency {low.concurrency}")

Async

import asyncio
from infermark import BenchmarkConfig, run_benchmark_async

async def main():
    config = BenchmarkConfig(url="http://localhost:8000/v1", model="llama-3")
    report = await run_benchmark_async(config)
    print(f"Peak throughput: {report.best_throughput().tokens_per_second:.1f} tok/s")

asyncio.run(main())

Compare endpoints

Find out whether vLLM, TGI, or Ollama is faster for your model and hardware:

# Benchmark each endpoint separately
infermark run http://gpu1:8000/v1 --model llama-3 -o vllm.json
infermark run http://gpu2:8080/v1 --model llama-3 -o tgi.json
infermark run http://gpu3:11434/v1 --model llama-3 -o ollama.json

# Side-by-side comparison
infermark compare vllm.json tgi.json ollama.json

Export formats

# JSON (for programmatic analysis)
infermark run http://localhost:8000/v1 -o report.json

# Markdown (paste into docs/PRs)
infermark run http://localhost:8000/v1 --markdown report.md

Configuration

BenchmarkConfig(
    url="http://localhost:8000/v1",     # Any OpenAI-compatible endpoint
    model="meta-llama/Llama-3-70B",     # Model name
    prompt="Explain relativity.",        # Prompt to send
    max_tokens=256,                      # Max output tokens per request
    concurrency_levels=[1, 4, 8, 16],   # Test these concurrency levels
    n_requests=100,                      # Requests per level
    timeout=120.0,                       # Per-request timeout (seconds)
    mode=BenchmarkMode.STREAMING,        # STREAMING or NON_STREAMING
    warmup=3,                            # Warmup requests before measurement
    api_key="sk-...",                    # Optional API key
)

How it works

Warmup — Sends a few requests to prime the server's KV cache and JIT compilation
For each concurrency level — Fires N requests with M concurrent workers using asyncio
Streaming measurement — Parses SSE chunks to measure TTFT and inter-token latency
Statistics — Computes P50/P75/P90/P95/P99, mean, min, max, std from raw timings
Report — Rich terminal tables, JSON, or Markdown output

Supported endpoints

Anything that speaks the OpenAI chat completions API:

vLLM
Text Generation Inference (TGI)
SGLang
Ollama (with OLLAMA_ORIGINS=*)
llama.cpp server
LiteLLM proxy
OpenAI, Anthropic (via compatible proxy), Together, Fireworks, etc.

License

Apache-2.0

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
.github/workflows		.github/workflows
assets		assets
examples		examples
scripts		scripts
src/infermark		src/infermark
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Project	What it does
tokonomics	Token counting & cost management for LLM APIs
datacrux	Training data quality — dedup, PII, contamination
castwright	Synthetic instruction data generation
datamix	Dataset mixing & curriculum optimization
toksight	Tokenizer analysis & comparison
trainpulse	Training health monitoring
ckpt	Checkpoint inspection, diffing & merging
quantbench	Quantization quality analysis
modeldiff	Behavioral regression testing
vibesafe	AI-generated code safety scanner
injectionguard	Prompt injection detection

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

infermark

What it measures

Install

Quick start

CLI

Python

Async

Compare endpoints

Export formats

Configuration

How it works

Supported endpoints

See Also

License

About

Uh oh!

Releases 3

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

infermark

What it measures

Install

Quick start

CLI

Python

Async

Compare endpoints

Export formats

Configuration

How it works

Supported endpoints

See Also

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages