Here the Step by Step instructions can be found to benchmark OCR models.
In the following description all steps, including setting up the python environment are listed.
| Script | Purpose |
|---|---|
bench_cv_endpoint.py |
Benchmark OCR models against real document images (CV/resume pages) |
acquire_cv_data.py |
Download and render real CV documents from a public dataset for use with the OCR benchmark |
- Python 3.12+
- A Stoney AI on Demand key (
STONEY_KEY) - For OCR benchmarks: disk space for CV image data (~200MB or more, depending on how much data should be copied)
curl -LsSf https://astral.sh/uv/install.sh | sh
source ~/.bashrcuv venv --python 3.12 ~/.venv-bench
source ~/.venv-bench/bin/activateFor OCR benchmarks (the Python scripts):
uv pip install Pillow pypdfium2 huggingface_hubFor LLM benchmarks (vllm bench CLI):
uv pip install vllm#Set your personal key:
STONEY_KEY=sk-your-key-here
# Make key visible for bench script:
export OPENAI_API_KEY=$STONEY_KEY curl https://llm.stoney-cloud.com/v1/models \
--silent --fail --show-error \
--header "Authorization: Bearer $STONEY_KEY" \
| jqThis should return a list of available models.
Uses vllm bench to measure throughput and latency of text generation models through the API gateway.
vllm bench serve \
--backend openai-chat \
--model "Qwen/Qwen3-Coder-Next" \
--base-url https://llm.stoney-cloud.com \
--endpoint /v1/chat/completions \
--dataset-name random \
--random-input-len 1024 \
--random-output-len 256 \
--num-prompts 50 \
--max-concurrency 1 \
--tokenizer "Qwen/Qwen2.5-7B-Instruct" \
--percentile-metrics ttftThe console prints a summary per run:
============ Serving Benchmark Result ============
Successful requests: 48
Failed requests: 2
Maximum request concurrency: 1
Benchmark duration (s): 157.46
Total input tokens: 49536
Total generated tokens: 12288
Request throughput (req/s): 0.30
Output token throughput (tok/s): 78.04
Peak output token throughput (tok/s): 257.00
Peak concurrent requests: 2.00
Total token throughput (tok/s): 392.63
---------------Time to First Token----------------
Mean TTFT (ms): 3143.01
Median TTFT (ms): 3142.48
P99 TTFT (ms): 3257.47
==================================================- Successful requests: Successful prompt requests
- Failed requests: Unsuccessful prompts
- Maximum request concurrency: How many requests the model processes simultaneously.
- Benchmark duration (s): The duration of the benchmark run in seconds.
- Total input tokens: The total number of input tokens.
- Total generated tokens: The total number of tokens generated by the model.
- Request throughput (req/s): The number of requests processed per second.
- Output token throughput (tok/s): The average number of tokens generated per second.
- Peak output token throughput (tok/s): The maximum measured number of output tokens per second.
- Peak concurrent requests: The maximum measured number of requests processed simultaneously.
- Total token throughput (tok/s): The average of all tokens processed during the measurement.
- Mean Time to First Token (TTFT) (ms): The average time elapsed between input and the first visible output.
- Median TTFT (ms): The expected time between input and the first visible output. Also known as TTFT p50.
- p50: Means that 50% of all requests are processed faster.
- p99 TTFT (ms): The time elapsed in the “worst case” scenario until the first token is generated.
- p99: Means that 99% of all requests are processed faster.
- Tokenizer: The tokenizer is used to send queries to the evaluated model during a benchmark. These are typically small, publicly available models, such as Qwen/Qwen2.5-7B-Instruct.
Uses a custom Python script to benchmark OCR models against real CV/resume documents, measuring pages per minute and latency.
Download and render real CVs from a public HuggingFace dataset:
python acquire_cv_data.py --count 50 --out cv_bench_dataThis downloads 50 CV PDFs and renders each page to a PNG image in cv_bench_data/. Each page becomes one benchmark request.
Verify the data:
ls cv_bench_data/*.png | wc -lpython bench_cv_endpoint.py \
--endpoint https://llm.stoney-cloud.com/v1/chat/completions \
--data cv_bench_data \
--model "lightonai/LightOnOCR-2-1B" \
--api-key $STONEY_KEY \
--concurrency 1 \
--limit 20| Parameter | What it controls |
|---|---|
--model |
Model ID as shown by /v1/models |
--data |
Path to the directory of rendered CV page images |
--concurrency |
Simultaneous requests |
--limit |
Number of page images to process per run |
--max-tokens |
Maximum output tokens per page (default: 4096) |
The script prints a summary per run:
--- benchmark result (concurrency 1) ---
concurrency : 1
requested : 50
ok : 50
failed : 0
duration_s : 93.958
pages_s : 0.532
pages_min : 31.9
out_tok_s : 419.4
latency_p50_s : 1.63
latency_p99_s : 10.016
CSV output is also written to stdout for easy collection into result files.
- concurrency: How many requests the model processes simultaneously.
- requested: How many requests were sent.
- ok: Number of accepted requests (in this case, CVs).
- failed: Number of rejected requests.
- duration_s: The duration of the benchmark run.
- pages_s: The average number of pages that can be processed per second.
- pages_min: The average number of pages that can be processed per minute.
- out_tok_s: The number of tokens generated per second.
- latency_p50_s: The average response time in seconds.
- latency_p99_s: The response time required in the "worst case" scenario, in seconds.
- p50: Means that 50% of all requests are processed faster.
- p99: Means that 99% of all requests are processed faster.