Model Benchmarking

Here the Step by Step instructions can be found to benchmark OCR models.

In the following description all steps, including setting up the python environment are listed.

What's in this repo

Script	Purpose
`bench_cv_endpoint.py`	Benchmark OCR models against real document images (CV/resume pages)
`acquire_cv_data.py`	Download and render real CV documents from a public dataset for use with the OCR benchmark

Prerequisites

Python 3.12+
A Stoney AI on Demand key (STONEY_KEY)
For OCR benchmarks: disk space for CV image data (~200MB or more, depending on how much data should be copied)

Setup

1. Install uv (Python package manager)

curl -LsSf https://astral.sh/uv/install.sh | sh
source ~/.bashrc

2. Create a virtual environment

uv venv --python 3.12 ~/.venv-bench
source ~/.venv-bench/bin/activate

3. Install dependencies

For OCR benchmarks (the Python scripts):

uv pip install Pillow pypdfium2 huggingface_hub

For LLM benchmarks (vllm bench CLI):

uv pip install vllm

4. Set your API key

#Set your personal key: 
STONEY_KEY=sk-your-key-here

# Make key visible for bench script:
export OPENAI_API_KEY=$STONEY_KEY

5. Verify access

curl https://llm.stoney-cloud.com/v1/models \
  --silent --fail --show-error \
  --header "Authorization: Bearer $STONEY_KEY" \
  | jq

This should return a list of available models.

Benchmarking LLMs (text generation models)

Uses vllm bench to measure throughput and latency of text generation models through the API gateway.

Quick single test

vllm bench serve \
  --backend openai-chat \
  --model "Qwen/Qwen3-Coder-Next" \
  --base-url https://llm.stoney-cloud.com \
  --endpoint /v1/chat/completions \
  --dataset-name random \
  --random-input-len 1024 \
  --random-output-len 256 \
  --num-prompts 50 \
  --max-concurrency 1 \
  --tokenizer "Qwen/Qwen2.5-7B-Instruct" \
  --percentile-metrics ttft

Output

The console prints a summary per run:

============ Serving Benchmark Result ============
Successful requests:                     48        
Failed requests:                         2         
Maximum request concurrency:             1         
Benchmark duration (s):                  157.46    
Total input tokens:                      49536     
Total generated tokens:                  12288     
Request throughput (req/s):              0.30      
Output token throughput (tok/s):         78.04     
Peak output token throughput (tok/s):    257.00    
Peak concurrent requests:                2.00      
Total token throughput (tok/s):          392.63    
---------------Time to First Token----------------
Mean TTFT (ms):                          3143.01   
Median TTFT (ms):                        3142.48   
P99 TTFT (ms):                           3257.47   
==================================================

Understanding the metrics for LLMs (Text generation) using vllm bench

Successful requests: Successful prompt requests
Failed requests: Unsuccessful prompts
Maximum request concurrency: How many requests the model processes simultaneously.
Benchmark duration (s): The duration of the benchmark run in seconds.
Total input tokens: The total number of input tokens.
Total generated tokens: The total number of tokens generated by the model.
Request throughput (req/s): The number of requests processed per second.
Output token throughput (tok/s): The average number of tokens generated per second.
Peak output token throughput (tok/s): The maximum measured number of output tokens per second.
Peak concurrent requests: The maximum measured number of requests processed simultaneously.
Total token throughput (tok/s): The average of all tokens processed during the measurement.
Mean Time to First Token (TTFT) (ms): The average time elapsed between input and the first visible output.
Median TTFT (ms): The expected time between input and the first visible output. Also known as TTFT p50.
p50: Means that 50% of all requests are processed faster.
p99 TTFT (ms): The time elapsed in the “worst case” scenario until the first token is generated.
p99: Means that 99% of all requests are processed faster.
Tokenizer: The tokenizer is used to send queries to the evaluated model during a benchmark. These are typically small, publicly available models, such as Qwen/Qwen2.5-7B-Instruct.

Benchmarking OCR models (with real documents)

Uses a custom Python script to benchmark OCR models against real CV/resume documents, measuring pages per minute and latency.

Step 1 — Acquire test data

Download and render real CVs from a public HuggingFace dataset:

python acquire_cv_data.py --count 50 --out cv_bench_data

This downloads 50 CV PDFs and renders each page to a PNG image in cv_bench_data/. Each page becomes one benchmark request.

Verify the data:

ls cv_bench_data/*.png | wc -l

Step 2 — Quick single test

python bench_cv_endpoint.py \
  --endpoint https://llm.stoney-cloud.com/v1/chat/completions \
  --data cv_bench_data \
  --model "lightonai/LightOnOCR-2-1B" \
  --api-key $STONEY_KEY \
  --concurrency 1 \
  --limit 20

Key parameters

Parameter	What it controls
`--model`	Model ID as shown by `/v1/models`
`--data`	Path to the directory of rendered CV page images
`--concurrency`	Simultaneous requests
`--limit`	Number of page images to process per run
`--max-tokens`	Maximum output tokens per page (default: 4096)

Output

The script prints a summary per run:

  --- benchmark result (concurrency 1) ---
   concurrency   : 1
  requested     : 50
  ok            : 50
  failed        : 0
  duration_s    : 93.958
  pages_s       : 0.532
  pages_min     : 31.9
  out_tok_s     : 419.4
  latency_p50_s : 1.63
  latency_p99_s : 10.016

CSV output is also written to stdout for easy collection into result files.

Understanding the metrics for OCR

concurrency: How many requests the model processes simultaneously.
requested: How many requests were sent.
ok: Number of accepted requests (in this case, CVs).
failed: Number of rejected requests.
duration_s: The duration of the benchmark run.
pages_s: The average number of pages that can be processed per second.
pages_min: The average number of pages that can be processed per minute.
out_tok_s: The number of tokens generated per second.
latency_p50_s: The average response time in seconds.
latency_p99_s: The response time required in the "worst case" scenario, in seconds.
p50: Means that 50% of all requests are processed faster.
p99: Means that 99% of all requests are processed faster.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
LICENSE		LICENSE
README.md		README.md
acquire_cv_data.py		acquire_cv_data.py
cv_bench_endpoint.py		cv_bench_endpoint.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Model Benchmarking

What's in this repo

Prerequisites

Setup

1. Install uv (Python package manager)

2. Create a virtual environment

3. Install dependencies

4. Set your API key

5. Verify access

Benchmarking LLMs (text generation models)

Quick single test

Output

Understanding the metrics for LLMs (Text generation) using vllm bench

Benchmarking OCR models (with real documents)

Step 1 — Acquire test data

Step 2 — Quick single test

Key parameters

Output

Understanding the metrics for OCR

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Model Benchmarking

What's in this repo

Prerequisites

Setup

1. Install uv (Python package manager)

2. Create a virtual environment

3. Install dependencies

4. Set your API key

5. Verify access

Benchmarking LLMs (text generation models)

Quick single test

Output

Understanding the metrics for LLMs (Text generation) using vllm bench

Benchmarking OCR models (with real documents)

Step 1 — Acquire test data

Step 2 — Quick single test

Key parameters

Output

Understanding the metrics for OCR

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages