Skip to content

stepping-stone/benchmark-scripts

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Model Benchmarking

Here the Step by Step instructions can be found to benchmark OCR models.

In the following description all steps, including setting up the python environment are listed.

What's in this repo

Script Purpose
bench_cv_endpoint.py Benchmark OCR models against real document images (CV/resume pages)
acquire_cv_data.py Download and render real CV documents from a public dataset for use with the OCR benchmark

Prerequisites

  • Python 3.12+
  • A Stoney AI on Demand key (STONEY_KEY)
  • For OCR benchmarks: disk space for CV image data (~200MB or more, depending on how much data should be copied)

Setup

1. Install uv (Python package manager)

curl -LsSf https://astral.sh/uv/install.sh | sh
source ~/.bashrc

2. Create a virtual environment

uv venv --python 3.12 ~/.venv-bench
source ~/.venv-bench/bin/activate

3. Install dependencies

For OCR benchmarks (the Python scripts):

uv pip install Pillow pypdfium2 huggingface_hub

For LLM benchmarks (vllm bench CLI):

uv pip install vllm

4. Set your API key

#Set your personal key: 
STONEY_KEY=sk-your-key-here

# Make key visible for bench script:
export OPENAI_API_KEY=$STONEY_KEY 

5. Verify access

curl https://llm.stoney-cloud.com/v1/models \
  --silent --fail --show-error \
  --header "Authorization: Bearer $STONEY_KEY" \
  | jq

This should return a list of available models.


Benchmarking LLMs (text generation models)

Uses vllm bench to measure throughput and latency of text generation models through the API gateway.

Quick single test

vllm bench serve \
  --backend openai-chat \
  --model "Qwen/Qwen3-Coder-Next" \
  --base-url https://llm.stoney-cloud.com \
  --endpoint /v1/chat/completions \
  --dataset-name random \
  --random-input-len 1024 \
  --random-output-len 256 \
  --num-prompts 50 \
  --max-concurrency 1 \
  --tokenizer "Qwen/Qwen2.5-7B-Instruct" \
  --percentile-metrics ttft

Output

The console prints a summary per run:

============ Serving Benchmark Result ============
Successful requests:                     48        
Failed requests:                         2         
Maximum request concurrency:             1         
Benchmark duration (s):                  157.46    
Total input tokens:                      49536     
Total generated tokens:                  12288     
Request throughput (req/s):              0.30      
Output token throughput (tok/s):         78.04     
Peak output token throughput (tok/s):    257.00    
Peak concurrent requests:                2.00      
Total token throughput (tok/s):          392.63    
---------------Time to First Token----------------
Mean TTFT (ms):                          3143.01   
Median TTFT (ms):                        3142.48   
P99 TTFT (ms):                           3257.47   
==================================================

Understanding the metrics for LLMs (Text generation) using vllm bench

  • Successful requests: Successful prompt requests
  • Failed requests: Unsuccessful prompts
  • Maximum request concurrency: How many requests the model processes simultaneously.
  • Benchmark duration (s): The duration of the benchmark run in seconds.
  • Total input tokens: The total number of input tokens.
  • Total generated tokens: The total number of tokens generated by the model.
  • Request throughput (req/s): The number of requests processed per second.
  • Output token throughput (tok/s): The average number of tokens generated per second.
  • Peak output token throughput (tok/s): The maximum measured number of output tokens per second.
  • Peak concurrent requests: The maximum measured number of requests processed simultaneously.
  • Total token throughput (tok/s): The average of all tokens processed during the measurement.
  • Mean Time to First Token (TTFT) (ms): The average time elapsed between input and the first visible output.
  • Median TTFT (ms): The expected time between input and the first visible output. Also known as TTFT p50.
  • p50: Means that 50% of all requests are processed faster.
  • p99 TTFT (ms): The time elapsed in the “worst case” scenario until the first token is generated.
  • p99: Means that 99% of all requests are processed faster.
  • Tokenizer: The tokenizer is used to send queries to the evaluated model during a benchmark. These are typically small, publicly available models, such as Qwen/Qwen2.5-7B-Instruct.

Benchmarking OCR models (with real documents)

Uses a custom Python script to benchmark OCR models against real CV/resume documents, measuring pages per minute and latency.

Step 1 — Acquire test data

Download and render real CVs from a public HuggingFace dataset:

python acquire_cv_data.py --count 50 --out cv_bench_data

This downloads 50 CV PDFs and renders each page to a PNG image in cv_bench_data/. Each page becomes one benchmark request.

Verify the data:

ls cv_bench_data/*.png | wc -l

Step 2 — Quick single test

python bench_cv_endpoint.py \
  --endpoint https://llm.stoney-cloud.com/v1/chat/completions \
  --data cv_bench_data \
  --model "lightonai/LightOnOCR-2-1B" \
  --api-key $STONEY_KEY \
  --concurrency 1 \
  --limit 20

Key parameters

Parameter What it controls
--model Model ID as shown by /v1/models
--data Path to the directory of rendered CV page images
--concurrency Simultaneous requests
--limit Number of page images to process per run
--max-tokens Maximum output tokens per page (default: 4096)

Output

The script prints a summary per run:

  --- benchmark result (concurrency 1) ---
   concurrency   : 1
  requested     : 50
  ok            : 50
  failed        : 0
  duration_s    : 93.958
  pages_s       : 0.532
  pages_min     : 31.9
  out_tok_s     : 419.4
  latency_p50_s : 1.63
  latency_p99_s : 10.016

CSV output is also written to stdout for easy collection into result files.


Understanding the metrics for OCR

  • concurrency: How many requests the model processes simultaneously.
  • requested: How many requests were sent.
  • ok: Number of accepted requests (in this case, CVs).
  • failed: Number of rejected requests.
  • duration_s: The duration of the benchmark run.
  • pages_s: The average number of pages that can be processed per second.
  • pages_min: The average number of pages that can be processed per minute.
  • out_tok_s: The number of tokens generated per second.
  • latency_p50_s: The average response time in seconds.
  • latency_p99_s: The response time required in the "worst case" scenario, in seconds.
  • p50: Means that 50% of all requests are processed faster.
  • p99: Means that 99% of all requests are processed faster.

About

Scripts and instructions for benchmarking AI models

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages