# Benchmark vLLM Server with inference-perf

### Local vLLM Setup using docker

Run vLLM Server as a docker container with the model HuggingFace `HuggingFaceTB/SmolLM2-135M-Instruct`

In [1]:
!docker-compose up -d

Creating network "vllm_default" with the default driver
Creating vllm_vllm-server_1 ... 
Creating vllm_prometheus_1  ... 
[2Bting vllm_vllm-server_1 ... [32mdone[0m[2A[2K

**Note**: Configure [vLLM engine arguments](https://docs.vllm.ai/en/latest/serving/engine_args.html#engine-args) like `--max-model-len` and  `--max-num-seqs` according to local compute capacity

### Benchmark with inference_perf

Create a configuration file for the test using `shareGPT` data and run the constant rate test for `30s`. You can also use any of the other data generators like `random`, `shared-prefix` or `synthetic` with their own configuration using the corresponding `config-*.yml` file instead.

In [2]:
!cat config.yml

data:
  type: shareGPT
load:
  type: constant
  stages:
  - rate: 1
    duration: 30
api: 
  type: chat
server:
  type: vllm
  model_name: HuggingFaceTB/SmolLM2-135M-Instruct
  base_url: http://0.0.0.0:8000
tokenizer:
  pretrained_model_name_or_path: HuggingFaceTB/SmolLM2-135M-Instruct

In [3]:
!inference-perf --config_file config.yml

None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
2025-06-29 15:53:45,430 - inference_perf.config - INFO - Using configuration from: config.yml
2025-06-29 15:53:45,432 - inference_perf.config - INFO - Benchmarking with the following config:

api:
  type: chat
  streaming: false
data:
  type: shareGPT
  input_distribution: null
  output_distribution: null
  shared_prefix: null
load:
  type: constant
  interval: 1.0
  stages:
  - rate: 1
    duration: 30
  num_workers: 10
  worker_max_concurrency: 10
  worker_max_tcp_connections: 2500
metrics: null
report:
  request_lifecycle:
    summary: true
    per_stage: true
    per_request: false
  prometheus:
    summary: true
    per_stage: false
storage:
  local_storage:
    path: reports-20250629-155345
    report_file_prefix: null
  google_cloud_storage: null
server:
  type: vllm
  model_name: HuggingFaceTB/SmolLM2-135M-Instruct
  base

View benchmark metrics in the reports folder created.

### Cleanup

Delete vLLM Server docker processes

In [4]:
!docker-compose down

Stopping vllm_vllm-server_1 ... 
Stopping vllm_prometheus_1  ... 
[2BRemoving vllm_vllm-server_1 ... mdone[0m[2A[2K
Removing vllm_prometheus_1  ... 
[2BRemoving network vllm_default[32mdone[0m
