To help with development and testing we have developed a light weight vLLM simulator. It does not truly run inference, but it does emulate responses to the HTTP REST endpoints of vLLM. Currently it supports partial OpenAI-compatible API:
- /v1/chat/completions
- /v1/completions
- /v1/models
In addition, a set of the vLLM HTTP endpoints are suppored as well. These include:
Endpoint | Description |
---|---|
/v1/load_lora_adapter | simulates the dynamic registration of a LoRA adapter |
/v1/unload_lora_adapter | simulates the dynamic unloading and unregistration of a LoRA adapter |
/metrics | exposes Prometheus metrics. See the table below for details |
/health | standard health check endpoint |
/ready | standard readiness endpoint |
In addition, it supports a subset of vLLM's Prometheus metrics. These metrics are exposed via the /metrics HTTP REST endpoint. Currently supported are the following metrics:
Metric | Description |
---|---|
vllm:gpu_cache_usage_perc | The fraction of KV-cache blocks currently in use (from 0 to 1). Currently this value will always be zero. |
vllm:lora_requests_info | Running stats on LoRA requests |
vllm:num_requests_running | Number of requests currently running on GPU |
vllm:num_requests_waiting | Prometheus metric for the number of queued requests |
The simulated inference has no connection with the model and LoRA adapters specified in the command line parameters or via the /v1/load_lora_adapter HTTP REST endpoint. The /v1/models endpoint returns simulated results based on those same command line parameters and those loaded via the /v1/load_lora_adapter HTTP REST endpoint.
The simulator supports two modes of operation:
echo
mode: the response contains the same text that was received in the request. For/v1/chat/completions
the last message for the role=user
is used.random
mode: the response is randomly chosen from a set of pre-defined sentences.
Timing of the response is defined by two parameters: time-to-first-token
and inter-token-latency
.
For a request with stream=true
: time-to-first-token
defines the delay before the first token is returned, inter-token-latency
defines the delay between subsequent tokens in the stream.
For a requst with stream=false
: the response is returned after delay of <time-to-first-token> + (<inter-token-latency> * (<number_of_output_tokens> - 1))
It can be run standalone or in a Pod for testing under packages such as Kind.
API responses contains a subset of the fields provided by the OpenAI API.
Click to show the structure of requests/responses
/v1/chat/completions
- request
- stream
- model
- messages
- role
- content
- response
- id
- created
- model
- choices
- index
- finish_reason
- message
- request
/v1/completions
- request
- stream
- model
- prompt
- max_tokens (for future usage)
- response
- id
- created
- model
- choices
- text
- request
/v1/models
- response
- object (list)
- data
- id
- object (model)
- created
- owned_by
- root
- parent
- response
For more details see the vLLM documentation
config
: the path to a yaml configuration fileport
: the port the simulator listents on, default is 8000model
: the currently 'loaded' model, mandatoryserved-model-name
: model names exposed by the API (a list of space-separated strings)lora-modules
: a list of LoRA adapters (a list of space-separated JSON strings): '{"name": "name", "path": "lora_path", "base_model_name": "id"}', optional, empty by defaultmax-loras
: maximum number of LoRAs in a single batch, optional, default is onemax-cpu-loras
: maximum number of LoRAs to store in CPU memory, optional, must be >= than max-loras, default is max-lorasmax-model-len
: model's context window, maximum number of tokens in a single request including input and output, optional, default is 1024max-num-seqs
: maximum number of sequences per iteration (maximum number of inference requests that could be processed at the same time), default is 5mode
: the simulator mode, optional, by defaultrandom
echo
: returns the same text that was sent in the requestrandom
: returns a sentence chosen at random from a set of pre-defined sentences
time-to-first-token
: the time to the first token (in milliseconds), optional, by default zerointer-token-latency
: the time to 'generate' each additional token (in milliseconds), optional, by default zeroseed
: random seed for operations (if not set, current Unix time in nanoseconds is used)
In addition, as we are using klog, the following parameters are available:
add_dir_header
: if true, adds the file directory to the header of the log messagesalsologtostderr
: log to standard error as well as files (no effect when -logtostderr=true)log_backtrace_at
: when logging hits line file:N, emit a stack trace (default :0)log_dir
: if non-empty, write log files in this directory (no effect when -logtostderr=true)log_file
: if non-empty, use this log file (no effect when -logtostderr=true)log_file_max_size
: defines the maximum size a log file can grow to (no effect when -logtostderr=true). Unit is megabytes. If the value is 0, the maximum file size is unlimited. (default 1800)logtostderr
: log to standard error instead of files (default true)one_output
: if true, only write logs to their native severity level (vs also writing to each lower severity level; no effect when -logtostderr=true)skip_headers
: if true, avoid header prefixes in the log messagesskip_log_headers
: if true, avoid headers when opening log files (no effect when -logtostderr=true)stderrthreshold
: logs at or above this threshold go to stderr when writing to files and stderr (no effect when -logtostderr=true or -alsologtostderr=true) (default 2)v
: number for the log level verbosityvmodule
: comma-separated list of pattern=N settings for file-filtered logging
max-running-requests
was replaced bymax-num-seqs
lora
was replaced bylora-modules
, which is now a list of JSON strings, e.g, '{"name": "name", "path": "lora_path", "base_model_name": "id"}'
To build a Docker image of the vLLM Simulator, run:
make image-build
Please note that the default image tag is ghcr.io/llm-d/llm-d-inference-sim:dev
.
The following environment variables can be used to change the image tag: REGISTRY
, SIM_TAG
, IMAGE_TAG_BASE
or IMG
.
To run the vLLM Simulator image under Docker, run:
docker run --rm --publish 8000:8000 ghcr.io/llm-d/llm-d-inference-sim:dev --port 8000 --model "Qwen/Qwen2.5-1.5B-Instruct" --lora "tweet-summary-0,tweet-summary-1"
Note: To run the vLLM Simulator with the latest release version, in the above docker command replace dev
with the current release which can be found on GitHub.
Note: The above command exposes the simulator on port 8000, and serves the Qwen/Qwen2.5-1.5B-Instruct model.
To build the vLLM simulator to run locally as an executable, run:
make build
To run the vLLM simulator in a standalone test environment, run:
./bin/llm-d-inference-sim --model my_model --port 8000
To run the vLLM simulator in a Kubernetes cluster, run:
kubectl apply -f manifests/deployment.yaml
To verify the deployment is available, run:
kubectl get deployment vllm-llama3-8b-instruct