The repository contains code for fmperf
(foundation model performance), a Python-based benchmarking tool for large language models (LLM) serving frameworks.
Currently it can be used to evaluate the performance and energy efficiency of TGIS and vLLM inference servers.
The high-level architecture of the benchmark is illustrated below.
The python module fmperf
can be used to specify and orchestrate benchmarking experiments.
All experiments are executed on a Kubernetes (k8s) or OpenShift cluster, and fmperf
uses the Python k8s client to orchestrate the different components.
Specifically, fmperf
provides a simple Python API for deploying an inference server as a k8s Deployment
and Service
.
Additionally, APIs are provided for creating a k8s Job
to perform load testing of the service (fmperf.loadgen
).
The benchmark runs on your laptop or development machine, but needs to be pointed to a k8s or OpenShift cluster where the experiments will be run.
This may be a remote cluster
, or it could be a local k8s cluster
in your development machine. For full instructions on bringing up a local k8s cluster please see our documentation here.
In order to use fmperf
, we need to clone the repo and create a conda environment with necessary dependencies:
git clone https://github.com/fmperf-project/fmperf
cd fmperf
conda create -y -n fmperf-env python=3.11
conda activate fmperf-env
pip install -r requirements.txt
pip install -e .
Examples of how use use fmperf
as a library can be found in the examples
directory. The directory comprises various python scripts performing the benchmarking experiments. Please see our documentation here.
It is also possible to use the load testing functionaltiy of fmperf
independently of the OpenShift orchestration layer.
In this scenario, you can build the docker container and then configure it via a .env
file to point at an inference server
that you have deployed locally (or remotely).
Build docker image using the Dockerfile:
docker build -t fmperf .
Create environment file and populate with desired configuration:
cp .env.example .env
The benchmark needs to create a large file containing many heterogeneous requests. Rather than create this file each time, it is recommended to persist this file. To do this we will create a directory on the host that the container will write to:
mkdir requests
chmod o+w requests
Then we run the container to generate the sample requests. There are two ways this can be done currently. Firstly, one can generate a set of requests assuming simple uniform distributions for parameters like number of input/output tokens:
docker run --env-file .env -it --rm -v $(pwd)/requests:/requests fmperf python -m fmperf.loadgen.generate-input
Alternatively, one can generate a set of requests using models that have been trained on requests sent to the internal production deployment of BAM:
docker run --env-file .env -it --rm -v $(pwd)/requests:/requests fmperf python -m fmperf.loadgen.generate-input --from-model
Important Note: if generating inputs for vLLM, it is also necessary to add another -v
argument to mount the folder where the model weights
reside (in exactly the same way they are mounted inside the inference server image). This is required because it is necessary to perform tokenization
inside the fmperf image until vLLM PR 3144 is merged.
Once the requests have been generated, we can run an experiment for a fixed number of virtual users (controlled via the NUM_USERS
env variable) by running the container as follows:
docker run --env-file .env -it --rm -v $(pwd)/requests:/requests fmperf python -m fmperf.loadgen.run
Finally, we can run a sweep over different number of virtual users (controlled via the SWEEP_USERS
env variable) as follows:
docker run --env-file .env -it --rm -v $(pwd)/requests:/requests fmperf python -m fmperf.loadgen.sweep
If you need help using the framework encounter issues please open an issue directly on this repo.