## Summary

Below is a summary of my findings:

- 🏁 **[exllama via text-generation-webui](https://github.com/oobabooga/text-generation-webui) is the fastest**, but I don't really like the software because installation is brittle and its trying to do too many things at once (front-end, back end etc).  You can also try [exllama](https://github.com/turboderp/exllama) alone without the webserver (but I didn't have time to try this yet).  _This wasn't the fastest up until recently when I updated the newest version and re-ran the benchmarks._
- ❤️ **[CTranslate2](https://github.com/OpenNMT/CTranslate2) is my favorite tool, which is among the fastest but is also the easiest to use**.  The documentation is the best out of all of the solutions I tried.  Furthermore, I think that the ergonomics are excellent for the models that they support.  Unlike vLLM, CTranslate doesn't seem to support distributed inference just yet.
- 🛠️ **[vLLM](https://vllm.readthedocs.io/en/latest/) is really fast, but CTranslate can be much faster.**  On other hand, **vLLM supports distributed inference**, which is something you will need for larger models. **vLLM might be the sweet spot for serving very large models.**
- 😐 **[Text Generation Inference](https://github.com/huggingface/text-generation-inference) is an ok option (but nowhere near as fast as `vLLM`) if you want to deploy HuggingFace LLMs in a standard way**.  TGI has some nice features like telemetry baked in ([via OpenTelemetry](https://opentelemetry.io/docs/concepts/signals/traces/)) and integration with the HF ecosystem like [inference endpoints](https://huggingface.co/inference-endpoints). One thing to note that as of 7/28/2023, the license for TGI was changed to be more **[restrictive that may interfere with certain commercial uses](https://github.com/huggingface/text-generation-inference/commit/bde25e62b33b05113519e5dbf75abda06a03328e)**. I am personally not a fan of the license. 

### Rough Benchmarks

This study focuses on various approaches to optimizing **latency**.  Specifically, I want to know which tools are the most effective at optimizing latency for open source LLMs. In order to focus on latency, I hold the following variables constant:

- batch size of `n = 1` for all prediction requests (holding throughput constant).[^1]  
- All experiments were conducted on a `Nvidia A6000` GPU, unless otherwise noted.
- Max output tokens were always set to `200`.
- All numbers are calculated as an average over a fixed set of 9 prompts.
- The model used is [meta-llama/Llama-2-7b-hf](https://huggingface.co/meta-llama/Llama-2-7b-hf) on the HuggingFace Hub [^2].

[^1]: It is common to explore the inference vs throughput frontier when conducting inference benchmarks.  I did not do this, since I was most interested in latency.  [Here is an example](https://github.com/mosaicml/llm-foundry/tree/main/scripts/inference/benchmarking#different-hw-setups-for-mpt-7b) of how to conduct inference benchmarks that consider both throughput and latency.
[^2]: For [Llama v2 models](https://huggingface.co/meta-llama), you must be careful to use the models ending in `-hf` as those are the ones that are compatible with the transformers library.  
[^3]: It's not an apples to apples comparison, since the largest OpenAI models are much larger than open source models.  However, I have found that fine-tuning a small model can often be better when you are trying to accomplish a very specific task.

In addition to batch size of `n = 1` and using a `A6000` GPU (unless noted otherwise), I also made sure I warmed up the model by sending an initial inference request before measuring latency.

<center>Llama-v2-7b benchmark: <i>batch size = 1, max output tokens = 200</i></center>

In [None]:
#|echo: false
import pandas as pd
pd.set_option("display.precision", 1)
df = pd.concat([pd.read_csv('_llama-inference/hf-endpoint/bench-hf-endpoint.csv').assign(platform='HF Hosted Inference Endpoint').assign(options='-').assign(gpu='A10G'),
                pd.read_csv('_llama-inference/hf/bench-hf.csv').assign(platform='HuggingFace Transformers (no server)').assign(options='-').assign(gpu='A6000'),
                pd.read_csv('_llama-inference/hf/bench-hf-bb.csv').assign(platform='HuggingFace Transformers (no server)').assign(options='nf4 4bit quantization bitsandbytes').assign(gpu='A6000'),
                pd.read_csv('_llama-inference/tgi/bench-default.csv').assign(platform='TGI').assign(options='-').assign(gpu='A6000'), 
                pd.read_csv('_llama-inference/tgi/bench-quantize-bb.csv').assign(platform='TGI').assign(options='quantized w/ bitsandbytes').assign(gpu='A6000'),
                pd.read_csv('_llama-inference/tgi/bench-quantize-gptq.csv').assign(platform='TGI').assign(options='quantized w/ GPTQ').assign(gpu='A6000'),
                pd.read_csv('_llama-inference/exllama/bench-exllama.csv').assign(platform='text-generation-webui').assign(options='exllama').assign(gpu='A6000'),
                pd.read_csv('_llama-inference/ctranslate/bench-ctranslate.csv').assign(platform='CTranslate2').assign(options='float16 quantization').assign(gpu='A6000'),
                pd.read_csv('_llama-inference/ctranslate/bench-ctranslate-int8.csv').assign(platform='CTranslate2').assign(options='int8 quantization').assign(gpu='A6000'),
                pd.read_csv('_llama-inference/vllm/bench-vllm.csv').assign(platform='vllm').assign(options='-').assign(gpu='A6000'),
                pd.read_csv('_llama-inference/vllm/modal-examples/bench-vllm.csv').assign(platform='vllm').assign(options='-').assign(gpu='A100 (on Modal Labs)')]
              )

df['tok/sec'] = df['tok_count'] / df['time']

(df.groupby(['platform', 'options', 'gpu']).mean('time')[['tok/sec', 'time', 'tok_count']]
 .rename(columns={'tok/sec': 'avg tok/sec', 'time': 'avg time (seconds)', 'tok_count': 'avg output token count'})
)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,avg tok/sec,avg time (seconds),avg output token count
platform,options,gpu,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
CTranslate2,float16 quantization,A6000,44.8,4.5,200.0
CTranslate2,int8 quantization,A6000,62.6,3.2,200.0
HF Hosted Inference Endpoint,-,A10G,30.4,6.6,202.0
HuggingFace Transformers (no server),-,A6000,24.6,7.5,181.4
HuggingFace Transformers (no server),nf4 4bit quantization bitsandbytes,A6000,24.3,7.6,181.4
TGI,-,A6000,21.1,9.5,200.0
TGI,quantized w/ GPTQ,A6000,23.6,8.8,200.0
TGI,quantized w/ bitsandbytes,A6000,1.9,103.0,200.0
text-generation-webui,exllama,A6000,77.0,1.7,134.0
vllm,-,A100 (on Modal Labs),41.5,3.4,143.1


In some cases I did not use an `A6000` b/c the platform didn't have that particular GPU available.  You can ignore these rows if you like, but I still think it is valuable information.  I had access to a A6000, so I just used what I had.

I noticed that the output of the LLM was quite different (less tokens) when using [vLLM](https://github.com/vllm-project/vllm.git).  I am not sure if I did something wrong here, or it changes the behavior of the LLM.

Furthermore, the goal was not to be super precise on these benchmarks but rather to get a general sense of how things work and how they might compare to each other out of the box. Some of the tools above are inference servers which perform logging, tracing etc. in addition to optimizing models which effect latency.  The idea is to see where there are significant differences between tools.  I discussed this more [here](#comparison-without-tgi-server).

## Background

One capability you need to be successful with open source LLMs is the ability to serve models efficiently.  There are two categories of tools for model inference:

- **Inference servers:** these help with providing a web server that can provide a REST/grpc or other interface to interact with your model as a service.  These inference servers usually have parameters to help you make [trade-offs between throughput and latency](https://www.simonwenkel.com/notes/ai/practical/latency-vs-throughput-in-machine-learning-pipelines.html). Additionally, some inference servers come with additional features like telemetry, model versioning and more. You can learn more about this topic the [serving section](../serving/index.qmd) of these notes. For LLMs, popular inference servers are the [Text Generation Inference (TGI)](https://github.com/huggingface/text-generation-inference) and [vLLM](https://github.com/vllm-project/vllm).

- **Model Optimization**: These modify your model to make them faster for inference.  Examples include [quantization](https://huggingface.co/docs/optimum/concept_guides/quantization),  [Paged Attention](https://vllm.ai/), [Exllama](https://github.com/turboderp/exllama) and more.

It is common to use both **Inference servers** and **Model Optimization** techniques in conjunction.  Some inference servers like [TGI](https://github.com/huggingface/text-generation-inference)and [vLLM](https://vllm.readthedocs.io/en/latest/) even help you apply optimization techniques.[^4]

[^4]: [The Modular Inference Engine](https://www.modular.com/engine) is another example of an inference server that also applies optimization techniques.  At the time of this writing, this is proprietary technology, but its worth keeping an eye on this in the future.

# Notes On Tools

Other than benchmarking, an important goal of this study was to understand how to use different platforms & tools. 

## CTranslate2

[CTranslate2](https://github.com/OpenNMT/CTranslate2) is an optimization tool that can make models ridiculously fast.  h/t to [Anton](https://twitter.com/abacaj/status/1685107222097903617?s=20). The documentation for CTranslate2 contains [specific instructions for llama models](https://opennmt.net/CTranslate2/guides/transformers.html#llama-2).


To optimize `llama v2`, we first need to quantize the model. This can be done like so:

```bash
ct2-transformers-converter --model meta-llama/Llama-2-7b-hf --quantization int8 --output_dir llama-2-7b-ct2 --force
```

`meta-llama/Llama-2-7b-hf` refers to the [HuggingFace repo for this model](https://huggingface.co/meta-llama/Llama-2-7b-hf).  The benchmarking code is as follows (can also be found [here](https://github.com/hamelsmu/llama-inference/blob/master/ctranslate/bench.py)):

```python
import time
import ctranslate2
import transformers
import sys
sys.path.append('../common/')
from questions import questions
import pandas as pd

generator = ctranslate2.Generator("llama-2-7b-ct2", device="cuda")
tokenizer = transformers.AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")

def predict(prompt:str):
    "Generate text give a prompt"
    start = time.perf_counter()
    tokens = tokenizer.convert_ids_to_tokens(tokenizer.encode(prompt))
    results = generator.generate_batch([tokens], sampling_topk=1, max_length=200, include_prompt_in_result=False)
    tokens = results[0].sequences_ids[0]
    output = tokenizer.decode(tokens)
    request_time = time.perf_counter() - start
    return {'tok_count': len(tokens),
            'time': request_time,
            'question': prompt,
            'answer': output,
            'note': 'CTranslate2 int8 quantization'}

if __name__ == '__main__':
    counter = 1
    responses = []

    for q in questions:
        if counter >= 2: responses.append(predict(q))
        counter += 1

    df = pd.DataFrame(responses)
    df.to_csv('bench-ctranslate-int8.csv', index=False)
```

## Text Generation Inference (TGI)

:::{.callout-warning}
### License Restrictions

The license for TGI was [recently changed](https://github.com/huggingface/text-generation-inference/commit/bde25e62b33b05113519e5dbf75abda06a03328e) away from Apache 2.0 to be more restrictive.  Be careful when using TGI in commercial applications.

:::


[Text generation inference](https://github.com/huggingface/text-generation-inference) which is often referred to as “TGI” was easy to use without any optimization.  You can run it like this:

```{.bash filename=“start_server.sh”}
#!/bin/bash

if [ -z "$HUGGING_FACE_HUB_TOKEN" ]
then
  echo "HUGGING_FACE_HUB_TOKEN is not set. Please set it before running this script."
  exit 1
fi

model="TheBloke/Llama-2-7B-GPTQ"
volume=$PWD/data

docker run --gpus all \
 -e HUGGING_FACE_HUB_TOKEN=$HUGGING_FACE_HUB_TOKEN \
 -e GPTQ_BITS=4 -e GPTQ_GROUPSIZE=128 \
 --shm-size 5g -p 8081:80 \
 -v $volume:/data ghcr.io/huggingface/text-generation-inference \
 --max-best-of 1 "$@"
```

We can then run the server with this command:

```bash
bash start_server.sh --model-id "meta-llama/Llama-2-7b-hf"
```

:::{.callout-note}

#### Help

You can see all the options for the TGI container with the help flag like so:

```bash
docker run ghcr.io/huggingface/text-generation-inference --help | less
```
:::

### Quantization

Quantization was very difficult to get working.  There is a `—quantize` flag with accepts `bitsandbytes` and `gptq`.  The `bitsandbytes` approach makes inference __much__ slower, which [others have reported](https://github.com/huggingface/text-generation-inference/issues/309#issuecomment-1542124381).  

To make `gptq` work for llama v2 models requires a bunch of work, you have to [install the text-generation-server](https://github.com/huggingface/text-generation-inference/tree/main/server) which can take a while and is very brittle to get right.  I had to  step through the [Makefile](https://github.com/huggingface/text-generation-inference/blob/main/server/Makefile) carefully.  After that you have to download the weights with:

```bash
text-generation-server download-weights meta-llama/Llama-2-7b-hf
```

You can run the following command to perform the quantization (the last argument is the destination directory where the weights are stored).

```bash
text-generation-server quantize "meta-llama/Llama-2-7b-hf" data/quantized/
```

**However, this step is not needed for the most popular models, as someone will likely already have quantized and uploaded them to the Hub.**

#### Pre-Quantized Models

Alternatively, you can use a pre-quantized model that has been uploaded to the Hub.  [TheBloke/Llama-2-7B-GPTQ](https://huggingface.co/TheBloke/Llama-2-7B-GPTQ) is a good example of one.  To get this to work, you have to be careful to set the `GPTQ_BITS` and `GPTQ_GROUPSIZE` environment variables to match the config.  For example [This config](https://huggingface.co/TheBloke/Llama-2-7B-GPTQ/blob/main/quantize_config.json#L2-L3) necessitates setting `GPTQ_BITS=4` and `GPTQ_GROUPSIZE=128` These are already set in `start_server.sh` shown above.  [This PR](https://github.com/huggingface/text-generation-inference/pull/671) will eventually fix that.

To use the [TheBloke/Llama-2-7B-GPTQ](https://huggingface.co/TheBloke/Llama-2-7B-GPTQ) with TGI, I can use the same bash script with the following arguments:

```bash
bash start_server.sh --model-id TheBloke/Llama-2-7B-GPTQ --quantize gptq
```

### Comparison Without TGI Server

When I first drafted this study I got the following response on twitter:

<blockquote class="twitter-tweet"><p lang="en" dir="ltr">Based on your code (<a href="https://t.co/hSYaPTsEaK">https://t.co/hSYaPTsEaK</a>) it seems like you measure the full HTTP request, which is like comparing trees to an apple.</p>&mdash; Philipp Schmid (@_philschmid) <a href="https://twitter.com/_philschmid/status/1685187971400470528?ref_src=twsrc%5Etfw">July 29, 2023</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>

Phillip certainly has a point!  I am indeed testing both!  I'm looking for big differences in tools here, and since some inference servers have optimization tools, and some optimization tools do not have an inference server I cannot do a true apples to apples comparison.  However, I think its still useful to try different things as advertised to see what is possible, and also take note of really significant gaps in latency between tools.

Therefore, I ran the following tests to perform the similar optimizations as TGI, but without the server to see what happened:

#### HuggingFace Transformers

I was able to get slightly better performance without the TGI server as predicted by Phillip, **but it did not account for the the massive gap between some tools** (which is exactly the kind of thing I was looking for). 

To benchmark quantization with bitsandbytes, I [followed this blog post](https://huggingface.co/blog/4bit-transformers-bitsandbytes) and wrote [this benchmarking code](https://github.com/hamelsmu/llama-inference/blob/master/hf/bench.py).  I quantized the model by loading it like this:

```python
model_id = "meta-llama/Llama-2-7b-hf"
tokenizer = AutoTokenizer.from_pretrained(model_id)
nf4_config = BitsAndBytesConfig(
   load_in_4bit=True,
   bnb_4bit_quant_type="nf4",
   bnb_4bit_compute_dtype=torch.bfloat16
)
model_nf4 = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=nf4_config)
```

Unlike TGI, I was able to get bitsandbytes to work properly here, but just like TGI it didn't speed anything up for me with respect to inference latency. As reflected in the benchmark table, I got nearly the same results with transformers [without any optimizations](https://github.com/hamelsmu/llama-inference/blob/master/hf/bench.py).  

#### GPTQ

I also quantized the model using [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ) without an inference server to compare against TGI.  The code for that is [here](https://github.com/hamelsmu/llama-inference/blob/master/hf/bench-gptq.py).

The results were so bad ~ 5 tok/sec that I decided not to put this in the table, because it seemed quite off to me. 

## Text Generation WebUI

[Aman](https://twitter.com/tmm1/status/1683255057201135616?s=20) let me know about [text-generation-web-ui](https://github.com/oobabooga/text-generation-webui), and also [these instructions](https://github.com/paul-gauthier/aider/issues/110#issuecomment-1644318545) for quickly experimenting with [ExLlama](https://github.com/turboderp/exllama) and [ggml](https://github.com/ggerganov/ggml).  I wasn't able to get the `ggml` variant to work properly, unfortunately.  If you are really serious about using exllama, I recommend trying to use it without the text generation UI and look at the [exllama](https://github.com/turboderp/exllama/tree/master) repo, specifically at [test_benchmark_inference.py](https://github.com/turboderp/exllama/blob/master/test_benchmark_inference.py).  (I didn't have time for this, but if I was going to use exllama for anything serious I would go this route).

From the root of the [text-generation-web-ui](https://github.com/oobabooga/text-generation-webui) repo, you can run the following commands to start an inference server optimized with `ExLlama`:

```bash
python3 download-model.py TheBloke/Llama-2-7B-GPTQ
python3 server.py --listen --extensions openai --loader exllama_hf --model TheBloke_Llama-2-7B-GPTQ
```

After the server was started, I used [this code](https://github.com/hamelsmu/llama-inference/blob/master/exllama/bench.py) to conduct the benchmark.

Overall, I didn't like this particular piece of software much.  It's bit bloated because its trying to do too many things at once (An inference server, Web UIs, and other optimizations).  That being said, the documentation is good and it is easy to use.  

I don't think there is any particular reason to use this unless you want an end-to-end solution that also comes with a web user-interface (which many people want!).

## vLLM

[vLLM](https://github.com/vllm-project/vllm.git) only works with CUDA 11.8, which I configured using [this approach](https://hamel.dev/notes/cuda.html).  After configuring CUDA and installing the right version of PyTorch, you need to install the bleeding edge from git:

```bash
pip install -U git+https://github.com/vllm-project/vllm.git
```

A good recipe to use for vLLM can be find on [these Modal docs](https://modal.com/docs/guide/ex/vllm_inference).  Surprisingly, I had much lower latency when running on a local `A6000` vs. a hosted `V100` on Modal Labs.  It's possible that I did something wrong here.  Either way, **`vLLM` offered the lowest latency compared to everything else by a significant margin.**  If I really wanted to optimize for latency today, I would reach for `vLLM`.

`vLLM` [offers a server](https://vllm.readthedocs.io/en/latest/serving/distributed_serving.html), but I benchmarked the model locally using their tools instead.  The code for the benchmarking can be [found here](https://github.com/hamelsmu/llama-inference/blob/master/vllm/bench.py):

```python
from vllm import SamplingParams, LLM

#from https://modal.com/docs/guide/ex/vllm_inference

questions = [
    # Coding questions
    "Implement a Python function to compute the Fibonacci numbers.",
    "Write a Rust function that performs binary exponentiation.",
    "What are the differences between Javascript and Python?",
    # Literature
    "Write a story in the style of James Joyce about a trip to the Australian outback in 2083, to see robots in the beautiful desert.",
    "Who does Harry turn into a balloon?",
    "Write a tale about a time-traveling historian who's determined to witness the most significant events in human history.",
    # Math
    "What is the product of 9 and 8?",
    "If a train travels 120 kilometers in 2 hours, what is its average speed?",
    "Think through this step by step. If the sequence a_n is defined by a_1 = 3, a_2 = 5, and a_n = a_(n-1) + a_(n-2) for n > 2, find a_6.",
]

MODEL_DIR = "/home/ubuntu/hamel-drive/vllm-models"

def download_model_to_folder():
    from huggingface_hub import snapshot_download
    import os

    snapshot_download(
        "meta-llama/Llama-2-7b-hf",
        local_dir=MODEL_DIR,
        token=os.environ["HUGGING_FACE_HUB_TOKEN"],
    )
    return LLM(MODEL_DIR)


def generate(question, llm, note=None):
    response = {'question': question, 'note': note}
    sampling_params = SamplingParams(
        temperature=1.0,
        top_p=1,
        max_tokens=200,
    )
    
    start = time.perf_counter()
    result = llm.generate(question, sampling_params)
    request_time = time.perf_counter() - start

    for output in result:
        response['tok_count'] = len(output.outputs[0].token_ids)
        response['time'] = request_time
        response['answer'] = output.outputs[0].text
    
    return response

if __name__ == '__main__':
    llm = download_model_to_folder()
    counter = 1
    responses = []

    for q in questions:
        response = generate(question=q, llm=llm, note='vLLM')
        if counter >= 2:
            responses.append(response)
        counter += 1
    
    df = pd.DataFrame(responses)
    df.to_csv('bench-vllm.csv', index=False)
```

## HuggingFace Inference Endpoint

I deployed an [inference endpoint](https://ui.endpoints.huggingface.co/) on HuggingFace for [meta-llama/Llama-2-7b-hf](https://huggingface.co/meta-llama/Llama-2-7b-hf), on a `Nvidia A10G` GPU. I didn't try to turn on any optimizations like quantization and wanted to see what the default performance would be like.

The documentation for these interfaces can be found [here](https://huggingface.github.io/text-generation-inference/#/).  There is also [a python client](https://huggingface.co/docs/huggingface_hub/package_reference/inference_client#huggingface_hub.InferenceClient.text_generation).

Their documentation says they are using TGI under the hood.  However, my latency was significantly faster on their hosted inference platform than using TGI locally.  This could be due to the fact that I used a `A10G` with them but only a `A6000` locally.  It's worth looking into why this discrepancy exists further.

The code for this benchmark can be found [here](https://github.com/hamelsmu/llama-inference/blob/master/hf-endpoint/bench.py).