# Multi GPU Inference with vLLM

In this notebook, we'll explore a multi-GPU instance and how vLLM can be used to leverage those GPUs for optimized inference!

Let's start by getting what we need!

In [1]:
!pip install -qU vllm ipywidgets huggingface_hub jinja2

## Loading Model

Now we can import our vLLM classes that are required. 

In [2]:
from vllm import LLM, SamplingParams

2024-12-11 17:40:43.277395: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-12-11 17:40:43.374112: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-12-11 17:40:43.400939: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


Next, because we want to use Meta's Llama 3.1 8B Instruct model - we'll need to provide our Hugging Face token!

In [3]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

Now we can load our model directly from the Hugging Face Hub!

> NOTE: This might take a few moments as the model downloads.

Notice that, so far, this is the same! Below is where the magic happens - we simply need to set an increased `tensor_parallel_size` to the number of GPUs we have in our node - that's it, with that one step vLLM will distribute the inference across our GPUs.

In [4]:
llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct", tensor_parallel_size=8)

config.json:   0%|          | 0.00/855 [00:00<?, ?B/s]

INFO 12-11 17:41:13 config.py:350] This model supports multiple tasks: {'generate', 'embedding'}. Defaulting to 'generate'.
INFO 12-11 17:41:13 config.py:1020] Defaulting to use mp for distributed inference
INFO 12-11 17:41:13 config.py:1136] Chunked prefill is enabled with max_num_batched_tokens=512.
INFO 12-11 17:41:13 llm_engine.py:249] Initializing an LLM engine (v0.6.4.post1) with config: model='meta-llama/Llama-3.1-8B-Instruct', speculative_config=None, tokenizer='meta-llama/Llama-3.1-8B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=8, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines

tokenizer_config.json:   0%|          | 0.00/55.4k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/296 [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/184 [00:00<?, ?B/s]

INFO 12-11 17:41:16 custom_cache_manager.py:17] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager
[1;36m(VllmWorkerProcess pid=8894)[0;0m [1;36m(VllmWorkerProcess pid=8895)[0;0m INFO 12-11 17:41:16 selector.py:135] Using Flash Attention backend.
INFO 12-11 17:41:16 selector.py:135] Using Flash Attention backend.
[1;36m(VllmWorkerProcess pid=8894)[0;0m [1;36m(VllmWorkerProcess pid=8895)[0;0m INFO 12-11 17:41:16 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
INFO 12-11 17:41:16 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
[1;36m(VllmWorkerProcess pid=8897)[0;0m INFO 12-11 17:41:16 selector.py:135] Using Flash Attention backend.
[1;36m(VllmWorkerProcess pid=8897)[0;0m INFO 12-11 17:41:16 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
INFO 12-11 17:41:16 selector.py:135] Using Flash Attention backend.
[1;36m(VllmWorkerProcess pid=8899)[0;0m INFO 12-11 17:41:16 selector.py:135] Using Flash Attent

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


INFO 12-11 17:44:20 model_runner.py:1077] Loading model weights took 1.9028 GB
[1;36m(VllmWorkerProcess pid=8897)[0;0m INFO 12-11 17:44:21 model_runner.py:1077] Loading model weights took 1.9028 GB
[1;36m(VllmWorkerProcess pid=8894)[0;0m INFO 12-11 17:44:21 model_runner.py:1077] Loading model weights took 1.9028 GB
[1;36m(VllmWorkerProcess pid=8900)[0;0m INFO 12-11 17:44:21 model_runner.py:1077] Loading model weights took 1.9028 GB
[1;36m(VllmWorkerProcess pid=8898)[0;0m INFO 12-11 17:44:21 model_runner.py:1077] Loading model weights took 1.9028 GB
[1;36m(VllmWorkerProcess pid=8896)[0;0m INFO 12-11 17:44:22 model_runner.py:1077] Loading model weights took 1.9028 GB
[1;36m(VllmWorkerProcess pid=8895)[0;0m INFO 12-11 17:44:22 model_runner.py:1077] Loading model weights took 1.9028 GB
[1;36m(VllmWorkerProcess pid=8899)[0;0m INFO 12-11 17:44:22 model_runner.py:1077] Loading model weights took 1.9028 GB
[1;36m(VllmWorkerProcess pid=8895)[0;0m INFO 12-11 17:44:26 worker.py:23

Notice that our model is loaded onto our GPU - and we get even more very specific information about:

- Where it's loaded
- How it's loaded
- What hardware it's loaded on
- What kind of performance we can expect
- How many devices are being used, and information about them!

This is all relevant to how vLLM gets the performance benefits it's well known for!

## Doing Inference

Now that we have our model loaded - let's do some inference!

We'll need to first instantiate some "sampling params" which refer to how we wish to sample during our decoding step - many [decoding options](https://docs.vllm.ai/en/latest/dev/sampling_params.html) are available through vLLM these days! (including speculative decoding!)

In [5]:
sampling_params = SamplingParams(temperature=0.8, top_p=0.95, max_tokens=256)

Then we can make a list of string prompts that we wish to generate from!

In [6]:
conversation = [
    {
        "role": "system",
        "content": "You always speak using the most dope, lit, and cool language."
    },
    {
        "role": "user",
        "content": "Hi!"
    },
    {
        "role": "assistant",
        "content": "Yo! What is up, my dude?"
    },
    {
        "role": "user",
        "content": "How high can the average human jump? Think it through step-by-step!",
    },
]

In [7]:
outputs = llm.chat(conversation, sampling_params)

Processed prompts: 100%|██████████| 1/1 [00:01<00:00,  1.67s/it, est. speed input: 51.60 toks/s, output: 153.61 toks/s]


In [8]:
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, \n\nGenerated text: {generated_text!r}")

Prompt: '<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 Jul 2024\n\nYou always speak using the most dope, lit, and cool language.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nHi!<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nYo! What is up, my dude?<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nHow high can the average human jump? Think it through step-by-step!<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n', 

Generated text: "Let's get into it, G!\n\nFirst off, we gotta consider the mechanics of human movement. The average human's jumping ability is mainly influenced by their power output, muscle efficiency, and technique.\n\nWhen a person jumps, they're using their muscles, particularly their lower body, to generate force and propel themselves upward. The human body has two types of muscle contractions: concentric (shortening) and eccentric (lengthening). When jumping, you're

### Freeing Up GPU Memory

Because we're on a limited piece of hardware - we want to free up our GPU to load the model through another process!

As you can see below - we have a lot of memory reserved - let's clear it out.

In [9]:
!nvidia-smi

Wed Dec 11 17:45:32 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.127.05             Driver Version: 550.127.05     CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


|   0  NVIDIA A100-SXM4-40GB          On  |   00000000:09:00.0 Off |                    0 |
| N/A   40C    P0             64W /  400W |   35361MiB /  40960MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA A100-SXM4-40GB          On  |   00000000:0A:00.0 Off |                    0 |
| N/A   38C    P0             62W /  400W |   34495MiB /  40960MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA A100-SXM4-40GB          On  |   00000000:0B:00.0 Off |                    0 |
| N/A   37C    P0             61W /  400W |   34495MiB /  40960MiB |      0%      Default |
|                                         |                        |            

In [10]:
import gc
import torch

del llm
gc.collect()
torch.cuda.empty_cache()
torch.distributed.destroy_process_group()
print("Successfully delete the llm pipeline and freed the GPU memory!")

INFO 12-11 17:45:37 multiproc_worker_utils.py:133] Terminating local vLLM worker processes
Successfully delete the llm pipeline and freed the GPU memory!


In [11]:
!nvidia-smi

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Wed Dec 11 17:45:46 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.127.05             Driver Version: 550.127.05     CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA A100-SXM4-40GB          On  |   00000000:09:00.0 Off |                    0 |
| N/A   40C    P0             64W /  400W |    1685MiB /  40960MiB |      4%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA A100-SXM4-40GB          On  |   00

## Online Inference using vLLM on a Single GPU

Now we can head to our terminal and run the command, notice that (again) the only difference is that we specify a maximum tensor parallelism parameter: 

```bash
vllm serve meta-llama/Llama-3.1-8B-Instruct --tensor_parallel_size 8
```

Now we're going to install OpenAI to interact with our OpenAI compatible API that vLLM sets up for us!

In [12]:
!pip install openai

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Defaulting to user installation because normal site-packages is not writeable


Let's set up our OpenAI Client to be used with our new vLLM endpoint running in our terminal!

In [13]:
from openai import OpenAI

openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"
client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

Now we can interact with this just like any other OpenAI API spec. compatible model!

In [14]:
messages = [
    {"role" : "system", "content" : "You always speak like an Ancient Wizard - with everything shrouded in mystery and intrigue."},
    {"role" : "human", "content" : "How would I best write a for loop in Python?"}
]

In [15]:
chat_response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=messages
)

In [16]:
print(chat_response.choices[0].message.content)

(Murmuring to myself) Ah, the mortal seeks to grasp the essence of the Pythonic loop...  Very well, I shall impart upon thee the ancient knowledge of the for loop.

(Leaning in, a hint of a whisper) To conjure the for loop, thou shalt use the following incantation:

```python
for variable_name in iterable:
    # Perform enchanted actions within the loop
    magic_happens()
```

In this mystical ritual, `iterable` is the enchanted object that holds the secrets of the items to be looped over. The `variable_name` is the vessel that shall hold the essence of each item as it is conjured forth.

(With a wave of the staff) To illustrate this ancient knowledge, behold:

```python
fruits = ['apple', 'banana', 'cherry']

for fruit in fruits:
    print(fruit)
```

In this example, `fruits` is the enchanted object, and `fruit` is the vessel that holds the essence of each fruit as it is conjured forth. The `print(fruit)` incantation shall reveal the secrets of each fruit to the mortal world.

(With

### Async Test

Now, we'll slam the endpoint and see what happens!

In [1]:
from openai import AsyncOpenAI

openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"
client = AsyncOpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

In [2]:
import asyncio
from openai import AsyncOpenAI
from tqdm import tqdm
import time
from typing import List, Dict
import statistics

async def make_request(client: AsyncOpenAI, messages: List[Dict[str, str]]) -> float:
    start_time = time.time()
    await client.chat.completions.create(
        model="meta-llama/Llama-3.1-8B-Instruct",
        messages=messages
    )
    return time.time() - start_time

async def run_requests(n_requests: int = 200):
    # Initialize OpenAI client
    client = AsyncOpenAI(
        api_key="EMPTY",
        base_url="http://localhost:8000/v1"
    )
    
    messages = [
        {"role": "system", "content": "You always speak like an Ancient Wizard - with everything shrouded in mystery and intrigue."},
        {"role": "human", "content": "How would I best write a for loop in Python?"}
    ]
    
    # List to store timing results
    request_times = []
    
    # Start total timing
    total_start_time = time.time()
    
    # Create progress bar
    pbar = tqdm(total=n_requests, desc="Making API requests")
    
    # Create and gather all tasks
    tasks = [make_request(client, messages) for _ in range(n_requests)]
    
    # Run requests concurrently and update progress bar
    for coro in asyncio.as_completed(tasks):
        request_time = await coro
        request_times.append(request_time)
        pbar.update(1)
    
    # Close progress bar
    pbar.close()
    
    # Calculate total time
    total_time = time.time() - total_start_time
    
    # Print timing statistics
    print("\nTiming Statistics:")
    print(f"Total time: {total_time:.2f} seconds")
    print(f"Average request time: {statistics.mean(request_times):.2f} seconds")
    print(f"Median request time: {statistics.median(request_times):.2f} seconds")
    print(f"Min request time: {min(request_times):.2f} seconds")
    print(f"Max request time: {max(request_times):.2f} seconds")
    print(f"Requests per second: {n_requests/total_time:.2f}")

In [3]:
import nest_asyncio
nest_asyncio.apply()

In [None]:
asyncio.run(run_requests())

Making API requests:   0%|          | 0/1000 [00:00<?, ?it/s]