# Workshop Optimized Model Serving with vLLM V1 with AMD GPUs

Welcome to this hands-on workshop! 

You will run vLLM with v0 and new v1 features and compare performance characteristics. 

The overall lab architecture is
![architecture](./assets/ws201_1.jpg)
- ⭐ vLLM Server with <span style="color:red">legacy v0</span> 
- ⭐ vLLM Server with <span style="color:red">v1</span> **without** <span style="color:green">Prefix caching</span> 
- ⭐ vLLM Server with <span style="color:red">v1</span> **with** <span style="color:green">Prefix caching</span>
- 💻 Jupyter Notebook Client: You'll remotely launch vLLM Servers and measure performance here

In this workshop, you will launch vLLM Servers on AMD MI300X GPUs and record and compare LLM performance metrics.
![llm_metrics](./assets/ws201_2.jpg)

- 🚩**TTFT** and **TPOT** are key LLM latency metrics, <span style="color:green">_Shorter is better_ </span> 
- 🚩**TOTAL_TPS** (Total Token Per Second) is a LLM throughput metric , <span style="color:green">_Higher is better_ </span> 

To do so, you need to <span style="color:red">iterate **three times** </span>of launching the vLLM server and measuring performance. 

Let's dive in and see the benefits of <span style="color:red"> vLLM v1</span> !

## Please run these helper functions first

In [None]:
def server_status(port):
    import subprocess
    log_labels = f'''
        #!/usr/bin/bash
        curl -s http://localhost:{port}/v1/models > /dev/null
        '''
    subprocess.run(log_labels, shell=True, check=True)

def server_llm_resp(port):
    try:
        server_status(port)
    except:
        print("Please launch vLLM server at port "+str(port)+ " first")
        return 0
        
    from openai import OpenAI
    # Set OpenAI's API key and API base to use vLLM's API server.
    openai_api_key = "EMPTY"
    openai_api_base = "http://localhost:"+str(port)+"/v1"
    client = OpenAI(
        api_key=openai_api_key,
        base_url=openai_api_base,
    )
    chat_response = client.chat.completions.create(
    model="RedHatAI/Llama-3.1-8B-Instruct",
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Tell me a joke."},
        ],
        max_tokens=100
    )
    print("RESPONSE:\n %10s"%(chat_response.choices[0].message.content))

def run_bench(SRC_LOG, TGT_LOG, isl, osl, con_list, case, port):
    try:
        server_status(port)
    except:
        print("Please launch vLLM server at port "+str(port)+ " first")
        return 0
    import subprocess
    log_labels = f'''
        #!/usr/bin/bash
        printf run,                     2>&1 | tee -a {TGT_LOG}
        printf prompts,                 2>&1 | tee -a {TGT_LOG}
        printf median_ttft,             2>&1 | tee -a {TGT_LOG}
        printf median_tpot,             2>&1 | tee -a {TGT_LOG}
        printf median_e2e,              2>&1 | tee -a {TGT_LOG}
        printf total_tps                2>&1 | tee -a {TGT_LOG}
        printf "\n"                     2>&1 | tee -a {TGT_LOG}
        '''
    subprocess.run(log_labels, shell=True, check=True)
    for concurrency in con_list:
        prompts = 4 * concurrency

        vllm_run = f'''
            #!/usr/bin/env bash
            VLLM_LOGGING_LEVEL=ERROR \
            python3 /app/vllm/benchmarks/benchmark_serving.py \
                --model RedHatAI/Llama-3.1-8B-Instruct \
                --dataset-name random \
                --random-input-len {isl} \
                --random-output-len {osl} \
                --num-prompts {prompts} \
                --max-concurrency {concurrency} \
                --ignore-eos \
                --port {port} \
                --percentile-metrics ttft,tpot,e2el \
                2>&1 | tee {SRC_LOG}
            '''
        log_post_process = f'''
            #!/usr/bin/bash
            bash ./rpt_sum.sh {SRC_LOG} {TGT_LOG} {case}
            '''

        subprocess.run(vllm_run, shell=True, check=True)
        subprocess.run(log_post_process, shell=True, check=True)

def visualize_bench(logs):
    !pip install matplotlib -q
    import numpy as np
    import pandas as pd
    import matplotlib.pyplot as plt

    df_sum = pd.DataFrame()

    for log in logs:
        df_log = pd.read_csv(log, sep=',')
        print(df_log)
        df_sum = pd.concat([df_sum, df_log])

    fig, axes = plt.subplots(figsize=(16,8),nrows=1, ncols=3)

    df_sum_pivot=df_sum.pivot(index='prompts', columns='run', values='median_ttft')
    df_sum_pivot.plot.bar(rot=0, title='median_ttft (ms), lower is better', ax=axes[0])

    df_sum_pivot=df_sum.pivot(index='prompts', columns='run', values='median_tpot')
    df_sum_pivot.plot.bar(rot=0, title='median_tpot (ms), lower is better', ax=axes[1])

    ax = plt.gca() 
    ax.set_facecolor("pink")
    df_sum_pivot=df_sum.pivot(index='prompts', columns='run', values='total_tps')
    df_sum_pivot.plot.bar(rot=0, title='total_tps (tok/sec), higher is better', ax=axes[2])

## Customize Jupyter notebook screen

To show multiple terminals and ipynb notebook files, you can customize the screen by dragging and dropping. 

We recommend you to have 2 terminals and 1 jupyter notebook. 

- 1st terminal: ```watch -n 1 rocm-smi```
- 2nd terminal: vLLM server command copying and pasting
- 3rd jupyter notebook (notebook.ipynb) : This notebook
  
![llm_metrics](./assets/ws201_3.gif)


# STEP 1 vLLM v0 Performance Benchmark

<span style="color:blue"><strong>⚠️ WARNING:</strong></span> Copy and paste this server command in the terminal

```sh
VLLM_USE_V1=0 \
VLLM_LOGGING_LEVEL=INFO \
vllm serve RedHatAI/Llama-3.1-8B-Instruct \
            --disable-log-requests \
            --trust-remote-code -tp 1 \
            --cuda-graph-sizes 64 \
            --port 8001 \
            --chat-template /app/vllm/examples/tool_chat_template_llama3.1_json.jinja
```

- 📌Notice: <span style="color:green">```VLLM_USE_V1=0```</span> is an environment variable to let vLLM run in v0 mode.
- 📌Notice: Use <span style="color:green">```--port 8001```</span>in v0 mode. 

From vLLM server, <span style="color:blue"><strong>⚠️ WARNING:</strong></span> you should see these messages first. 

<span style="color:red"> *INFO:     Started server process [210]*</span>

<span style="color:red"> *INFO:     Waiting for application startup.*</span>

<span style="color:red"> *INFO:     Application startup complete.*</span>

Then run the following cells and check TTFT, TPOT, and TOTAL_TPS metrics vLLM v0 of RedHatAI/Llama-3.1-8B-Instruct on a single MI300X GPU.

In [None]:
# 1-1)  When vLLM server is ready, check the answer of "Tell me a joke"
port=8001 # vlLM v0 port

server_llm_resp(port)

In [None]:
# 1-2) Run Benchmark: input_len/output_len = 1024/1024, concurrency = 32 and 64
port=8001 # vlLM v0 port

SRC_LOG="v0.log"
TGT_LOG="v0_summary.log"
case="v0"
!rm -f v0_summary.log
run_bench(SRC_LOG, TGT_LOG, 1024, 1024, [32, 64], case, port)

In [None]:
# 1-3) Visualize Benchmarks
try:
    logs = [
        "v0_summary.log",
        ]
    visualize_bench(logs)
except:
    print("Please rerun the previous step")

# STEP 2 vLLM v1 Performance Benchmark

<span style="color:blue"><strong>⚠️ WARNING:</strong></span> ```Ctrl + C``` in the terminal and close the previous vLLM engine and copy and paste this server command in the terminal

```sh
VLLM_USE_V1=1 \
VLLM_V1_USE_PREFILL_DECODE_ATTENTION=1 \
VLLM_LOGGING_LEVEL=INFO \
vllm serve RedHatAI/Llama-3.1-8B-Instruct \
            --disable-log-requests \
            --trust-remote-code -tp 1 \
            --cuda-graph-sizes 64 \
            --no-enable-prefix-caching \
            --port 8002 \
            --chat-template /app/vllm/examples/tool_chat_template_llama3.1_json.jinja
```

- 📌Notice: <span style="color:green">```VLLM_USE_V1=1```</span> is an environment variable to let vLLM run in v1 mode.
- 📌Notice: <span style="color:green">```--no-enable-prefix-caching```</span> is a vLLM argument to <span style="color:purple">disable</span>  prefix-caching.
- 📌Notice: Use <span style="color:green">```--port 8002```</span>in v1 mode. 

From vLLM server, <span style="color:blue"><strong>⚠️ WARNING:</strong></span> you should see these messages first.

<span style="color:red"> *INFO:     Started server process [210]*</span>

<span style="color:red"> *INFO:     Waiting for application startup.*</span>

<span style="color:red"> *INFO:     Application startup complete.*</span>

Then run the following cells and check TTFT, TPOT, and TOTAL_TPS metrics vLLM v1 of RedHatAI/Llama-3.1-8B-Instruct on a single MI300X GPU.

In [None]:
# 2-1) When vLLM server is ready, check the answer of "Tell me a joke"
port=8002 # vlLM v1 port

server_llm_resp(port)

In [None]:
# 2-2) Run Benchmark: input_len/output_len = 1024/1024, concurrency = 32 and 64
port=8002 # vlLM v1 port

SRC_LOG="v1.log"
TGT_LOG="v1_summary.log"
case="v1"
!rm -f v1_summary.log
run_bench(SRC_LOG, TGT_LOG, 1024, 1024, [32, 64], case, port)

In [None]:
# 2-3) Visualize Benchmarks
try:
    logs = [
        "v0_summary.log",
        "v1_summary.log",
        ]
    visualize_bench(logs)
except:
    print("Please rerun the previous step")

# STEP 3 vLLM v1 + Prefix Caching Performance Benchmark

<span style="color:blue"><strong>⚠️ WARNING:</strong></span> ```Ctrl + C``` in the terminal and close the previous vLLM engine and copy and paste this server command in the terminal

```sh
VLLM_USE_V1=1 \
VLLM_V1_USE_PREFILL_DECODE_ATTENTION=1 \
VLLM_LOGGING_LEVEL=INFO \
vllm serve RedHatAI/Llama-3.1-8B-Instruct \
            --disable-log-requests \
            --trust-remote-code -tp 1 \
            --cuda-graph-sizes 64 \
            --enable-prefix-caching \
            --port 8003 \
            --chat-template /app/vllm/examples/tool_chat_template_llama3.1_json.jinja 

```

- 📌Notice: <span style="color:green">```VLLM_USE_V1=1```</span> is an environment variable to let vLLM run in v1 mode.
- 📌Notice: <span style="color:green">```--enable-prefix-caching```</span> is a vLLM argument to <span style="color:green">enable</span> prefix-caching.
- 📌Notice: Use <span style="color:green">```--port 8003```</span>in v1 + prefix caching mode. 

From vLLM server, <span style="color:blue"><strong>⚠️ WARNING:</strong></span> you should see these messages first.

<span style="color:red"> *INFO:     Started server process [210]*</span>

<span style="color:red"> *INFO:     Waiting for application startup.*</span>

<span style="color:red"> *INFO:     Application startup complete.*</span>

Then run the following cells and check TTFT, TPOT, and TOTAL_TPS metrics vLLM v1 and also prefix caching of RedHatAI/Llama-3.1-8B-Instruct on a single MI300X GPU.

In [None]:
# 3-1) When vLLM server is ready, check the answer of "Tell me a joke"
port=8003 # vlLM v1 + prefix caching port

server_llm_resp(port)

In [None]:
# 3-2) Run Benchmark: input_len/output_len = 1024/1024, concurrency = 32 and 64
port=8003 # vlLM v1 + prefix caching port

SRC_LOG="v1PC.log"
TGT_LOG="v1PC_summary.log"
case="v1PC"
!rm -f v1PC_summary.log
run_bench(SRC_LOG, TGT_LOG, 1024, 1024, [32, 64], case, port)
# Run one more time to use prefix caching capability
!rm -f v1PC_summary.log
run_bench(SRC_LOG, TGT_LOG, 1024, 1024, [32, 64], case, port)

In [None]:
# 3-3) Visualize Benchmarks
try:
    logs = [
        "v0_summary.log",
        "v1_summary.log",
        "v1PC_summary.log",
        ]
    visualize_bench(logs)
except:
    print("Please rerun the previous step")