# <font color="#76b900">**Notebook 3:** Measuring NIM Performance with GenAI-Perf</font>

In this notebook, you will measure the performance of our NIM instance using the [**NVIDIA GenAI-Perf**](https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/client/src/c%2B%2B/perf_analyzer/genai-perf/README.html) tool. It is a client-side LLM-focused benchmarking tool which can test for key metrics such as `TTFT`, `ITL`, `E2E Latency`, and requests per second. GenAI-Perf can be used to measure and compare the performance of any inference endpoint that provides an OpenAI-compatible or Triton Inference Server API. Read [**our documentation**](https://docs.nvidia.com/nim/benchmarking/llm/latest/step-by-step.html) to gain more insights about NVIDIA GenAI-Perf.

## Learning Objectives
By the end of this notebook, you will be able to:
- Use NVIDIA GenAI-Perf to measure key performance metrics of an NIM LLM deployment.
- Understand the impact of different input and output lengths on performance.
- Analyze the effect of concurrency on latency and throughput.
- Estimate the required number of GPUs for a given workload.

**Before starting this notebook, please make sure to watch its corresponding video.**

## Table of Contents

- [**First Performance Measurement with NVIDIA GenAI-Perf**](#First-Performance-Measurement-with-NVIDIA-GenAI-Perf)
- [**Loop Over Concurrencies with NVIDIA GenAI-Perf**](#Loop-Over-Concurrencies-with-NVIDIA-GenAI-Perf)
- [**Plot Latency-Throughput Curves**](#Plot-Latency-Throughput-Curves)
- [**[EXERCISE] Calculate the Necessary Number of GPUs**](#[EXERCISE]-Calculate-the-Necessary-Number-of-GPUs)


<br><hr>

## **First Performance Measurement with NVIDIA GenAI-Perf**

Start by submitting a simple call to [**NVIDIA GenAI-Perf**](https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/client/src/c%2B%2B/perf_analyzer/genai-perf/README.html). Though it is a CLI tool aimed to be run in the terminal, the commands can also be invoked from your notebook. Start by passing in a few variables to our experimental setup:

- To create a variety of inputs with some realistic sizes, you can sample synthetic inputs with defined input statistics (mean and standard deviation of length). 
- To keep the output length consistent for your experiments, you can set `min_tokens = max_tokens = output-tokens-mean` and specify `ignore_eos: true` to tell the backend to ignore the special `EOS` tokens (typical stopping criteria for the LLM). 
- To test out how your system will operate with multiple concurrent users/asynchronous calls, you can specify the `concurrency` flag.

In [1]:
%%bash
export INPUT_SEQUENCE_LENGTH=200
export INPUT_SEQUENCE_STD=10
export OUTPUT_SEQUENCE_LENGTH=50
export CONCURRENCY=10

genai-perf \
    -m meta/llama3-8b-instruct \
    --endpoint-type chat \
    --service-kind openai \
    --streaming \
    -u nim:8000 \
    --synthetic-input-tokens-mean $INPUT_SEQUENCE_LENGTH \
    --synthetic-input-tokens-stddev $INPUT_SEQUENCE_STD \
    --concurrency $CONCURRENCY \
    --output-tokens-mean $OUTPUT_SEQUENCE_LENGTH \
    --extra-inputs max_tokens:$OUTPUT_SEQUENCE_LENGTH \
    --extra-inputs min_tokens:$OUTPUT_SEQUENCE_LENGTH \
    --extra-inputs ignore_eos:true \
    -- \
    --max-threads=256

2024-11-06 19:59 [INFO] genai_perf.parser:711 - Detected passthrough args: ['--max-threads=256']
2024-11-06 19:59 [INFO] genai_perf.parser:226 - Model name 'meta/llama3-8b-instruct' cannot be used to create artifact directory. Instead, 'meta_llama3-8b-instruct' will be used.
2024-11-06 19:59 [INFO] genai_perf.wrapper:138 - Running Perf Analyzer : 'perf_analyzer -m meta/llama3-8b-instruct --async --input-data artifacts/meta_llama3-8b-instruct-openai-chat-concurrency10/llm_inputs.json --endpoint v1/chat/completions --service-kind openai -u nim:8000 --measurement-interval 10000 --stability-percentage 999 --profile-export-file artifacts/meta_llama3-8b-instruct-openai-chat-concurrency10/profile_export.json -i http --concurrency-range 10 --max-threads=256'
[3m                                  LLM Metrics                                   [0m
┏━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━┳━━━━━━━━┳━━━━━━━┓
┃[1m [0m[1m               Statistic[0m[1m [0m┃[1m [0m[1m   a

Upon running successfully, you should see a table of results with statistical measurements of the latencies and throughputs. The inflight batching mechanism of NIM, together with the standard deviation in the input length, leads to batches with both prefill and decoding stages. This means that each sequence has a different measurement of latency and throughput.

We encourage you to play with the input and output lengths, as well as with the concurrency. Do you see similar trends as the ones explored in the previous notebook?

<br><hr>

## **Loop Over Concurrencies with NVIDIA GenAI-Perf**

In the previous notebook, you analyzed the impact of the concurrency level by displaying plots of the latency and throughput. With NVIDIA GenAI-Perf, you can recreate those plots by measuring the performance of NIM with different concurrencies. Let's start by defining a list of concurrencies to iterate over:

In [2]:
import os

# Define a list with the concurrencies
concurrencies = [1, 2, 5, 10, 50, 100, 250]

# Set the environment variable in Python
os.environ['CONCURRENCIES'] = ','.join(map(str, concurrencies))

The next step is to iterate over the list of concurrencies. The following bash command loops over the concurrencies and saves the results under the `artifacts` directory. It takes around 1m20s in our setup, so you can start running it while inspecting the code:

In [9]:
%%time
%%bash
export INPUT_SEQUENCE_LENGTH=200
export INPUT_SEQUENCE_STD=0
export OUTPUT_SEQUENCE_LENGTH=50

IFS=',' read -r -a array <<< "$CONCURRENCIES"
for concurrency in "${array[@]}"; do

    echo "Concurrency: $concurrency"
    ## TODO: Invoke genai-perf for each concurrency level.
    ## TODO: Additionally, set measurement interval to 10000 to avoid timeout
    ## TODO: To differentiate the logfile, modify the --profile-export-file to something else. 
    genai-perf \
        -m meta/llama3-8b-instruct \
        --endpoint-type chat \
        --service-kind openai \
        --streaming \
        -u nim:8000 \
        --synthetic-input-tokens-mean $INPUT_SEQUENCE_LENGTH \
        --synthetic-input-tokens-stddev $INPUT_SEQUENCE_STD \
        --concurrency $concurrency \
        --output-tokens-mean $OUTPUT_SEQUENCE_LENGTH \
        --extra-inputs max_tokens:$OUTPUT_SEQUENCE_LENGTH \
        --extra-inputs min_tokens:$OUTPUT_SEQUENCE_LENGTH \
        --extra-inputs ignore_eos:true \
        --measurement-interval 10000 \
        --profile-export-file ${INPUT_SEQUENCE_LENGTH}_${OUTPUT_SEQUENCE_LENGTH}.json \
        -- \
        --max-threads=256 \

done

Concurrency: 1
2024-11-06 20:02 [INFO] genai_perf.parser:711 - Detected passthrough args: ['--max-threads=256']
2024-11-06 20:02 [INFO] genai_perf.parser:226 - Model name 'meta/llama3-8b-instruct' cannot be used to create artifact directory. Instead, 'meta_llama3-8b-instruct' will be used.
2024-11-06 20:02 [INFO] genai_perf.wrapper:138 - Running Perf Analyzer : 'perf_analyzer -m meta/llama3-8b-instruct --async --input-data artifacts/meta_llama3-8b-instruct-openai-chat-concurrency1/llm_inputs.json --endpoint v1/chat/completions --service-kind openai -u nim:8000 --measurement-interval 10000 --stability-percentage 999 --profile-export-file artifacts/meta_llama3-8b-instruct-openai-chat-concurrency1/200_50.json -i http --concurrency-range 1 --max-threads=256'
[3m                                  LLM Metrics                                   [0m
┏━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━┳━━━━━━━━┳━━━━━━━┓
┃[1m [0m[1m               Statistic[0m[1m [0m┃[1m [0m[1m

Take a look at the `artifact` directory to understand how NVIDIA GenAI-Perf saves the results. Each concurrency has a subdirectory, and inside you can find the file `profile_export_genai_perf.json` with a summary of the performance metrics.

<details>
<summary><b>Reveal Solution</b></summary>

```sh
%%time
%%bash
export INPUT_SEQUENCE_LENGTH=200
export INPUT_SEQUENCE_STD=0
export OUTPUT_SEQUENCE_LENGTH=50

IFS=',' read -r -a array <<< "$CONCURRENCIES"
for concurrency in "${array[@]}"; do

    echo "Concurrency: $concurrency"
    ## TODO: Invoke genai-perf for each concurrency level.
    ## TODO: Additionally, set measurement interval to 10000 to avoid timeout
    ## TODO: To differentiate the logfile, modify the --profile-export-file to something else. 
    genai-perf \
        -m meta/llama3-8b-instruct \
        --endpoint-type chat \
        --service-kind openai \
        --streaming \
        -u nim:8000 \
        --synthetic-input-tokens-mean $INPUT_SEQUENCE_LENGTH \
        --synthetic-input-tokens-stddev $INPUT_SEQUENCE_STD \
        --concurrency $concurrency \
        --output-tokens-mean $OUTPUT_SEQUENCE_LENGTH \
        --extra-inputs max_tokens:$OUTPUT_SEQUENCE_LENGTH \
        --extra-inputs min_tokens:$OUTPUT_SEQUENCE_LENGTH \
        --extra-inputs ignore_eos:true \
        --measurement-interval 10000 \
        --profile-export-file ${INPUT_SEQUENCE_LENGTH}_${OUTPUT_SEQUENCE_LENGTH}.json \
        -- \
        --max-threads=256 \

done
```

</details>

<br><hr>

## **Plot Latency-Throughput Curves**

Now that the artifacts have been generated, you can read the values for `TTFT` and output tokens per second from the various concurrencies. Run the following cell to read the `profile_export_genai_perf.json` file for each concurrency and accumulate the output-tokens-per-second and time-to-first-token measurements:

In [10]:
import json

def parse_data(file_path, metric):
    with open(file_path, 'r') as file:
        data = json.load(file)
        return data[metric]['avg']

root_dir = "./artifacts"
directory_prefix = "meta_llama3-8b-instruct-openai-chat-concurrency"
TPS = []
TTFT = []
for con in concurrencies:
    file = os.path.join(root_dir, directory_prefix+str(con), f"profile_export_genai_perf.json")
    TPS.append(parse_data(file, "output_token_throughput"))
    TTFT.append(parse_data(file, "time_to_first_token")/1e6)

Let's use plotly to depict their relationship with respect to concurrency:

In [11]:
import plotly.express as px
import plotly.io as pio
pio.renderers.default = "iframe"

fig = px.line(x=TTFT, y=TPS, text=concurrencies)
fig.update_traces(textposition="top left")
fig.update_layout(xaxis_title="TTFT (ms)", yaxis_title="Output tokens per second")
fig.show()

This plot should be familiar after your work in the previous notebook. You can see that the `TTFT` increases as the concurrency, displayed next to each dot, goes up. Increasing the concurrency leads to more output tokens per second and better GPU utilization, but `TTFT` is penalized. There are also diminishing gains in throughput when increasing the concurrency.

We recommend that you plot the impact of concurrency not only in `TTFT`, but also in inter-token and `E2E Latency`. Likewise, you can also focus on input tokens per second in addition to output tokens.

<br><hr>

## **[EXERCISE] Calculate the Necessary Number of GPUs**

The previous plot is useful to understand the tradeoff between latency and throughput in inference. If you choose a high concurrency, you can reach a higher throughput and consequently better GPU utilization. If you choose a low concurrency, the latency is lower. 
- For online applications where latency is important, a good recommendation is to define a maximum limit for the `TTFT`. For example, you can decide to set a maximum TTFT of 1s for your chatbot application, so that the user waits 1s at most before starting to read the response. You can then choose the highest concurrency level that doesn't surpass that latency. 
- For offline applications where latency is less of a concern, the concurrency can be higher. We recommend enabling IFB and selecting the highest reasonable concurrency (and ensuring that your connection client is built to sustain it).

For a quick exercise, let's suppose we are interested in an online application. We set the maximum `TTFT` to be 1s. In our setup, the highest level of concurrency that doesn't surpass a `TTFT` of 1s is 100, so we can start with that.

The number of output tokens per second can be retrieved as follows:

In [12]:
con = 100
file = os.path.join(root_dir, directory_prefix+str(con), f"profile_export_genai_perf.json")
TPS_100 = parse_data(file, "output_token_throughput")
print(f"For a level of concurrency of {con}, the output tokens per second is {TPS_100:.2f}")

For a level of concurrency of 100, the output tokens per second is 1858.85


Next, you can compute the number of prompts per second. Let's assume that your average target prompt is 50 tokens long:

In [13]:
## TODO: calculate num prompts per second per GPU
PPS = 0
print(f"The prompts per second is {PPS:.2f}")

The prompts per second is 0.00


The final step to compute the number of GPUs needed is to estimate the number of requests per second that your system is going to receive. Let's imagine that the estimation is to receive 100 requests per second. Then, assuming that our NIM measurement was obtained with only one GPU, the number of necessary GPUs can be reasonably extrapolated:

In [14]:
## TODO: calculate num GPUs needed to handle 100 requests
n_gpus = 0
print(f"The number of GPUs needed to handle 100 requests per second is {n_gpus:.1f}")

The number of GPUs needed to handle 100 requests per second is 0.0


In [15]:
con = 100
file = os.path.join(root_dir, directory_prefix+str(con), f"profile_export_genai_perf.json")
TPS_100 = parse_data(file, "output_token_throughput")
print(f"For a level of concurrency of {con}, the output tokens per second is {TPS_100:.2f}")

## TODO: calculate num prompts per second per GPU
PPS = TPS_100 / 50
print(f"The prompts per second is {PPS:.2f}")

## TODO: calculate num GPUs needed to handle 100 requests
n_gpus = 100 / PPS
print(f"The number of GPUs needed to handle 100 requests per second is {n_gpus:.1f}")

For a level of concurrency of 100, the output tokens per second is 1858.85
The prompts per second is 37.18
The number of GPUs needed to handle 100 requests per second is 2.7


This is how you can estimate the number of GPUs needed for online applications for the specific TP and [NIM model profile](https://docs.nvidia.com/nim/large-language-models/latest/utilities.html#list-available-model-profiles) used. If you get a fractional number of GPUs, you should round up to the nearest TP multiple (i.e. any number with TP1, multiple of 2 with TP2, etc.). You can repeat the measurements above for all the suitable NIM profiles to select the one that best fits your requirements.

**In the next notebook,** you will understand how to convert that number of GPUs into a **Total Cost of Ownership (TCO)** metric for both on-prem and in-the-cloud inference deployments.

<details>
<summary><b>Reveal Solution</b></summary>

```python
con = 100
file = os.path.join(root_dir, directory_prefix+str(con), f"profile_export_genai_perf.json")
TPS_100 = parse_data(file, "output_token_throughput")
print(f"For a level of concurrency of {con}, the output tokens per second is {TPS_100:.2f}")

## TODO: calculate num prompts per second per GPU
PPS = TPS_100 / 50
print(f"The prompts per second is {PPS:.2f}")

## TODO: calculate num GPUs needed to handle 100 requests
n_gpus = 100 / PPS
print(f"The number of GPUs needed to handle 100 requests per second is {n_gpus:.1f}")
```

</details>

<br>

Going forward, you can reuse this same methodology for other applications and requirements. For example, instead of focusing on the `TTFT`, you can think about an application that imposes a maximum `E2E Latency`. The same logic applies, but now you need to choose a concurrency level that satisfies your maximum `E2E Latency`. Then, you obtain the output tokens per second for that configuration and can apply the same logic as above.