# <font color="#76b900">**Notebook 2:** Benchmarking Throughput/Latency Tradeoffs</font>

Welcome back to the notebooks! In this course, we will be heavily focusing on the speed of an LLM to produce a response from the user's perspective. The process of measuring this speed is called **benchmarking**, which you will get the chance to do in the subsequent notebooks using a tool called [**GenAI-Perf**](https://github.com/triton-inference-server/client/tree/main/src/c%2B%2B/perf_analyzer/genai-perf). To get started, a set of pre-collected benchmarks is provided in [**dataset/nim.csv**](dataset/nim.csv) for you to explore throughout this notebook!

## Learning Objectives
By the end of this notebook, you will be able to:
- Explore the real measurements of the time to first token (TTFT), end-to-end latency (E2E Latency), and inter-token latency (ITL).
- Analyze throughput metrics and their dependencies on various factors.
- Investigate the effects of tensor parallelism and concurrency settings on latency and throughput.
- Utilize provided benchmarking data to make informed decisions about model deployment and scaling.

**Before starting this notebook, please make sure to watch its corresponding video.**

## Table of Contents

- [**Getting Started With The Dataset**](#Getting-Started-With-The-Dataset)
  - [Use-Case Definition Columns](#Use-Case-Definition-Columns)
  - [Benchmarking Results](#Benchmarking-Results)
  - [Investigating Benchmarked Data](#Investigating-Benchmarked-Data)
  - [**Plotting Latency vs Throughput**](#Plotting-Latency-vs-Throughput)
  - [Latency vs Throughput Tradeoff](#Latency-vs-Throughput-Tradeoff)
- [**Tensor Parallelism**](#Tensor-Parallelism)
  - [Benefits of Tensor Parallelism N](#Benefits-of-Tensor-Parallelism-N)
  - [**Scenario**: 1xTP2 vs 2xTP1](#Scenario:-1xTP2-vs-2xTP1)
  - [**Scenario**: Fluctuating Tokens-Per-Second](#Scenario:-Fluctuating-Tokens-Per-Second)
  - [**Scenario**: TTFT Incorporated Into Request](#Scenario:-TTFT-Incorporated-Into-Request)
  - [**Scenario**: FP8 vs FP16](#Scenario:-FP8-vs-FP16)
- [**[EXERCISE] A Use-Case-Based Sizing Example**](#[EXERCISE]-A-Use-Case-Based-Sizing-Example)
- [**Sizing Best Practices**](#Sizing-Best-Practices)
  - [NVIDIA Software Stack](#NVIDIA-Software-Stack)
  - [Model Size and Hardware Considerations](#Model-Size-and-Hardware-Considerations)
  - [Streaming vs. Sequential Mode](#Streaming-vs.-Sequential-Mode)
  - [Other Performance Considerations](#Other-Performance-Considerations)



<br><hr>

## **Getting Started With The Dataset**

The provided [**dataset/nim.csv datasheet**](dataset/nim.csv) is a simple csv file with each row corresponding to a benchmark run against an older version of NVIDIA NIM. Most of the columns are described below. Some of the names have a shortened form used in code to tidy up the processing, so below we indicate two names for many columns. For the details refer to the  [**GenAI-Perf documentation on metrics**](https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/client/src/c%2B%2B/perf_analyzer/genai-perf/README.html#metrics) and [**its source code**](https://github.com/triton-inference-server/client/blob/db888f1aca588a10f5e4a4b02a4e4ff60d437b6f/src/c%2B%2B/perf_analyzer/genai-perf/genai_perf/profile_data_parser/llm_profile_data_parser.py#L76).

Below are useful definitions to help interpret the datasheet. Feel free to [skip ahead for now](#skip-definitions) and come back when you need them.

### Use-Case Definition Columns
* `'task_inputs-model'`, `'model'`: The LLM being tested.
* `'task_inputs-execution_type'`, `'execution_mode'`: We will be testing only `NIM_TRT-LLM`.
* `'task_inputs-precision'`, `'precision'`: In our case, `FP16` or `FP8`. Due to the HW acceleration supported in Hopper architecture, one may significantly benefit from reducing the precision, while keeping the accuracy close to the original.
* `'input_config-synthetic_input_tokens_mean'`, `input_len`: We are using a synthetic dataset generator, built into GenAI-Perf. This is the target mean of the distribution of the LLM inputs in tokens.
* `'input_config-synthetic_input_tokens_stddev'`: This is the target standard deviation of the input lengths in tokens.
* `'input_config-output_tokens_mean'`, `output_len`: For benchmarking, we force the LLM to generate a specific number of tokens and ignore End Of Sequence (EOS) tokens. This is a simple trick to simulate the responses, similar in length to the real ones, even without using a real data subsample. Real datasets are also supported by GenAI-Perf.
* `'task_inputs-n_gpus'`, `n_gpus`: Number of GPUs. In our case, the model is parallelized across GPUs using tensor parallelism only (TP). See [the *Tensor Parallelism* section](#Tensor-Parallelism) for details.
* `'task_inputs-GPU'`, `'device'`: The type of the GPU the benchmark has been using. In our case, the benchmark has been run on a DGX A100 with `A100-SXM4-80GB` and on a DGX H100 with `H100_80GB_HBM3`.
* `'input_config-concurrency'`, `'concurrency'`: Concurrency. With `concurrency=N`, GenAI-Perf attempts to send inference requests to the server such that N requests are always outstanding during profiling. For example, when using 4, GenAI-Perf will attempt to have 4 outgoing inference requests at all times during profiling. At the start of benchmarking, it immediately sends 4 requests and then, as soon as any request is completed, it sends a new one.

### Benchmarking Results
* `'time_to_first_token-avg'`, `'latency_first_token'`: Average time between when a request is sent and when its first response is received, one value per request in the benchmark. As for many following columns, the unit is specified in a separate column: `'time_to_first_token-unit'`. In our case, this is `ns`, nanoseconds, $10^{-9}$ s, but to simplify understanding we convert `'latency_first_token'` to ms, milliseconds, $10^{-3}$ s. Some additional percentiles of the same distribution are also measured. In our dataset, we see the 95th percentile p95 in `'time_to_first_token-p95'`. `min` and `max` are also measured, but are not present in our dataset.
* `'inter_token_latency-avg'`, `'latency_per_token_decoding'`: Average time between intermediate responses for a single request divided by the number of generated tokens of the latter response, one value per response per request in benchmark
* `'request_latency-avg'`, `'latency'`: Time between when a request is sent and when its final response is received, one value per request in the benchmark. Also referred to as end-to-end or e2e latency.
* `'request_throughput-avg'`, `'prompts_per_s'`: Number of final responses from benchmark divided by benchmark duration.
* `'output_token_throughput-avg'`, `'out_tokens_per_s'`: Total number of output tokens from benchmark divided by benchmark duration
* `'num_output_token-avg'`: Average total number of output tokens of a request, one value per request in the benchmark. Note, that these are counted with a default tokenizer. It allows comparing models between each other, but the exact token count may be skewed due to the specifics of the tokenizers used in the models.
* `'num_input_token-avg'`: Average total number of input tokens of a request, one value per request in benchmark

<a id="skip-definitions"></a>

More details on the reasoning behind the metrics used and comparison to LLMPerf can be found in [**the NIM benchmarking guide**](https://docs.nvidia.com/nim/benchmarking/llm/latest/metrics.html). We also provide some overview of using request rate vs concurrency during the measurement phase and in the results at the end of this notebook.

<br><hr>

### Investigating Benchmarked Data

In [1]:
# all the imports. To expand, click on the blue vertical bar to the left
import plotly.express as px
import plotly.graph_objects as go
import plotly.io as pio
pio.renderers.default = "iframe"
from plotly.subplots import make_subplots

import pandas as pd
import numpy as np
from ipywidgets import interact, fixed, IntSlider, IntText
import ipywidgets as widgets
import glob
from utils.config import config
import utils.preprocess_data

In [2]:
df_raw = utils.preprocess_data.load_csvs_by_glob("dataset/nim.csv")
print(f"Loaded {len(df_raw)} rows")

Loaded 787 rows


In [3]:
from utils.preprocess_data import preprocess_latencies_nim_pbr
df = preprocess_latencies_nim_pbr(df_raw)

In [4]:
df.columns

Index(['model', 'device', 'n_gpus', 'out_tokens_per_s', 'latency_first_token',
       'latency_per_token_decoding', 'input_len', 'output_len', 'concurrency',
       'precision', 'prompts_per_s', 'latency', 'out_tokens_per_s_per_user',
       'TP', 'PP', 'input_output_len', 'prompts_per_s_per_gpu',
       'prompts_per_s_per_8_gpus', 'out_tokens_per_s_per_gpu',
       'out_tokens_per_s_per_8_gpus', 'concurrency_per_8'],
      dtype='object')

In [5]:
for c in ['model', 'device', 'precision', 'TP', 'input_output_len']:
    print(f"{c}: {list(df[c].unique())}")

model: ['meta-llama3-8b-instruct', 'meta-llama3-70b-instruct']
device: ['A100-SXM4-80GB', 'H100_80GB_HBM3']
precision: ['fp16', 'fp8', 'bf16']
TP: [2, 1, 8, 4]
input_output_len: ['200 in → 1000 out', '2000 in → 2000 out', '200 in → 2000 out', '7000 in → 1000 out', '2000 in → 200 out', '200 in → 200 out']


<br><hr>

### **Plotting Latency vs Throughput**

This plot is crucial for benchmarking. It displays the tradeoff between latency and throughput.

#### **Plot Description**

* The x-axis shows a unit of latency.
* The y-axis shows a unit of throughput.
* Each point on the plot represents a measurement taken with the same underlying model, device, and max_batch_size.
* Points are joined with lines, sorted by TP.
* The difference between the points in one line is the measurement concurrency.

#### **Plot Interpretation**

* The best points are the closest to the top-left corner.
* A higher point indicates greater throughput.
* A point farther to the left indicates lower latency.

You will have the opportunity to explore these plots interactively. Later, a selection of plots will be provided to illustrate specific points.

In [6]:
def scatter(df, filters, index_columns, x_metric, y_metric):
    def compare_with_nan(series, value):
        return series.isna() if pd.isna(value) else (series == value)
    df_measured = df.copy()
    df_measured = df_measured[df_measured[list(filters)].eq(pd.Series(filters)).all(axis=1)]
    index_set = list(df_measured[index_columns].groupby(index_columns, dropna=False).first().index)
    if len(index_columns) == 1: 
        index_set = [(v,) for v in index_set]
    index_set.reverse()
    fig = make_subplots()
    df_measured = df_measured.sort_values("concurrency")
    for i, index_value in enumerate(index_set):
        index_filters = { k: v for k, v in zip(index_columns, index_value) }
        boolean_series_list = [compare_with_nan(df_measured[col], val) for col, val in index_filters.items()]
        df_filtered = df_measured[np.logical_and.reduce(boolean_series_list)].sort_values(by="concurrency")
        hover = [
            ""
            + f"{row['TP']=} <br>"
            + f"{row['prompts_per_s_per_8_gpus']=:.1f} <br>"
            + f"{row['concurrency']=} <br>"
            + f"{row['concurrency_per_8']=} <br>"
            + f"{row['input_len']=} <br>"
            + f"{row['output_len']=} <br>"
            + f"{row['latency']=:.1f} <br>"
            + f"{row['latency_first_token']=:.1f} <br>"
            + f"{row['model']=} <br>"
            # + f"batch scheme: {row['batch scheme']} <br>"
            for i, row in df_filtered .iterrows()
        ]
        trace = go.Scatter(
            x = df_filtered [x_metric],
            y = df_filtered [y_metric],
            name = f", ".join(f"{k} {v}" for k,v in index_filters.items()),
            customdata = hover,
            hovertemplate = "%{customdata}",
            marker={
                "size": 10, 
                "color": px.colors.qualitative.G10[i],
                "opacity": 0.7
            },
        )

        fig.add_trace(trace)
    fig.update_xaxes(title_text=config.columns_labels[x_metric], type="log")
    fig.update_yaxes(title_text=config.columns_labels[y_metric])
    fig.update_layout(title=f",<br>".join(f"{k} {v}" for k,v in filters.items()))
    fig.update_layout(margin_t=150)
    fig.update_layout(title_pad_l=40)

    fig.show()

def scatter_fixed_lengths(df, model, device, input_output_len, x_metric="latency_first_token", y_metric="prompts_per_s"):
    filters = {
        "model": model,
        "device": device,
        "input_output_len": input_output_len,
    }
    index_columns = ["TP", "precision"]
    return scatter(df, filters, index_columns, x_metric, y_metric)

In [7]:
interact(
    scatter_fixed_lengths, 
    df = fixed(df),
    model = df['model'].unique(),
    device = df['device'].unique(),
    input_output_len = df['input_output_len'].unique(),
    x_metric = ["latency_per_token_decoding", "latency_first_token", "latency"],
    y_metric = ["out_tokens_per_s", "prompts_per_s", "prompts_per_s_per_8_gpus", "out_tokens_per_s_per_user", "out_tokens_per_s_per_8_gpus"],
)

interactive(children=(Dropdown(description='model', options=('meta-llama3-8b-instruct', 'meta-llama3-70b-instr…

<function __main__.scatter_fixed_lengths(df, model, device, input_output_len, x_metric='latency_first_token', y_metric='prompts_per_s')>

<br><hr>

### Latency vs Throughput Tradeoff

The plots above illustrate a crucial tradeoff: **To achieve low latency for each individual request, you must compromise on throughput and vice versa.**

#### Example
* With concurrency set to 250, throughput can be up to 50 times higher than with concurrency set to 1.
* Meanwhile, latency is only 5 times higher.
* By relaxing latency requirements, you can significantly improve throughput and reduce Total Cost of Ownership (TCO).


<br><hr>

## **Tensor Parallelism**

LLM scaling laws suggest you should expect larger and larger LLMs.
To accommodate them, TP plays a crucial role. We partition such models across multiple GPUs. Even if a model fits on a single GPU, increasing the number of GPUs can provide benefits.

#### **Benefits of Tensor Parallelism N**

* N times lower memory footprint per GPU.
* N times increase in memory bandwidth.
* N times compute resources for the model.
* Same precision, same accuracy.

#### **Scenario:** 1xTP2 vs 2xTP1

If a model fits on one GPU, deploying it in TP2 mode with 2 GPUs doubles the memory bandwidth and compute resources. This reduction in latency is particularly beneficial for individual requests without batching.

Running a batch of 2 requests in TP2 mode provides similar resources to running batch size 1 in 2 instances of TP1. However, TP2 mode incurs additional overhead due to communication between GPUs.

Measurements highlight the importance of a low-latency GPU interconnect for inference speeds of larger models. For optimal performance, consider using:

* NVLink-enabled servers like DGXes and HGXes
* Pair-wise connected PCIe cards like H100 NVL

The corresponding data has been provided for further exploration.

In [8]:
scatter(df, {
        "model": "meta-llama3-8b-instruct",
        "precision": "fp8",
        "input_len": 7000,
        "device": 'H100_80GB_HBM3',
    }, ["TP"],
    x_metric="latency_first_token",
    y_metric="prompts_per_s_per_8_gpus",
)

Achieving the lowest possible Time-To-First-Token (TTFT) often requires a higher TP. However, if your acceptable latency is already met with a lower TP, you can potentially achieve better throughputs without sacrificing performance.

#### **Scenario:** Fluctuating Tokens-Per-Second
The plot below demonstrates that even for a single model, TP, and precision, the throughput (measured in `output_tokens/second/model_instance`) is not constant.

In [9]:
scatter(df, {
        "model": "meta-llama3-8b-instruct",
        "precision": "fp16",
        "TP": 1,
        "device": 'H100_80GB_HBM3',
    }, ["input_output_len"],
    x_metric="latency",
    y_metric="out_tokens_per_s",
)

When measuring throughput, it's crucial to consider the accompanying measurement parameters to ensure accurate and meaningful results:

* Input and output length
* GPU specifications
* TP (tensor parallelism) settings
* Concurrency levels
* Precision settings

#### **Scenario:** TTFT Incorporated Into Request
The ratio between TTFT and E2E Latency is an important consideration. A lower ratio indicates that implementing streaming as part of the app is even more efficient.

In [10]:
df["ttft_ratio"] = df["latency_first_token"] / df["latency"]
scatter(df, {
        "model": "meta-llama3-8b-instruct",
        "device": 'H100_80GB_HBM3',
        "precision": "fp16",
        "TP": 1,
    },
    ["input_output_len"],
    x_metric="latency",
    y_metric="ttft_ratio",
)

A distinctive point is recognizable in the plots where the values suddenly skyrocket. This indicates the formation of a queue. As the time in the queue grows relative to the actual request processing time, the TTFT to E2E latency ratio approaches 1. This is because the time spent in the queue is included in the TTFT measurement.

#### **Scenario:** FP8 vs FP16
NVIDIA GPUs have been supporting FP8 precision since the Hopper generation, as noted in the [**Hopper whitepaper**](https://resources.nvidia.com/en-us-tensor-core), page 23, Hopper FP8 Data Format. This halves data storage requirements and doubles throughput when compared to FP16. 

**The key benefits of FP8-precision use include:**

* **Reduced Storage Requirements:** FP8 requires less storage space compared to FP16. 
* **Increased Throughput:** FP8 doubles throughput, enabling faster processing. 
* **Minimized Accuracy Loss:** The Transformer Engine dynamically scales tensors to minimize accuracy loss when using FP8.

In this plot below, consider the benchmarking difference between the two approaches:

In [11]:
scatter(df, {
        "model": "meta-llama3-8b-instruct",
        "device": 'H100_80GB_HBM3',
        "input_output_len": '2000 in → 200 out',
        "TP": 1,
    },
    ["precision"],
    x_metric="latency_first_token",
    y_metric="prompts_per_s_per_8_gpus",
)

As you can see, FP8 is delivering around 2X the LLM throughput at the same latency. However, this comes with an accuracy drop. The decrease in accuracy from FP16 to FP8 is generally within 1 MMLU (Massive Multitask Language Understanding) point. For a more detailed analysis, refer to the following resources:
* [**TRT-LLM blog post on quantization**](https://nvidia.github.io/TensorRT-LLM/blogs/quantization-in-TRT-LLM.html) 
* [**TRT-LLM quantization documentation**](https://nvidia.github.io/TensorRT-LLM/reference/precision.html).
* To further measure the impact of quantization on your own dataset, consider using [**NeMo Evaluator Microservice**](https://developer.nvidia.com/docs/nemo-microservices/evaluation/source/overview.html).

<br><hr>

## **[EXERCISE] A Use-Case-Based Sizing Example**

Explore how to use throughput vs. latency tradeoff plots to answer the sizing question with a real-world example: a call center [**RAG Customer Assistant**](https://blogs.nvidia.com/blog/what-is-retrieval-augmented-generation/).

**Let's assume the following:**
* We have 30M calls per year
* Each call takes 10 minutes
* We call a RAG system every minute during the call
* Each RAG system call results in 3 requests to llm
* The peak number of requests per second is 7X of the average one

In [12]:
calls_per_year = 30_000_000
requests_per_call = 10 * 3
seconds_in_year = 365 * 24 * 60 * 60

requests_per_year = requests_per_call * calls_per_year
requests_per_second_avg = requests_per_year / seconds_in_year
requests_per_second_peak = 7 * requests_per_second_avg

print(f"{requests_per_second_avg  = :.0f}")
print(f"{requests_per_second_peak = :.0f}")

requests_per_second_avg  = 29
requests_per_second_peak = 200


Now, let's recap the presented list of the inputs, required for appropriate sizing and have some answers to them.

* **What model are you planning to use?**
    - Llama3-8B
* **What is the average number of tokens in the prompt to your LLM (Length of input)?**
    - 2000 tokens in
* **What is the average number of tokens in your LLM output?**
    - 200 tokens out
* **How many requests per second should your full system process at its peak?**
    - 200 requests per second
* **What is your latency limit? First-token? Last-token?**
    - 300 ms FTL. We synthesize speech as a stream.
* **What GPUs are you considering?**
    - As we're planning for high-load scenarios, we will benefit from utilizing multiple H100s with their price/perf ratio.

To size for this use case, let's plan to use our favorite plot of throughput in "prompts/second/8 GPUs" vs "first-token latency".

In [13]:
scatter_fixed_lengths(df,
    model = 'meta-llama3-8b-instruct',
    device = 'H100_80GB_HBM3',
    input_output_len = '2000 in → 200 out',
    x_metric = "latency_first_token",
    y_metric =  "prompts_per_s_per_8_gpus",
)

Let's select a point that meets our criteria and focus on the relevant parameters. We can see from the plot that the most efficient full deployment will consist of 8 instances of the same model, deployed on one DGX H100 server with 8 H100 GPUs. By hovering over the top point on the curve with TTFT below 300ms, you will see the details from the benchmarking results.

<details>
<summary><b>Reveal Solution</b></summary>

<img src="images/hover.png" alt="hover" width=1200px/>

```
concurrency=10
prompts_per_second_per_8_gpus=41.6
```

</details>
<br>

Knowing the `concurrency` and `prompts_per_second_per_(8_gpus AKA dgx)`, we can compute the number of DGX units needed to make up our target system.

In [14]:
## TODO: Calculate target target system/8gpu = number of dgxs
def get_num_dgxs(target_prompt_per_second, prompts_per_second_per_dgx):
    """
    arg1: target prompts/second/sys
    arg2: prompts/second/8gpu
    return: target system/8gpu
    """
    return 0

peak_throughput = requests_per_second_peak # = 200
## TODO: Retrieve statistic from table for a sufficient first-token latency
throughput_per_dgx_h100_220ms = 0  ## TODO: Populate for TTFT ~ 220ms
throughput_per_dgx_h100_500ms = 0  ## TODO: Populate for TTFT ~ 500ms

num_dgxs_220ms = get_num_dgxs(peak_throughput, throughput_per_dgx_h100_220ms)
num_dgxs_500ms = get_num_dgxs(peak_throughput, throughput_per_dgx_h100_500ms)

print(f"Target Latency of ~220ms: {num_dgxs_220ms:.1f} DGX Units")
print(f"Target Latency of ~500ms: {num_dgxs_500ms:.1f} DGX Units")

Target Latency of ~220ms: 0.0 DGX Units
Target Latency of ~500ms: 0.0 DGX Units


In [15]:
def get_num_dgxs(target_prompt_per_second, prompts_per_second_per_dgx):
    """
    arg1: target prompts/second/sys
    arg2: prompts/second/8gpu
    return: target system/8gpu
    """
    return target_prompt_per_second / prompts_per_second_per_dgx

peak_throughput = requests_per_second_peak # = 200
## TODO: Retrieve statistic from table for a sufficient first-token latency
throughput_per_dgx_h100_220ms = 41.6
throughput_per_dgx_h100_500ms = 70.7

num_dgxs_220ms = get_num_dgxs(peak_throughput, throughput_per_dgx_h100_220ms)
num_dgxs_500ms = get_num_dgxs(peak_throughput, throughput_per_dgx_h100_500ms)

print(f"Target Latency of ~220ms: {num_dgxs_220ms:.1f} DGX Units")
print(f"Target Latency of ~500ms: {num_dgxs_500ms:.1f} DGX Units")

Target Latency of ~220ms: 4.8 DGX Units
Target Latency of ~500ms: 2.8 DGX Units


So it looks like for this use case, we will need 4.8 DGX systems to meet our target latency requirement based on our closest benchmark point (220ms), and could intuit that 4 would likely fall a bit short while still being comparable. Notice also how a shift to just 500ms TTFT would move that number all the way down to 2.8 DGX systems for the same number of customers.  

<details>
<summary><b>Reveal Solution</b></summary>

```python 
## TODO: Calculate target target system/8gpu = number of dgxs
def get_num_dgxs(target_prompt_per_second, prompts_per_second_per_dgx):
    """
    arg1: target prompts/second/sys
    arg2: prompts/second/8gpu
    return: target system/8gpu
    """
    return target_prompt_per_second / prompts_per_second_per_dgx

peak_throughput = requests_per_second_peak # = 200
## TODO: Retrieve statistic from table for a sufficient first-token latency
throughput_per_dgx_h100_220ms = 41.6
throughput_per_dgx_h100_500ms = 70.7

num_dgxs_220ms = get_num_dgxs(peak_throughput, throughput_per_dgx_h100_220ms)
num_dgxs_500ms = get_num_dgxs(peak_throughput, throughput_per_dgx_h100_500ms)

print(f"Target Latency of ~220ms: {num_dgxs_220ms:.1f} DGX Units")
print(f"Target Latency of ~500ms: {num_dgxs_500ms:.1f} DGX Units")
```

</details>

<hr><br>

## **Sizing Best Practices**
To effectively estimate sizing for your project, the following best practices have been identified:

#### NVIDIA Software Stack

* Use the NVIDIA software stack, including: [NIM](https://docs.nvidia.com/nim/large-language-models/latest/getting-started.html), [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM/tree/main) and [Triton Inference Server](https://github.com/triton-inference-server).
* Utilize [GenAI-Perf](https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/client/src/c%2B%2B/perf_analyzer/genai-perf/README.html) and [the NIM benchmarking guide](https://docs.nvidia.com/nim/benchmarking/llm/latest/metrics.html) to measure the performance.

#### Model Size and Hardware Considerations

* For models larger than 13B that require more than one GPU, prefer NVLink-enabled systems.
* Larger models require more memory and have higher latency, scaling approximately with the model size.

#### Streaming vs. Sequential Mode

* Apps where the LLM response is consumed directly should be developed in streaming mode.
* Apps where the LLM response is consumed by another LLM or sequential system require sequential mode.
* When designing apps from scratch, develop them to rely on streaming mode. For older apps, sequential mode may be used as a workaround to introduce LLMs.

#### Other Performance Considerations

* The cost and latency are usually dominated by the number of output tokens, as indicated by `ttft_ratio`. Input tokens are much cheaper.
* TTFT is determined by the input length in streaming mode.
* Generating text is almost always faster than human reading speed.
* Introducing strict latency limits can significantly decrease available throughput.

<br>

---

<center><a href="https://www.nvidia.com/en-us/training/"><img src="https://dli-lms.s3.amazonaws.com/assets/general/DLI_Header_White.png" width="400" height="186" /></a></center>