# GenAI-perf -> NIM LLM TCO Calculator Data Connector

This notebook shows you how to do LLM performance benchmarking with the NVIDIA GenAI-perf tool and then export the data to an Excel spreadsheet, which can be used to transfer the data to the NIM [spreadsheet TCO calculator tool](https://docs.google.com/spreadsheets/d/1UF_sy89kcLIkdnK0dC-6QwcAgVDUV0ANJ22JnC2dW7g/edit?gid=0#gid=0).

Note: the NIM LLM TCO calculator is implemented as a Google spreadsheet. Please make a private copy for your own usage.


To execute this notebook, you can use the NVIDIA Pytorch container:
```
docker run --gpus=all --ipc=host --net=host --rm -it -v $PWD:/myworkspace nvcr.io/nvidia/pytorch:25.03-py3 bash  
```

Then from within the docker interactive session:
```
jupyter lab --ip 0.0.0.0 --port=8888 --allow-root --notebook-dir=/myworkspace
```

First, we define some metadata fields describing the deployment environment.

**Notes:**
- NIM engine ID  provides both the backend type (e.g. TensorRT-LLM, vLLM or SGlang) and precision. You can find this information when the NIM container starts.

- This notebook collects data corresponding to a single deployment environment described by the metadata field.  

In [1]:
meta_field = {
 'Model': "meta-llama/Meta-Llama-3-8B-Instruct",
 'GPU Type': "H100_80GB",
 'number_of_gpus': 1,
 'Precision': "BF16",
 'Execution Mode': "NIM-TRTLLM",
}


## Pre-requisite

First, we install the GenAI-perf tool in the Pytorch container. 
As a client-side LLM-focused benchmarking tool, NVIDIA GenAI-Perf provides key metrics such as time to first token (TTFT), inter-token latency (ITL), tokens per second (TPS), requests per second (RPS) and more. GenAI-Perf also supports any LLM inference service conforming to the OpenAI API specification, a widely accepted de facto standard in the industry. For this benchmarking guide, we’ll use NVIDIA NIM, a collection of inference microservices that offer high-throughput and low-latency inference for both base and fine-tuned LLMs. NIM features ease-of-use and enterprise-grade security and manageability. 

### Install GenAI-perf tool

In [None]:
%%bash
pip install genai-perf==0.0.12

### Setting up a NIM LLM server (optional)

If you don't already have a target for benchmarking, like an openAI compatible LLM service, let's setup one. 

NVIDIA NIM provides the easiest and quickest way to put LLMs and other AI foundation models into production. Read [A Simple Guide to Deploying Generative AI with NVIDIA NIM](https://developer.nvidia.com/blog/a-simple-guide-to-deploying-generative-ai-with-nvidia-nim/) or consult the latest [NIM LLM documentation](https://docs.nvidia.com/nim/large-language-models/latest/introduction.html) to get started, which will walk you through hardware requirements and prerequisites, including NVIDIA NGC API keys.

For convenience, the following commands have been provided for deploying NIM and executing inference from the [Getting Started Guide](https://docs.nvidia.com/nim/large-language-models/latest/getting-started.html):   

                                                                                                    
```
export NGC_API_KEY=<YOUR_NGC_API_KEY> 

# Choose a container name for bookkeeping
export CONTAINER_NAME=llama-3.1-8b-instruct

# Choose a LLM NIM Image from NGC
export IMG_NAME="nvcr.io/nim/meta/${CONTAINER_NAME}:latest"

# Choose a path on your system to cache the downloaded models
export LOCAL_NIM_CACHE=./cache/nim
mkdir -p "$LOCAL_NIM_CACHE"

# Start the LLM NIM
docker run -it --rm --name=$CONTAINER_NAME \
  --gpus all \
  --shm-size=16GB \
  -e NGC_API_KEY \
  -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
  -u $(id -u) \
  -p 8000:8000 \
  $IMG_NAME
```


## Performance benchmarking script

The next step is to define the use cases (i.e. input/output sequence length scenarios) and carry out the benchmarking.

In [7]:
%%writefile benchmark.sh
declare -A useCases

# Populate the array with use case descriptions and their specified input/output lengths
useCases["Translation"]="200/200"
useCases["Text classification"]="200/5"
useCases["Text summary"]="1000/200"
useCases["Code generation"]="200/1000"

# Function to execute genAI-perf with the input/output lengths as arguments
runBenchmark() {
    local description="$1"
    local lengths="${useCases[$description]}"
    IFS='/' read -r inputLength outputLength <<< "$lengths"

    echo "Running genAI-perf for $description with input length $inputLength and output length $outputLength"
    #Runs
    for concurrency in 1 2 5 10 50 100 250; do

        local INPUT_SEQUENCE_LENGTH=$inputLength
        local INPUT_SEQUENCE_STD=0
        local OUTPUT_SEQUENCE_LENGTH=$outputLength
        local CONCURRENCY=$concurrency
        local MODEL=meta/llama-3.1-8b-instruct
        
        genai-perf profile \
            -m $MODEL \
            --endpoint-type chat \
            --service-kind openai \
            --streaming \
            -u localhost:8000 \
            --synthetic-input-tokens-mean $INPUT_SEQUENCE_LENGTH \
            --synthetic-input-tokens-stddev $INPUT_SEQUENCE_STD \
            --concurrency $CONCURRENCY \
            --output-tokens-mean $OUTPUT_SEQUENCE_LENGTH \
            --extra-inputs max_tokens:$OUTPUT_SEQUENCE_LENGTH \
            --extra-inputs min_tokens:$OUTPUT_SEQUENCE_LENGTH \
            --extra-inputs ignore_eos:true \
            --tokenizer meta-llama/Meta-Llama-3-8B-Instruct \
            --measurement-interval 30000 \
            --profile-export-file ${INPUT_SEQUENCE_LENGTH}_${OUTPUT_SEQUENCE_LENGTH}.json \
            -- \
            -v \
            --max-threads=256
    
    done
}

# Iterate over all defined use cases and run the benchmark script for each
for description in "${!useCases[@]}"; do
    runBenchmark "$description"
done



Writing benchmark.sh


Next, we execute the bash script, which will carry out the defined benchmarking scenarios and gather the data in a default directory named `artifacts` under the current working directory.

In [None]:
%%bash
bash benchmark.sh

## Reading gen-AI-perf data

Once performance benchmarking is done, we read and collect the results in a single data frame.

In [8]:
gen_AI_perf_field = [
 'Inter Token 90th Percentile Latency (ms)',
 'Inter Token 99th Percentile Latency (ms)',
 'Inter Token Average Latency (ms)',
 'Time to First Token 90th Percentile Latency (ms)',
 'Time to First Token 99th Percentile Latency (ms)',
 'Time to First Token Average Latency (ms)',
 'Request 90th Percentile Latency (ms)',
 'Request 99th Percentile Latency (ms)',
 'Request Latency (ms)',
 'Requests per Second',
 'Tokens per Second']

# Other experimental params: 'Seq Length (ISL/OSL)', 'Concurrency',

In [9]:
import os
import json
import pandas as pd

root_dir = "./artifacts"
directory_prefix = "meta_llama-3.1-8b-instruct-openai-chat-concurrency" # Change this to fit the actual model deployed

ISL_OSL_list = ["200_5", "200_200", "1000_200", "200_1000"]
concurrencies = [1, 2, 5, 10, 50, 100, 250]
df = pd.DataFrame(columns=gen_AI_perf_field)

for con in concurrencies:
    for ISL_OSL in ISL_OSL_list:
        filename = os.path.join(root_dir, directory_prefix+str(con), f"{ISL_OSL}_genai_perf.json")
        
        # Open and read the file
        with open(filename, 'r') as file:
            data = json.load(file)
        
        row =  {
         'Inter Token 90th Percentile Latency (ms)': data["inter_token_latency"]["p90"],
         'Inter Token 99th Percentile Latency (ms)': data["inter_token_latency"]["p99"],
         'Inter Token Average Latency (ms)': data["inter_token_latency"]["avg"],
         'Time to First Token 90th Percentile Latency (ms)': data["time_to_first_token"]["p90"],
         'Time to First Token 99th Percentile Latency (ms)': data["time_to_first_token"]["p99"],
         'Time to First Token Average Latency (ms)': data["time_to_first_token"]["avg"],
         'Request 90th Percentile Latency (ms)': data["request_latency"]["p90"],
         'Request 99th Percentile Latency (ms)': data["request_latency"]["p99"],
         'Request Latency (ms)': data["request_latency"]["avg"],
         'Requests per Second': data["request_throughput"]["avg"],
         'Tokens per Second': data["output_token_throughput"]["avg"],
         'Seq Length (ISL/OSL)': ISL_OSL,
         'Concurrency': con
        } 
        
        row = meta_field | row
        
        df = pd.concat([df, pd.DataFrame([row])], ignore_index=True)

  df = pd.concat([df, pd.DataFrame([row])], ignore_index=True)


## Exporting data to excel format

We next export the benchmarking data to a TCO-tool compatible format, which comprises both metadata fields as well as performance metric fields.

In [10]:
df.head()

Unnamed: 0,Inter Token 90th Percentile Latency (ms),Inter Token 99th Percentile Latency (ms),Inter Token Average Latency (ms),Time to First Token 90th Percentile Latency (ms),Time to First Token 99th Percentile Latency (ms),Time to First Token Average Latency (ms),Request 90th Percentile Latency (ms),Request 99th Percentile Latency (ms),Request Latency (ms),Requests per Second,Tokens per Second,Model,GPU Type,number_of_gpus,Precision,Execution Mode,Seq Length (ISL/OSL),Concurrency
0,9.594225,10.384453,9.041131,18.409172,19.843728,17.393711,66.557111,71.716564,62.599366,15.96136,95.768158,meta-llama/Meta-Llama-3-8B-Instruct,H100_80GB,1.0,BF16,NIM-TRTLLM,200_5,1.0
1,10.887888,11.2632,10.615027,18.011177,38.893825,18.188744,2195.40086,2265.86754,2138.4097,0.467599,93.865874,meta-llama/Meta-Llama-3-8B-Instruct,H100_80GB,1.0,BF16,NIM-TRTLLM,200_200,1.0
2,11.618933,11.998436,11.210382,62.158805,79.05302,54.133457,2390.421083,2467.364641,2294.288986,0.435829,87.527501,meta-llama/Meta-Llama-3-8B-Instruct,H100_80GB,1.0,BF16,NIM-TRTLLM,1000_200,1.0
3,11.376184,11.402237,11.155124,19.120465,19.441144,18.441507,11367.166599,11417.401786,11155.899836,0.089634,89.584068,meta-llama/Meta-Llama-3-8B-Instruct,H100_80GB,1.0,BF16,NIM-TRTLLM,200_1000,1.0
4,10.997904,13.013792,10.076813,33.621545,40.719498,30.210196,86.358385,100.114304,80.594263,24.799054,148.794324,meta-llama/Meta-Llama-3-8B-Instruct,H100_80GB,1.0,BF16,NIM-TRTLLM,200_5,2.0


In [None]:
!pip install openpyxl

In [7]:
columns = [
 'Model',
 'GPU Type',
 'Seq Length (ISL/OSL)',
 'number_of_gpus',
 'Concurrency',
 'Precision',
 'Execution Mode',
 'Inter Token 90th Percentile Latency (ms)',
 'Inter Token 99th Percentile Latency (ms)',
 'Inter Token Average Latency (ms)',
 'Time to First Token 90th Percentile Latency (ms)',
 'Time to First Token 99th Percentile Latency (ms)',
 'Time to First Token Average Latency (ms)',
 'Request 90th Percentile Latency (ms)',
 'Request 99th Percentile Latency (ms)',
 'Request Latency (ms)',
 'Requests per Second',
 'Tokens per Second'
 ]
df[columns].to_excel('data.xlsx', index=False)


## Importing the data to the TCO calculator

The [NIM TCO calculator tool](https://docs.google.com/spreadsheets/d/1UF_sy89kcLIkdnK0dC-6QwcAgVDUV0ANJ22JnC2dW7g/edit?gid=0#gid=0) is implemented as a Google spreadsheet. You can use Google spreadsheet to open the excel file above, then simply copy the data rows into the "data" subsheet of the TCO calculator. That will complete the import phase and make the new data available in the TCO calculator.