# Azure OpenAI benchmarking - Quick Test

Serge Retkowsky | 08-Jul-2024

<img src="img/logobench.jpg" width=600>

- This notebook will do some benchmarks tests on your Azure OpenAI models on PayGo and/or PTU.
- This notebook is using the GitGub repo from Microsoft: https://github.com/Azure/azure-openai-benchmark

### Predefined scenarios
<img src="img/image1.jpg">

### Profiles
<img src="img/image2.jpg">

### Generated metrics
<img src="img/image3.jpg">


- When running a test with **retry=none**, any throttled request will be treated as throttled and a new request will be made to replace it, with the start time of the replacement request being reset to a newer time. If the resource being tested starts returning 429s, **then any latency metrics from this tool will only represent the values of the final successful request**, without also including the time that was spent retrying to resource until a successful response was received (which may not be representative of the real-world user experience). This setting should be used when the workload being tested results is within the resource's capacity and no throttling occurs, or where you are looking to understand what percentage of requests to a PTU instance might need to be diverted to a backup resource, such as during periods of peak load which require more throughput than the PTU resource can handle.

- When running a test with **retry=exponential**, any failed or throttled request will be retried with exponential backoff, up to a **max of 60 seconds**. While it is always recommended to deploy backup AOAI resources for use-cases that will experience periods of high load, **this setting may be useful for trying to simulate a scenario where no backup resource is available**, and where throttled or failed requests must still be fulfilled by the resource. **In this case, the TTFT and e2e latency metrics will represent the time from the first throttled request to the time that the final request was successful, and may be more reflective of the total time that an end user could spend waiting for a response**, e.g. in a chat application. Use this option in situations where you want to understand the latency of requests which are throttled and need to be retried on the same resource, and the how the total latency of a request is impacted by multiple request retries.

As a practical example, if a PTU resource is tested beyond 100% capacity and starts returning 429s:
- With **retry=none** the TTFT and e2e latency statistics will remain stable (and very low), since only the successful requests will be included in the metrics. Number of throttled requests will be relatively high.
- With **retry=exponential**, the TTFT/e2e latency metrics will increase (potentially up to the max of 60 seconds), while the number of throttled requests will remain lower (since a request is only treated as throttled after 60 seconds, regardless of how many attempts were made within the retry period).
Total throughput values (RPM, TPM) may be lower when retry=none if rate limiting is applied.

**As a best practice, any PTU resource should be deployed with a backup PayGO resource for times of peak load. As a result, any testing should be conducted with the values suggested in the AOAI capacity calculator (within the AI Azure Portal) to ensure that throttling does not occur during testing.**

In [1]:
import datetime
import openai
import os
import pandas as pd
import pytz
import re
import requests
import shutil
import subprocess
import sys
import time


from datetime import datetime as DT
from dotenv import load_dotenv
from openai import AzureOpenAI

In [2]:
local_tz = DT.now(pytz.timezone(requests.get("https://ipinfo.io").json()["timezone"])).strftime("%d-%b-%Y %H:%M:%S")
print(f"Today is {local_tz}")

Today is 11-Jul-2024 13:34:45


In [3]:
print(f"Python version: {sys.version}")
print(f"OpenAI version: {openai.__version__}")

Python version: 3.10.11 (main, May 16 2023, 00:28:57) [GCC 11.2.0]
OpenAI version: 1.35.1


### To maximize cells output

In [4]:
%%javascript Python 
OutputArea.auto_scroll_threshold = 9999

<IPython.core.display.Javascript object>

## 1. Settings

In [5]:
load_dotenv("azure.env")

True

In [6]:
HOME = os.getcwd()
print(f"Current directory is: {HOME}")

Current directory is: /mnt/batch/tasks/shared/LS_root/mounts/clusters/seretkow9/code/Users/seretkow/AOAI benchmarks


In [7]:
# Dir to save all the generated results
RES_DIR = "results"

os.makedirs(RES_DIR, exist_ok=True)

local_tz = DT.now(pytz.timezone(requests.get("https://ipinfo.io").json()["timezone"])).strftime("%d%b%Y_%H%M%S")
RESULTS_DIR = f"{RES_DIR}/results_{local_tz}"
os.makedirs(RESULTS_DIR, exist_ok=True)

In [8]:
ZIP_DIR = "zip"

os.makedirs(ZIP_DIR, exist_ok=True)

In [9]:
# Dir to download the github repo from https://github.com/Azure/azure-openai-benchmark
DIR_BENCHMARK = 'azure-openai-benchmark'

shutil.rmtree(DIR_BENCHMARK)
!git clone https://github.com/Azure/azure-openai-benchmark

Cloning into 'azure-openai-benchmark'...
remote: Enumerating objects: 224, done.[K
remote: Counting objects: 100% (158/158), done.[K
remote: Compressing objects: 100% (51/51), done.[K
remote: Total 224 (delta 129), reused 107 (delta 107), pack-reused 66[K
Receiving objects: 100% (224/224), 59.81 KiB | 153.00 KiB/s, done.
Resolving deltas: 100% (137/137), done.


In [10]:
os.chdir(DIR_BENCHMARK)

In [11]:
#!pip install -r requirements.txt

In [12]:
print(f"Current directory is: {os.getcwd()}")

Current directory is: /mnt/batch/tasks/shared/LS_root/mounts/clusters/seretkow9/code/Users/seretkow/AOAI benchmarks/azure-openai-benchmark


In [13]:
# For model 1
ENDPOINT1 = os.getenv("ENDPOINT1")
KEY1 = os.getenv("KEY1")
MODEL1 = os.getenv("MODEL1")
MODEL1_LABEL = os.getenv("MODEL1_LABEL")
COLOR1 = os.getenv("COLOR1")

In [14]:
!python -m benchmark.bench load --help

usage: bench.py load [-h] [-a API_VERSION] [-k API_KEY_ENV] [-c CLIENTS]
                     [-n REQUESTS] [-d DURATION] [-r RATE]
                     [-w AGGREGATION_WINDOW]
                     [-s {balanced,context,generation,custom}]
                     [-p CONTEXT_TOKENS] [-m MAX_TOKENS] [-i COMPLETIONS]
                     [--frequency-penalty FREQUENCY_PENALTY]
                     [--presence-penalty PRESENCE_PENALTY]
                     [--temperature TEMPERATURE] [--top-p TOP_P]
                     [-f {jsonl,human}] [-t {none,exponential}] -e DEPLOYMENT
                     api_base_endpoint

positional arguments:
  api_base_endpoint     Azure OpenAI deployment base endpoint.

optional arguments:
  -h, --help            show this help message and exit
  -a API_VERSION, --api-version API_VERSION
                        Set OpenAI API version.
  -k API_KEY_ENV, --api-key-env API_KEY_ENV
                        Environment variable that contains the API

In [15]:
os.environ['OPENAI_API_KEY'] = KEY1

In [16]:
print(f"Let's use the {MODEL1} model...\n")
    
command = [
        'python', '-m', 'benchmark.bench', 'load',
        '--deployment', MODEL1,  # Model to use
        #'--rate', '10',
        '--retry', 'exponential',  # Retry
        '--duration', '30', # Should be >= 30
        ENDPOINT1 # Model endpoint to use
]

start = time.time()
now = datetime.datetime.today()
print(f"{now} Running benchmarks for model {MODEL1}...")
result = subprocess.run(command, capture_output=True, text=True)
elapsed = time.time() - start
print("\nDone")
print("Elapsed time: " + time.strftime(
        "%H:%M:%S.{}".format(str(elapsed % 1)[2:])[:15], time.gmtime(elapsed)))

Let's use the gpt-4o model...

2024-07-11 11:35:01.094531 Running benchmarks for model gpt-4o...

Done
Elapsed time: 00:00:40.658332


In [17]:
print(result)

CompletedProcess(args=['python', '-m', 'benchmark.bench', 'load', '--deployment', 'gpt-4o', '--retry', 'exponential', '--duration', '30', 'https://azureopenai-sweden-sr.openai.azure.com/'], returncode=0, stdout='2024-07-11 11:35:03 rpm: n/a   processing: 20   completed: 0     failures: 0    throttled: 0    requests: 0     tpm: 0      ttft_avg: n/a    ttft_95th: n/a    tbt_avg: n/a    tbt_95th: n/a    e2e_avg: n/a    e2e_95th: n/a    util_avg: n/a    util_95th: n/a   \n2024-07-11 11:35:04 rpm: n/a   processing: 20   completed: 0     failures: 0    throttled: 0    requests: 0     tpm: 0      ttft_avg: n/a    ttft_95th: n/a    tbt_avg: n/a    tbt_95th: n/a    e2e_avg: n/a    e2e_95th: n/a    util_avg: n/a    util_95th: n/a   \n2024-07-11 11:35:05 rpm: n/a   processing: 20   completed: 0     failures: 0    throttled: 0    requests: 0     tpm: 0      ttft_avg: n/a    ttft_95th: n/a    tbt_avg: n/a    tbt_95th: n/a    e2e_avg: n/a    e2e_95th: n/a    util_avg: n/a    util_95th: n/a   \n2024-

In [18]:
print(result.stdout)

2024-07-11 11:35:03 rpm: n/a   processing: 20   completed: 0     failures: 0    throttled: 0    requests: 0     tpm: 0      ttft_avg: n/a    ttft_95th: n/a    tbt_avg: n/a    tbt_95th: n/a    e2e_avg: n/a    e2e_95th: n/a    util_avg: n/a    util_95th: n/a   
2024-07-11 11:35:04 rpm: n/a   processing: 20   completed: 0     failures: 0    throttled: 0    requests: 0     tpm: 0      ttft_avg: n/a    ttft_95th: n/a    tbt_avg: n/a    tbt_95th: n/a    e2e_avg: n/a    e2e_95th: n/a    util_avg: n/a    util_95th: n/a   
2024-07-11 11:35:05 rpm: n/a   processing: 20   completed: 0     failures: 0    throttled: 0    requests: 0     tpm: 0      ttft_avg: n/a    ttft_95th: n/a    tbt_avg: n/a    tbt_95th: n/a    e2e_avg: n/a    e2e_95th: n/a    util_avg: n/a    util_95th: n/a   
2024-07-11 11:35:06 rpm: n/a   processing: 20   completed: 0     failures: 0    throttled: 0    requests: 0     tpm: 0      ttft_avg: n/a    ttft_95th: n/a    tbt_avg: n/a    tbt_95th: n/a    e2e_avg: n/a    e2e_95th: n/

In [19]:
print(result.stderr)

2024-07-11 11:35:02 INFO     using shape profile balanced: context tokens: 500, max tokens: 500
2024-07-11 11:35:02 INFO     warming up prompt cache
2024-07-11 11:35:02 INFO     starting load...
2024-07-11 11:35:32 INFO     waiting for 21 requests to drain
2024-07-11 11:35:41 INFO     finished load test



In [20]:
data = result.stdout
lines = data.strip().split('\n')
datalist = []
headers = None
pattern = re.compile(r"(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}) (.+)")

for line in lines:
    match = pattern.match(line)
    if match:
        timestamp, stats = match.groups()
        stats_dict = dict(re.findall(r"(\S+): (\S+)", stats))
        if not headers:
            headers = ['timestamp'] + list(stats_dict.keys())
        datalist.append([timestamp] + list(stats_dict.values()))

df = pd.DataFrame(datalist, columns=headers)
df

Unnamed: 0,timestamp,rpm,processing,completed,failures,throttled,requests,tpm,ttft_avg,ttft_95th,tbt_avg,tbt_95th,e2e_avg,e2e_95th,util_avg,util_95th
0,2024-07-11 11:35:03,,20,0,0,0,0,0.0,,,,,,,,
1,2024-07-11 11:35:04,,20,0,0,0,0,0.0,,,,,,,,
2,2024-07-11 11:35:05,,20,0,0,0,0,0.0,,,,,,,,
3,2024-07-11 11:35:06,,20,0,0,0,0,0.0,,,,,,,,
4,2024-07-11 11:35:07,,20,0,0,0,0,0.0,,,,,,,,
5,2024-07-11 11:35:08,20.0,20,2,0,0,2,20080.0,0.276,0.299,0.011,0.011,5.762,5.775,,
6,2024-07-11 11:35:09,77.1,20,9,0,0,9,77452.0,0.281,0.314,0.011,0.012,6.061,6.21,,
7,2024-07-11 11:35:10,75.0,20,10,0,0,10,75300.0,0.284,0.314,0.012,0.014,6.252,7.19,,
8,2024-07-11 11:35:11,80.0,20,12,0,0,12,80320.0,0.284,0.313,0.013,0.016,6.636,8.52,,
9,2024-07-11 11:35:12,72.0,20,12,0,0,12,72288.0,0.284,0.313,0.013,0.016,6.636,8.52,,


In [21]:
df.shape

(40, 16)

In [22]:
df.describe()

Unnamed: 0,timestamp,rpm,processing,completed,failures,throttled,requests,tpm,ttft_avg,ttft_95th,tbt_avg,tbt_95th,e2e_avg,e2e_95th,util_avg,util_95th
count,40,40.0,40,40,40,40,40,40,40.0,40.0,40.0,40.0,40.0,40.0,40.0,40.0
unique,39,34.0,9,26,1,1,26,36,24.0,15.0,7.0,13.0,26.0,19.0,1.0,1.0
top,2024-07-11 11:35:41,,20,0,0,0,0,0,,0.382,0.017,0.029,,14.52,,
freq,2,5.0,30,5,40,40,5,5,5.0,14.0,20.0,7.0,5.0,7.0,40.0,40.0
