## GuideLLM 
GuideLLM is an open source benchmarking tool designed to evaluate the performance of LLMs served through vLLM. It provides fine-grained metrics such as:

Token throughput
Latency (time-to-first-token, inter-token, request latency)
Concurrency scaling
Request-level diagnostics

This notebook will be used to evaluate the performance of the base model deployment using the same parameters and config as the compressed model deployment. The goal is to compare the difference in the performance of base and compressed model deployments.


In [1]:
from guidellm.benchmark import GenerativeBenchmarksReport

  from .autonotebook import tqdm as notebook_tqdm


## Metrics provided by GuideLLM
GuideLLM provides multiple metrics tat can be used to evaluate the performace of a LLM.

These performance metrics include:

- ``Request Rate`` (Requests Per Second): The number of requests processed per second.
Indicates the throughput of the system and its ability to handle concurrent workloads.
- ``Request Concurrency``: The number of requests being processed simultaneously.
Helps evaluate the system's capacity to handle parallel workloads.

- ``Output Tokens Per Second``
The average number of output tokens generated per second as a throughput metric across all requests. Provides insights into the server's performance and efficiency in generating output tokens.
- ``Total Tokens Per Second``: The combined rate of prompt and output tokens processed per second as a throughput metric across all requests.
Provides insights into the server's overall performance and efficiency in processing both prompt and output tokens.

- ``Request Latency``: The time taken to process a single request, from start to finish.
A critical metric for evaluating the responsiveness of the system.

- ``Time to First Token (TTFT)``: The time taken to generate the first token of the output.
Indicates the initial response time of the model, which is crucial for user-facing applications.

- ``Inter-Token Latency (ITL)``: The average time between generating consecutive tokens in the output, excluding the first token. Helps assess the smoothness and speed of token generation.
  
- ``Time Per Output Token``: The average time taken to generate each output token, including the first token. Provides a detailed view of the model's token generation efficiency.
  
- ``Statistical Summaries``
GuideLLM provides detailed statistical summaries for each of the above metrics using the StatusDistributionSummary and DistributionSummary models. These summaries include the following statistics:

    **Summary Statistics**
    
        - Mean: The average value of the metric.
        - Median: The middle value of the metric when sorted.
        - Mode: The most frequently occurring value of the metric.
        - Variance: The measure of how much the values of the metric vary.
        - Standard Deviation (Std Dev): The square root of the variance, indicating the   spread of the values.
       - Min: The minimum value of the metric.
       - Max: The maximum value of the metric.
       - Count: The total number of data points for the metric.
       - Total Sum: The sum of all values for the metric.

### Token Configuration for Different Use Cases

GuideLLM allows to configure both **input (prompt) tokens** and **output (completion) tokens** depending on the workload to be evaluated.  
Different use cases benefit from different token budgets, and these values can be fully adjusted based on the requirements.

Below are example token configurations commonly used when benchmarking LLMs:

#### Chat Use Case
Chat-style interactions typically have *moderate* prompt length and *short to medium* responses.

- **Input tokens:** ~512 
- **Output tokens:** ~1024  
- **Why:** Chat prompts are usually concise, and responses should be coherent but not excessively long.

---

#### RAG (Retrieval-Augmented Generation)
RAG workloads include retrieved documents in the prompt, so input size is much larger while answers remain relatively short.

- **Input tokens:** ~2,000  
- **Output tokens:** ~500  
- **Why:** Retrieved context contributes heavily to prompt length; outputs should stay grounded and precise.

---

#### Reasoning (e.g., long-form reasoning, code explanation, chain-of-thought tasks)
Reasoning tasks often need short prompts but *longer* answers to capture detailed step-by-step reasoning.

- **Input tokens:** ~300  
- **Output tokens:** ~1,500  
- **Why:** These tasks require extended reasoning or multi-step analysis, so the model needs more room in its output.



## Performance Benchmarking
Here is a list of GuideLLM benchmarking parameters
### Parameters:
**1. target**: The URL of the vLLM model server to benchmark.

**2. profile/rate-type**: Defines the traffic pattern. Optons include:
- ``synchronous``: Runs requests one at a time (sequential)
- ``throughput``: Tests maximum throughput by running requests in parallel - To see how many requests can be handled in parallel
- ``concurrent``: Runs a fixed number of parallel request streams
- ``constant``: Sends requests at a fixed rate per second
- ``poisson``: Sends requests following a Poisson distribution
- ``sweep``: Automatically determines optimal performance points (default)


**3. rate**: GuideLLM supports multiple workload simulation modes, known as rate types (see full list). Each rate type determines which benchmarks are run. The example above uses sweep, which runs a series of benchmarks for 30 seconds each: first, a synchronous test that sends one request at a time (representing minimal traffic), then a throughput test where all requests are sent in parallel to identify the system's maximum RPS. Finally, it runs intermediate RPS levels to capture latency metrics across the full traffic spectrum.

**4. data**: Specifies the dataset source. This can be a file path, Hugging Face dataset ID, synthetic data configuration, or in-memory data. In this case, we will be setting it to define a synthetic data configuration. 
Synthetic datasets allow you to generate data on the fly with customizable parameters. This is useful for controlled experiments, stress testing, and simulating specific scenarios. For example, you might want to evaluate how a model handles long prompts or generates outputs with specific characteristics. Data can be configured for different use cases like chat, RAG, code generation etc.
Important config parameters:
- ``prompt_tokens``: : Average number of tokens in prompts.
- ``output_tokens``: Average number of tokens in outputs.
- ``samples``: Number of samples to generate (default: 1000)
- ``source``: Source text for generation (default: prideandprejudice.txt.gz). This can be any text file, URL containing a text file, or a compressed text file. The text is used to sample from at a word and punctuation granularity and then combined into a single string of the desired lengths.

**5. rate**: The numeric rate value whose meaning depends on profile - for sweep it's the number of benchmarks, for concurrent it's simultaneous requests, for constant/poisson it's requests per second

**6. max-seconds**: Maximum duration in seconds for each benchmark run (can also use **--max-requests** to limit by request count instead)


**7. processor**: Specifies the tokenizer to use. This is only required for synthetic data generation or when local calculations are specified through configuration settings. By default, the processor is set to the --model argument. If --model is not supplied, it defaults to the model retrieved from the backend. The tokenizer is used to calculate the number of tokens to adjust the input length based on ``prompt_tokens``.  Using the model’s native tokenizer ensures the prompt token count matches what the model actually receives and the output token count reflects the true workload.

**Note**: For synthetic data generation, a source file has to be provided which can be continuous text in a compatible format like txt. Input prompts (number can be specified using the ``source`` param) are then sampled from this file, with prompts having a length of ``prompt_tokens`` tokens.

### Run Performance Benchmarking

Identify the following parameters:

**target**: the url for the vLLM inference server started in the previous step (url=http://127.0.0.1:8000 according to the previous step)

**output-path**: path to save the results.

If needed, adjust the target, output-path, profile etc in the command below and run it in a terminal.

**NOTES**:
- Make sure you run the following command from the model_serve_flow directory.
- If you started the vLLM server for the base model on a different port, adjust the ``target``.
- This command uses the same config as the config used for benchmarking the compressed model.
```
guidellm benchmark \
  --target "http://127.0.0.1:8000" \
  --profile sweep \
  --max-seconds 120 \
  --data "prompt_tokens=1024,output_tokens=512" \
  --output-path "results/base_performance_benchmarks.json"
```

### Results

The above command will result is multiple tables.

1. **Request Latency Statistics (Completed Requests)**

This table focuses on how **long** requests take and the latency characteristics of the server.

```text

ℹ Request Latency Statistics (Completed Requests)
|=============|=========|========|=========|=========|======|======|=======|=======|
| Benchmark   | Request Latency || TTFT             || ITL        || TPOT         ||
| Strategy    | Sec             || ms               || ms         || ms           ||
|             | Mdn     | p95    | Mdn     | p95     | Mdn  | p95  | Mdn   | p95   |
|-------------|---------|--------|---------|---------|------|------|-------|-------|
| synchronous | 11.4    | 11.4   | 115.9   | 124.1   | 22.2 | 22.2 | 22.3  | 22.4  |
| throughput  | 62.3    | 92.4   | 33854.1 | 60812.2 | 55.7 | 92.4 | 121.6 | 180.5 |
| constant    | 12.5    | 12.6   | 130.4   | 143.1   | 24.3 | 24.3 | 24.5  | 24.5  |
| constant    | 13.2    | 13.2   | 133.6   | 144.9   | 25.6 | 25.6 | 25.8  | 25.8  |
| constant    | 14.0    | 14.1   | 133.5   | 144.9   | 27.1 | 27.2 | 27.3  | 27.5  |
| constant    | 14.7    | 14.8   | 138.7   | 151.9   | 28.6 | 28.6 | 28.8  | 28.9  |
| constant    | 16.5    | 16.6   | 140.2   | 156.9   | 32.0 | 32.2 | 32.3  | 32.4  |
| constant    | 17.5    | 17.5   | 143.2   | 157.9   | 33.9 | 34.0 | 34.1  | 34.3  |
| constant    | 18.5    | 18.7   | 146.8   | 161.8   | 36.0 | 36.3 | 36.2  | 36.5  |
| constant    | 20.4    | 20.4   | 147.0   | 162.4   | 39.6 | 39.7 | 39.8  | 39.9  |
|=============|=========|========|=========|=========|======|======|=======|=======|
```

2.  **Server Throughput Statistics**

This table focuses on how many requests a server can handle per second. Throughput can be thought of as the **rate** (or time required) of processing. 
```text
ℹ Server Throughput Statistics
|=============|=====|======|=======|=======|========|========|========|=======|=======|========|
| Benchmark   | Requests                |||| Input Tokens   || Output Tokens || Total Tokens  ||
| Strategy    | Per Sec   || Concurrency  || Per Sec        || Per Sec       || Per Sec       ||
|             | Mdn | Mean | Mdn   | Mean  | Mdn    | Mean   | Mdn    | Mean  | Mdn   | Mean   |
|-------------|-----|------|-------|-------|--------|--------|--------|-------|-------|--------|
| synchronous | 0.1 | 0.1  | 1.0   | 1.0   | 92.5   | 101.8  | 45.1   | 44.8  | 45.1  | 137.4  |
| throughput  | 0.5 | 1.7  | 125.0 | 100.5 | 132.2  | 3094.4 | 604.5  | 885.0 | 607.5 | 2713.7 |
| constant    | 0.3 | 0.2  | 3.0   | 3.2   | 303.9  | 314.3  | 138.2  | 135.9 | 138.3 | 416.8  |
| constant    | 0.5 | 0.4  | 6.0   | 5.7   | 519.6  | 530.3  | 204.8  | 227.9 | 205.2 | 698.9  |
| constant    | 0.7 | 0.6  | 10.0  | 8.6   | 735.7  | 746.1  | 262.5  | 318.9 | 262.7 | 977.8  |
| constant    | 0.9 | 0.8  | 13.0  | 11.5  | 951.1  | 962.4  | 344.7  | 408.1 | 344.8 | 1251.4 |
| constant    | 1.1 | 0.9  | 18.0  | 15.5  | 1169.7 | 1178.3 | 422.8  | 491.6 | 423.4 | 1507.6 |
| constant    | 1.3 | 1.1  | 23.0  | 19.3  | 1383.0 | 1394.3 | 464.6  | 576.5 | 465.1 | 1767.8 |
| constant    | 1.5 | 1.3  | 28.0  | 23.4  | 1598.5 | 1610.6 | 497.3  | 658.7 | 498.2 | 2019.9 |
| constant    | 1.7 | 1.4  | 34.0  | 28.5  | 1827.6 | 1826.5 | 576.5  | 734.0 | 577.6 | 2250.6 |
|=============|=====|======|=======|=======|========|========|========|=======|=======|========|

```
### Compressed Model Performance Summary
1. Max concurrency under load: 34.0 (Concurrency Mdn)
2. Max output tokens per second under load: 576.5 (Output tokens per sec Mdn)
3. Request latency under load: 20.4 (Request Latency in secs Mdn)
4. Time to first token under load: 147.0 (TTFT ms Mdn)
5. Inter token latency under load: 39.6 (ITL ms Mdn)

### SLO Analysis

Assume the Service Level Objective (SLO) is:

    TTFT ≤ 200 milliseconds for 95% of requests (p95) with optimal concurrency

At the highest concurrency of 34 requests, TTFT for 95% requests is 162.4 ms, which fulfills the criteria set by the SLO but is much higher than the value recorded for the quantized model (136.0 ms)

In [2]:
# save the results in JSON
report = GenerativeBenchmarksReport.load_file(
    path="results/base_performance_benchmarks.json",
)
base_benchmarks = report.benchmarks

In [3]:
base_benchmarks[0]

GenerativeBenchmark(type_='generative_benchmark', config=BenchmarkConfig(id_='81906880-14f4-4106-a2dc-a7c9091cb5e2', run_id='7afa4ba9-c332-4234-8495-ff7d7234e1a6', run_index=0, strategy=SynchronousStrategy(type_='synchronous', worker_count=1, max_concurrency=1), constraints={'max_seconds': {'type_': 'max_duration', 'max_duration': 120.0, 'current_index': 0}}, sample_requests=10, warmup=TransientPhaseConfig(percent=None, value=None, mode='prefer_duration'), cooldown=TransientPhaseConfig(percent=None, value=None, mode='prefer_duration'), prefer_response_metrics=True, profile=SweepProfile(type_='sweep', completed_strategies=[SynchronousStrategy(type_='synchronous', worker_count=1, max_concurrency=1), ThroughputStrategy(type_='throughput', worker_count=10, max_concurrency=512, rampup_duration=0.0), AsyncConstantStrategy(type_='constant', worker_count=10, max_concurrency=512, rate=0.2875), AsyncConstantStrategy(type_='constant', worker_count=10, max_concurrency=512, rate=0.4916666666666666)