# GuideLLM Canopy Backend Benchmark 🚀

This notebook tests our **Canopy backend** using a forked GuideLLM that supports our Canopy endpoint (technically all endpoints that take text in and responds in streaming like Canopy, but we add a check to see if it's Canopy just to make sure).


## Set-up

First, let's install the modified version of GuideLLM.  
You can find it here if you are interested: [https://github.com/RHRolun/guidellm](https://github.com/RHRolun/guidellm)

In [None]:
!pip install -q git+https://github.com/RHRolun/guidellm.git

## Benchmark Configuration 🎯

With native Canopy support, we simply use models prefixed with "canopy_" to automatically enable Canopy backend behavior.

In [None]:
# ---- Target & Endpoint Configuration ----
base_url = "http://canopy-backend:8000"
endpoint = "/summarize"

# Complete target URL with endpoint path
target = f"{base_url}{endpoint}"  # e.g., "https://...com/summarize"
model = "canopy_system"  # Models starting with "canopy_" automatically enable Canopy support
processor = "RedHatAI/Llama-3.2-3B-Instruct-quantized.w8a8"  # Use a real HuggingFace model for tokenization, in this case we use the underlying model of Canopy
output_path = "canopy-benchmark-summarize-endpoint.yaml"

print(f"Base URL: {base_url}")
print(f"Endpoint: {endpoint}")
print(f"Full target: {target}")
print(f"Processor: {processor}")
print(f"Output file: {output_path}")

## Emulated Workload 📦

We'll emulate prompts/outputs with fixed token sizes for a baseline throughput test.

In [None]:
# ---- Workload Configuration ----
data = {
    "type": "emulated",
    "prompt_tokens": 512,
    "output_tokens": 128,
}
rate_type = "synchronous"   # pacing type
max_seconds = 60            # total duration (seconds)

backend_args = {}  # No special backend args needed - Canopy support is built-in
extra_flags = ""  # Add any extra flags here

## Build Command 👀

Here we generate the GuideLLM command for our Canopy backend and then print it so we know what we are running.  
We will use this command later when we productionize evaluations.

In [None]:
import json, shlex

def build_canopy_guidellm_command(
    target: str,
    model: str,  # canopy_summarize or canopy_info_search
    processor: str,
    data: dict,
    output_path: str,
    rate_type: str,
    max_seconds: int,
    backend_args: dict = None,
    extra_flags: str = "",
):
    data_json = json.dumps(data, separators=(",", ":"))
    
    cmd_list = [
        "guidellm", "benchmark",
        f"--target={target}",
        f"--model={model}",  # Use canopy model name (starts with "canopy_")
        f"--processor={processor}",
        f"--backend-type=openai_http",  # OpenAI backend with native Canopy support
        f"--data={data_json}",
        f"--output-path={output_path}",
        f"--rate-type={rate_type}",
        f"--max-seconds={int(max_seconds)}",
    ]
    
    # Only add backend-args if we have them
    if backend_args:
        backend_args_json = json.dumps(backend_args, separators=(",", ":"))
        cmd_list.append(f"--backend-args={backend_args_json}")
    
    if extra_flags.strip():
        cmd_list.extend(extra_flags.strip().split())

    # Pretty shell string
    parts = [
        "guidellm benchmark",
        f"--target={shlex.quote(target)}",
        f"--model={model}",
        f"--processor={shlex.quote(processor)}",
        f"--backend-type=openai_http",
        f"--data='{data_json}'",
        f"--output-path={shlex.quote(output_path)}",
        f"--rate-type={shlex.quote(rate_type)}",
        f"--max-seconds={int(max_seconds)}",
    ]
    
    # Only add backend-args to display if we have them
    if backend_args:
        backend_args_json = json.dumps(backend_args, separators=(",", ":"))
        parts.insert(-4, f"--backend-args='{backend_args_json}'")  # Insert before data
    
    if extra_flags.strip():
        parts.append(extra_flags)

    cmd_shell = " ".join(parts)
    return cmd_list, cmd_shell

cmd_list, cmd_shell = build_canopy_guidellm_command(
    target, model, processor, data, output_path, rate_type, max_seconds, backend_args, extra_flags
)

print(f"Canopy {endpoint} endpoint benchmark command (with native support):\n")
print(cmd_shell)
print("\nSubprocess list:\n")
print(cmd_list)

## Run Benchmark ▶️

Execute the benchmark against your Canopy backend and stream logs. This might take ~2 minutes #PleaseHold

In [None]:
import shutil, subprocess, sys, os

print(f"🚀 Starting canopy {endpoint} endpoint benchmark with native support...\n")
print(f"Target: {target}")
print(f"Endpoint: {endpoint}")
print(f"Model: {model} (enables native Canopy support)")
print(f"Duration: {max_seconds}s\n")

if shutil.which("guidellm") is None:
    print("⚠️  guidellm command not found in PATH, trying python module...")
    cmd_prefix = ["python", "-m", "guidellm"]
else:
    cmd_prefix = ["guidellm"]

# Run guidellm directly - no wrapper needed with native support
process = subprocess.Popen([
    *cmd_prefix, "benchmark"
] + cmd_list[2:], stdout=subprocess.PIPE, stderr=subprocess.STDOUT, text=True)

for line in process.stdout:
    sys.stdout.write(line)
ret = process.wait()

print(f"\n✅ Benchmark completed. Exit code: {ret}")
if ret != 0:
    print("❌ Non-zero exit code. Review the logs above.")
    print("   Common issues:")
    print("   - Canopy service not accessible")
    print("   - Endpoint feature not enabled") 
    print("   - Network connectivity issues")
    print("   - Model name doesn't start with 'canopy_'")
else:
    print(f"📊 Results saved to: {output_path}")

## Inspect Results 📝

The results are quite long, but here is a small summary of them.

In [None]:
from pathlib import Path

p = Path(output_path)
if p.exists():
    print(f"📋 Results: {p.resolve()}")
    print(f"📊 Benchmarked: Canopy {endpoint} endpoint (native support)")
    print(f"🎯 Target: {target}")
    print(f"⏱️  Duration: {max_seconds}s\n")
    
    try:
        print("--- First 200 lines ---\n")
        with p.open("r", encoding="utf-8", errors="replace") as f:
            for i, line in enumerate(f):
                if i >= 200:
                    print("... (truncated)")
                    break
                print(line.rstrip("\n"))
    except Exception as e:
        print(f"Preview error: {e}")
else:
    print(f"❌ Results not found: {p}")
    print("   Check the benchmark command and logs above.")

And now, let's learn how to interpret the results.

## Understanding GuideLLM Benchmark Metrics 📊

As you can see, GuideLLM produces a **LOT** of outputs, to the point it's difficult to skim through.  
Here are a few key performance indicators that are good to look at and what range they should fall into.  
If you run the cell below, you can see what results we got, and if you really want to look through all of the metrics yourself you can find them in `canopy-benchmark-summarize-endpoint.yaml`.

**🚀 Time to First Token (TTFT)**
- **What it measures**: How long users wait before seeing any response
- **Good range**: < 500ms (excellent < 200ms)
- **Impact**: User perception of responsiveness

**⚡ Output Tokens per Second** 
- **What it measures**: Speed of text generation during streaming
- **Good range**: 20-100 tokens/sec (depends on model size)
- **Impact**: How fast users see text appear

**📈 Requests per Second**
- **What it measures**: How many complete requests the system handles per second
- **Good range**: Varies by use case (0.1-10+ req/sec)
- **Impact**: System throughput and user capacity

**🎯 Request Latency**
- **What it measures**: Total time from request start to completion
- **Good range**: < 30 seconds for long responses
- **Impact**: Overall user experience

**⚙️ Inter-Token Latency**
- **What it measures**: Consistency of token generation (milliseconds between tokens)
- **Good range**: 10-50ms (lower = more consistent)
- **Impact**: Smoothness of streaming text

**📊 Success Rate**
- **What it measures**: Percentage of requests that complete successfully
- **Good range**: > 95% (production requirement)
- **Impact**: System reliability

In [None]:
from metrics_analyzer import analyze_benchmark_results
analyze_benchmark_results(output_path=output_path)

## HTML Report 🕸️

We can also save the results in multiple different formats, here is an example of it as HTML:

In [None]:
from guidellm.benchmark import GenerativeBenchmarksReport

report = GenerativeBenchmarksReport.load_file(
    path=output_path,
)
benchmarks = report.benchmarks

for benchmark in benchmarks:
    print(benchmark.id_)
report.save_html(path="canopy-benchmark-summarize-endpoint.html")

Then simply download the HTML file by right click > Download and open it locally.