In [6]:
import pandas as pd
import numpy as np
from pathlib import Path

pd.set_option("display.max_colwidth", 800)

### Review the workload context and usage assumptions provided in the notebook

In [7]:
workload_context = {
    "workload_name": "Simulated GenAI Inference - Customer Support Assistant",
    "traffic_pattern": "steady",
    "requests_per_day": 250_000,
    "avg_input_tokens": 650,
    "avg_output_tokens": 220,
    "sla_notes": "Interactive use case. Prefer p95 latency under ~1200 ms.",
    "kg_co2e_per_kwh": 0.40,
}

pd.DataFrame([workload_context])

Unnamed: 0,workload_name,traffic_pattern,requests_per_day,avg_input_tokens,avg_output_tokens,sla_notes,kg_co2e_per_kwh
0,Simulated GenAI Inference - Customer Support Assistant,steady,250000,650,220,Interactive use case. Prefer p95 latency under ~1200 ms.,0.4


### Load the baseline inference metrics for the simulated generative AI workload

In [8]:
baseline_path = Path("baseline_inference_metrics.csv")

if not baseline_path.exists():
    raise FileNotFoundError(
        "baseline_inference_metrics.csv not found. "
        "Place it next to this notebook or update baseline_path."
    )

baseline_df = pd.read_csv(baseline_path)
baseline_df

Unnamed: 0,metric,baseline_value,notes
0,p50_latency_ms,420.0,Median end-to-end inference latency
1,p95_latency_ms,980.0,Tail latency under moderate load
2,throughput_requests_per_sec,18.0,Sustained throughput under steady traffic
3,average_memory_gb,22.5,Average GPU memory utilization
4,cost_per_1k_requests_usd,7.4,Estimated cloud cost based on instance pricing
5,estimated_energy_kwh_per_1k_requests,3.2,Rough estimate based on compute utilization


### Analyze the baseline metrics to identify potential efficiency concerns

In [9]:
# Pull baseline values into a dict for easy downstream calculations.
baseline = dict(zip(baseline_df["metric"], baseline_df["baseline_value"]))

# A few simple checks based on the workload notes.
baseline_findings = []

p95 = baseline.get("p95_latency_ms")
if p95 is not None:
    if p95 > 1200:
        baseline_findings.append(f"p95 latency is {p95:.0f} ms, which may be high for an interactive SLA.")
    else:
        baseline_findings.append(f"p95 latency is {p95:.0f} ms, within the stated interactive target range.")

mem = baseline.get("average_memory_gb")
if mem is not None:
    if mem >= 24:
        baseline_findings.append(f"Average memory is {mem:.1f} GB, which may force larger (more expensive) GPUs.")
    elif mem >= 16:
        baseline_findings.append(f"Average memory is {mem:.1f} GB; watch GPU sizing and headroom for spikes.")
    else:
        baseline_findings.append(f"Average memory is {mem:.1f} GB, leaving reasonable headroom for common GPU tiers.")

cost = baseline.get("cost_per_1k_requests_usd")
if cost is not None:
    baseline_findings.append(f"Estimated cost is ${cost:.2f} per 1k requests. Scale impact depends on daily volume.")

energy = baseline.get("estimated_energy_kwh_per_1k_requests")
if energy is not None:
    baseline_findings.append(f"Estimated energy is {energy:.2f} kWh per 1k requests. Consider energy as a constraint at scale.")

pd.DataFrame({"baseline_findings": baseline_findings})

Unnamed: 0,baseline_findings
0,"p95 latency is 980 ms, within the stated interactive target range."
1,Average memory is 22.5 GB; watch GPU sizing and headroom for spikes.
2,Estimated cost is $7.40 per 1k requests. Scale impact depends on daily volume.
3,Estimated energy is 3.20 kWh per 1k requests. Consider energy as a constraint at scale.


### Review the simulated optimization scenario and its post-optimization metrics

#### Optimization Scenario 

For this exercise, assume the following optimization has already been applied to the generative AI inference system.

##### Optimization Applied
**INT8 quantization with dynamic batching**

##### Description
- Reduced-precision (INT8) inference is used to lower memory usage and improve throughput.
- Dynamic batching groups multiple requests together to increase hardware utilization under steady traffic.

##### Expected Benefits
- Lower GPU memory footprint
- Higher sustained throughput
- Reduced cost per request
- Reduced energy consumption per request

##### Known Tradeoffs
- Median latency may increase slightly due to batching overhead
- Output quality impact is expected to be minor but not zero
- Occasional formatting drift or subtle tone changes may occur for longer outputs

##### Your Task
You are **not** asked to design or implement this optimization.

Your task is to **evaluate whether this optimization is acceptable** for the workload using the provided performance, cost, and energy metrics.


In [16]:
optimization_scenario = {
    "name": "INT8 quantization + dynamic batching",
    "notes": [
        "Reduced precision inference lowers memory and can improve throughput.",
        "Batching can increase throughput but may increase median latency under certain traffic patterns.",
        "Quality impact is assumed to be minimal but not zero (monitor formatting + factuality).",
    ],
    "quality_impact_note": "Minor: occasional slight tone shift and rare formatting drift under long outputs.",
}

pd.DataFrame({"scenario_notes": optimization_scenario["notes"]})

Unnamed: 0,scenario_notes
0,Reduced precision inference lowers memory and can improve throughput.
1,Batching can increase throughput but may increase median latency under certain traffic patterns.
2,Quality impact is assumed to be minimal but not zero (monitor formatting + factuality).


### Construct a comparison table that summarizes baseline and optimized metrics side by side

In [12]:
# Simulated optimized metrics aligned to the baseline metric list
optimized_metrics = {
    "p50_latency_ms": 460,                         # slight increase due to batching overhead
    "p95_latency_ms": 820,                         # improved tail latency via better throughput + lower contention
    "throughput_requests_per_sec": 32,             # improved
    "average_memory_gb": 14.2,                     # reduced memory footprint
    "cost_per_1k_requests_usd": 4.60,              # lower cost
    "estimated_energy_kwh_per_1k_requests": 2.1,   # lower energy
}

optimized_df = pd.DataFrame({
    "metric": list(optimized_metrics.keys()),
    "optimized_value": list(optimized_metrics.values())
})

comparison = baseline_df.merge(optimized_df, on="metric", how="left")

# Add deltas and percent deltas
comparison["delta"] = comparison["optimized_value"] - comparison["baseline_value"]
comparison["pct_change"] = np.where(
    comparison["baseline_value"] != 0,
    (comparison["delta"] / comparison["baseline_value"]) * 100,
    np.nan
)

display(comparison)

Unnamed: 0,metric,baseline_value,notes,optimized_value,delta,pct_change
0,p50_latency_ms,420.0,Median end-to-end inference latency,460.0,40.0,9.52381
1,p95_latency_ms,980.0,Tail latency under moderate load,820.0,-160.0,-16.326531
2,throughput_requests_per_sec,18.0,Sustained throughput under steady traffic,32.0,14.0,77.777778
3,average_memory_gb,22.5,Average GPU memory utilization,14.2,-8.3,-36.888889
4,cost_per_1k_requests_usd,7.4,Estimated cloud cost based on instance pricing,4.6,-2.8,-37.837838
5,estimated_energy_kwh_per_1k_requests,3.2,Rough estimate based on compute utilization,2.1,-1.1,-34.375


### Calculate and summarize the differences between baseline and optimized metrics

In [13]:
requests_per_day = workload_context["requests_per_day"]

# Pull per-1k request cost/energy and scale them
baseline_cost_per_day = (baseline["cost_per_1k_requests_usd"] * requests_per_day) / 1000
optimized_cost_per_day = (optimized_metrics["cost_per_1k_requests_usd"] * requests_per_day) / 1000

baseline_energy_per_day = (baseline["estimated_energy_kwh_per_1k_requests"] * requests_per_day) / 1000
optimized_energy_per_day = (optimized_metrics["estimated_energy_kwh_per_1k_requests"] * requests_per_day) / 1000

# Optional CO2e estimate (assumption-based)
kg_co2e_per_kwh = workload_context.get("kg_co2e_per_kwh", None)
baseline_co2e_kg_per_day = baseline_energy_per_day * kg_co2e_per_kwh if kg_co2e_per_kwh is not None else None
optimized_co2e_kg_per_day = optimized_energy_per_day * kg_co2e_per_kwh if kg_co2e_per_kwh is not None else None

scale_summary = {
    "requests_per_day": requests_per_day,
    "baseline_cost_per_day_usd": baseline_cost_per_day,
    "optimized_cost_per_day_usd": optimized_cost_per_day,
    "daily_cost_savings_usd": baseline_cost_per_day - optimized_cost_per_day,
    "baseline_energy_kwh_per_day": baseline_energy_per_day,
    "optimized_energy_kwh_per_day": optimized_energy_per_day,
    "daily_energy_savings_kwh": baseline_energy_per_day - optimized_energy_per_day,
}

if baseline_co2e_kg_per_day is not None:
    scale_summary.update({
        "assumed_kg_co2e_per_kwh": kg_co2e_per_kwh,
        "baseline_co2e_kg_per_day": baseline_co2e_kg_per_day,
        "optimized_co2e_kg_per_day": optimized_co2e_kg_per_day,
        "daily_co2e_savings_kg": baseline_co2e_kg_per_day - optimized_co2e_kg_per_day,
    })

pd.DataFrame([scale_summary]).T.rename(columns={0: "value"})

Unnamed: 0,value
requests_per_day,250000.0
baseline_cost_per_day_usd,1850.0
optimized_cost_per_day_usd,1150.0
daily_cost_savings_usd,700.0
baseline_energy_kwh_per_day,800.0
optimized_energy_kwh_per_day,525.0
daily_energy_savings_kwh,275.0
assumed_kg_co2e_per_kwh,0.4
baseline_co2e_kg_per_day,320.0
optimized_co2e_kg_per_day,210.0


### Evaluate the tradeoffs introduced by the optimization

In [17]:
tradeoffs = []

# Latency tradeoff (p50 may increase; p95 may improve)
tradeoffs.append({
    "area": "Latency",
    "what_changed": f"p50: {baseline['p50_latency_ms']} -> {optimized_metrics['p50_latency_ms']} ms, "
                    f"p95: {baseline['p95_latency_ms']} -> {optimized_metrics['p95_latency_ms']} ms",
    "interpretation": "Median latency slightly increased (batching overhead), but tail latency improved."
})

# Throughput
tradeoffs.append({
    "area": "Throughput",
    "what_changed": f"{baseline['throughput_requests_per_sec']} -> {optimized_metrics['throughput_requests_per_sec']} req/s",
    "interpretation": "Higher throughput reduces queueing risk and can stabilize tail latency under load."
})

# Memory
tradeoffs.append({
    "area": "Memory",
    "what_changed": f"{baseline['average_memory_gb']} -> {optimized_metrics['average_memory_gb']} GB",
    "interpretation": "Lower memory enables smaller instances or more headroom, improving cost flexibility."
})

# Quality / flexibility (scenario-based note)
tradeoffs.append({
    "area": "Output Quality",
    "what_changed": optimization_scenario["quality_impact_note"],
    "interpretation": "Treat as a monitoring requirement. Add regression tests for tone, formatting, and factuality."
})

pd.DataFrame(tradeoffs)

Unnamed: 0,area,what_changed,interpretation
0,Latency,"p50: 420.0 -> 460 ms, p95: 980.0 -> 820 ms","Median latency slightly increased (batching overhead), but tail latency improved."
1,Throughput,18.0 -> 32 req/s,Higher throughput reduces queueing risk and can stabilize tail latency under load.
2,Memory,22.5 -> 14.2 GB,"Lower memory enables smaller instances or more headroom, improving cost flexibility."
3,Output Quality,Minor: occasional slight tone shift and rare formatting drift under long outputs.,"Treat as a monitoring requirement. Add regression tests for tone, formatting, and factuality."


### Short analysis and clear recommendation 

In [18]:
analysis = f"""
Summary
- Cost: ${baseline['cost_per_1k_requests_usd']:.2f} -> ${optimized_metrics['cost_per_1k_requests_usd']:.2f} per 1k requests.
- Energy: {baseline['estimated_energy_kwh_per_1k_requests']:.2f} -> {optimized_metrics['estimated_energy_kwh_per_1k_requests']:.2f} kWh per 1k requests.
- Throughput: {baseline['throughput_requests_per_sec']:.0f} -> {optimized_metrics['throughput_requests_per_sec']:.0f} req/s.
- Latency: p50 increased slightly ({baseline['p50_latency_ms']:.0f} -> {optimized_metrics['p50_latency_ms']:.0f} ms) while p95 improved
  ({baseline['p95_latency_ms']:.0f} -> {optimized_metrics['p95_latency_ms']:.0f} ms).

Assessment
The optimization appears acceptable for this interactive workload because tail latency improves and throughput increases materially,
while cost and energy per 1k requests drop significantly. The primary tradeoff is a modest p50 latency increase and a potential minor
quality impact due to reduced precision and batching. If adopted, pair the change with lightweight quality regression checks and
monitor p50 latency under real traffic to ensure batching does not degrade perceived responsiveness.
""".strip()

print(analysis)

Summary
- Cost: $7.40 -> $4.60 per 1k requests.
- Energy: 3.20 -> 2.10 kWh per 1k requests.
- Throughput: 18 -> 32 req/s.
- Latency: p50 increased slightly (420 -> 460 ms) while p95 improved
  (980 -> 820 ms).

Assessment
The optimization appears acceptable for this interactive workload because tail latency improves and throughput increases materially,
while cost and energy per 1k requests drop significantly. The primary tradeoff is a modest p50 latency increase and a potential minor
quality impact due to reduced precision and batching. If adopted, pair the change with lightweight quality regression checks and
monitor p50 latency under real traffic to ensure batching does not degrade perceived responsiveness.


In [19]:
recommendation = {
    "decision": "Adopt with guardrails",
    "guardrails": [
        "Run A/B rollout with real traffic and monitor p50/p95 latency and error rates.",
        "Add automated quality regression checks for formatting, tone stability, and factuality.",
        "Set batching limits to avoid excessive median latency under low traffic.",
        "Revisit cost/energy estimates monthly using actual infrastructure billing and utilization.",
    ],
    "why": "Tail latency, throughput, memory, cost, and energy improved materially; risks are manageable with monitoring and tests."
}

pd.DataFrame({
    "decision": [recommendation["decision"]],
    "why": [recommendation["why"]],
    "guardrails": ["; ".join(recommendation["guardrails"])]
})

Unnamed: 0,decision,why,guardrails
0,Adopt with guardrails,"Tail latency, throughput, memory, cost, and energy improved materially; risks are manageable with monitoring and tests.","Run A/B rollout with real traffic and monitor p50/p95 latency and error rates.; Add automated quality regression checks for formatting, tone stability, and factuality.; Set batching limits to avoid excessive median latency under low traffic.; Revisit cost/energy estimates monthly using actual infrastructure billing and utilization."
