# Detailed Analysis Traces for GPU-Disaggregated Deep Learning Recommendation Models

## ℹ️ Overview

This repository contains a comprehensive trace dataset for GPU-disaggregated serving of Deep Learning Recommendation Models (DLRMs).
The dataset captures operational characteristics of **156 inference services**, comprising a total of **23,871 inference instances**. 
These instances are further divided into **16,485 CN (CPU Node) inference instances** and **7,386 HN (Heterogeneous GPU Node) inference instances**.

All instances in this dataset are categorized as *Latency-Sensitive (LS)* workloads, reflecting their critical performance requirements. These inference instances are typically **high-priority** and **long-running**, ensuring sustained availability and responsiveness for end users.

## 🗄️ Dataset Details

The core dataset is provided in the file [`disaggregated_DLRM_trace.csv`](./disaggregated_DLRM_trace.csv).

### Field Descriptions

- `instance_sn`: Unique identifier for the instance.
- `role`: Role of the instance.
  - `CN`: CPU Node
  - `HN`: Heterogeneous GPU Node
- `app_name`: Name of the application group to which the instance belongs. An application group is a collection of instances sharing the same application name. For example, `app_0` may contain multiple instances like `instance_0`, `instance_1`, etc.
- `cpu_request`: Number of CPU cores requested by the instance (in vCPUs).
- `cpu_limit`: Maximum number of CPU cores allowed for the instance (same as `cpu_request` in this scenario).
- `gpu_request`: Number of GPUs requested by the instance.
- `gpu_limit`: Maximum number of GPUs allowed for the instance (same as `gpu_request` in this scenario).
- `rdma_request`: Allocated percentage of the bandwidth of an RDMA Network Interface Card (RNIC), ranging from 0 to 100. Currently, this value is used as a constraint for scheduling density.
- `rdma_limit`: Maximum RDMA bandwidth allowed for the instance (same as `rdma_request` in this scenario).
- `memory_request`: Amount of main memory requested by the instance (in GiB).
- `memory_limit`: Maximum amount of main memory allowed for the instance (in GiB).
- `disk_request`: Amount of disk space requested by the instance (in GiB).
- `disk_limit`: Maximum amount of disk space allowed for the instance (in GiB).
- `max_instance_per_node`: Maximum number of instances of the same `app_name` that can be deployed on a single node. A value of `-1` indicates no deployment density limit.
- `creation_time`: Timestamp indicating when the instance was created, expressed as the difference in **seconds** from the trace start time. If the instance existed before the trace start time, this field is set to `NaN`.
- `scheduled_time`: Timestamp indicating when the instance was scheduled, expressed as the difference in **seconds** from the trace start time. If the instance was scheduled before the trace start time, this field is set to `NaN`.
- `deletion_time`: Timestamp indicating when the instance was deleted, expressed as the difference in seconds from the trace start time. If the instance was deleted after the trace end time, this field is set to `NaN`.


In [1]:
# Load the disaggregated DLRM trace CSV, run detailed analysis, and save figures, tables, and an HTML report.

"""
How to use
    Put this script next to disaggregated_DLRM_trace.csv.
    Run it with a regular Python interpreter. No arguments needed.

What you get in ./dlrm_analysis_out
    CSV tables
        - overall_resource_stats.csv
        - per_role_resource_stats.csv
        - correlation_matrix.csv
        - per_app_aggregates.csv
        - max_instance_per_node_counts.csv
        - time_summaries.csv
        - concurrency_over_time.csv
        - arrivals_per_hour.csv
        - departures_per_hour.csv
        - per_app_time_stats.csv
    Figures in ./dlrm_analysis_out/figures
        - histograms for CPU, GPU, memory, RDMA, disk, schedule_delay, runtime
        - boxplots of CPU by role and memory by role
        - heatmap of resource correlation
        - concurrency_over_time.png
        - arrivals_and_departures_over_time.png
        - cpu_vs_memory_scatter.png
        - rdma_distribution_by_role.png
    A single HTML report
        - report.html that links to every figure and table
"""

import os
import math
from pathlib import Path
from datetime import datetime
from typing import List, Tuple, Dict

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [2]:
# --------------------------- configuration ---------------------------

OUTDIR = Path("./dlrm_analysis_out")
FIGDIR = OUTDIR / "figures"
CSV_NAME = "disaggregated_DLRM_trace.csv"

In [3]:
# --------------------------- utilities ---------------------------

def ensure_outdirs(outdir: Path, figdir: Path) -> None:
    outdir.mkdir(parents=True, exist_ok=True)
    figdir.mkdir(parents=True, exist_ok=True)

def to_numeric(df: pd.DataFrame, cols: List[str]) -> pd.DataFrame:
    for c in cols:
        if c in df.columns:
            df[c] = pd.to_numeric(df[c], errors="coerce")
    return df

def describe_resources(frame: pd.DataFrame) -> pd.DataFrame:
    cols = ["cpu_request", "gpu_request", "rdma_request", "memory_request", "disk_request"]
    cols = [c for c in cols if c in frame.columns]
    desc = frame[cols].describe(percentiles=[0.5, 0.9, 0.95]).T
    desc = desc.rename(columns={"50%": "p50", "90%": "p90", "95%": "p95"})
    return desc[["count", "mean", "std", "min", "p50", "p90", "p95", "max"]]

def safe_percentile(x: pd.Series, q: float) -> float:
    x = pd.to_numeric(x, errors="coerce").dropna()
    if len(x) == 0:
        return float("nan")
    return float(np.percentile(x, q))

def seconds_to_hms(sec: float) -> str:
    if not math.isfinite(sec):
        return "na"
    sec = int(round(sec))
    d, rem = divmod(sec, 86400)
    h, rem = divmod(rem, 3600)
    m, s = divmod(rem, 60)
    if d > 0:
        return f"{d}d {h}h {m}m {s}s"
    if h > 0:
        return f"{h}h {m}m {s}s"
    if m > 0:
        return f"{m}m {s}s"
    return f"{s}s"

def save_hist(series: pd.Series, title: str, xlabel: str, figpath: Path, bins=30, logx=False, logy=False):
    data = pd.to_numeric(series, errors="coerce").dropna()
    if len(data) == 0:
        return
    plt.figure()
    if logx:
        positive = data[data > 0]
        if len(positive) > 0:
            edges = np.logspace(np.log10(positive.min()), np.log10(positive.max()), bins + 1)
            plt.hist(positive, bins=edges)
            plt.xscale("log")
        else:
            plt.hist(data, bins=bins)
    else:
        plt.hist(data, bins=bins)
    if logy:
        plt.yscale("log")
    plt.xlabel(xlabel)
    plt.ylabel("Count")
    plt.title(title)
    plt.tight_layout()
    plt.savefig(figpath, dpi=160)
    plt.close()

def save_box_by_role(df: pd.DataFrame, col: str, figpath: Path):
    sub = df[["role", col]].dropna()
    if sub.empty:
        return
    plt.figure()
    sub.boxplot(by="role", column=[col])
    plt.suptitle("")
    plt.title(f"{col} by role")
    plt.xlabel("role")
    plt.ylabel(col)
    plt.tight_layout()
    plt.savefig(figpath, dpi=160)
    plt.close()

def save_heatmap(corr: pd.DataFrame, title: str, figpath: Path):
    if corr.empty:
        return
    plt.figure(figsize=(6, 5))
    plt.imshow(corr.values, interpolation="nearest")
    plt.xticks(range(len(corr.columns)), corr.columns, rotation=45, ha="right")
    plt.yticks(range(len(corr.index)), corr.index)
    for i in range(len(corr.index)):
        for j in range(len(corr.columns)):
            v = corr.values[i, j]
            plt.text(j, i, f"{v:.2f}", ha="center", va="center", fontsize=8, color="white")
    plt.colorbar()
    plt.title(title)
    plt.tight_layout()
    plt.savefig(figpath, dpi=160)
    plt.close()

def sweep_concurrency(starts: np.ndarray, ends: np.ndarray, tmin: float, tmax: float, hour_bin: float = 1.0):
    if not np.isfinite(tmin) or not np.isfinite(tmax) or tmax <= tmin:
        return pd.Series(dtype=float)
    step = 3600.0 * hour_bin
    n_bins = int(math.ceil((tmax - tmin) / step)) + 1
    edges = np.linspace(tmin, tmin + n_bins * step, n_bins + 1)
    centers = (edges[:-1] + edges[1:]) / 2.0
    sh, _ = np.histogram(starts, bins=edges)
    eh, _ = np.histogram(ends, bins=edges)
    conc = np.cumsum(sh) - np.cumsum(eh)
    return pd.Series(conc, index=pd.Index(centers, name="time_sec"), name="concurrency")

def hourly_counts(ts: pd.Series, tmin: float, tmax: float):
    if ts.dropna().empty or not np.isfinite(tmin) or not np.isfinite(tmax) or tmax <= tmin:
        return pd.Series(dtype=float)
    step = 3600.0
    n_bins = int(math.ceil((tmax - tmin) / step)) + 1
    edges = np.linspace(tmin, tmin + n_bins * step, n_bins + 1)
    centers = (edges[:-1] + edges[1:]) / 2.0
    counts, _ = np.histogram(ts.dropna().values, bins=edges)
    return pd.Series(counts, index=pd.Index(centers, name="time_sec"))

def scatter_cpu_mem(df: pd.DataFrame, figpath: Path):
    sub = df[["cpu_request", "memory_request", "role"]].dropna()
    if sub.empty:
        return
    plt.figure()
    for role, g in sub.groupby("role"):
        plt.scatter(g["cpu_request"], g["memory_request"], s=6, alpha=0.6, label=role)
    plt.xlabel("CPU request vCPUs")
    plt.ylabel("Memory request GiB")
    plt.legend()
    plt.title("CPU versus memory request")
    plt.tight_layout()
    plt.savefig(figpath, dpi=160)
    plt.close()

def rdma_hist_by_role(df: pd.DataFrame, figpath: Path):
    sub = df[["role", "rdma_request"]].dropna()
    if sub.empty:
        return
    plt.figure()
    roles = sorted(sub["role"].dropna().unique())
    for idx, role in enumerate(roles):
        vals = sub.loc[sub["role"] == role, "rdma_request"]
        plt.hist(vals, bins=30, alpha=0.5, label=role)
    plt.xlabel("RDMA percent")
    plt.ylabel("Count")
    plt.title("RDMA request distribution by role")
    plt.legend()
    plt.tight_layout()
    plt.savefig(figpath, dpi=160)
    plt.close()

def write_report_html(context: Dict, outpath: Path):
    def link(p: Path) -> str:
        rel = os.path.relpath(p, outpath.parent)
        return f'<a href="{rel}" target="_blank">{p.name}</a>'
    now = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
    lines = []
    lines.append("<html><head><meta charset='utf-8'><title>DLRM trace analysis</title></head><body>")
    lines.append(f"<h2>DLRM trace analysis</h2><p>Generated at {now}</p>")
    lines.append("<h3>Dataset scope</h3>")
    lines.append("<ul>")
    lines.append(f"<li>Rows {context['n_rows']}, columns {context['n_cols']}, apps {context['n_apps']}</li>")
    lines.append(f"<li>Role split, CN {context['cn']} and HN {context['hn']}</li>")
    lines.append(f"<li>GPU backed share {context['gpu_share']:.2%}</li>")
    lines.append("</ul>")

    lines.append("<h3>Key medians by role</h3>")
    lines.append("<ul>")
    lines.append(f"<li>CN median, CPU {context['cn_cpu_med']:.0f} vCPUs, memory {context['cn_mem_med']:.0f} GiB, RDMA {context['cn_rdma_med']:.0f} percent</li>")
    lines.append(f"<li>HN median, CPU {context['hn_cpu_med']:.0f} vCPUs, memory {context['hn_mem_med']:.0f} GiB, GPU {context['hn_gpu_med']:.0f}, RDMA {context['hn_rdma_med']:.0f} percent</li>")
    lines.append("</ul>")

    lines.append("<h3>Correlations</h3>")
    lines.append(f"<p>CPU to memory {context['corr_cpu_mem']:.3f}, GPU to RDMA {context['corr_gpu_rdma']:.3f}</p>")

    lines.append("<h3>Density and placement</h3>")
    lines.append(f"<p>Mode of max_instance_per_node {context['ipn_mode']}, share with no limit {context['ipn_no_limit']:.2%}</p>")

    lines.append("<h3>Timing behavior</h3>")
    lines.append("<ul>")
    lines.append(f"<li>Schedule delay p50 {context['sched_p50']}, p90 {context['sched_p90']}, p95 {context['sched_p95']}</li>")
    lines.append(f"<li>Runtime p50 {context['run_p50']}, p90 {context['run_p90']}, p95 {context['run_p95']}</li>")
    lines.append(f"<li>Missing counts, creation NaN {context['preexist']}, scheduled NaN {context['prescheduled']}, deletion NaN {context['posttrace']}</li>")
    lines.append("</ul>")

    lines.append("<h3>Figures</h3><ul>")
    for fig in context["figs"]:
        lines.append(f"<li>{link(fig)}</li>")
    lines.append("</ul>")

    lines.append("<h3>Tables</h3><ul>")
    for tbl in context["tables"]:
        lines.append(f"<li>{link(tbl)}</li>")
    lines.append("</ul>")

    lines.append("</body></html>")
    outpath.write_text("\n".join(lines), encoding="utf-8")

In [4]:
# --------------------------- main workflow ---------------------------

def main():
    ensure_outdirs(OUTDIR, FIGDIR)

    if not Path(CSV_NAME).exists():
        raise FileNotFoundError(
            f"Could not find {CSV_NAME}. Place the CSV next to this script."
        )

    # Load the dataset
    df = pd.read_csv(CSV_NAME)

    # Convert numeric columns
    num_cols = [
        "cpu_request", "cpu_limit", "gpu_request", "gpu_limit",
        "rdma_request", "rdma_limit", "memory_request", "memory_limit",
        "disk_request", "disk_limit", "max_instance_per_node",
        "creation_time", "scheduled_time", "deletion_time",
    ]
    to_numeric(df, num_cols)

    # Basic shape
    n_rows = len(df)
    n_cols = len(df.columns)
    n_apps = df["app_name"].nunique()

    # Role counts
    role_counts = df["role"].value_counts(dropna=False)
    cn = int(role_counts.get("CN", 0))
    hn = int(role_counts.get("HN", 0))

    # Resource summaries
    overall_stats = describe_resources(df)
    by_role_stats = df.groupby("role", dropna=False).apply(describe_resources)

    # Correlation matrix
    corr_cols = ["cpu_request", "gpu_request", "rdma_request", "memory_request", "disk_request"]
    corr = df[corr_cols].corr()

    # Density distribution
    ipn_counts = df["max_instance_per_node"].value_counts(dropna=False).sort_index()
    ipn_mode = int(df["max_instance_per_node"].mode().iloc[0]) if not df["max_instance_per_node"].mode().empty else None
    ipn_no_limit_share = float((df["max_instance_per_node"] == -1).mean())

    # Timing features
    df["schedule_delay"] = np.where(
        df["creation_time"].notna() & df["scheduled_time"].notna(),
        df["scheduled_time"] - df["creation_time"],
        np.nan,
    )
    df["runtime"] = np.where(
        df["scheduled_time"].notna() & df["deletion_time"].notna(),
        df["deletion_time"] - df["scheduled_time"],
        np.nan,
    )

    sched_p50 = safe_percentile(df["schedule_delay"], 50)
    sched_p90 = safe_percentile(df["schedule_delay"], 90)
    sched_p95 = safe_percentile(df["schedule_delay"], 95)
    run_p50 = safe_percentile(df["runtime"], 50)
    run_p90 = safe_percentile(df["runtime"], 90)
    run_p95 = safe_percentile(df["runtime"], 95)

    preexisting = int(df["creation_time"].isna().sum())
    prescheduled = int(df["scheduled_time"].isna().sum())
    posttrace = int(df["deletion_time"].isna().sum())

    # Concurrency and hourly arrivals or departures
    all_times = pd.concat([df["creation_time"], df["scheduled_time"], df["deletion_time"]], ignore_index=True).dropna()
    tmin = float(all_times.min()) if len(all_times) else 0.0
    tmax = float(all_times.max()) if len(all_times) else 0.0

    valid = df["scheduled_time"].notna()
    starts = df.loc[valid, "scheduled_time"].to_numpy()
    ends = df.loc[valid, "deletion_time"].to_numpy()
    ends_filled = np.where(np.isfinite(ends), ends, tmax)
    conc_series = sweep_concurrency(starts, ends_filled, tmin, tmax, hour_bin=1.0)

    arrivals_hourly = hourly_counts(df["scheduled_time"], tmin, tmax)
    departures_hourly = hourly_counts(df["deletion_time"], tmin, tmax)

    # Per app aggregates and time stats
    per_app = df.groupby("app_name").agg(
        instances=("instance_sn", "nunique"),
        role_cn=("role", lambda s: int((s == "CN").sum())),
        role_hn=("role", lambda s: int((s == "HN").sum())),
        cpu_req_sum=("cpu_request", "sum"),
        gpu_req_sum=("gpu_request", "sum"),
        rdma_req_sum=("rdma_request", "sum"),
        mem_req_sum=("memory_request", "sum"),
        disk_req_sum=("disk_request", "sum"),
        max_ipn_mode=("max_instance_per_node", lambda s: s.mode().iloc[0] if not s.mode().empty else np.nan),
        sched_delay_p50=("schedule_delay", lambda s: safe_percentile(s, 50)),
        runtime_p50=("runtime", lambda s: safe_percentile(s, 50)),
    ).reset_index()

    # --------------------------- save CSV tables ---------------------------

    overall_stats.round(3).to_csv(OUTDIR / "overall_resource_stats.csv")
    by_role_stats.round(3).to_csv(OUTDIR / "per_role_resource_stats.csv")
    corr.round(3).to_csv(OUTDIR / "correlation_matrix.csv")
    per_app.round(3).to_csv(OUTDIR / "per_app_aggregates.csv", index=False)
    ipn_counts.to_csv(OUTDIR / "max_instance_per_node_counts.csv", header=["count"])

    time_df = pd.DataFrame({
        "schedule_delay_p50_s": [sched_p50],
        "schedule_delay_p90_s": [sched_p90],
        "schedule_delay_p95_s": [sched_p95],
        "runtime_p50_s": [run_p50],
        "runtime_p90_s": [run_p90],
        "runtime_p95_s": [run_p95],
        "preexisting_creation_na": [preexisting],
        "prescheduled_na": [prescheduled],
        "posttrace_deletion_na": [posttrace],
        "tmin_s": [tmin],
        "tmax_s": [tmax],
    })
    time_df.to_csv(OUTDIR / "time_summaries.csv", index=False)

    if not conc_series.empty:
        conc_series.to_csv(OUTDIR / "concurrency_over_time.csv", header=True)
    if not arrivals_hourly.empty:
        arrivals_hourly.to_csv(OUTDIR / "arrivals_per_hour.csv", header=["arrivals"])
    if not departures_hourly.empty:
        departures_hourly.to_csv(OUTDIR / "departures_per_hour.csv", header=["departures"])

    # --------------------------- save figures ---------------------------

    save_hist(df["cpu_request"], "Distribution of CPU requests", "CPU request vCPUs", FIGDIR / "cpu_request_hist.png", bins=40)
    save_hist(df["gpu_request"], "Distribution of GPU requests", "GPU count", FIGDIR / "gpu_request_hist.png", bins=5)
    save_hist(df["memory_request"], "Distribution of memory requests", "Memory GiB", FIGDIR / "memory_request_hist.png", bins=40)
    save_hist(df["rdma_request"], "Distribution of RDMA requests", "RDMA percent", FIGDIR / "rdma_request_hist.png", bins=40)
    if "disk_request" in df.columns:
        save_hist(df["disk_request"], "Distribution of disk requests", "Disk GiB", FIGDIR / "disk_request_hist.png", bins=40)

    save_hist(df["schedule_delay"], "Distribution of schedule delays", "Delay seconds", FIGDIR / "schedule_delay_hist.png", bins=60, logx=True, logy=True)
    save_hist(df["runtime"], "Distribution of runtimes", "Runtime seconds", FIGDIR / "runtime_hist.png", bins=60, logx=True, logy=True)

    save_box_by_role(df, "cpu_request", FIGDIR / "cpu_by_role_box.png")
    save_box_by_role(df, "memory_request", FIGDIR / "memory_by_role_box.png")

    save_heatmap(corr, "Resource correlation", FIGDIR / "correlation_heatmap.png")

    if not conc_series.empty:
        plt.figure()
        conc_series.plot()
        plt.xlabel("Trace time seconds")
        plt.ylabel("Concurrent active instances")
        plt.title("Active instance concurrency over time")
        plt.tight_layout()
        plt.savefig(FIGDIR / "concurrency_over_time.png", dpi=160)
        plt.close()

    if not arrivals_hourly.empty or not departures_hourly.empty:
        plt.figure()
        if not arrivals_hourly.empty:
            arrivals_hourly.plot(label="arrivals")
        if not departures_hourly.empty:
            departures_hourly.plot(label="departures")
        plt.xlabel("Trace time seconds")
        plt.ylabel("Count per hour")
        plt.title("Arrivals and departures per hour")
        plt.legend()
        plt.tight_layout()
        plt.savefig(FIGDIR / "arrivals_and_departures_over_time.png", dpi=160)
        plt.close()

    scatter_cpu_mem(df, FIGDIR / "cpu_vs_memory_scatter.png")
    rdma_hist_by_role(df, FIGDIR / "rdma_distribution_by_role.png")

    # --------------------------- console summary ---------------------------

    gpu_share = float((df.get("gpu_request", pd.Series(dtype=float)) > 0).mean()) if "gpu_request" in df.columns else 0.0

    cn_cpu_med = float(df.loc[df["role"] == "CN", "cpu_request"].median())
    cn_mem_med = float(df.loc[df["role"] == "CN", "memory_request"].median())
    cn_rdma_med = float(df.loc[df["role"] == "CN", "rdma_request"].median())

    hn_cpu_med = float(df.loc[df["role"] == "HN", "cpu_request"].median())
    hn_mem_med = float(df.loc[df["role"] == "HN", "memory_request"].median())
    hn_gpu_med = float(df.loc[df["role"] == "HN", "gpu_request"].median())
    hn_rdma_med = float(df.loc[df["role"] == "HN", "rdma_request"].median())

    corr_cpu_mem = float(corr.loc["cpu_request", "memory_request"]) if {"cpu_request", "memory_request"} <= set(corr.index) else float("nan")
    corr_gpu_rdma = float(corr.loc["gpu_request", "rdma_request"]) if {"gpu_request", "rdma_request"} <= set(corr.index) else float("nan")

    print("\n========== Disaggregated DLRM trace summary ==========")
    print(f"Rows {n_rows:,}   Columns {n_cols:,}   Apps {n_apps:,}")
    print(f"Role split, CN {cn:,}   HN {hn:,}   GPU backed share {gpu_share:.2%}")
    print(f"CN median, CPU {cn_cpu_med:.0f} vCPUs   Memory {cn_mem_med:.0f} GiB   RDMA {cn_rdma_med:.0f} percent")
    print(f"HN median, CPU {hn_cpu_med:.0f} vCPUs   Memory {hn_mem_med:.0f} GiB   GPU {hn_gpu_med:.0f}   RDMA {hn_rdma_med:.0f} percent")
    print(f"Corr CPU to memory {corr_cpu_mem:.3f}   Corr GPU to RDMA {corr_gpu_rdma:.3f}")
    print(f"Mode of max_instance_per_node {ipn_mode}   Share with no limit {ipn_no_limit_share:.2%}")
    print(f"Schedule delay p50 {seconds_to_hms(sched_p50)}   p90 {seconds_to_hms(sched_p90)}   p95 {seconds_to_hms(sched_p95)}")
    print(f"Runtime p50 {seconds_to_hms(run_p50)}   p90 {seconds_to_hms(run_p90)}   p95 {seconds_to_hms(run_p95)}")
    print(f"Missing counts, creation NaN {preexisting:,}   scheduled NaN {prescheduled:,}   deletion NaN {posttrace:,}")
    if not conc_series.empty:
        print(f"Concurrency samples {len(conc_series):,} from tmin {tmin:.0f} to tmax {tmax:.0f} seconds")
    print(f"Outputs are in {OUTDIR.resolve()}")
    print("======================================================\n")

    # --------------------------- HTML report ---------------------------

    figs = sorted(FIGDIR.glob("*.png"))
    tables = sorted([
        OUTDIR / "overall_resource_stats.csv",
        OUTDIR / "per_role_resource_stats.csv",
        OUTDIR / "correlation_matrix.csv",
        OUTDIR / "per_app_aggregates.csv",
        OUTDIR / "max_instance_per_node_counts.csv",
        OUTDIR / "time_summaries.csv",
        OUTDIR / "concurrency_over_time.csv",
        OUTDIR / "arrivals_per_hour.csv",
        OUTDIR / "departures_per_hour.csv",
    ], key=lambda p: p.name if p.exists() else "zzz_" + p.name)
    tables = [p for p in tables if p.exists()]

    context = {
        "n_rows": n_rows,
        "n_cols": n_cols,
        "n_apps": n_apps,
        "cn": cn,
        "hn": hn,
        "gpu_share": gpu_share,
        "cn_cpu_med": cn_cpu_med,
        "cn_mem_med": cn_mem_med,
        "cn_rdma_med": cn_rdma_med,
        "hn_cpu_med": hn_cpu_med,
        "hn_mem_med": hn_mem_med,
        "hn_gpu_med": hn_gpu_med,
        "hn_rdma_med": hn_rdma_med,
        "corr_cpu_mem": corr_cpu_mem,
        "corr_gpu_rdma": corr_gpu_rdma,
        "ipn_mode": ipn_mode if ipn_mode is not None else float("nan"),
        "ipn_no_limit": ipn_no_limit_share,
        "sched_p50": seconds_to_hms(sched_p50),
        "sched_p90": seconds_to_hms(sched_p90),
        "sched_p95": seconds_to_hms(sched_p95),
        "run_p50": seconds_to_hms(run_p50),
        "run_p90": seconds_to_hms(run_p90),
        "run_p95": seconds_to_hms(run_p95),
        "preexist": preexisting,
        "prescheduled": prescheduled,
        "posttrace": posttrace,
        "figs": figs,
        "tables": tables,
    }
    write_report_html(context, OUTDIR / "report.html")


if __name__ == "__main__":
    main()


Rows 23,871   Columns 17   Apps 156
Role split, CN 16,485   HN 7,386   GPU backed share 30.94%
CN median, CPU 64 vCPUs   Memory 320 GiB   RDMA 1 percent
HN median, CPU 8 vCPUs   Memory 40 GiB   GPU 1   RDMA 25 percent
Corr CPU to memory 0.987   Corr GPU to RDMA 0.173
Mode of max_instance_per_node -1   Share with no limit 51.31%
Schedule delay p50 0s   p90 59s   p95 2m 16s
Runtime p50 2h 17m 58s   p90 3d 7h 39m 47s   p95 7d 1h 50m 50s
Missing counts, creation NaN 7,280   scheduled NaN 7,280   deletion NaN 8,878
Concurrency samples 745 from tmin 75 to tmax 2677541 seconds
Outputs are in /Users/sultanulovi/Downloads/clusterdata-master/cluster-trace-gpu-v2025/dlrm_analysis_out



<Figure size 640x480 with 0 Axes>

<Figure size 640x480 with 0 Axes>

# END