# Static Evaluation on Collected Traces

* Usage: Click `Runtime -> Run All`
* Hardware Requirement: No special requirement, only a CPU enabled machine.
* Software Requirement: Python 3, with `numpy`, `pandas`, `matplotlib`.
* Disk Space Requirement: It takes ~1.5GB to place traces downloaded from Internet.

Expected results:
Static evaluation on collected traces are expected to closely fit the figures and tables
presented in the paper, except for some statistics in the first five rows of Table. 2 and
Sec. 4.5. To save disk space, we mistakenly deleted the original traces for these results.
And because these results capture the finest runtime dynamics of the GPU, exact
reproduction will be impossible. Our later experiments can only reproduce similar results.
Please accept our apologies for the inconvenience, and we will update the revised paper
to include the latest results.

Notes:
Each following section can be executed independently with trace downloaded.

## 0. Download Traces

Our collected trace, after washing out unnecessary part, takes up ~1.5GB for processing.

It's hosted on Cloudfare R2 with following anonymous link:

In [None]:
!wget https://pub-eef24bf0aa5b4950860ea28dfbe39d8c.r2.dev/trace.zip
!unzip trace.zip

## 1. Block Scheduling Cost

* Correspondence to paper: Sec. 4.5
* Notes: Optimization example is not included in static evaluation, please see `dynamic.ipynb`

#### Explanation:
As briefly discussed in Sec. 4.5, block scheduling cost are wasted cycles on clearing the context of finished block and scheduling a new block to the SM.

Neutrino measure the block scheduling cost via tracing the `%smid` and block start/end time.
Then, for blocks on the same SM, we calculate the difference between *the starting time of next block* and *the finishing time of previous block as the block scheduling cost*.

We represent block scheduling cost in `cycles` and the way to interpret the cost is its portion w.r.t. the execution time.

In [None]:
import sys

!{sys.executable} block_sched/block_sched.py block_sched/Jan_23_06_27_37_268336/result/0.486795.bin

## 2. Densified Memory Access Timeline (DMAT) Plot

* Correspondence to paper: Fig.1, Fig.10(A/B/C), in total 4 figures presented
* Here we use a Jupyter magic `%%capture` that will capture the `stdout` of the cell, we use it to read the path of generated DMAT Plot for display.

In [None]:
import sys
from IPython import Image

In [None]:
%%capture cap
!{sys.executable} dmat/dmat.py dmat/fig1.dmat

In [None]:
# Output corresponding to Fig. 1
Image(cap.stdout)

In [None]:
%%capture cap
!{sys.executable} dmat/dmat.py dmat/fig10a.dmat

In [None]:
# Output corresponding to Fig. 10a
Image(cap.stdout)

In [None]:
%%capture cap
!{sys.executable} dmat/dmat.py dmat/fig10b.dmat

In [None]:
# Output corresponding to Fig. 10b
Image(cap.stdout)

In [None]:
%%capture cap
!{sys.executable} dmat/dmat.py dmat/fig10c.dmat

In [None]:
# Output corresponding to Fig. 10c
Image(cap.stdout)

## 3. Kernel Slowdown and Additional Registers

* Correspondence to paper: Table. 2
* Notes: The CUTLASS and Triton part are collected later and are _slightly_ different from the original paper result. We will use the new results in the revised paper.

In [None]:
import sys

In [None]:
# CUTLASS Part of Results
!{sys.executable} kernel_overhead/cutlass_op/overhead.py

In [None]:
# Triton Part of Results
!{sys.executable} kernel_overhead/triton_op/overhead.py

In [None]:
# PyTorch Part of Results
!{sys.executable} kernel_overhead/pytorch_op/overhead.py

## 4. Maximum Memory Usage

* Correspondence to paper: Fig. 11

In [None]:
import os

BASE = "max_mem"

for model in os.listdir(BASE):
    for test in os.listdir(os.path.join(BASE, model)):
        max_mem = 0
        original = 0
        for sub in os.listdir(os.path.join(BASE, model, test)):
            path = os.path.join(BASE, model, test, sub)
            with open(os.path.join(path, "stdout.log"), "r") as f:
                stdout = f.read().split("\n")
            for line in stdout:
                if line.startswith("max memory:"):
                    original = max(original, int(line.strip().split(" ")[-1]))
                    break
            traces = [trace for trace in os.listdir(path) if os.path.isdir(os.path.join(path, trace))]
            for trace in traces:
                # print(os.path.join(path, trace, "event.log"))
                with open(os.path.join(path, trace, "event.log"), "r", encoding="utf-8", errors="ignore") as f:
                    event_log = f.read().split("\n")
                for line in event_log:
                    if line.startswith("[exec] probe-mem"):
                        max_mem = max(max_mem, int(line.split(" ")[2]))
        print(model, test, original, max_mem)

## 5. Exposed Latency

* Correspondence to paper: Fig. 12

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib

matplotlib.rc('font', **{'size' : 12})

def plot(name: str, data: pd.DataFrame):

    # data for conv
    labels = data["probe"]
    ncu_ratios = data["ncu"] / data["original"]
    neutrino_ratios = data["neutrino"] / data["original"]

    epilogue = data["epilogue"] / data["neutrino"]
    kernel   = data["kernel"] / data["neutrino"]
    prologue = data["prologue"] / data["neutrino"]

    proportions = [
      [0.008547190046981101, 0.772077464754433, 0.21937534519858598],
      [0.015973571121974683, 0.21929421667871976, 0.7647322121993056],
      [0.01659435009449902, 0.22045060782593218, 0.7629550420795688]
    ]

    proportions = np.array(proportions).T
    bottoms = np.array(epilogue * neutrino_ratios)
    middles = np.array(kernel * neutrino_ratios) + bottoms
    aboves  = np.array(prologue * neutrino_ratios) + middles

    x = np.arange(len(labels))  # the label locations
    width = 0.31  # the width of the bars

    fig, ax = plt.subplots(figsize=(6,4))
    bars1 = ax.bar(x - width/2, ncu_ratios, width, label='Nsight Compute', color='#b31805')

    # Create stacked bars for green values
    # bars2_bottom = ax.bar(x + width/2, neutrino_aboves, width, label='Neutrino (Prologue)', color='#00bb00')
    bars2_middle = ax.bar(x + width/2, aboves, width, label="Neutrino (Kernel)", color='#812be1')
    bars2_bottom = ax.bar(x + width/2, bottoms, width,  label='Neutrino (Epilogue)', color='#0053d6')

    ax.set_xticks(x)
    ax.set_xticklabels(labels)

    ncu_labels = [f"{a:.2f}x" for a in ncu_ratios.tolist()]
    neutrino_labels = [f"{a:.2f}x" for a in neutrino_ratios.tolist()]
    ax.bar_label(ax.containers[0], labels=ncu_labels, label_type='edge')
    ax.margins(y=0.1)
    ax.bar_label(ax.containers[1], labels=neutrino_labels, label_type='edge')
    ax.margins(y=0.1)
    ax.set_label(name)

    ax.legend()

    # Show the plot
    plt.show()
    # plt.savefig(f"{data["name", 0]}.svg", format="svg")

In [None]:
plot("attn", pd.read_csv("exposed_latency/attn.csv"))

In [None]:
plot("gmm", pd.read_csv("exposed_latency/gmm.csv"))

In [None]:
plot("conv", pd.read_csv("exposed_latency/conv.csv"))

In [None]:
plot("gemm", pd.read_csv("exposed_latency/gemm.csv"))

## 6. Warp Scheduling and Tailing Effect

* Correspondence to paper: Fig. 13 of Sec. 7

In [None]:
import sys
from IPython.display import Image # For display

In [None]:
# Fig13a
!{sys.executable} warp_sched/fig13a.py warp_sched/exclusive_sched/result/2.756086.bin
Image("warp_sched/fig13a.png")

In [None]:
# Fig13b
!{sys.executable} warp_sched/fig13b.py warp_sched/shared_sched/result/3.201079.bin
Image("warp_sched/fig13b.png")

In [None]:
# Fig13cd
!{sys.executable} warp_sched/fig13cd.py warp_sched/tmp.pkl

In [None]:
Image("warp_sched/fig13c.png")

In [None]:
Image("warp_sched/fig13d.png")

In [None]:
Image("warp_sched/fig13d-sub.png")