# HAT Substrate Comparison - Neutral vs Emotional

This notebook loads and explores the data collected by `substrate_collector.py`
during two LLM inference runs (one per prompt condition) on a CloudLab bare-metal node.

**Data sources per run:**

| File | Resolution | Content |
|------|------------|---------|
| `perf_stat.txt` | 1 ms buckets | 16 system-wide perf events (IRQs, softirqs, TLB, power) |
| `proc_sample.csv` | 200 ms | Per-process CPU time + RSS for the llama.cpp container PID |
| `proc_system_sample.csv` | 200 ms | /proc/interrupts, /proc/softirqs, PSI, net, disk, freq |
| `responses.jsonl` | per request | LLM request timing + llama.cpp internal metrics |
| `collector_meta.json` | once | Collector config, t0/t1 timestamps |

**Timing model:** Each run records a fixed **44 s** window
(2 s baseline + 40 s prompt budget + 2 s tail) so the time axes are directly comparable.

In [None]:
import re, json
from pathlib import Path

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

plt.rcParams['figure.dpi'] = 120
plt.rcParams['figure.figsize'] = (12, 4)

## 1 - Configuration

Point `BASE_DIR` at the folder that contains your run directories.
Each run directory name looks like `2026-02-09T22-18-50_emotional`.

In [None]:
BASE_DIR = Path.home() / 'Desktop' / 'mccviahat' / 'mccviahat_runs'

NEUTRAL_RUN   = '2026-02-09T22-28-54_neutral'
EMOTIONAL_RUN = '2026-02-09T22-18-50_emotional'

n_dir = BASE_DIR / NEUTRAL_RUN
e_dir = BASE_DIR / EMOTIONAL_RUN

for d in [n_dir, e_dir]:
    assert d.exists(), f'Missing: {d}'
print('Both run directories found')

## 2 - Parsers

### perf stat parser

Each line in `perf_stat.txt` follows this format (the `#` comment is always present):
```
<timestamp>  <value>            <event_name>             # <rate> /sec
<timestamp>  <value> <unit>     <event_name>             # <rate> <desc>
```

The event name is always the **last token before the `#`**.  This makes parsing robust
regardless of whether a unit (`msec`, `Joules`, `C`) is present.

In [None]:
def parse_perf_stat(path: Path) -> pd.DataFrame:
    """Parse perf stat -I output into tidy (t_s, event, value) rows.
    
    Strategy: split each line on '#', take the left side,
    the event name is the LAST token before '#'.
    """
    rows = []
    for line in path.read_text(encoding='utf-8', errors='replace').splitlines():
        line = line.strip()
        if not line or line.startswith('#'):
            continue
        parts = line.split('#', 1)
        left = parts[0].strip()
        tokens = left.split()
        if len(tokens) < 3:
            continue
        
        t_s_str = tokens[0]
        event   = tokens[-1]       # last token before '#'
        raw_val = tokens[1]        # second token is the count/value
        
        try:
            t_s = float(t_s_str)
        except ValueError:
            continue
        
        raw_val = raw_val.replace(',', '')
        if raw_val.startswith('<'):
            val = np.nan
        else:
            try:
                val = float(raw_val)
            except ValueError:
                val = np.nan
        
        rows.append((t_s, event, val))
    
    return pd.DataFrame(rows, columns=['t_s', 'event', 'value'])


def perf_to_wide(tidy: pd.DataFrame) -> pd.DataFrame:
    """Pivot tidy perf -> wide: one row per timestamp, one column per event."""
    w = tidy.pivot_table(index='t_s', columns='event', values='value', aggfunc='first')
    return w.sort_index().reset_index()

### Response and /proc loaders

In [None]:
def load_json(path: Path) -> dict:
    return json.loads(path.read_text(encoding='utf-8'))


def load_responses(run_dir: Path) -> pd.DataFrame:
    """Load responses.jsonl and extract llama.cpp timings."""
    rows = []
    for line in (run_dir / 'responses.jsonl').read_text(encoding='utf-8').splitlines():
        if not line.strip():
            continue
        r = json.loads(line)
        try:
            resp = json.loads(r['response_raw'])
            timings = resp.get('timings', {})
            r['prompt_ms']         = timings.get('prompt_ms')
            r['predicted_ms']      = timings.get('predicted_ms')
            r['tokens_evaluated']  = resp.get('tokens_evaluated')
            r['tokens_predicted']  = resp.get('tokens_predicted')
        except Exception:
            pass
        rows.append(r)
    df = pd.DataFrame(rows)
    keep = [c for c in ['id', 'title', 'ok', 't_request_start_ns', 't_request_end_ns',
                         'prompt_ms', 'predicted_ms', 'tokens_evaluated', 'tokens_predicted']
            if c in df.columns]
    df = df[keep].sort_values('t_request_start_ns').reset_index(drop=True)
    df['duration_s'] = (df['t_request_end_ns'] - df['t_request_start_ns']) / 1e9
    return df


def load_proc_system(run_dir: Path) -> pd.DataFrame:
    """Load proc_system_sample.csv and add relative time."""
    df = pd.read_csv(run_dir / 'proc_system_sample.csv')
    df['t_s'] = (df['timestamp_ns'] - df['timestamp_ns'].iloc[0]) / 1e9
    return df


def load_proc_sample(run_dir: Path) -> pd.DataFrame:
    """Load proc_sample.csv (per-process stats) and add relative time."""
    df = pd.read_csv(run_dir / 'proc_sample.csv')
    df['t_s'] = (df['timestamp_ns'] - df['timestamp_ns'].iloc[0]) / 1e9
    return df

## 3 - Load all data

In [None]:
n_cmeta = load_json(n_dir / 'collector_meta.json')
e_cmeta = load_json(e_dir / 'collector_meta.json')

n_resp = load_responses(n_dir)
e_resp = load_responses(e_dir)

n_perf_tidy = parse_perf_stat(n_dir / 'perf_stat.txt')
e_perf_tidy = parse_perf_stat(e_dir / 'perf_stat.txt')
n_perf = perf_to_wide(n_perf_tidy)
e_perf = perf_to_wide(e_perf_tidy)

n_sys = load_proc_system(n_dir)
e_sys = load_proc_system(e_dir)

n_proc = load_proc_sample(n_dir)
e_proc = load_proc_sample(e_dir)

# Sanity report
perf_events = sorted(n_perf_tidy['event'].unique())
print(f'Perf events found ({len(perf_events)}): {perf_events}')
print(f'Perf rows   - neutral: {len(n_perf):,}  emotional: {len(e_perf):,}')
print(f'Proc system - neutral: {len(n_sys)}  emotional: {len(e_sys)}')
print(f'Responses   - neutral: {len(n_resp)}/5  emotional: {len(e_resp)}/5')
if len(e_resp) < 5:
    print(f'  -> Emotional has {len(e_resp)}/5 (prompt budget ran out - expected for long prompts)')

## 4 - Request timing comparison

Emotional prompts are **longer** (more tokens to evaluate) so they take more time.
With `n_predict=40`, the generation (predicted) time is similar, but prompt processing
dominates for emotional prompts (~6 s vs ~0.15 s).

In [None]:
show_cols = ['id', 'title', 'duration_s', 'tokens_evaluated', 'tokens_predicted', 'prompt_ms', 'predicted_ms']
print('=== Neutral ===')
display(n_resp[show_cols])
print(f'\n=== Emotional ({len(e_resp)}/5 completed) ===')
display(e_resp[show_cols])

## 5 - Perf event summary

Mean event counts per 1 ms bucket, comparing the two conditions.

In [None]:
common_events = sorted(set(n_perf.columns) & set(e_perf.columns) - {'t_s'})
print(f'Events in both runs ({len(common_events)}):\n')
print(f'{"Event":35s}  {"Neutral mean":>14s}  {"Emotional mean":>14s}')
print('-' * 68)
for e in common_events:
    print(f'{e:35s}  {n_perf[e].mean():14.2f}  {e_perf[e].mean():14.2f}')

## 6 - Time-series plots

Smoothed (500 ms rolling mean) time-series of key perf events.
Shaded bands mark request windows; dotted line marks the end of the 2 s baseline.

In [None]:
def request_windows(resp, t0_ns):
    return [(float(r['t_request_start_ns'] - t0_ns) / 1e9,
             float(r['t_request_end_ns']   - t0_ns) / 1e9)
            for _, r in resp.iterrows()]

n_wins = request_windows(n_resp, n_cmeta['t0_ns'])
e_wins = request_windows(e_resp, e_cmeta['t0_ns'])


def plot_event(evt, smooth_ms=500):
    if evt not in n_perf.columns or evt not in e_perf.columns:
        print(f'  {evt} not in both runs - skipping.')
        return
    fig, axes = plt.subplots(1, 2, figsize=(14, 3.5), sharey=True)
    for ax, label, df, wins, color in [
        (axes[0], 'Neutral',   n_perf, n_wins, 'steelblue'),
        (axes[1], 'Emotional', e_perf, e_wins, 'firebrick'),
    ]:
        smoothed = df[evt].rolling(smooth_ms, min_periods=1).mean()
        ax.plot(df['t_s'], smoothed, color=color, lw=0.8, alpha=0.9)
        for s, e in wins:
            ax.axvspan(s, e, alpha=0.10, color=color)
        ax.axvline(2.0, ls=':', color='gray', lw=0.5)
        ax.set_title(f'{label} - {evt}')
        ax.set_xlabel('time (s)')
    axes[0].set_ylabel(f'rolling {smooth_ms}ms mean')
    plt.tight_layout()
    plt.show()

In [None]:
for evt in ['irq:irq_handler_entry', 'irq:softirq_entry', 'tlb:tlb_flush',
            'context-switches', 'cpu-clock']:
    plot_event(evt)

In [None]:
for evt in ['power/energy-pkg/', 'power/energy-ram/', 'msr/cpu_thermal_margin/', 'core_power.throttle']:
    plot_event(evt, smooth_ms=1000)

## 7 - Per-process CPU usage (from /proc)

`proc_sample.csv` tracks the llama.cpp container's user + system CPU jiffies at 200 ms.
We diff consecutive samples to approximate CPU utilization per interval.

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(14, 3.5), sharey=True)
for ax, label, df, color in [
    (axes[0], 'Neutral',   n_proc, 'steelblue'),
    (axes[1], 'Emotional', e_proc, 'firebrick'),
]:
    cpu = (df['proc_utime_jiffies'] + df['proc_stime_jiffies']).diff()
    ax.plot(df['t_s'], cpu, color=color, lw=0.8)
    ax.axvline(2.0, ls=':', color='gray', lw=0.5)
    ax.set_title(f'{label} - container CPU delta (jiffies/200ms)')
    ax.set_xlabel('time (s)')
axes[0].set_ylabel('d(utime+stime) jiffies')
plt.tight_layout()
plt.show()

## 8 - /proc softirq rates

`proc_system_sample.csv` has cumulative `/proc/softirqs` counters.
Diffing gives approximate rates per 200 ms interval.

**Known issue (Feb 9 data):** PSI columns have duplicate names (cpu/memory/io all
share `full_avg10`, etc.). Also, the collector had an `or`-chain bug that dropped
values equal to 0.  Both are now fixed for the next run.

In [None]:
softirq_names = ['HI', 'TIMER', 'NET_TX', 'NET_RX', 'BLOCK', 'TASKLET', 'SCHED', 'HRTIMER', 'RCU']
available_sirqs = [s for s in softirq_names if s in n_sys.columns]

if available_sirqs:
    fig, axes = plt.subplots(1, 2, figsize=(14, 4), sharey=True)
    for ax, label, df, color in [
        (axes[0], 'Neutral',   n_sys, 'steelblue'),
        (axes[1], 'Emotional', e_sys, 'firebrick'),
    ]:
        for sirq in available_sirqs:
            rate = df[sirq].diff() / df['t_s'].diff()
            ax.plot(df['t_s'], rate, lw=0.8, alpha=0.7, label=sirq)
        ax.set_title(f'{label} - softirq rates (/proc/softirqs)')
        ax.set_xlabel('time (s)')
        ax.legend(fontsize=7, ncol=2)
    axes[0].set_ylabel('events/sec')
    plt.tight_layout()
    plt.show()
else:
    print('No softirq columns found')

## 9 - Data quality diagnostic

In [None]:
print('=' * 60)
print('DATA QUALITY SUMMARY')
print('=' * 60)

# 1. Perf events
n_events = set(n_perf_tidy['event'].unique())
e_events = set(e_perf_tidy['event'].unique())
print(f'\nPerf events: {len(n_events)} neutral, {len(e_events)} emotional')
only_n = n_events - e_events
only_e = e_events - n_events
if only_n: print(f'  Only in neutral: {only_n}')
if only_e: print(f'  Only in emotional: {only_e}')

# 2. All-zero events
print('\nAll-zero perf events (recorded but never fired):')
any_zero = False
for evt in common_events:
    n_sum, e_sum = n_perf[evt].sum(), e_perf[evt].sum()
    if n_sum == 0 and e_sum == 0:
        print(f'  {evt} - zero in both')
        any_zero = True
    elif n_sum == 0 or e_sum == 0:
        print(f'  {evt} - zero in {"neutral" if n_sum == 0 else "emotional"} only')
        any_zero = True
if not any_zero:
    print('  (none)')

# 3. Responses
print(f'\nResponses: neutral={len(n_resp)}/5, emotional={len(e_resp)}/5')
if len(e_resp) < 5:
    print('  Emotional prompts are ~2x longer (~400 vs ~220 tokens).')
    print('  At ~11s per prompt, only 4 fit in the 40s budget. This is expected.')
    print('  Consider: increase budget_s to 60s, or reduce n_predict.')

# 4. Duplicate column names in proc_system_sample (PSI bug)
dup_cols = n_sys.columns[n_sys.columns.duplicated()].tolist()
if dup_cols:
    print(f'\nDuplicate columns in proc_system_sample: {set(dup_cols)}')
    print('  -> PSI columns for cpu/memory/io share names. Fixed for next run.')

# 5. proc_system NaN check
n_nans = n_sys.isna().sum()
problem_cols = n_nans[n_nans > 0]
if len(problem_cols) > 0:
    print(f'\nproc_system_sample NaN columns ({len(problem_cols)}):')
    for c, cnt in problem_cols.items():
        print(f'  {c}: {cnt}/{len(n_sys)} NaN')
else:
    print('\nproc_system_sample: no NaN values')

print('\n' + '=' * 60)

## Next steps

This notebook is for **exploration** only. Once you have multiple runs per condition
(replications), the statistical analysis will:

1. Aggregate per-run summaries (mean event rate during inference window)
2. Compare conditions with Mann-Whitney U / permutation tests
3. Compute effect sizes (Cohen's d, rank-biserial correlation)
4. Apply multiple-comparison correction (Holm-Bonferroni)

**Action items from this exploration:**
- Re-run with the fixed collector (PSI prefixes + or-chain fix)
- Consider increasing `budget_s` to 60 s so all 5 emotional prompts complete
- Collect >= 10 replications per condition for meaningful statistics