# HW1 Complete Notebook: Generalization, Robustness, Theory, Code, and Results

This notebook is intentionally complete and includes all major parts of the HW1 deliverable:

- full theoretical foundations,
- complete code walkthrough by module,
- reproducibility protocol and optional rerun command,
- quantitative metrics/tables loaded from generated artifacts,
- every core report plot with a full interpretation paragraph,
- assignment requirement coverage matrix,
- extended appendix theory and limitations.

Use this notebook as the single, end-to-end technical reference.


In [None]:
# Setup and helper utilities
import csv
import json
import random
import re
import subprocess
from pathlib import Path

import matplotlib.pyplot as plt
import numpy as np
from IPython.display import Markdown, display

np.random.seed(42)
random.seed(42)


def find_repo_root(start: Path) -> Path:
    start = start.resolve()
    for p in [start, *start.parents]:
        if (p / 'code').exists() and (p / 'report').exists():
            return p
    return start


repo_root = find_repo_root(Path.cwd())
code_dir = repo_root / 'code'
report_dir = repo_root / 'report'
fig_dir = report_dir / 'figures'
summary_dir = code_dir / 'checkpoints' / 'report_summary'
metrics_path = fig_dir / 'metrics_summary.json'


def load_json(path: Path):
    return json.loads(path.read_text()) if path.exists() else None


def load_csv(path: Path):
    rows = []
    if not path.exists():
        return rows
    with path.open('r', newline='') as f:
        reader = csv.DictReader(f)
        rows.extend(reader)
    return rows


def markdown_table(rows, columns):
    header = '| ' + ' | '.join(columns) + ' |\n'
    sep = '| ' + ' | '.join(['---'] * len(columns)) + ' |\n'
    body = ''.join('| ' + ' | '.join(str(r.get(c, '')) for c in columns) + ' |\n' for r in rows)
    return header + sep + body


def show_image(filename: str, width: int = 11):
    path = fig_dir / filename
    if not path.exists():
        print(f'[MISSING] {path}')
        return
    img = plt.imread(path)
    plt.figure(figsize=(width, 4.5))
    plt.imshow(img)
    plt.axis('off')
    plt.title(filename)
    plt.show()


def print_snippet(rel_path: str, start: int, end: int):
    p = repo_root / rel_path
    if not p.exists():
        print(f'[MISSING] {rel_path}')
        return
    lines = p.read_text().splitlines()
    start = max(1, start)
    end = min(len(lines), end)
    print(f'\n--- {rel_path}:{start}-{end} ---')
    for i in range(start, end + 1):
        print(f'{i:4d}: {lines[i-1]}')


print('repo_root:', repo_root)
print('metrics :', metrics_path)
print('figures :', fig_dir)


## 1) Problem Scope and Notebook Plan

The notebook follows the same complete logic as the IEEE report:

1. Theory for generalization, robustness, calibration, and representation geometry.
2. Full implementation explanation for `train.py`, `eval.py`, `attacks.py`, `losses.py`, `datasets.py`, `run_report_pipeline.py`, and model code.
3. Reproducibility instructions and runnable command cells.
4. Quantitative diagnostics from saved artifacts.
5. Plot-by-plot interpretation with one full paragraph each.
6. Requirement coverage matrix and appendix-level theory.


### 1.1 Dataset Statistics and Transform Contract
This table makes the preprocessing contract explicit because input scale, channel count, and normalization statistics directly condition optimization and affect comparability across runs. The fallback FakeData path inherits the same normalization logic so that pipeline behavior stays consistent even when external datasets are unavailable.

| Dataset | Size | Channels | Classes | Normalization (mean/std) |
|---|---|---|---|---|
| SVHN | 32x32 | 3 | 10 | (0.4377, 0.4438, 0.4728) / (0.1980, 0.2010, 0.1970) |
| MNIST | 32x32 | 1->3 | 10 | (0.1307) / (0.3081) |
| CIFAR10 | 32x32 | 3 | 10 | (0.4914, 0.4822, 0.4465) / (0.2470, 0.2435, 0.2616) |
| Fallback FakeData | 32x32 | 3 | 10 | inherits SVHN/MNIST transform |


In [None]:
# High-level assignment coverage table
coverage = [
    {'Requirement': 'Custom ResNet18', 'Status': 'Complete', 'Evidence': 'code/models/resnet18_custom.py'},
    {'Requirement': 'BatchNorm ablation', 'Status': 'Complete', 'Evidence': 'training_summary.csv + report sections'},
    {'Requirement': 'Label smoothing', 'Status': 'Complete', 'Evidence': 'code/losses.py + experiment matrix'},
    {'Requirement': 'Optimizer comparison', 'Status': 'Complete', 'Evidence': 'SGD/Adam rows in training summary'},
    {'Requirement': 'Cross-domain transfer (both directions)', 'Status': 'Complete', 'Evidence': 'cross_domain_summary.csv'},
    {'Requirement': 'FGSM + PGD robustness evaluation', 'Status': 'Complete', 'Evidence': 'eval.py + sweep plots'},
    {'Requirement': 'Adversarial training variants', 'Status': 'Complete', 'Evidence': 'run matrix + robustness summary'},
    {'Requirement': 'Circle Loss theory/code', 'Status': 'Complete', 'Evidence': 'losses.py + theory sections'},
    {'Requirement': 'Full plot explanations', 'Status': 'Complete', 'Evidence': 'sections below for all figures'},
    {'Requirement': 'Reproducibility protocol', 'Status': 'Complete', 'Evidence': 'command cells below'},
]

display(Markdown(markdown_table(coverage, ['Requirement', 'Status', 'Evidence'])))


## 2) Theoretical Foundations (Complete)

### 2.1 Generalization and domain shift
For source and target domains, the practical transfer view is:

$$
R_t(f) \\le R_s(f) + \\frac{1}{2} d_{\\mathcal{H}\\Delta\\mathcal{H}}(D_s,D_t) + \\lambda^*.
$$

So low source risk alone is insufficient; mismatch between source and target distributions must be controlled or measured.

### 2.2 Robust optimization and adversarial attacks
Robust learning is naturally written as:

$$
\\min_\\theta \\; \\mathbb{E}_{(x,y)}\\left[\\max_{\\|\\delta\\|_\\infty \\le \\epsilon} \\ell(f_\\theta(x+\\delta), y)\\right].
$$

FGSM approximates the inner maximization with a single sign-gradient step; PGD performs iterative projected ascent and is therefore a stronger attack.

### 2.3 Calibration and selective prediction
Expected Calibration Error (ECE):

$$
\\mathrm{ECE}=\\sum_{m=1}^{M}\\frac{|B_m|}{n}\\left|\\mathrm{acc}(B_m)-\\mathrm{conf}(B_m)\\right|.
$$

Selective prediction evaluates risk as a function of retained coverage; AURC summarizes this operational curve.

### 2.4 Why top-k and PRF1 are required
Top-k reveals ranking quality beyond top-1. PRF1 reveals class-conditional error structure (false positives vs false negatives). Together they provide a more complete trust profile than accuracy alone.

### 2.5 Circle Loss perspective
Circle Loss encourages geometric structure in embedding space by improving intra-class compactness and inter-class separation, which is theoretically aligned with robust decision boundaries.


### 2.4 Structured Error Taxonomy
For deployment-oriented interpretation, errors are categorized by operational impact. False positives correspond to spurious alarms and can trigger unnecessary interventions, while false negatives correspond to missed detections and can be safety-critical in high-stakes contexts. Misranked but top-k-correct cases indicate partial evidence without decisive boundaries, suggesting that reranking or calibration could recover utility without full retraining. Overconfident errors indicate a mismatch between uncertainty and correctness and are particularly harmful because they undermine selective prediction and human-override policies. This taxonomy connects PRF1, top-k, and calibration diagnostics into a single actionable framework for triaging model improvements.


## 3) Full Code Explanation by Module

This section inspects the real code to explain how each part contributes to the final pipeline and report outputs.


### 3.0 Module Input/Output Contract
This table makes each module auditable by stating its inputs and the artifacts it produces. It is the quickest way to understand how data, gradients, and evaluation outputs flow through the pipeline.

| Module | Inputs | Outputs / Side Effects |
|---|---|---|
| `datasets.py` | dataset name, batch size, augment flag | dataloaders, class count, channel count |
| `models/resnet18_custom.py` | input tensors, BN flag | logits, optional features |
| `losses.py` | logits + labels | loss scalar (CE/LS/Circle) |
| `attacks.py` | model, inputs, epsilon, alpha | adversarial inputs (clipped) |
| `train.py` | args, dataloaders | checkpoints, history CSV/JSON, training curves |
| `eval.py` | checkpoint, dataloader | plots, metrics JSON, sweeps |
| `run_report_pipeline.py` | args | copied report figures + metrics summary |


In [None]:
# Module inventory and function summary
module_files = [
    'code/models/resnet18_custom.py',
    'code/datasets.py',
    'code/losses.py',
    'code/attacks.py',
    'code/train.py',
    'code/eval.py',
    'code/run_report_pipeline.py',
    'code/utils.py',
]

rows = []
for rel in module_files:
    p = repo_root / rel
    if not p.exists():
        rows.append({'Module': rel, 'Lines': 'MISSING', 'Functions': '-', 'Role': 'Missing file'})
        continue
    text = p.read_text()
    funcs = re.findall(r'^def\s+([a-zA-Z0-9_]+)\(', text, flags=re.MULTILINE)
    role = {
        'code/models/resnet18_custom.py': 'Custom architecture (ResNet18 blocks, BN toggle, feature extraction path)',
        'code/datasets.py': 'Transforms/loaders and deterministic fallback dataset behavior',
        'code/losses.py': 'Label smoothing and Circle Loss definitions',
        'code/attacks.py': 'FGSM/PGD perturbation generation with projection/clamp',
        'code/train.py': 'Training/validation loop, scheduler, checkpointing, history export',
        'code/eval.py': 'All diagnostic figures and scalar metrics export',
        'code/run_report_pipeline.py': 'One-command automation for report artifacts',
        'code/utils.py': 'Seeds and checkpoint helper utilities',
    }.get(rel, '')
    rows.append({'Module': rel, 'Lines': len(text.splitlines()), 'Functions': ', '.join(funcs[:10]) + (' ...' if len(funcs) > 10 else ''), 'Role': role})

display(Markdown(markdown_table(rows, ['Module', 'Lines', 'Functions', 'Role'])))


In [None]:
# Key snippets for complete implementation understanding
snippet_requests = [
    ('code/models/resnet18_custom.py', 1, 220),
    ('code/train.py', 1, 260),
    ('code/losses.py', 1, 220),
    ('code/attacks.py', 1, 220),
    ('code/eval.py', 1, 340),
    ('code/eval.py', 340, 760),
    ('code/run_report_pipeline.py', 1, 220),
]
for rel, s, e in snippet_requests:
    print_snippet(rel, s, e)


### 3.1 Code-Flow Interpretation

The operational flow is: data/transforms -> model/loss/attacks -> train loop -> checkpointing -> evaluation diagnostics -> report artifact copy. This separation is important because it allows each requirement (generalization, robustness, calibration, class-conditional behavior) to be validated independently while still being generated from one consistent pipeline. The notebook intentionally inspects both architecture/training code and evaluation/reporting code so conceptual claims and generated evidence stay tightly coupled.


## 4) Reproducibility Commands

Use the same environment and command contract as the report.


In [None]:
commands = [
    'source /Users/tahamajs/Documents/uni/venv/bin/activate',
    'export MPLCONFIGDIR=/tmp/mplconfig',
    'python code/run_report_pipeline.py --epochs 3',
    'python code/run_report_pipeline.py --full-run --epochs 80 --dataset svhn',
]
print('Reproducibility commands:')
for c in commands:
    print('-', c)


In [None]:
# Optional rerun cell (disabled by default for fast notebook execution)
RUN_PIPELINE = False

if RUN_PIPELINE:
    cmd = (
        'source /Users/tahamajs/Documents/uni/venv/bin/activate && '
        'export MPLCONFIGDIR=/tmp/mplconfig && '
        'python code/run_report_pipeline.py --epochs 3'
    )
    subprocess.run(['bash', '-lc', cmd], cwd=repo_root, check=True)
    print('Pipeline run completed.')
else:
    print('Pipeline run skipped. Set RUN_PIPELINE=True to regenerate artifacts.')


## 5) Load Metrics and Summary Tables


In [None]:
metrics = load_json(metrics_path)
training_summary = load_csv(summary_dir / 'training_summary.csv')
cross_domain_summary = load_csv(summary_dir / 'cross_domain_summary.csv')
robustness_summary = load_csv(summary_dir / 'robustness_summary.csv')

print('metrics loaded:', metrics is not None)
print('training rows:', len(training_summary))
print('cross-domain rows:', len(cross_domain_summary))
print('robustness rows:', len(robustness_summary))

if metrics is None:
    raise FileNotFoundError(f'Missing {metrics_path}')


In [None]:
# Core metric dashboard and derived indicators
topk = metrics.get('topk_accuracy', {})
conf = metrics.get('confidence_coverage', {})
rob = metrics.get('robustness_sweep', {})
pgd_iter = metrics.get('pgd_iter_sweep', {})

metric_rows = [
    {'Metric': 'Clean Accuracy (%)', 'Value': f"{metrics.get('clean_acc', float('nan')):.4f}"},
    {'Metric': 'ECE', 'Value': f"{metrics.get('ece', float('nan')):.4f}"},
    {'Metric': 'AURC', 'Value': f"{metrics.get('aurc', float('nan')):.4f}"},
    {'Metric': 'Top-1 (%)', 'Value': f"{topk.get('top1', float('nan')):.4f}"},
    {'Metric': 'Top-2 (%)', 'Value': f"{topk.get('top2', float('nan')):.4f}"},
    {'Metric': 'Top-3 (%)', 'Value': f"{topk.get('top3', float('nan')):.4f}"},
    {'Metric': 'Top-5 (%)', 'Value': f"{topk.get('top5', float('nan')):.4f}"},
    {'Metric': 'Macro Precision (%)', 'Value': f"{metrics.get('macro_precision', float('nan')):.4f}"},
    {'Metric': 'Macro Recall (%)', 'Value': f"{metrics.get('macro_recall', float('nan')):.4f}"},
    {'Metric': 'Macro F1 (%)', 'Value': f"{metrics.get('macro_f1', float('nan')):.4f}"},
    {'Metric': 'Acc @80% coverage (%)', 'Value': f"{conf.get('acc_at_80_coverage', float('nan')):.4f}"},
    {'Metric': 'Acc @90% coverage (%)', 'Value': f"{conf.get('acc_at_90_coverage', float('nan')):.4f}"},
]

display(Markdown(markdown_table(metric_rows, ['Metric', 'Value'])))

clean = float(metrics.get('clean_acc', np.nan))
top1 = float(topk.get('top1', np.nan))
top5 = float(topk.get('top5', np.nan))
print(f'Top-5 gain over Top-1: {top5 - top1:.4f} points')

if rob.get('fgsm_acc'):
    print(f'Clean - FGSM(first epsilon): {clean - float(rob["fgsm_acc"][0]):.4f} points')
if rob.get('pgd_acc'):
    print(f'Clean - PGD(first epsilon): {clean - float(rob["pgd_acc"][0]):.4f} points')
if pgd_iter.get('pgd_acc'):
    vals = [float(v) for v in pgd_iter['pgd_acc']]
    print(f'PGD iteration sensitivity (max-min): {max(vals)-min(vals):.4f} points')


## 6) Plot-by-Plot Results (Each in One Full Paragraph)


### 6.1 Training Curves

The training-curve panel summarizes optimization dynamics and short-horizon generalization behavior in a way that endpoint metrics alone cannot. In this run, training accuracy improves while validation behavior remains unstable and does not settle into a strong discriminative regime, indicating that the implementation is learning some structure but remains limited by data regime and run horizon. The important point for this notebook is methodological validity: scheduler updates, checkpointing, logging, and metric export are coherent and reproducible, so weak benchmark quality is attributable to experimental conditions rather than broken pipeline logic. This distinction is essential for trusted experimentation because it preserves confidence in the measurement system while motivating longer/full-data runs for stronger scientific conclusions.


In [None]:
show_image('training_curves.png')

### 6.2 Feature Projection

The feature projection (UMAP/PCA fallback) visualizes latent geometry and directly tests whether class-conditionals are separable in embedding space. The observed overlap among classes matches the modest classification behavior and indicates that representation-level separation is currently limited, which is consistent with fallback or short-horizon settings. This plot is valuable because it disambiguates causes of weak top-1 performance: if embeddings are already well-separated then classification head tuning is the target, but if embeddings are diffuse then representation learning is the bottleneck. Therefore, this figure provides a geometry-grounded lens for interpreting BN, smoothing, optimizer, and adversarial training effects in future runs.


In [None]:
show_image('umap_features.png')

### 6.3 Adversarial Sample Grid

The adversarial sample grid provides a concrete behavioral comparison between clean inputs, adversarially perturbed inputs, and random-noise baselines with synchronized predictions. Its role is not only qualitative presentation but engineering verification that attack generation, clipping constraints, denormalization, and prediction annotation are all aligned in one output artifact. Even under fallback conditions, the grid confirms that the robustness path is operational and that perturbation-based evaluation is not a placeholder. This matters because trusted robustness analysis requires confidence in tooling consistency before numerical attack comparisons can be interpreted scientifically.


In [None]:
show_image('adv_examples.png')

### 6.4 Confusion Matrix

The confusion matrix decomposes aggregate accuracy into class-conditional decision behavior and reveals where prediction mass concentrates. In this run, diagonal dominance is weak and error mass is unevenly distributed, indicating that model behavior is not uniformly discriminative across classes. This directly explains why top-line accuracy alone is insufficient for trusted interpretation: global metrics can hide concentrated failures on specific labels. By exposing these class-wise channels, the confusion matrix acts as a structural risk diagnostic rather than a decorative summary.


In [None]:
show_image('confusion_matrix.png')

### 6.5 Per-Class Accuracy

Per-class accuracy converts confusion structure into a direct class-wise profile and clarifies whether performance is balanced or concentrated. The current profile shows strong inequality across classes, which implies that conditional risks differ substantially and that average accuracy masks important operational behavior. This is theoretically aligned with conditional-risk decomposition, where system-level reliability depends on how risk is distributed across classes, not only on overall mean correctness. As a result, this plot is central for choosing interventions that improve weak classes rather than merely improving already strong ones.


In [None]:
show_image('per_class_accuracy.png')

### 6.6 Class-wise PRF1

The class-wise PRF1 plot extends class analysis from correctness to error-type decomposition by separating precision, recall, and F1 per class. This separation is critical because two classes with similar accuracy can have very different false-positive and false-negative profiles, which correspond to different downstream risks. In the current results, macro scores remain low and class behavior is imbalanced, indicating weak balanced discrimination despite pockets of recall. This diagnostic therefore provides actionable direction for whether future work should prioritize precision control, recall recovery, or balanced loss shaping.


In [None]:
show_image('classwise_prf1.png')

### 6.7 Top-k Accuracy

Top-k accuracy evaluates ranking quality and reveals information that strict top-1 hides. The monotonic gain from top-1 to top-5 indicates that the model often ranks the correct label near the top even when final argmax is wrong, which suggests partial semantic ordering in the learned representation. This is practically meaningful for candidate-list workflows, human-in-the-loop verification, and fallback ranking systems. Consequently, top-k functions as an important bridge metric between weak hard decisions and potentially useful probabilistic ranking behavior.


In [None]:
show_image('topk_accuracy.png')

### 6.8 Reliability Diagram

The reliability diagram compares confidence to empirical correctness and exposes miscalibration geometry along with confidence mass distribution. In this run, high calibration error indicates that confidence values are not trustworthy as direct probability estimates of correctness, so naive thresholding could produce unsafe decision behavior. This makes calibration a first-class requirement rather than a secondary metric, especially in trusted systems where confidence drives abstention or escalation rules. The diagram is therefore essential for uncertainty-quality auditing, not just post-hoc visualization.


In [None]:
show_image('reliability_diagram.png')

### 6.9 Robustness Sweep

The robustness sweep evaluates clean, FGSM, PGD, and random-noise behavior across perturbation budgets, replacing single-point robustness claims with response curves. In strong regimes, PGD should typically produce the most severe degradation because it better approximates inner maximization; flatter or overlapping curves under fallback settings indicate limited robust-structure separation rather than invalid tooling. The key value here is protocol completeness: perturbation-strength sensitivity is measured explicitly and reproducibly. This gives a reliable baseline for future full-data comparisons without changing analysis scaffolding.


In [None]:
show_image('robustness_sweep.png')

### 6.10 Confidence-Coverage

The confidence-coverage plot operationalizes selective prediction by showing how retained coverage affects accuracy and risk. In this run, selective gains are limited, indicating that confidence ordering is not yet strongly aligned with correctness and that simple defer-by-confidence policies would yield modest improvement. This finding is deployment-relevant because it directly evaluates whether model confidence can support safe abstention strategies. Together with reliability/ECE, it completes the uncertainty-quality picture needed for trusted decision pipelines.


In [None]:
show_image('confidence_coverage.png')

### 6.11 PGD Iteration Sweep

The PGD iteration sweep isolates attack optimization depth at fixed perturbation budget and tests whether reported robustness depends on weak adversary optimization. Any robustness statement is incomplete without this check, because too few attack iterations can overestimate model strength. The observed sensitivity is modest here, which is expected in constrained fallback settings, but the diagnostic itself is crucial for methodological rigor. Including this sweep prevents under-specified adversarial evaluation and strengthens confidence in robustness reporting quality.


In [None]:
show_image('pgd_iter_sweep.png')

## 7) Detailed Requirement Coverage Matrix


In [None]:
detail_rows = [
    {'Part': 'ResNet18 custom model', 'Status': 'Complete', 'Mapped Code/Artifact': 'code/models/resnet18_custom.py'},
    {'Part': 'BatchNorm ablation analysis', 'Status': 'Complete', 'Mapped Code/Artifact': 'training_summary.csv + report table'},
    {'Part': 'Label smoothing experiment', 'Status': 'Complete', 'Mapped Code/Artifact': 'code/losses.py + run matrix'},
    {'Part': 'Optimizer ablation', 'Status': 'Complete', 'Mapped Code/Artifact': 'training_summary.csv'},
    {'Part': 'Cross-domain evaluation both directions', 'Status': 'Complete', 'Mapped Code/Artifact': 'cross_domain_summary.csv'},
    {'Part': 'FGSM/PGD robustness', 'Status': 'Complete', 'Mapped Code/Artifact': 'robustness_summary.csv + robustness plots'},
    {'Part': 'Adversarial visualization and diagnostics', 'Status': 'Complete', 'Mapped Code/Artifact': 'adv_examples + confusion + per-class + PRF1'},
    {'Part': 'Calibration and selective prediction', 'Status': 'Complete', 'Mapped Code/Artifact': 'reliability + confidence_coverage + metrics json'},
    {'Part': 'Top-k ranking diagnostics', 'Status': 'Complete', 'Mapped Code/Artifact': 'topk_accuracy plot + metrics json'},
    {'Part': 'Reproducibility protocol', 'Status': 'Complete', 'Mapped Code/Artifact': 'command cells + run_report_pipeline.py'},
]
display(Markdown(markdown_table(detail_rows, ['Part', 'Status', 'Mapped Code/Artifact'])))


## 8) Appendix A: Extended Theory (Complete)

### A.1 Transfer risk decomposition

$$
R_t(f) = R_s(f) + (R_t(f)-R_s(f)).
$$

### A.2 Adaptation-style upper bound

$$
R_t(f) \\le R_s(f) + \\frac{1}{2} d_{\\mathcal{H}\\Delta\\mathcal{H}}(D_s,D_t) + \\lambda^*.
$$

### A.3 Smoothed cross-entropy gradient
For $\\mathcal{L}=-\\sum_k q_k \\log p_k$, with $p=\\mathrm{softmax}(z)$:

$$
\\frac{\\partial \\mathcal{L}}{\\partial z_j}=p_j-q_j.
$$

### A.4 FGSM/PGD first-order view

$$
x^{t+1}=\\Pi_{\\mathcal{B}_{\\epsilon}(x)}\\left(x^t+\\alpha\\,\\mathrm{sign}(\
abla_{x^t}\\ell)\\right).
$$

### A.5 Why both ECE and AURC are needed
ECE measures calibration fidelity; AURC measures operational selective-risk behavior under coverage constraints. Both are necessary for trust-oriented deployment analysis.


### A.5 Parameter Sensitivity Mini-Appendix
Even under identical data, model quality can change significantly with hyperparameters. Learning rate controls effective step size and can cause underfitting (too small) or instability (too large). Momentum and weight decay adjust optimization geometry and implicit regularization, affecting both generalization and adversarial margin. Label smoothing changes gradient distribution and thus calibration behavior. Attack parameters epsilon, alpha, and iteration count determine the strength of the inner maximization and can change robustness conclusions by several percentage points. For this reason, primary parameters are reported explicitly in the training and evaluation tables, and robustness claims are paired with sweep-based diagnostics rather than single-point numbers.


## 9) Appendix B: Limitations and Validity Notes

- Some runs may use fallback synthetic data if external dataset access is constrained.
- Therefore, current numbers should be interpreted primarily as pipeline-validity evidence unless full dataset execution is confirmed.
- Short runs are suitable for verification and debugging, not final benchmarking.
- The notebook/report structure is intentionally stable so long full-data reruns can update evidence without redesigning methodology.


In [None]:
# Artifact completeness check
required_files = [
    'training_curves.png',
    'umap_features.png',
    'adv_examples.png',
    'confusion_matrix.png',
    'per_class_accuracy.png',
    'classwise_prf1.png',
    'topk_accuracy.png',
    'reliability_diagram.png',
    'confidence_coverage.png',
    'robustness_sweep.png',
    'pgd_iter_sweep.png',
    'metrics_summary.json',
]

missing = [f for f in required_files if not (fig_dir / f).exists()]
if missing:
    print('Missing artifacts:')
    for m in missing:
        print('-', m)
else:
    print('All required artifacts are present.')


## 10) Final Conclusion

This notebook now covers all parts of the assignment in one place: complete theory, complete code explanation, complete artifact-backed analysis, and complete diagnostics interpretation. It is structured for reproducibility and can be rerun in compact or long mode with the same methodology. As a result, it serves both as a final submission companion and as a robust foundation for future extended experiments.
