---
title: "Lab 0: Data Science Primer"
subtitle: "Bias–variance, uncertainty, validation"
format:
  html:
    toc: false
    number-sections: true
execute:
  echo: true
  warning: false
  message: false
---

::: callout-note
### Expected Time
- FIN510: Seminar hands‑on ≈ 60 min; 
- Directed learning extensions ≈ 90–120 min
- FIN720: Computer lab ≈ 120 min
:::

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/quinfer/fin510-colab-notebooks/blob/main/labs/lab00_primer.ipynb)

::: callout-note
### How to use this lab
- Work through the tasks in order and keep notes in your own notebook.
- Reflection prompts are for your learning logs—there is **no submission** for this lab.
- Bring insights back to the seminar to connect with the Week 0 slide discussion.
:::

## Setup (Colab‑only installs)

In [None]:
# Run this cell in Colab if needed
try:
    import numpy, pandas, matplotlib
except Exception:
    !pip -q install numpy pandas matplotlib scipy

## Orientation — Notebook Basics (10 min)

In [None]:
# Running cells, variables, functions, and a tiny assert
print("Hello, notebook!")

a = 2 + 2
assert a == 4, "Basic arithmetic check failed"

nums = [10, 20, 30, 40]
assert nums[:2] == [10,20]

info = {"ticker": "AAPL", "price": 185.0}
assert "ticker" in info and isinstance(info["price"], (int,float))

def add(x, y):
    return x + y
assert add(2,3) == 5

print("Orientation checks passed ✔")

## Objectives

- Visualise bias–variance trade‑off on synthetic data
- Compute bootstrap confidence intervals
- Practise time‑aware validation (walk‑forward schematic)

## Task 1 — Bias–Variance Curves

In [None]:
import numpy as np
import matplotlib.pyplot as plt

complexity = np.arange(1, 11)
bias2 = (1/complexity)**1.2
variance = 0.03 * (complexity/10)**1.8
mse = bias2 + variance

plt.figure(figsize=(9,5))
plt.plot(complexity, bias2, 'o-', label='Bias²')
plt.plot(complexity, variance, 's-', label='Variance')
plt.plot(complexity, mse, '^-', label='MSE')
plt.xlabel('Model complexity (relative)')
plt.ylabel('Error')
plt.title('Bias–Variance Trade‑off (Illustrative)')
plt.grid(alpha=0.3)
plt.legend(); plt.tight_layout()

Checkpoint: Where is MSE minimised? Explain why.

## Task 2 — Bootstrap CI for Mean

In [None]:
import numpy as np
import matplotlib.pyplot as plt

np.random.seed(3)
x = np.random.lognormal(mean=0.0, sigma=0.5, size=300)
B = 2000
boot_means = []
for _ in range(B):
    xb = np.random.choice(x, size=len(x), replace=True)
    boot_means.append(np.mean(xb))

boot_means = np.array(boot_means)
ci_low, ci_high = np.percentile(boot_means, [2.5, 97.5])
ci_low, ci_high
assert ci_low < np.mean(x) < ci_high
print("Bootstrap CI computed ✔")

Discuss: CI meaning (frequentist) vs credible interval (Bayesian).

## Task 3 — Walk‑Forward Validation Schematic

In [None]:
import matplotlib.pyplot as plt

plt.figure(figsize=(10,2.5))
blocks = [(0,20, 'Train 1'), (20,30,'Valid 1'), (30,50,'Train 2'), (50,60,'Valid 2')]
for s,e,label in blocks:
    plt.barh(0, e-s, left=s, height=0.6,
             color='tab:blue' if 'Train' in label else 'tab:orange')
    plt.text((s+e)/2, 0, label, ha='center', va='center', color='white', fontsize=10)
plt.yticks([]); plt.xlabel('Time index'); plt.xlim(0,60)
plt.title('Rolling/Expanding Walk‑Forward (Toy)'); plt.tight_layout()
print("Walk‑forward schematic drawn ✔")

Deliverable: One short paragraph on when to prefer simple models despite @kelly2024complexity.

## Task 4 — Stylised-Fact Diagnostics

In [None]:
import numpy as np
import pandas as pd
from scipy import stats
import matplotlib.pyplot as plt

np.random.seed(42)

# Synthetic returns with regime shifts to emphasise stylised facts
regimes = np.concatenate([
    np.random.normal(0, 0.6, size=250),
    np.random.normal(0, 1.8, size=120),
    np.random.normal(0, 0.9, size=200)
]) / 100
returns = pd.Series(regimes)

kurt = stats.kurtosis(returns, fisher=False)
acf_abs_lag1 = returns.abs().autocorr(lag=1)

downside = returns[returns < 0]
upside = returns[returns >= 0]
semivar_down = (downside ** 2).mean()
semivar_up = (upside ** 2).mean()

roll_mean = returns.rolling(60).mean()
roll_std = returns.rolling(60).std()

fig, ax = plt.subplots(3, 1, figsize=(10,8), sharex=True)
returns.plot(ax=ax[0], color='tab:blue', linewidth=0.8)
ax[0].set_title('Synthetic Returns with Regimes')
ax[0].grid(alpha=0.3)

returns.abs().plot(ax=ax[1], color='tab:orange', linewidth=0.8)
ax[1].set_title('|Returns| Highlight Volatility Clustering')
ax[1].grid(alpha=0.3)

roll_mean.plot(ax=ax[2], color='tab:green', linewidth=1, label='Rolling mean (60)')
roll_std.plot(ax=ax[2], color='tab:red', linewidth=1, label='Rolling std (60)')
ax[2].set_title('Rolling Moments — Regime Signals')
ax[2].legend()
ax[2].grid(alpha=0.3)
ax[2].set_xlabel('Observation')

plt.tight_layout()
plt.show()

print(f"Kurtosis (Gaussian=3): {kurt:.2f}")
print(f"Autocorr |returns| lag 1: {acf_abs_lag1:.2f}")
print(f"Downside semivariance: {semivar_down:.4f}")
print(f"Upside semivariance:   {semivar_up:.4f}")

- Tail risk: how does the kurtosis compare to the Gaussian benchmark (3)?
- Volatility memory: does the |returns| autocorrelation justify extra lags in your models?
- Asymmetry: what do the downside vs upside semivariances imply for features?
- Regime shifts: where do rolling mean/std change and how should validation windows respond?

Checkpoint: Draft a bullet list of features or diagnostics you would add before escalating model complexity.

## Reflection — Complexity Decision Flow

Link your outputs back to the primer deck’s decision flow. In your notes, capture 2–3 bullet points for each step:

1. Evidence of richer signal (bias–variance + stylised facts)
2. Baseline benchmark you would defend
3. Validation design to test complexity honestly
4. Governance checks before promoting Model B (@kelly2024complexity)

::: callout-tip
### Troubleshooting
- If a plot is blank: check variable names and that x/y lengths match.
- If a cell fails: run Setup, then Runtime → Restart and run all (Colab).
- If numbers differ: verify random seeds and parameters.
:::

## Governance & Failure Modes Checklist

- Flag two potential leakage risks in your coursework dataset and note how you will test for them.
- Assign a notional “model steward” and list what they must sign off before deployment.
- Pick one explainability technique (e.g., SHAP, PDP) and describe how it would reduce black-box anxiety for stakeholders.

## Save Outputs (optional)

In [None]:
import matplotlib.pyplot as plt
plt.savefig('lab00_last_figure.png', dpi=150)
"Saved: lab00_last_figure.png"

## Exit Ticket (Optional, No Submission)

- Dataset you will explore with a complex model and why it merits richer features.
- Validation design you will implement to prove the complex model beats the baseline.
- Governance or ethical concern you will monitor as you iterate.

::: callout-note
### Further Reading (Hilpisch 2019)
- See our curated list: [Hilpisch Code Resources](../resources/hilpisch-code.qmd) — Week 0 (Primer)
- Chapter 13 notebooks (statistics, ML workflows) show squared‑loss diagnostics, residual analysis, and evaluation patterns consistent with this lab.
:::