# Ethical Reminder (from the Lab Alliance Compact)

- Data belongs to truth, not expectations; document steps transparently.
- Perform **your own analysis**; credit all sources and collaborators properly.
- Communicate respectfully; ask for help early; uphold psychological safety.

> By proceeding, you acknowledge the Compact and agree to act accordingly.

# Linear Regression Pilot: Weighted Least Squares with Uncertainties

**Learning goals**
1. Generate synthetic data with known ground truth.
2. Perform **weighted** linear regression using `scipy.optimize.curve_fit`.
3. Interpret fit parameters, standard errors, confidence intervals.
4. Diagnose fits via **residuals** and χ² per degree of freedom.
5. Understand the distinction between **statistical** and **systematic** uncertainties.

**Model:**  \( y = a x + b \)

This pilot complements *Part 4 · Statistics & Data Analysis* in the handbook.


In [None]:
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
rng = np.random.default_rng(42)

def model(x, a, b):
    return a*x + b

# --- Ground truth and synthetic data ---
a_true, b_true = 2.0, -1.0
N = 30
x = np.linspace(0, 10, N)

# Heteroscedastic noise: sigma grows mildly with x
sigma = 0.2 + 0.02*x
y = model(x, a_true, b_true) + rng.normal(0.0, sigma)

x[:5], y[:5], sigma[:5]

## Visualize data with error bars
Always label axes with units (if applicable) and show uncertainties on measured points.

In [None]:
fig, ax = plt.subplots(figsize=(6,4))
ax.errorbar(x, y, yerr=sigma, fmt='o', capsize=3)
ax.set_xlabel('x')
ax.set_ylabel('y')
ax.set_title('Synthetic data with heteroscedastic noise')
plt.show()

## Weighted fit with `curve_fit`
`curve_fit` supports weights via `sigma` and `absolute_sigma=True`.

- `sigma` = standard deviation (uncertainty) of each point.
- `absolute_sigma=True` tells `curve_fit` to treat these as **absolute** uncertainties.
- The returned covariance then reflects the scale of `sigma`.

In [None]:
p0 = [1.0, 0.0]  # initial guess [a, b]
popt, pcov = curve_fit(model, x, y, p0=p0, sigma=sigma, absolute_sigma=True)
a_fit, b_fit = popt
a_err, b_err = np.sqrt(np.diag(pcov))
print(f"Fit: a = {a_fit:.4f} ± {a_err:.4f}, b = {b_fit:.4f} ± {b_err:.4f}")

## Goodness-of-fit and residuals
The **reduced** chi-squared (χ²_ν) is informative:
χ² = Σ [(y_i - f(x_i))² / σ_i²],  χ²_ν = χ²/(N - p). Values near 1 suggest consistent uncertainties.

In [None]:
y_fit = model(x, *popt)
res = y - y_fit
chi2 = np.sum((res/sigma)**2)
ndof = len(x) - len(popt)
chi2_red = chi2/ndof
print(f"chi2 = {chi2:.2f}, ndof = {ndof}, chi2_red = {chi2_red:.2f}")

fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(6,6), sharex=True)
ax1.errorbar(x, y, yerr=sigma, fmt='o', capsize=3, label='data')
ax1.plot(x, y_fit, label='fit')
ax1.set_ylabel('y')
ax1.legend()
ax1.set_title('Fit and data')

ax2.axhline(0, lw=1)
ax2.errorbar(x, res, yerr=sigma, fmt='o', capsize=3)
ax2.set_xlabel('x')
ax2.set_ylabel('residuals')
ax2.set_title('Residuals (with y-errors)')
plt.tight_layout()
plt.show()

## Confidence intervals
From the covariance matrix `pcov`, 1σ standard errors are sqrt of the diagonal.
Approximate 95% CIs: a ± 1.96·σ_a, b ± 1.96·σ_b.
We can also use a **bootstrap** to check robustness.

In [None]:
def ci95(mu, sigma):
    return (mu - 1.96*sigma, mu + 1.96*sigma)

ci_a = ci95(a_fit, a_err)
ci_b = ci95(b_fit, b_err)
print(f"95% CI (cov): a in [{ci_a[0]:.4f}, {ci_a[1]:.4f}], b in [{ci_b[0]:.4f}, {ci_b[1]:.4f}]")

# --- Simple residual bootstrap ---
B = 500
a_boot = np.empty(B)
b_boot = np.empty(B)
for k in range(B):
    resampled = np.random.default_rng().choice(res, size=res.size, replace=True)
    y_boot = y_fit + resampled
    popt_b, pcov_b = curve_fit(model, x, y_boot, p0=popt, sigma=sigma, absolute_sigma=True)
    a_boot[k], b_boot[k] = popt_b

a_mu, b_mu = np.mean(a_boot), np.mean(b_boot)
a_std, b_std = np.std(a_boot, ddof=1), np.std(b_boot, ddof=1)
print(f"Bootstrap mean±std: a = {a_mu:.4f}±{a_std:.4f}, b = {b_mu:.4f}±{b_std:.4f}")

### Interpreting results
- If χ²_ν≈1, your point-wise uncertainties are broadly consistent.
- Compare covariance-based CIs with bootstrap spread; big gaps hint at model mismatch or non-Gaussian noise.
- **Systematic vs statistical**: we modeled statistical noise only. Systematic effects (e.g., calibration offset) shift results coherently and must be analyzed from experimental context, **not** from deviation to literature values.

## Exercises (short)
1. **Weighted vs unweighted**: Repeat the fit with `sigma=None`. Compare parameters, errors, and χ²_ν.
2. **Outliers**: Add one outlier (e.g., at the highest x). How do residuals and CIs change?
3. **Systematic offset**: Add a constant offset to all y-values. Which parameter(s) change and why? How would you detect this experimentally?
4. **Prediction band**: Using the covariance, draw the 95% confidence band of the regression line. What assumptions are implicit?