# 📘 Notebook 1: What Is Modeling, Really?

### 🎯 Learning Objectives
- Understand modeling as a method of approximating real-world processes
- Learn what a 'Data Generating Process' means (without using abbreviations)
- Visualize the difference between reality and a model's internal representation
- Simulate a process and compare linear vs non-linear approximations
- Question the philosophical and statistical assumptions made during modeling

## 1. What Is a Model?
A model is a **structured simplification** of something we do not fully understand.
In statistical learning, a model is a function that maps input values to predictions about output values.

**Key questions:**
- What assumptions are we making about reality when we model?
- Can a model be useful even if it is wrong?
- What are we giving up in exchange for tractability?

## 2. Simulate a Simple Real-World Process
Let’s simulate a true, known process. We will then try to model it and measure how much we lose in the simplification.

In [1]:

import numpy as np
import pandas as pd
import plotly.express as px

# Simulated data-generating process: nonlinear, noisy
np.random.seed(42)
X = np.linspace(-3, 3, 200)
true_function = lambda x: 1.5 * x**2 - 2 * x + 5
noise = np.random.normal(0, 2, size=X.shape)
y = true_function(X) + noise

df = pd.DataFrame({'x': X, 'y': y, 'truth': true_function(X)})
px.scatter(df, x='x', y='y', title='Simulated Real-World Data (with noise)').show()


### 💡 Concept: Data Generating Process
A **data generating process** is the real-world mechanism (possibly unknown) that produces our data.
In this simulation, we control the mechanism: it’s a quadratic function plus random noise.

**Assumption:** We never get to see this function in practice — only its output + noise.

## 3. Try to Approximate with a Linear Model
Now we fit a linear model — a poor but informative simplification.

In [2]:

from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X.reshape(-1, 1), y)
y_pred = model.predict(X.reshape(-1, 1))

df['y_pred_linear'] = y_pred
px.scatter(df, x='x', y='y', opacity=0.6, title='Linear Model vs Data')    .add_scatter(x=X, y=y_pred, name='Linear Prediction').show()


### 💡 Concept: Approximation Error
- Linear models can only express linear relationships.
- Our true function is nonlinear — so linear regression introduces **bias** even if trained perfectly.
- **Approximation error** is the gap between the true function and the best model we could possibly fit using this method.

## 4. Visualize the True Function vs Model

In [3]:

px.line(df, x='x', y='truth', title='True Function vs Linear Approximation')    .add_scatter(x=df['x'], y=df['y_pred_linear'], name='Linear Prediction').show()


## 5. Introduce Philosophical Framing
All models are wrong, but some are useful (George Box).

**Key questions:**
- Is it better to have an interpretable but wrong model, or a black-box accurate model?
- How do we know when we’ve over-simplified?
- Are we fitting data or fitting noise?

## 6. Next Steps
In the next notebook, we’ll formally define and derive the linear regression solution from first principles.
You’ll learn where the solution comes from, what assumptions it hides, and how to extend it step-by-step.

We’ll also begin working with real-world datasets and understand when and why linear models fall short.