# scikit-learn Mental Model — fit / transform / predict

## Objective
This notebook builds a **precise mental model** of how scikit-learn objects behave.

Focus is **not API memorization**, but understanding:
- when learning happens
- where state is stored
- why errors occur
- how misuse leads to leakage

If these fundamentals are weak, pipelines and evaluation will silently fail.

This notebook intentionally triggers errors to make the lifecycle explicit.


## What scikit-learn Actually Is

scikit-learn is **not**:
- an AutoML system
- a statistical reasoning engine
- aware of causality or intent

scikit-learn **is**:
> a deterministic execution engine that enforces a strict object lifecycle

It executes exactly what is asked — even if the request is conceptually wrong.

Correctness is **your responsibility**, not the library’s.


## The One Lifecycle (Non-Negotiable)

Every scikit-learn object follows:

$$
\texttt{fit} \;\rightarrow\; \texttt{transform (optional)} \;\rightarrow\; \texttt{predict}
$$

- **fit** → learn parameters from data  
- **transform** → apply learned parameters  
- **predict** → generate outputs using learned parameters  

Nothing else happens behind the scenes.


In [1]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()


## Instantiation ≠ Learning

At this point:
- no data has been seen
- no statistics have been computed
- no parameters exist

The object only holds **configuration**, not knowledge.

This distinction is critical for understanding leakage.


In [2]:
hasattr(scaler, "mean_"), hasattr(scaler, "scale_")


(False, False)

## The Underscore (`_`) Contract

In scikit-learn:

> Any attribute ending in `_` exists **only after** `fit()` is called.

Examples:
- `mean_`
- `scale_`
- `coef_`
- `intercept_`
- `classes_`

This is a **visual guarantee**:
- no underscore → no learned state
- underscore present → learning has occurred

Ignoring this contract means operating blindly.


In [3]:
scaler.transform([[170], [180]])


NotFittedError: This StandardScaler instance is not fitted yet. Call 'fit' with appropriate arguments before using this estimator.

## Why `NotFittedError` Exists

`transform()` requires previously learned parameters:

$$
z = \frac{x - \mu}{\sigma}
$$

But:
- $\mu$ (mean) does not exist
- $\sigma$ (standard deviation) does not exist

Because `fit()` was never called.

Instead of guessing, scikit-learn **refuses to proceed**.

This error enforces the lifecycle:
$$
\texttt{fit} \;\rightarrow\; \texttt{transform}
$$


## Estimators vs Transformers

### Estimator
- implements `fit()`
- usually implements `predict()`
- learns a mapping: $$X \rightarrow y$$

Examples:
- LinearRegression
- LogisticRegression

### Transformer
- implements `fit()`
- implements `transform()`
- changes representation of $$X$$

Examples:
- StandardScaler
- OneHotEncoder

Some objects do **both**, but the roles remain distinct.


## The Data Shape Contract (Critical)

scikit-learn enforces strict shape rules.

### Features ($X$)
- must be **2D**
- shape:
$$
(n_{\text{samples}}, n_{\text{features}})
$$

### Target ($y$)
- must be **1D**
- shape:
$$
(n_{\text{samples}})
$$

Even with a single feature, $X$ **must remain 2D**.

This consistency enables:
- pipelines
- cross-validation
- metric computation


In [4]:
from sklearn.linear_model import LinearRegression

X = [1000, 1500, 2000]   # wrong shape
y = [100, 150, 200]

model = LinearRegression()
model.fit(X, y)


ValueError: Expected 2D array, got 1D array instead:
array=[1000 1500 2000].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

## Why the Shape Error Is Conceptual (Not Just Syntax)

A 1D list:
$$
[1000, 1500, 2000]
$$

does not encode:
- how many samples?
- how many features?

scikit-learn cannot infer whether this is:
- 3 samples × 1 feature
- 1 sample × 3 features

Ambiguity is rejected by design.

Engineering systems require **explicit structure**.


In [5]:
import numpy as np

X_correct = np.array([[1000], [1500], [2000]])
y_correct = np.array([100, 150, 200])

model.fit(X_correct, y_correct)


0,1,2
,"fit_intercept  fit_intercept: bool, default=True Whether to calculate the intercept for this model. If set to False, no intercept will be used in calculations (i.e. data is expected to be centered).",True
,"copy_X  copy_X: bool, default=True If True, X will be copied; else, it may be overwritten.",True
,"tol  tol: float, default=1e-6 The precision of the solution (`coef_`) is determined by `tol` which specifies a different convergence criterion for the `lsqr` solver. `tol` is set as `atol` and `btol` of :func:`scipy.sparse.linalg.lsqr` when fitting on sparse training data. This parameter has no effect when fitting on dense data. .. versionadded:: 1.7",1e-06
,"n_jobs  n_jobs: int, default=None The number of jobs to use for the computation. This will only provide speedup in case of sufficiently large problems, that is if firstly `n_targets > 1` and secondly `X` is sparse or if `positive` is set to `True`. ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context. ``-1`` means using all processors. See :term:`Glossary ` for more details.",
,"positive  positive: bool, default=False When set to ``True``, forces the coefficients to be positive. This option is only supported for dense arrays. For a comparison between a linear regression model with positive constraints on the regression coefficients and a linear regression without such constraints, see :ref:`sphx_glr_auto_examples_linear_model_plot_nnls.py`. .. versionadded:: 0.24",False


In [6]:
model.coef_, model.intercept_


(array([0.1]), np.float64(0.0))

## Inspecting Learned Parameters Is Mandatory

Inspecting parameters:
- confirms training occurred
- reveals model behavior
- prevents silent failures

If you never inspect `coef_`, `mean_`, or similar attributes,
you are trusting results blindly.

That is not engineering.


## `predict()` Is Pure Inference

`predict()`:
- does not update parameters
- does not adapt
- does not learn

If predictions change, the cause is **external**:
- data changed
- model was re-fit
- randomness was uncontrolled


In [7]:
model.predict([[1800]])


array([180.])

In [8]:
model.predict([[1800]])


array([180.])

## Final Mental Checksum

Before running any cell, you must be able to answer:

> What has been fit?  
> On which data?  
> What state exists inside the object right now?

If you cannot answer this, stop and reassess.

This mental discipline prevents:
- leakage
- invalid evaluation
- false confidence

