# Model Evaluation — From Scratch

Up to Day 6, the focus was on **training models**.
Training alone does not tell us whether a model is useful.

This notebook focuses on **evaluation**:
- how to test models correctly
- how to compute metrics from scratch
- how to diagnose failure cases

The goal is not to say *that* a model is bad,
but to explain **why** it is bad.


## Why Evaluation Matters

A model can:
- perform extremely well on training data
- completely fail on unseen data

Evaluation answers a different question than training:

- Training: Can the model fit the data?
- Evaluation: Can the model generalize?

These questions must never be mixed.


In [1]:
import numpy as np
import matplotlib.pyplot as plt

np.random.seed(42)


## Dataset

We reuse a binary classification dataset so that:
- failure cases can be inspected
- linear and logistic models can be compared


In [2]:
n_samples = 200

X0 = np.random.randn(n_samples // 2, 2) + np.array([-2, -2])
y0 = np.zeros(n_samples // 2)

X1 = np.random.randn(n_samples // 2, 2) + np.array([2, 2])
y1 = np.ones(n_samples // 2)

X = np.vstack([X0, X1])
y = np.hstack([y0, y1])


## Manual Train / Test Split

The dataset is shuffled and split manually.

The test set must never be used during training.


In [3]:
indices = np.random.permutation(len(X))
split = int(0.8 * len(X))

train_idx = indices[:split]
test_idx = indices[split:]

X_train, y_train = X[train_idx], y[train_idx]
X_test, y_test = X[test_idx], y[test_idx]


## Confusion Matrix (From Scratch)

For binary classification:

- True Positive  (TP)
- True Negative  (TN)
- False Positive (FP)
- False Negative (FN)

All evaluation metrics are derived from these counts.


In [4]:
def confusion_matrix(y_true, y_pred):
    TP = np.sum((y_true == 1) & (y_pred == 1))
    TN = np.sum((y_true == 0) & (y_pred == 0))
    FP = np.sum((y_true == 0) & (y_pred == 1))
    FN = np.sum((y_true == 1) & (y_pred == 0))
    return TP, TN, FP, FN


## Evaluation Metrics (From Scratch)

Accuracy, Precision, and Recall measure **different aspects** of performance.


In [5]:
def accuracy(TP, TN, FP, FN):
    return (TP + TN) / (TP + TN + FP + FN)

def precision(TP, FP):
    return TP / (TP + FP) if (TP + FP) > 0 else 0.0

def recall(TP, FN):
    return TP / (TP + FN) if (TP + FN) > 0 else 0.0


## Model 1 — Linear Regression Used as a Classifier (Bad Baseline)

We intentionally misuse linear regression as a classifier
by thresholding its outputs.

This serves as a baseline to show **what not to do**.


In [6]:
# Train linear regression
X_aug = np.hstack([np.ones((len(X_train), 1)), X_train])
w_lin = np.linalg.inv(X_aug.T @ X_aug) @ X_aug.T @ y_train


In [7]:
# Test predictions
X_test_aug = np.hstack([np.ones((len(X_test), 1)), X_test])
y_lin_scores = X_test_aug @ w_lin
y_lin_pred = (y_lin_scores >= 0.5).astype(int)


## Model 2 — Logistic Regression (Correct Model)

This model outputs probabilities and is trained using log loss.


In [8]:
def sigmoid(z):
    z = np.clip(z, -500, 500)
    return 1 / (1 + np.exp(-z))


In [9]:
# Train logistic regression
w = np.zeros(X_train.shape[1])
b = 0.0
lr = 0.1

for _ in range(300):
    z = X_train @ w + b
    y_hat = sigmoid(z)

    dw = (1 / len(y_train)) * X_train.T @ (y_hat - y_train)
    db = (1 / len(y_train)) * np.sum(y_hat - y_train)

    w -= lr * dw
    b -= lr * db


In [10]:
# Test predictions
y_log_probs = sigmoid(X_test @ w + b)
y_log_pred = (y_log_probs >= 0.5).astype(int)


## Metric Comparison


In [11]:
# Linear model metrics
TP_l, TN_l, FP_l, FN_l = confusion_matrix(y_test, y_lin_pred)

# Logistic model metrics
TP_g, TN_g, FP_g, FN_g = confusion_matrix(y_test, y_log_pred)

print("Linear Regression (as classifier)")
print("Accuracy:", accuracy(TP_l, TN_l, FP_l, FN_l))
print("Precision:", precision(TP_l, FP_l))
print("Recall:", recall(TP_l, FN_l))

print("\nLogistic Regression")
print("Accuracy:", accuracy(TP_g, TN_g, FP_g, FN_g))
print("Precision:", precision(TP_g, FP_g))
print("Recall:", recall(TP_g, FN_g))


Linear Regression (as classifier)
Accuracy: 1.0
Precision: 1.0
Recall: 1.0

Logistic Regression
Accuracy: 1.0
Precision: 1.0
Recall: 1.0


## Why Accuracy Alone Is Dangerous

Accuracy hides **what kind of mistakes** the model is making.

Two models can have the same accuracy:
- one misses critical positives
- the other makes harmless false alarms

Accuracy cannot distinguish between these cases.


## Failure-Case Analysis

We now inspect **specific failures** made by the models.
This is where evaluation becomes diagnostic.


In [12]:
# Identify false negatives and false positives for logistic regression
false_negatives = np.where((y_test == 1) & (y_log_pred == 0))[0]
false_positives = np.where((y_test == 0) & (y_log_pred == 1))[0]

false_negatives[:5], false_positives[:5]


(array([], dtype=int64), array([], dtype=int64))

False negatives:
- dangerous when positives must not be missed

False positives:
- costly when false alarms matter

Which error matters more depends on the application.
Metrics must be chosen accordingly.


## Summary

- Evaluation must use unseen data
- Metrics are derived from confusion counts
- Accuracy alone can be misleading
- Precision and recall expose different failure modes
- A model is judged by *how* it fails, not just *how often*

This completes Week B: ML from scratch.
