# Day 02 — Building a Baseline Classification Model

## What Is a Baseline Model and Why Every ML Project Needs One

A **baseline model** is the simplest reasonable model you build at the very start of a machine learning project. It serves as a **reference point** — a line in the sand that every future model must beat to justify its added complexity.

### The "Beat the Baseline" Philosophy

In professional ML engineering, the workflow looks like this:

1. **Understand the problem** — What are we predicting? What data do we have?
2. **Build a baseline** — The simplest model that produces reasonable predictions.
3. **Iterate and improve** — Try more complex models, feature engineering, hyperparameter tuning.
4. **Always compare back to baseline** — If a fancy deep learning model only beats logistic regression by 0.2%, is the added complexity worth it?

Without a baseline, you have no way to answer the question: *"Is my model actually good, or does it just look good?"*

### What We Will Cover in This Notebook

- **Theory**: Classification problems, why baselines matter, logistic regression internals, feature scaling, evaluation metrics (accuracy, precision, recall, F1, ROC-AUC), confusion matrices, threshold tuning
- **Practice**: Loading a real medical dataset (Breast Cancer Wisconsin), building a logistic regression baseline with scikit-learn Pipelines, evaluating with multiple metrics, plotting ROC and Precision-Recall curves, threshold tuning, and comparing against dummy/random baselines

By the end, you will have a fully evaluated baseline model and a deep understanding of *why* each step matters.

---

## Theory: The Classification Problem

Classification is a **supervised learning** task where the goal is to predict a **discrete label** (a category) for each input sample.

### Binary vs Multi-Class Classification

| Type | # of Classes | Examples |
|------|-------------|----------|
| **Binary** | 2 | Spam vs Not Spam, Malignant vs Benign, Fraud vs Legitimate |
| **Multi-class** | 3+ | Digit recognition (0-9), Sentiment (positive/neutral/negative), Disease type |

### Real-World Classification Examples

- **Healthcare**: Given tumor measurements, predict whether a tumor is malignant or benign. A false negative (missing cancer) can be fatal, so **recall** is critical.
- **Finance**: Given transaction features, predict whether a transaction is fraudulent. Datasets are heavily **imbalanced** (99.9% legitimate), so accuracy alone is meaningless.
- **NLP**: Given an email's text, predict whether it is spam. A false positive (marking a real email as spam) annoys users, so **precision** matters.

### Decision Boundaries

Every classifier learns a **decision boundary** — a line (or surface, in higher dimensions) that separates the classes in feature space. For a simple 2D problem:

- **Logistic Regression** learns a straight line (linear boundary)
- **Decision Trees** learn axis-aligned rectangular regions
- **Neural Networks** learn complex, curved boundaries

The key insight: **simpler boundaries generalize better** when you have limited data. This is why logistic regression is an excellent baseline — it is hard to overfit a straight line.

---

## Theory: Why Baselines Matter

There are several levels of baselines, from trivial to useful:

### 1. Random Baseline
Predict classes completely at random. For a balanced binary problem, this gives ~50% accuracy. **Any real model must beat this.**

### 2. Majority Class Baseline ("Most Frequent")
Always predict the most common class. If 90% of patients are healthy, predicting "healthy" for everyone gives 90% accuracy — but catches zero sick patients. **This exposes why accuracy alone is dangerous.**

### 3. Simple Model Baseline
A straightforward model like logistic regression or a shallow decision tree. This is the **practical baseline** — it uses actual patterns in the data but with minimal complexity.

### How Baselines Prevent Wasted Effort

Consider this scenario:
- You spend 3 weeks building a complex ensemble model with 150 features
- It achieves 94% accuracy
- You celebrate... until you realize a majority-class prediction gives 93% accuracy
- Your complex model is barely better than doing nothing

**Baselines give you a reality check.** They tell you:
- How much room for improvement exists
- Whether your complex model is actually learning useful patterns
- What the minimum acceptable performance is

> *"A model is only as good as the baseline it beats."*

---

## Setup & Imports

In [None]:
# =============================================================================
# IMPORTS — Each library serves a specific purpose in our ML pipeline
# =============================================================================

# NumPy: The foundation of numerical computing in Python.
# We use it for array operations, mathematical functions, and threshold arrays.
import numpy as np

# Pandas: Data manipulation and analysis library.
# Provides DataFrames — tabular data structures that make EDA intuitive.
import pandas as pd

# Matplotlib: The foundational plotting library in Python.
# We use it for creating ROC curves, PR curves, and threshold plots.
import matplotlib.pyplot as plt

# Seaborn: Statistical visualization built on top of matplotlib.
# Makes it easy to create informative, attractive confusion matrix heatmaps.
import seaborn as sns

# sklearn.datasets: Provides well-known benchmark datasets.
# load_breast_cancer gives us a real-world medical classification dataset.
from sklearn.datasets import load_breast_cancer

# sklearn.model_selection: Tools for splitting data and validating models.
# train_test_split creates hold-out sets for unbiased evaluation.
from sklearn.model_selection import train_test_split

# sklearn.preprocessing: Data transformation tools.
# StandardScaler standardizes features to mean=0, std=1 — critical for logistic regression.
from sklearn.preprocessing import StandardScaler

# sklearn.pipeline: Chains preprocessing and model steps together.
# Prevents data leakage by ensuring the scaler is fit only on training data.
from sklearn.pipeline import Pipeline

# sklearn.linear_model: Linear models for classification and regression.
# LogisticRegression is our baseline classifier — simple, interpretable, effective.
from sklearn.linear_model import LogisticRegression

# sklearn.dummy: "Dummy" models that use simple rules (no learning).
# DummyClassifier helps us establish the absolute floor for model performance.
from sklearn.dummy import DummyClassifier

# sklearn.metrics: Functions to evaluate model performance.
# We import individual metric functions for fine-grained control.
from sklearn.metrics import (
    accuracy_score,        # Overall correctness: (TP+TN) / total
    precision_score,       # Of predicted positives, how many are correct: TP / (TP+FP)
    recall_score,          # Of actual positives, how many did we find: TP / (TP+FN)
    f1_score,              # Harmonic mean of precision and recall
    roc_auc_score,         # Area under the ROC curve
    confusion_matrix,      # 2x2 table of TP, TN, FP, FN
    classification_report, # Formatted summary of precision, recall, F1 per class
    roc_curve,             # Computes FPR and TPR at various thresholds
    precision_recall_curve,# Computes precision and recall at various thresholds
    average_precision_score# Area under the precision-recall curve
)

# Configure matplotlib for clean, readable plots
plt.rcParams['figure.figsize'] = (8, 5)  # Default figure size
plt.rcParams['figure.dpi'] = 100          # Resolution for notebook display
plt.rcParams['font.size'] = 11            # Base font size for readability

# Set random seed for reproducibility across the entire notebook.
# Anyone running this notebook will get the exact same results.
RANDOM_STATE = 42

print("All imports successful. Ready to build our baseline model!")

---

## Loading the Dataset: Breast Cancer Wisconsin

The **Breast Cancer Wisconsin (Diagnostic)** dataset is one of the most widely used datasets in machine learning education. It was created from digitized images of fine needle aspirates (FNA) of breast masses.

- **569 samples** (patients)
- **30 features** computed from cell nuclei measurements: radius, texture, perimeter, area, smoothness, compactness, concavity, concave points, symmetry, fractal dimension — each with mean, standard error, and "worst" (largest) value
- **2 target classes**: Malignant (0) and Benign (1)

This is a real-world medical classification problem where the stakes are high: missing a malignant tumor (false negative) could cost a life.

In [None]:
# =============================================================================
# LOAD THE BREAST CANCER WISCONSIN DATASET
# =============================================================================

# load_breast_cancer() returns a Bunch object (similar to a dictionary)
# containing the data, target, feature names, target names, and a description.
cancer_data = load_breast_cancer()

# Convert to a pandas DataFrame for easier exploration and manipulation.
# cancer_data.data is a NumPy array of shape (569, 30) — the feature matrix.
# cancer_data.feature_names provides human-readable column names.
df = pd.DataFrame(cancer_data.data, columns=cancer_data.feature_names)

# Add the target column to our DataFrame.
# 0 = malignant (cancerous), 1 = benign (non-cancerous)
df['target'] = cancer_data.target

# Create a human-readable label for easier interpretation in outputs.
# np.where acts like an if-else: if target==0 -> 'malignant', else -> 'benign'
df['diagnosis'] = np.where(df['target'] == 0, 'malignant', 'benign')

# Display the first 5 rows to get a feel for the data.
# Notice the features are continuous measurements (floats) on very different scales.
print(f"Dataset shape: {df.shape[0]} samples, {df.shape[1] - 2} features")
print(f"Target classes: {cancer_data.target_names}")
print(f"\nFeature names (first 10):")
for i, name in enumerate(cancer_data.feature_names[:10]):
    print(f"  {i+1}. {name}")
print(f"  ... and {len(cancer_data.feature_names) - 10} more\n")

df.head()

---

## Data Exploration (EDA)

In [None]:
# =============================================================================
# EXPLORATORY DATA ANALYSIS — Understanding our data before modeling
# =============================================================================

# Check the overall shape: how many rows (patients) and columns (features + target)
print("=" * 60)
print("DATASET OVERVIEW")
print("=" * 60)
print(f"Number of samples (patients): {df.shape[0]}")
print(f"Number of features: {df.shape[1] - 2}")  # Subtract target and diagnosis columns

# Check for missing values — a critical first step in any ML project.
# Missing values can break models or silently degrade performance.
missing = df.isnull().sum().sum()
print(f"Total missing values: {missing}")
print()  # Blank line for readability

# =============================================================================
# CLASS DISTRIBUTION — Is the dataset balanced or imbalanced?
# This directly affects our choice of evaluation metrics.
# =============================================================================
print("=" * 60)
print("CLASS DISTRIBUTION")
print("=" * 60)

# value_counts() shows how many samples belong to each class
class_counts = df['diagnosis'].value_counts()
print(class_counts)
print()

# Calculate the percentage of each class — important for understanding imbalance.
# If one class dominates (e.g., 95%), accuracy becomes a misleading metric.
class_pct = df['diagnosis'].value_counts(normalize=True) * 100
print("Class percentages:")
for label, pct in class_pct.items():
    print(f"  {label}: {pct:.1f}%")

# The dataset is moderately imbalanced (~63% benign, ~37% malignant).
# This means a majority-class baseline would achieve ~63% accuracy.
print()

# =============================================================================
# FEATURE STATISTICS — Understanding the scale and spread of each feature
# =============================================================================
print("=" * 60)
print("FEATURE STATISTICS (first 10 features)")
print("=" * 60)

# .describe() gives count, mean, std, min, 25%, 50%, 75%, max for each column.
# Notice the wildly different scales: mean radius ~6-28, mean area ~143-2501.
# This is why feature scaling is essential for logistic regression.
feature_cols = cancer_data.feature_names[:10]  # Show first 10 for readability
df[feature_cols].describe().round(2)

---

## Theory: Train/Test Split

### Why Split the Data?

The fundamental goal of machine learning is **generalization** — performing well on *new, unseen data*, not just the data we trained on. If we evaluate a model on the same data it was trained on, we get an **overly optimistic** estimate of performance.

A **train/test split** partitions the data into:
- **Training set**: Used to fit (train) the model. The model sees these examples and learns patterns.
- **Test set**: Held back and used *only* for evaluation. The model never sees these during training.

### Data Leakage

**Data leakage** occurs when information from the test set "leaks" into the training process. Common causes:
- Fitting a scaler on the full dataset before splitting (the scaler "knows" test set statistics)
- Using future information as features (e.g., using tomorrow's stock price to predict today's)
- Performing feature selection on the full dataset

**This is why we use Pipelines** — they ensure the scaler is fit *only* on training data.

### Stratification

When classes are imbalanced, a random split might put too many of one class in the train set and too few in the test set. **Stratified splitting** ensures both sets have the same class proportions as the original data.

### Common Split Ratios

| Ratio | Train | Validation | Test | Use Case |
|-------|-------|------------|------|----------|
| 80/20 | 80% | — | 20% | Simple projects, small datasets |
| 60/20/20 | 60% | 20% | 20% | When you need a validation set for tuning |
| 70/15/15 | 70% | 15% | 15% | Common in industry |

### `random_state` for Reproducibility

Setting `random_state=42` (or any fixed integer) ensures the split is identical every time you run the notebook. This is critical for **reproducible research** — anyone running your code gets the same results.

In [None]:
# =============================================================================
# TRAIN/TEST SPLIT
# =============================================================================

# Separate features (X) from the target (y).
# X contains the 30 numeric measurements; y contains the binary label (0 or 1).
X = df[cancer_data.feature_names]  # Use only the 30 feature columns
y = df['target']                    # 0 = malignant, 1 = benign

# Perform an 80/20 split with stratification.
# stratify=y ensures the class ratio (~37% malignant, ~63% benign) is preserved
# in both the training and test sets.
# random_state=42 makes the split deterministic and reproducible.
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.20,         # 20% of data reserved for testing
    random_state=RANDOM_STATE,  # Reproducibility
    stratify=y              # Preserve class proportions in both sets
)

# Verify the split sizes
print("=" * 60)
print("TRAIN/TEST SPLIT RESULTS")
print("=" * 60)
print(f"Training set: {X_train.shape[0]} samples ({X_train.shape[0]/len(X)*100:.0f}%)")
print(f"Test set:     {X_test.shape[0]} samples ({X_test.shape[0]/len(X)*100:.0f}%)")
print()

# Verify stratification worked correctly.
# Both sets should have approximately the same class proportions.
print("Class distribution verification:")
print(f"  Overall:  malignant={sum(y==0)/len(y)*100:.1f}%, benign={sum(y==1)/len(y)*100:.1f}%")
print(f"  Training: malignant={sum(y_train==0)/len(y_train)*100:.1f}%, benign={sum(y_train==1)/len(y_train)*100:.1f}%")
print(f"  Test:     malignant={sum(y_test==0)/len(y_test)*100:.1f}%, benign={sum(y_test==1)/len(y_test)*100:.1f}%")
print()
print("Stratification confirmed: class proportions are consistent across splits.")

---

## Theory: Logistic Regression

Despite its name, **logistic regression** is a **classification** algorithm, not a regression algorithm. It is one of the most important algorithms in machine learning and statistics.

### How It Works

Logistic regression models the **probability** that an input belongs to the positive class:

$$P(y = 1 \mid \mathbf{x}) = \frac{1}{1 + e^{-z}}$$

where $z = \mathbf{w}^T \mathbf{x} + b = w_1 x_1 + w_2 x_2 + \cdots + w_n x_n + b$

### The Sigmoid Function

The function $\sigma(z) = \frac{1}{1 + e^{-z}}$ is called the **sigmoid** (or logistic) function. It maps any real number to the range (0, 1), making the output interpretable as a probability:

- When $z \to +\infty$: $\sigma(z) \to 1$ (confident positive prediction)
- When $z = 0$: $\sigma(z) = 0.5$ (maximum uncertainty)
- When $z \to -\infty$: $\sigma(z) \to 0$ (confident negative prediction)

### Log-Odds (Logit)

The inverse of the sigmoid is the **logit** function: $z = \log\frac{P}{1-P}$. This is the log of the **odds ratio**. Logistic regression is fundamentally a **linear model in log-odds space** — it learns a linear combination of features that predicts the log-odds of the positive class.

### Decision Boundary

The default decision rule is: predict class 1 if $P(y=1|\mathbf{x}) \geq 0.5$, which corresponds to $z \geq 0$. This creates a **linear decision boundary** in feature space (a hyperplane).

### Regularization

Scikit-learn's `LogisticRegression` applies **regularization** by default to prevent overfitting:

- **L2 (Ridge)**: Default. Penalizes large weights: $\text{Cost} + \lambda \sum w_i^2$. Shrinks all weights toward zero.
- **L1 (Lasso)**: Penalizes with absolute values: $\text{Cost} + \lambda \sum |w_i|$. Can drive some weights to exactly zero (feature selection).
- **C parameter**: Inverse of regularization strength. Smaller C = stronger regularization. Default is C=1.0.

### Strengths and Weaknesses

| Strengths | Weaknesses |
|-----------|------------|
| Fast to train and predict | Assumes linear decision boundary |
| Outputs calibrated probabilities | Cannot capture complex non-linear patterns |
| Highly interpretable (coefficients = feature importance) | Sensitive to feature scaling |
| Works well with small datasets | Sensitive to outliers |
| Rarely overfits (with regularization) | Assumes features are roughly independent |

---

## Theory: Feature Scaling

### Why Scaling Matters for Logistic Regression

Logistic regression uses **gradient descent** to find optimal weights. When features are on vastly different scales (e.g., `mean radius` ranges 6-28, while `mean area` ranges 143-2501), the loss landscape becomes elongated. This causes:

- **Slower convergence**: Gradient descent takes a zigzag path instead of heading straight to the minimum.
- **Unfair weight penalization**: Regularization penalizes all weights equally, but a weight for a large-scale feature must be small to produce reasonable outputs. This biases the model.

### Types of Scalers

| Scaler | Formula | When to Use |
|--------|---------|-------------|
| **StandardScaler** | $z = \frac{x - \mu}{\sigma}$ | Default choice. Centers to mean=0, std=1. Best for normally distributed features. |
| **MinMaxScaler** | $z = \frac{x - x_{min}}{x_{max} - x_{min}}$ | Scales to [0, 1]. Good when you need bounded values (e.g., neural networks). |
| **RobustScaler** | $z = \frac{x - \text{median}}{IQR}$ | Uses median and IQR. Robust to outliers — use when your data has extreme values. |

### Our Choice: StandardScaler

We will use `StandardScaler` because:
1. The breast cancer features are approximately normally distributed
2. Logistic regression's gradient descent converges fastest with standardized features
3. It is the most common default in practice

### Preventing Data Leakage with Pipelines

We **must** fit the scaler only on training data, then use those same parameters to transform the test data. If we fit on the full dataset, the scaler "sees" test data statistics, which constitutes **data leakage**.

Scikit-learn's `Pipeline` handles this automatically:
- `.fit(X_train)` fits the scaler on training data only
- `.transform(X_test)` uses training statistics to transform test data
- `.predict(X_test)` chains transform + predict in one call

In [None]:
# =============================================================================
# BUILDING THE BASELINE PIPELINE
# =============================================================================
# A Pipeline chains preprocessing and modeling steps into a single object.
# This is the idiomatic way to build ML models in scikit-learn because:
#   1. It prevents data leakage (scaler is fit only on training data)
#   2. It simplifies code (one .fit() call does everything)
#   3. It makes deployment easier (one object to serialize/save)

baseline_model = Pipeline([
    # Step 1: StandardScaler
    # Transforms each feature to have mean=0 and standard deviation=1.
    # This ensures all 30 features contribute equally to the model,
    # regardless of their original scale.
    ('scaler', StandardScaler()),
    
    # Step 2: LogisticRegression
    # Our baseline classifier. Key parameters:
    #   - C=1.0 (default): moderate regularization strength
    #   - penalty='l2' (default): Ridge regularization to prevent overfitting
    #   - solver='lbfgs' (default): efficient optimizer for small-medium datasets
    #   - max_iter=10000: increase from default 100 to ensure convergence
    #     with 30 features (more features = more iterations needed)
    #   - random_state: reproducibility of the optimization
    ('logreg', LogisticRegression(
        max_iter=10000,
        random_state=RANDOM_STATE
    ))
])

# Display the pipeline structure.
# Scikit-learn provides a nice representation of the pipeline steps.
print("Pipeline structure:")
print(baseline_model)
print()
print("Step 1: StandardScaler (mean=0, std=1 normalization)")
print("Step 2: LogisticRegression (L2 regularized, C=1.0)")

In [None]:
# =============================================================================
# TRAINING THE MODEL
# =============================================================================
# When we call .fit(), the following happens internally:
#
# 1. StandardScaler.fit_transform(X_train):
#    - Computes the mean and std of each of the 30 features using ONLY X_train
#    - Transforms X_train: each value becomes (value - mean) / std
#    - The scaler stores the means and stds for later use on test data
#
# 2. LogisticRegression.fit(X_train_scaled, y_train):
#    - Initializes 30 weights (one per feature) and 1 bias term
#    - Uses L-BFGS optimizer to minimize the log-loss (cross-entropy) function
#    - Iteratively adjusts weights to maximize the likelihood of correct predictions
#    - Applies L2 regularization to prevent any single weight from becoming too large
#    - Stops when convergence criteria are met or max_iter is reached

baseline_model.fit(X_train, y_train)

# After training, we can inspect what the model learned.
# Access the logistic regression step from the pipeline.
logreg = baseline_model.named_steps['logreg']

# The model learned 30 coefficients (weights) — one for each feature.
# Positive weight = feature increases probability of benign (class 1).
# Negative weight = feature increases probability of malignant (class 0).
print("Model trained successfully!")
print(f"Number of coefficients: {logreg.coef_.shape[1]}")
print(f"Number of iterations to converge: {logreg.n_iter_[0]}")
print()

# Show the top 5 most influential features (by absolute coefficient value).
# Larger absolute values indicate features the model relies on most heavily.
coef_df = pd.DataFrame({
    'feature': cancer_data.feature_names,
    'coefficient': logreg.coef_[0]
})
coef_df['abs_coefficient'] = coef_df['coefficient'].abs()
coef_df = coef_df.sort_values('abs_coefficient', ascending=False)

print("Top 10 most influential features:")
for i, row in coef_df.head(10).iterrows():
    direction = "-> benign" if row['coefficient'] > 0 else "-> malignant"
    print(f"  {row['feature']:30s} coef={row['coefficient']:+.4f}  {direction}")

---

## Theory: Classification Metrics Deep Dive

A single number cannot capture all aspects of a classifier's performance. We need **multiple metrics**, each highlighting a different aspect.

### The Confusion Matrix Foundation

All classification metrics are derived from four fundamental counts:

| | **Predicted Positive** | **Predicted Negative** |
|---|---|---|
| **Actual Positive** | True Positive (TP) | False Negative (FN) |
| **Actual Negative** | False Positive (FP) | True Negative (TN) |

### Metric Formulas

| Metric | Formula | Intuition | Prioritize When... |
|--------|---------|-----------|--------------------|
| **Accuracy** | $\frac{TP + TN}{TP + TN + FP + FN}$ | Overall correctness | Classes are balanced |
| **Precision** | $\frac{TP}{TP + FP}$ | Of those we predicted positive, how many are correct? | False positives are costly (spam detection) |
| **Recall (Sensitivity)** | $\frac{TP}{TP + FN}$ | Of all actual positives, how many did we find? | False negatives are costly (cancer detection) |
| **Specificity** | $\frac{TN}{TN + FP}$ | Of all actual negatives, how many did we correctly identify? | Complement to recall for the negative class |
| **F1 Score** | $2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}$ | Balanced combination of precision and recall | You need a single metric balancing both |

### Why F1 Uses the Harmonic Mean (Not Arithmetic Mean)

The **harmonic mean** penalizes extreme imbalances. Consider:
- Precision = 1.0, Recall = 0.0
- Arithmetic mean = 0.5 (looks decent!)
- Harmonic mean (F1) = 0.0 (correctly shows the model is useless)

The harmonic mean ensures that **both** precision and recall must be high for F1 to be high.

### When Accuracy Is Misleading

Imagine a fraud detection dataset with 99.9% legitimate transactions. A model that always predicts "legitimate" achieves 99.9% accuracy but catches **zero fraud**. Accuracy is misleading whenever classes are imbalanced.

---

## Theory: The Confusion Matrix

The confusion matrix is a **visual summary** of a classifier's predictions, organized as a 2x2 table (for binary classification).

### Visual Layout

```
                    Predicted
                 Neg       Pos
Actual  Neg  [  TN    |   FP  ]
        Pos  [  FN    |   TP  ]
```

### Interpreting Each Quadrant

- **True Negatives (TN)**: Correctly identified negatives. "We said it's benign, and it IS benign."
- **False Positives (FP)** — **Type I Error**: Incorrectly flagged as positive. "We said it's malignant, but it's actually benign." Consequence: unnecessary biopsies, patient anxiety.
- **False Negatives (FN)** — **Type II Error**: Missed positives. "We said it's benign, but it's actually malignant." Consequence: **missed cancer** — potentially fatal.
- **True Positives (TP)**: Correctly identified positives. "We said it's malignant, and it IS malignant."

### Type I vs Type II Errors in Context

| Error Type | Formal Name | In Our Context | Severity |
|-----------|-------------|----------------|----------|
| Type I (FP) | False Alarm | Benign tumor flagged as malignant | Moderate — unnecessary follow-up |
| Type II (FN) | Missed Detection | Malignant tumor missed as benign | **Critical — missed cancer** |

In medical diagnosis, **Type II errors are usually far more dangerous than Type I errors.** We would rather have some false alarms than miss a single cancer case. This is why recall (sensitivity) is often the most important metric in healthcare applications.

In [None]:
# =============================================================================
# EVALUATING THE BASELINE MODEL
# =============================================================================

# Generate predictions on the test set.
# .predict() internally: scales X_test using training statistics, then classifies.
y_pred = baseline_model.predict(X_test)

# .predict_proba() returns probability estimates for each class.
# Shape: (n_samples, 2) — column 0 is P(malignant), column 1 is P(benign).
# We take column 1 (probability of benign/positive class) for ROC and PR curves.
y_proba = baseline_model.predict_proba(X_test)[:, 1]

# =============================================================================
# COMPUTE ALL METRICS
# =============================================================================
acc = accuracy_score(y_test, y_pred)
prec = precision_score(y_test, y_pred)    # Precision for the positive class (benign=1)
rec = recall_score(y_test, y_pred)         # Recall for the positive class
f1 = f1_score(y_test, y_pred)              # Harmonic mean of precision and recall
auc = roc_auc_score(y_test, y_proba)       # AUC uses probabilities, not hard predictions

# Print all metrics in a clean, organized format.
print("=" * 60)
print("BASELINE MODEL EVALUATION METRICS")
print("=" * 60)
print(f"Accuracy:  {acc:.4f}  (Overall correctness)")
print(f"Precision: {prec:.4f}  (Of predicted benign, how many are correct?)")
print(f"Recall:    {rec:.4f}  (Of actual benign cases, how many did we find?)")
print(f"F1 Score:  {f1:.4f}  (Harmonic mean of precision and recall)")
print(f"ROC-AUC:   {auc:.4f}  (Probability ranking quality)")
print()

# =============================================================================
# CONFUSION MATRIX
# =============================================================================
# Compute the confusion matrix: a 2x2 array of [TN, FP; FN, TP]
cm = confusion_matrix(y_test, y_pred)
tn, fp, fn, tp = cm.ravel()  # Unpack into individual values

print("=" * 60)
print("CONFUSION MATRIX")
print("=" * 60)
print(f"True Negatives  (correctly identified malignant): {tn}")
print(f"False Positives (benign misclassified as malignant): {fp}")
print(f"False Negatives (malignant misclassified as benign): {fn}")
print(f"True Positives  (correctly identified benign): {tp}")
print()

# Visualize the confusion matrix as a heatmap for intuitive interpretation.
fig, ax = plt.subplots(figsize=(6, 5))
sns.heatmap(
    cm,
    annot=True,         # Show numbers in each cell
    fmt='d',            # Integer format (not scientific notation)
    cmap='Blues',        # Blue color scheme
    xticklabels=['Malignant (0)', 'Benign (1)'],
    yticklabels=['Malignant (0)', 'Benign (1)'],
    ax=ax
)
ax.set_xlabel('Predicted Label')
ax.set_ylabel('True Label')
ax.set_title('Confusion Matrix — Logistic Regression Baseline')
plt.tight_layout()
plt.show()

# =============================================================================
# FULL CLASSIFICATION REPORT
# =============================================================================
# classification_report provides precision, recall, F1 for EACH class,
# plus macro/weighted averages.
print("=" * 60)
print("DETAILED CLASSIFICATION REPORT")
print("=" * 60)
print(classification_report(
    y_test, y_pred,
    target_names=['malignant (0)', 'benign (1)']
))

# Interpretation:
# - High recall for benign means we correctly identify most benign tumors.
# - Check recall for malignant — missing malignant cases is the most dangerous error.
# - Compare precision across classes to understand where the model makes mistakes.

---

## Theory: ROC Curve & AUC

### What the ROC Curve Plots

The **Receiver Operating Characteristic (ROC)** curve plots:
- **X-axis**: False Positive Rate (FPR) = FP / (FP + TN) — the fraction of negatives incorrectly classified as positive
- **Y-axis**: True Positive Rate (TPR) = TP / (TP + FN) — same as recall/sensitivity

Each point on the curve corresponds to a **different probability threshold**. By sweeping the threshold from 1.0 (predict nothing as positive) to 0.0 (predict everything as positive), we trace out the entire curve.

### AUC: Area Under the ROC Curve

The **AUC** summarizes the ROC curve in a single number:

| AUC Value | Interpretation |
|-----------|---------------|
| 1.0 | Perfect classifier — separates all positives from negatives |
| 0.9-1.0 | Excellent |
| 0.8-0.9 | Good |
| 0.7-0.8 | Fair |
| 0.5 | Random guessing — no discrimination ability |
| < 0.5 | Worse than random (model has learned inverted patterns) |

**Intuition**: AUC = the probability that the model ranks a randomly chosen positive example higher than a randomly chosen negative example.

### When AUC Is Useful vs Misleading

- **Useful**: When you care about the model's ability to *rank* examples (e.g., which patients to screen first)
- **Potentially misleading**: With highly imbalanced data, because FPR can be low even with many false positives (since TN is huge). In such cases, the Precision-Recall curve is often more informative.

In [None]:
# =============================================================================
# ROC CURVE VISUALIZATION
# =============================================================================

# Compute ROC curve data points.
# roc_curve returns three arrays:
#   fpr: False Positive Rate at each threshold
#   tpr: True Positive Rate (Recall) at each threshold
#   thresholds: The probability thresholds used
fpr, tpr, roc_thresholds = roc_curve(y_test, y_proba)

# Create the ROC plot
fig, ax = plt.subplots(figsize=(8, 6))

# Plot the ROC curve itself
ax.plot(
    fpr, tpr,
    color='darkorange',
    linewidth=2,
    label=f'Logistic Regression (AUC = {auc:.4f})'
)

# Plot the diagonal "random chance" line.
# A model with no discrimination power follows this line (AUC = 0.5).
# Everything above this line indicates the model has learned something useful.
ax.plot(
    [0, 1], [0, 1],
    color='navy',
    linewidth=1.5,
    linestyle='--',
    label='Random Chance (AUC = 0.5)'
)

# Shade the area under the ROC curve to visually represent AUC.
ax.fill_between(fpr, tpr, alpha=0.1, color='darkorange')

# Label axes and add title
ax.set_xlabel('False Positive Rate (1 - Specificity)', fontsize=12)
ax.set_ylabel('True Positive Rate (Sensitivity / Recall)', fontsize=12)
ax.set_title('ROC Curve — Logistic Regression Baseline', fontsize=14)
ax.legend(loc='lower right', fontsize=11)
ax.set_xlim([0.0, 1.0])
ax.set_ylim([0.0, 1.05])
ax.grid(True, alpha=0.3)  # Light grid for readability
plt.tight_layout()
plt.show()

# Print interpretation
print(f"ROC-AUC = {auc:.4f}")
print("Interpretation: The model has a {:.1f}% chance of ranking a randomly".format(auc * 100))
print("chosen benign sample higher than a randomly chosen malignant sample.")

---

## Theory: Precision-Recall Curve

### When to Use PR Curve vs ROC Curve

The **Precision-Recall (PR) curve** plots precision (y-axis) against recall (x-axis) at various thresholds. It is especially useful when:

1. **Classes are imbalanced**: In imbalanced datasets, ROC curves can be overly optimistic because FPR stays low even with many false positives (since TN is very large). The PR curve directly shows the tradeoff between precision and recall without being inflated by true negatives.

2. **You care about the positive class**: ROC gives equal weight to both classes, while PR focuses specifically on how well the model identifies the positive class.

### Average Precision (AP)

**Average Precision** summarizes the PR curve as the weighted mean of precisions at each threshold, with the increase in recall as the weight:

$$AP = \sum_n (R_n - R_{n-1}) \cdot P_n$$

AP ranges from 0 to 1. A perfect classifier has AP = 1.0. The baseline for a random classifier is the proportion of positives in the dataset.

### PR Curve vs ROC Curve — Summary

| Aspect | ROC Curve | PR Curve |
|--------|-----------|----------|
| Best for | Balanced datasets | Imbalanced datasets |
| X-axis | FPR | Recall |
| Y-axis | TPR (Recall) | Precision |
| Random baseline | Diagonal line (AUC=0.5) | Horizontal line at prevalence |
| Sensitive to class imbalance | Less sensitive | More sensitive |

In [None]:
# =============================================================================
# PRECISION-RECALL CURVE VISUALIZATION
# =============================================================================

# Compute precision-recall curve data points.
# precision_recall_curve returns:
#   precisions: Precision values at each threshold
#   recalls: Recall values at each threshold
#   pr_thresholds: The probability thresholds used
precisions, recalls, pr_thresholds = precision_recall_curve(y_test, y_proba)

# Average Precision: single-number summary of the PR curve.
# Analogous to AUC for the ROC curve.
ap = average_precision_score(y_test, y_proba)

# Create the PR curve plot
fig, ax = plt.subplots(figsize=(8, 6))

# Plot the precision-recall curve
ax.plot(
    recalls, precisions,
    color='green',
    linewidth=2,
    label=f'Logistic Regression (AP = {ap:.4f})'
)

# Plot the baseline: a horizontal line at the prevalence of the positive class.
# For a random classifier, precision = proportion of positives at all recall levels.
prevalence = y_test.sum() / len(y_test)
ax.axhline(
    y=prevalence,
    color='navy',
    linewidth=1.5,
    linestyle='--',
    label=f'Random Baseline (prevalence = {prevalence:.3f})'
)

# Shade the area under the curve
ax.fill_between(recalls, precisions, alpha=0.1, color='green')

# Label axes and add title
ax.set_xlabel('Recall (Sensitivity)', fontsize=12)
ax.set_ylabel('Precision', fontsize=12)
ax.set_title('Precision-Recall Curve — Logistic Regression Baseline', fontsize=14)
ax.legend(loc='lower left', fontsize=11)
ax.set_xlim([0.0, 1.0])
ax.set_ylim([0.0, 1.05])
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print(f"Average Precision (AP) = {ap:.4f}")
print(f"Random baseline precision = {prevalence:.4f} (proportion of positive class)")

---

## Theory: Threshold Tuning

### The Default Threshold

Logistic regression outputs a probability $P(y=1|\mathbf{x})$ for each sample. By default, scikit-learn uses a threshold of **0.5**:
- If $P \geq 0.5$: predict class 1 (benign)
- If $P < 0.5$: predict class 0 (malignant)

But 0.5 is **not always the best threshold**. The optimal threshold depends on the **business context**.

### Business-Driven Threshold Selection

| Scenario | Adjust Threshold | Effect |
|----------|-----------------|--------|
| Cancer screening (minimize missed cancers) | **Lower** the threshold (e.g., 0.3) | More samples flagged as malignant; higher recall, lower precision |
| Spam filtering (minimize blocking real emails) | **Raise** the threshold (e.g., 0.7) | Fewer samples flagged as spam; higher precision, lower recall |

### The Precision-Recall Tradeoff

There is a fundamental tradeoff:
- **Lowering the threshold**: Catches more true positives (recall goes up) but also catches more false positives (precision goes down)
- **Raising the threshold**: Reduces false positives (precision goes up) but misses more true positives (recall goes down)

The "optimal" threshold is the one that achieves the best balance **for your specific use case**. There is no universal best threshold.

In [None]:
# =============================================================================
# THRESHOLD TUNING — Finding the optimal decision threshold
# =============================================================================

# We will evaluate many thresholds to see how precision, recall, and F1 change.
# np.arange creates a range of thresholds from 0.1 to 0.9 in steps of 0.05.
thresholds_to_test = np.arange(0.10, 0.95, 0.05)

# Store results for each threshold so we can plot and compare.
threshold_results = []

for thresh in thresholds_to_test:
    # Apply the custom threshold to probability predictions.
    # Instead of the default 0.5, we classify as positive (benign=1)
    # only if the predicted probability exceeds our custom threshold.
    y_pred_custom = (y_proba >= thresh).astype(int)
    
    # Compute metrics at this threshold.
    # zero_division=0 handles the edge case where precision is undefined (no positive predictions).
    p = precision_score(y_test, y_pred_custom, zero_division=0)
    r = recall_score(y_test, y_pred_custom, zero_division=0)
    f = f1_score(y_test, y_pred_custom, zero_division=0)
    a = accuracy_score(y_test, y_pred_custom)
    
    threshold_results.append({
        'threshold': thresh,
        'precision': p,
        'recall': r,
        'f1': f,
        'accuracy': a
    })

# Convert to DataFrame for easy analysis and display
thresh_df = pd.DataFrame(threshold_results)

# Find the threshold that maximizes F1 score.
# F1 is a good default choice because it balances precision and recall.
best_idx = thresh_df['f1'].idxmax()
best_threshold = thresh_df.loc[best_idx, 'threshold']
best_f1 = thresh_df.loc[best_idx, 'f1']

print("=" * 60)
print("THRESHOLD TUNING RESULTS")
print("=" * 60)
print(thresh_df.to_string(index=False, float_format='{:.4f}'.format))
print(f"\nOptimal threshold (max F1): {best_threshold:.2f} with F1 = {best_f1:.4f}")

In [None]:
# =============================================================================
# VISUALIZE PRECISION / RECALL / F1 vs THRESHOLD
# =============================================================================
# This plot makes the precision-recall tradeoff visually obvious.
# You can clearly see where the curves cross and where F1 peaks.

fig, ax = plt.subplots(figsize=(10, 6))

# Plot each metric as a function of the threshold
ax.plot(thresh_df['threshold'], thresh_df['precision'],
        'b-o', label='Precision', linewidth=2, markersize=4)
ax.plot(thresh_df['threshold'], thresh_df['recall'],
        'r-s', label='Recall', linewidth=2, markersize=4)
ax.plot(thresh_df['threshold'], thresh_df['f1'],
        'g-^', label='F1 Score', linewidth=2, markersize=4)

# Mark the optimal threshold (max F1) with a vertical line.
# This is the threshold that best balances precision and recall.
ax.axvline(x=best_threshold, color='gray', linestyle='--', linewidth=1.5,
           label=f'Optimal Threshold = {best_threshold:.2f}')

# Mark the default threshold at 0.5 for comparison
ax.axvline(x=0.5, color='orange', linestyle=':', linewidth=1.5,
           label='Default Threshold = 0.50')

ax.set_xlabel('Decision Threshold', fontsize=12)
ax.set_ylabel('Metric Value', fontsize=12)
ax.set_title('Precision, Recall, and F1 vs Decision Threshold', fontsize=14)
ax.legend(loc='best', fontsize=10)
ax.set_xlim([0.1, 0.9])
ax.set_ylim([0.0, 1.05])
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

# Interpretation guidance
print("How to read this plot:")
print("- As threshold increases: precision goes UP (fewer false positives)")
print("- As threshold increases: recall goes DOWN (more false negatives)")
print("- F1 peaks where precision and recall are best balanced")
print(f"- The optimal F1 threshold ({best_threshold:.2f}) may differ from the default (0.50)")

---

## Theory: Dummy/Random Baselines

Before celebrating our logistic regression results, we should ask: **"How much better is our model compared to doing nothing intelligent at all?"**

Scikit-learn provides `DummyClassifier` — a "model" that ignores the features entirely and uses simple rules to make predictions.

### DummyClassifier Strategies

| Strategy | Behavior | Expected Accuracy |
|----------|----------|-------------------|
| `most_frequent` | Always predicts the majority class | = majority class proportion |
| `stratified` | Predicts classes randomly, maintaining training class proportions | ~ overall accuracy by chance |
| `uniform` | Predicts each class with equal probability (50/50 for binary) | ~ 50% for binary |

### Why Compare Against Dummy Classifiers?

1. **Sanity check**: If your model doesn't beat `most_frequent`, it hasn't learned anything useful.
2. **Quantify improvement**: Instead of saying "my model is 95% accurate," you can say "my model is 32 percentage points better than the naive baseline."
3. **Expose imbalanced data issues**: A `most_frequent` baseline on imbalanced data can have surprisingly high accuracy, revealing that accuracy alone is insufficient.

In [None]:
# =============================================================================
# COMPARING AGAINST RANDOM / DUMMY BASELINES
# =============================================================================
# We will train three dummy classifiers and compare them to our logistic regression.
# This gives us a complete picture of how much value our model actually adds.

# Strategy 1: "most_frequent" — always predicts the majority class (benign)
# This is the simplest possible "model." If we can't beat this, we have a problem.
dummy_most_frequent = DummyClassifier(
    strategy='most_frequent',
    random_state=RANDOM_STATE
)
dummy_most_frequent.fit(X_train, y_train)
y_pred_mf = dummy_most_frequent.predict(X_test)

# Strategy 2: "stratified" — randomly predicts classes based on training distribution
# Simulates an "informed random guesser" who knows the class proportions.
dummy_stratified = DummyClassifier(
    strategy='stratified',
    random_state=RANDOM_STATE
)
dummy_stratified.fit(X_train, y_train)
y_pred_strat = dummy_stratified.predict(X_test)

# Strategy 3: "uniform" — predicts each class with equal probability (coin flip)
# The most "ignorant" baseline — doesn't even know the class distribution.
dummy_uniform = DummyClassifier(
    strategy='uniform',
    random_state=RANDOM_STATE
)
dummy_uniform.fit(X_train, y_train)
y_pred_unif = dummy_uniform.predict(X_test)

# =============================================================================
# COMPUTE METRICS FOR ALL MODELS
# =============================================================================
# Build a comparison table with all models side by side.
# This is the clearest way to demonstrate the value of our baseline model.

def compute_metrics(y_true, y_pred, model_name):
    """Compute all classification metrics for a given set of predictions.
    Returns a dictionary suitable for building a comparison DataFrame."""
    return {
        'Model': model_name,
        'Accuracy': accuracy_score(y_true, y_pred),
        'Precision': precision_score(y_true, y_pred, zero_division=0),
        'Recall': recall_score(y_true, y_pred, zero_division=0),
        'F1 Score': f1_score(y_true, y_pred, zero_division=0)
    }

# Compute metrics for each model
comparison = pd.DataFrame([
    compute_metrics(y_test, y_pred, 'Logistic Regression (Baseline)'),
    compute_metrics(y_test, y_pred_mf, 'Dummy: Most Frequent'),
    compute_metrics(y_test, y_pred_strat, 'Dummy: Stratified'),
    compute_metrics(y_test, y_pred_unif, 'Dummy: Uniform'),
])

# Display the comparison table
print("=" * 70)
print("MODEL COMPARISON — Logistic Regression vs Dummy Baselines")
print("=" * 70)
print(comparison.to_string(index=False, float_format='{:.4f}'.format))
print()

# Calculate the improvement over the best dummy baseline
best_dummy_acc = max(
    accuracy_score(y_test, y_pred_mf),
    accuracy_score(y_test, y_pred_strat),
    accuracy_score(y_test, y_pred_unif)
)
improvement = acc - best_dummy_acc
print(f"Logistic Regression accuracy: {acc:.4f}")
print(f"Best dummy accuracy:          {best_dummy_acc:.4f}")
print(f"Improvement over dummy:       +{improvement:.4f} ({improvement*100:.1f} percentage points)")
print()
print("Our logistic regression baseline significantly outperforms all dummy classifiers,")
print("confirming that the model has learned meaningful patterns from the features.")

In [None]:
# =============================================================================
# VISUAL COMPARISON — Bar chart of all models
# =============================================================================
# A visual comparison makes it instantly clear how much better our model is.

fig, ax = plt.subplots(figsize=(10, 6))

# Prepare data for grouped bar chart.
# We compare all four metrics across all four models.
models = comparison['Model']
metrics = ['Accuracy', 'Precision', 'Recall', 'F1 Score']
x = np.arange(len(models))  # x positions for each model
width = 0.2                  # Width of each bar

# Plot bars for each metric, offset by width for grouping
for i, metric in enumerate(metrics):
    bars = ax.bar(x + i * width, comparison[metric], width, label=metric)

# Customize the plot for readability
ax.set_xlabel('Model', fontsize=12)
ax.set_ylabel('Score', fontsize=12)
ax.set_title('Model Comparison: Logistic Regression vs Dummy Baselines', fontsize=14)
ax.set_xticks(x + width * 1.5)  # Center tick labels under grouped bars
ax.set_xticklabels(models, rotation=15, ha='right', fontsize=9)
ax.legend(loc='lower right', fontsize=10)
ax.set_ylim([0, 1.1])
ax.grid(True, alpha=0.3, axis='y')  # Only horizontal gridlines
plt.tight_layout()
plt.show()

print("The chart clearly shows the logistic regression baseline outperforms")
print("all dummy classifiers across every metric.")

---

## Results Summary

Here is a consolidated view of everything we built and measured in this notebook:

### Dataset
- **Breast Cancer Wisconsin (Diagnostic)**: 569 samples, 30 features, 2 classes
- **Class distribution**: ~63% benign, ~37% malignant (moderately imbalanced)
- **Split**: 80% training (455 samples), 20% test (114 samples), stratified

### Models Built

| Model | Description |
|-------|-------------|
| **Logistic Regression** | StandardScaler + L2-regularized LogisticRegression in a Pipeline |
| **Dummy: Most Frequent** | Always predicts benign (majority class) |
| **Dummy: Stratified** | Random predictions maintaining class proportions |
| **Dummy: Uniform** | Coin-flip random predictions |

### Key Takeaways

1. **The logistic regression baseline is strong**: It significantly outperforms all dummy classifiers, confirming it has learned real patterns.
2. **Multiple metrics matter**: Accuracy alone does not tell the full story, especially with imbalanced classes.
3. **Threshold tuning can improve performance**: The default 0.5 threshold is not always optimal. Adjusting it based on the business problem (e.g., minimizing missed cancers) can yield better real-world outcomes.
4. **Pipelines prevent data leakage**: By bundling the scaler and model together, we ensure test data is never used during training.
5. **Dummy baselines set the floor**: Any model we build in future notebooks must beat the logistic regression baseline to justify its complexity.

---

## Next Steps: Connection to Day 03

We now have a **well-evaluated baseline** that future models must beat. In **Day 03**, we will explore **Decision Trees**, which offer:

- **Non-linear decision boundaries**: Unlike logistic regression's straight-line boundary, decision trees can capture complex, non-linear relationships between features.
- **Built-in feature selection**: Decision trees automatically determine which features are most informative by choosing the best splits.
- **No feature scaling required**: Decision trees split on individual feature values, so they are invariant to the scale of features.
- **High interpretability**: You can visualize the entire tree and trace exactly *why* a particular prediction was made.

However, decision trees also introduce new challenges:
- **Overfitting**: A deep tree memorizes the training data. We will learn about pruning and depth limits.
- **Instability**: Small changes in data can produce very different trees. This motivates ensemble methods (Random Forests, covered later).

### The ML Workflow So Far

```
Day 01: Data Preparation & EDA
Day 02: Baseline Model (Logistic Regression)    <-- You are here
Day 03: Decision Trees (non-linear models)
Day 04: Ensemble Methods (Random Forest, Boosting)
  ...
```

Every model we build from here on will be compared against today's logistic regression baseline. If a more complex model does not meaningfully improve on these results, the simpler model wins — that is the **beat the baseline** philosophy in action.