# Logistic Regression: Intuition and Math

## Learning Objectives

By the end of this notebook, you will be able to:

1. Explain the difference between classification and regression problems
2. Understand the sigmoid function and its role in logistic regression
3. Derive and interpret log-odds and the decision boundary
4. Write and interpret the binary cross-entropy loss function
5. Use `sklearn.linear_model.LogisticRegression` with key parameters
6. Visualize decision boundaries for 2D classification problems
7. Distinguish between `predict` and `predict_proba`

## Prerequisites

- Linear algebra basics (dot products, vectors)
- Linear regression concepts (weights, bias, fitting)
- Python, NumPy, Matplotlib fundamentals

## Table of Contents

1. [Classification vs Regression](#1)
2. [The Sigmoid Function](#2)
3. [Log-Odds and Decision Boundary](#3)
4. [Binary Cross-Entropy Loss](#4)
5. [Logistic Regression with sklearn](#5)
6. [Visualizing the Decision Boundary](#6)
7. [predict vs predict_proba](#7)
8. [Multi-class Classification (Brief)](#8)
9. [Common Mistakes](#9)
10. [Exercise](#10)

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification, load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

np.random.seed(42)
sns.set_style("whitegrid")
%matplotlib inline

<a id='1'></a>
## 1. Classification vs Regression

| Aspect | Regression | Classification |
|--------|-----------|----------------|
| Output | Continuous value (e.g., price, temperature) | Discrete category (e.g., spam/not spam, cat/dog) |
| Goal | Predict a quantity | Predict a class label |
| Loss | MSE, MAE | Cross-entropy, hinge loss |
| Example | House price = \$350,000 | Email is spam = Yes/No |

**Why not just use linear regression for classification?**

Linear regression outputs unbounded values $(-\infty, +\infty)$, but we need probabilities in $[0, 1]$. Logistic regression solves this by wrapping the linear output in a **sigmoid function**.

<a id='2'></a>
## 2. The Sigmoid Function

The sigmoid (logistic) function maps any real number to $(0, 1)$:

$$\sigma(z) = \frac{1}{1 + e^{-z}}$$

Properties:
- $\sigma(0) = 0.5$
- $\lim_{z \to +\infty} \sigma(z) = 1$
- $\lim_{z \to -\infty} \sigma(z) = 0$
- Derivative: $\sigma'(z) = \sigma(z)(1 - \sigma(z))$

In [None]:
def sigmoid(z):
    """Compute the sigmoid function."""
    return 1.0 / (1.0 + np.exp(-z))

z = np.linspace(-8, 8, 200)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Basic sigmoid
axes[0].plot(z, sigmoid(z), "b-", linewidth=2)
axes[0].axhline(y=0.5, color="gray", linestyle="--", alpha=0.5)
axes[0].axvline(x=0, color="gray", linestyle="--", alpha=0.5)
axes[0].set_xlabel("z", fontsize=12)
axes[0].set_ylabel(r"$\sigma(z)$", fontsize=12)
axes[0].set_title("Sigmoid Function", fontsize=14)
axes[0].set_ylim(-0.05, 1.05)

# Effect of w and b: sigma(w*x + b)
x = np.linspace(-5, 5, 200)
for w, b, label in [(1, 0, "w=1, b=0"), (3, 0, "w=3, b=0"),
                     (1, -2, "w=1, b=-2"), (1, 2, "w=1, b=2")]:
    axes[1].plot(x, sigmoid(w * x + b), linewidth=2, label=label)

axes[1].axhline(y=0.5, color="gray", linestyle="--", alpha=0.5)
axes[1].set_xlabel("x", fontsize=12)
axes[1].set_ylabel("P(y=1|x)", fontsize=12)
axes[1].set_title("Effect of w (steepness) and b (shift)", fontsize=14)
axes[1].legend()
axes[1].set_ylim(-0.05, 1.05)

plt.tight_layout()
plt.show()

print("Key observations:")
print("- Larger |w| makes the curve steeper (more confident predictions)")
print("- Changing b shifts the decision point (where P=0.5) left or right")

<a id='3'></a>
## 3. Log-Odds and Decision Boundary

Logistic regression models the **log-odds** (logit) as a linear function:

$$\log\frac{p}{1 - p} = w^T x + b$$

where $p = P(y = 1 | x)$.

Solving for $p$:

$$p = \sigma(w^T x + b) = \frac{1}{1 + e^{-(w^T x + b)}}$$

**Decision boundary**: the set of points where $p = 0.5$, which means $w^T x + b = 0$. In 2D, this is a straight line.

In [None]:
# Demonstrate log-odds
p_values = np.linspace(0.01, 0.99, 200)
log_odds = np.log(p_values / (1 - p_values))

plt.figure(figsize=(7, 5))
plt.plot(p_values, log_odds, "b-", linewidth=2)
plt.axhline(y=0, color="gray", linestyle="--", alpha=0.5)
plt.axvline(x=0.5, color="gray", linestyle="--", alpha=0.5)
plt.xlabel("Probability p", fontsize=12)
plt.ylabel("Log-odds = log(p / (1-p))", fontsize=12)
plt.title("Log-Odds (Logit) Function", fontsize=14)
plt.show()

print("When p=0.5, log-odds=0 => this is the decision boundary.")
print("When p>0.5, log-odds>0 => predict class 1.")
print("When p<0.5, log-odds<0 => predict class 0.")

<a id='4'></a>
## 4. Binary Cross-Entropy Loss

We cannot use MSE for logistic regression because the resulting loss surface is non-convex. Instead we use **binary cross-entropy** (log loss):

$$J = -\frac{1}{n}\sum_{i=1}^{n}\left[y_i \log(\hat{p}_i) + (1 - y_i)\log(1 - \hat{p}_i)\right]$$

Intuition:
- When $y=1$: loss $= -\log(\hat{p})$. If $\hat{p} \to 1$, loss $\to 0$. If $\hat{p} \to 0$, loss $\to \infty$.
- When $y=0$: loss $= -\log(1-\hat{p})$. If $\hat{p} \to 0$, loss $\to 0$. If $\hat{p} \to 1$, loss $\to \infty$.

This loss is **convex**, so gradient descent finds the global minimum.

In [None]:
# Visualize cross-entropy loss for a single sample
p_hat = np.linspace(0.001, 0.999, 200)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Case y=1
loss_y1 = -np.log(p_hat)
axes[0].plot(p_hat, loss_y1, "r-", linewidth=2)
axes[0].set_xlabel(r"$\hat{p}$ (predicted probability)", fontsize=12)
axes[0].set_ylabel("Loss", fontsize=12)
axes[0].set_title(r"Loss when $y = 1$: $-\log(\hat{p})$", fontsize=14)
axes[0].set_ylim(0, 5)

# Case y=0
loss_y0 = -np.log(1 - p_hat)
axes[1].plot(p_hat, loss_y0, "b-", linewidth=2)
axes[1].set_xlabel(r"$\hat{p}$ (predicted probability)", fontsize=12)
axes[1].set_ylabel("Loss", fontsize=12)
axes[1].set_title(r"Loss when $y = 0$: $-\log(1 - \hat{p})$", fontsize=14)
axes[1].set_ylim(0, 5)

plt.tight_layout()
plt.show()

print("The loss penalizes confident wrong predictions very heavily.")

<a id='5'></a>
## 5. Logistic Regression with sklearn

### Key Parameters of `LogisticRegression`

| Parameter | Default | Description |
|-----------|---------|-------------|
| `penalty` | `'l2'` | Regularization type: `'l1'`, `'l2'`, `'elasticnet'`, `None` |
| `C` | `1.0` | Inverse regularization strength. Smaller C = stronger regularization |
| `solver` | `'lbfgs'` | Optimization algorithm. Use `'liblinear'` for L1, `'saga'` for elasticnet |
| `class_weight` | `None` | Set to `'balanced'` for imbalanced classes |
| `max_iter` | `100` | Maximum iterations for solver convergence. Increase if you get convergence warnings |

In [None]:
# Generate synthetic 2D data
X, y = make_classification(
    n_samples=300, n_features=2, n_informative=2, n_redundant=0,
    n_clusters_per_class=1, random_state=42
)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Fit logistic regression
model = LogisticRegression(penalty="l2", C=1.0, solver="lbfgs", max_iter=200, random_state=42)
model.fit(X_train, y_train)

print(f"Coefficients (w): {model.coef_[0]}")
print(f"Intercept (b):    {model.intercept_[0]:.4f}")
print(f"Train accuracy:   {model.score(X_train, y_train):.4f}")
print(f"Test accuracy:    {model.score(X_test, y_test):.4f}")

<a id='6'></a>
## 6. Visualizing the Decision Boundary

In [None]:
def plot_decision_boundary(model, X, y, title="Decision Boundary"):
    """Plot decision boundary for a 2D classifier."""
    h = 0.02  # mesh step size
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                         np.arange(y_min, y_max, h))

    Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)

    plt.figure(figsize=(8, 6))
    plt.contourf(xx, yy, Z, alpha=0.3, cmap="RdYlBu")
    plt.scatter(X[y == 0, 0], X[y == 0, 1], c="blue", label="Class 0",
                edgecolors="k", alpha=0.7)
    plt.scatter(X[y == 1, 0], X[y == 1, 1], c="red", label="Class 1",
                edgecolors="k", alpha=0.7)
    plt.xlabel("Feature 1", fontsize=12)
    plt.ylabel("Feature 2", fontsize=12)
    plt.title(title, fontsize=14)
    plt.legend()
    plt.show()

plot_decision_boundary(model, X_test, y_test, "Logistic Regression Decision Boundary")

In [None]:
# Effect of regularization strength C
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

for ax, C_val in zip(axes, [0.01, 1.0, 100.0]):
    m = LogisticRegression(C=C_val, solver="lbfgs", max_iter=200, random_state=42)
    m.fit(X_train, y_train)

    h = 0.02
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                         np.arange(y_min, y_max, h))
    Z = m.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)

    ax.contourf(xx, yy, Z, alpha=0.3, cmap="RdYlBu")
    ax.scatter(X_test[y_test == 0, 0], X_test[y_test == 0, 1],
               c="blue", edgecolors="k", alpha=0.7)
    ax.scatter(X_test[y_test == 1, 0], X_test[y_test == 1, 1],
               c="red", edgecolors="k", alpha=0.7)
    ax.set_title(f"C={C_val} | Acc={m.score(X_test, y_test):.3f}", fontsize=13)
    ax.set_xlabel("Feature 1")
    ax.set_ylabel("Feature 2")

plt.suptitle("Effect of Regularization Strength C", fontsize=15, y=1.02)
plt.tight_layout()
plt.show()

print("Small C => strong regularization (simpler boundary, may underfit)")
print("Large C => weak regularization (complex boundary, may overfit)")

<a id='7'></a>
## 7. predict vs predict_proba

- `predict(X)`: returns the predicted class label (0 or 1) using threshold 0.5
- `predict_proba(X)`: returns probability estimates for each class `[P(y=0), P(y=1)]`

In [None]:
# Compare predict vs predict_proba on a few test samples
sample_indices = [0, 1, 2, 3, 4]
X_sample = X_test[sample_indices]

predictions = model.predict(X_sample)
probabilities = model.predict_proba(X_sample)

results = pd.DataFrame({
    "True Label": y_test[sample_indices],
    "Predicted Label": predictions,
    "P(class=0)": probabilities[:, 0].round(4),
    "P(class=1)": probabilities[:, 1].round(4),
})
print(results.to_string(index=False))
print("\nNote: predict() uses threshold=0.5 on P(class=1) by default.")
print("predict_proba() gives you the raw probabilities for more nuanced decisions.")

<a id='8'></a>
## 8. Multi-class Classification (Brief)

Logistic regression naturally handles binary classification. For multi-class problems, sklearn uses:

- **One-vs-Rest (OvR)**: Train one binary classifier per class. The class with the highest confidence wins. Set `multi_class='ovr'`.
- **Multinomial (Softmax)**: Generalize the sigmoid to multiple classes using the softmax function. Set `multi_class='multinomial'` (default with `solver='lbfgs'`).

In practice, sklearn handles this automatically when you pass labels with more than 2 classes.

In [None]:
# Quick multi-class example with all 3 Iris classes
iris = load_iris()
X_iris, y_iris = iris.data[:, :2], iris.target  # use first 2 features for visualization

model_multi = LogisticRegression(max_iter=200, random_state=42)
model_multi.fit(X_iris, y_iris)

print(f"Classes: {iris.target_names}")
print(f"Accuracy: {model_multi.score(X_iris, y_iris):.4f}")
print(f"Coefficients shape: {model_multi.coef_.shape}  (one row per class)")

<a id='9'></a>
## 9. Common Mistakes

1. **Not scaling features**: Logistic regression with regularization is sensitive to feature scales. Always standardize your features (e.g., `StandardScaler`).

2. **Ignoring convergence warnings**: If the solver does not converge, increase `max_iter` or scale your data.

3. **Using accuracy alone**: Accuracy is misleading for imbalanced classes. Always check precision, recall, and F1.

4. **Confusing `C` with regularization strength**: `C` is the *inverse* of regularization strength. Larger C = less regularization.

5. **Treating probabilities as calibrated**: Raw `predict_proba` outputs are not always well-calibrated. See the calibration notebook.

<a id='10'></a>
## 10. Exercise: Classify Iris (2 Classes)

**Task**: Build a logistic regression classifier on a binary subset of the Iris dataset.

1. Load the Iris dataset and keep only classes 0 and 1 (setosa and versicolor).
2. Use all 4 features. Split 70/30 with `random_state=42`.
3. Scale the features with `StandardScaler`.
4. Fit `LogisticRegression` with `C=1.0` and report train/test accuracy.
5. Print `predict_proba` for the first 5 test samples.

In [None]:
# Your solution here
# ------------------

# Step 1: Load Iris, keep only classes 0 and 1
iris = load_iris()
mask = iris.target < 2
X_ex, y_ex = iris.data[mask], iris.target[mask]
print(f"Dataset shape: {X_ex.shape}, Classes: {np.unique(y_ex)}")

# Step 2: Train/test split
X_tr, X_te, y_tr, y_te = train_test_split(X_ex, y_ex, test_size=0.3, random_state=42)

# Step 3: Scale features
scaler = StandardScaler()
X_tr_s = scaler.fit_transform(X_tr)
X_te_s = scaler.transform(X_te)

# Step 4: Fit and evaluate
clf = LogisticRegression(C=1.0, max_iter=200, random_state=42)
clf.fit(X_tr_s, y_tr)
print(f"Train accuracy: {clf.score(X_tr_s, y_tr):.4f}")
print(f"Test accuracy:  {clf.score(X_te_s, y_te):.4f}")

# Step 5: Probabilities for first 5 test samples
proba = clf.predict_proba(X_te_s[:5])
print("\nPredicted probabilities (first 5 test samples):")
print(pd.DataFrame(proba, columns=["P(setosa)", "P(versicolor)"]).to_string(index=False))