<a href="https://colab.research.google.com/github/sreent/machine-learning/blob/main/Logistic%20Regression/Logistic%20Regression%20Code%20Walk%20Through.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Logistic Regression: Code Walk Through

This notebook provides a step-by-step computational walkthrough of **Logistic Regression** using gradient descent optimization.

## What You'll Learn

- How the **sigmoid function** maps scores to probabilities
- How **gradient descent** optimizes the negative log-likelihood (NLL) loss
- How to compute gradients for logistic regression
- How the decision boundary evolves during training
- How to visualize convergence and final predictions

## 1. Import Libraries

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from scipy.special import expit  # Numerically stable sigmoid

# Set random seed for reproducibility
np.random.seed(42)

## 2. Generate Synthetic Binary Classification Data

We'll create a simple 2D dataset with two classes, following the pattern from the lecture slides.

In [None]:
# Generate two-class data
m = 100  # samples per class
n = 2    # features

np.random.seed(0)

# Class 0: centered around (1.5, -1.5)
class_0 = np.hstack((
    1.5 + np.random.randn(m, 1),
    -1.5 + np.random.randn(m, 1)
))

# Class 1: centered around (-1.5, 1.5)
class_1 = np.hstack((
    -1.5 + np.random.randn(m, 1),
    1.5 + np.random.randn(m, 1)
))

# Combine into training set
X_train = np.vstack((class_0, class_1))  # shape (2m, 2)
y_train = np.concatenate([np.zeros(m), np.ones(m)])  # shape (2m,)

print(f"X_train shape: {X_train.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"Class distribution: {np.bincount(y_train.astype(int))}")

## 3. Visualize the Data

In [None]:
plt.figure(figsize=(8, 6))
plt.scatter(X_train[y_train == 0, 0], X_train[y_train == 0, 1], 
            c='orange', label='Class 0', edgecolors='k', s=50)
plt.scatter(X_train[y_train == 1, 0], X_train[y_train == 1, 1], 
            c='skyblue', label='Class 1', edgecolors='k', s=50)
plt.xlabel('$x_1$', fontsize=12)
plt.ylabel('$x_2$', fontsize=12)
plt.title('Binary Classification Data', fontsize=14)
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

## 4. Understanding the Sigmoid Function

The **sigmoid (logistic) function** maps any real-valued score to a probability between 0 and 1:

$$\sigma(z) = \frac{1}{1 + e^{-z}}$$

Where $z = \vec{x}^T \times \vec{w}$ is the score.

In [None]:
# Visualize sigmoid function
z_values = np.linspace(-6, 6, 200)
sigmoid_values = expit(z_values)

plt.figure(figsize=(10, 6))
plt.plot(z_values, sigmoid_values, linewidth=2, color='purple')
plt.axhline(y=0.5, color='gray', linestyle='--', label='Decision threshold (0.5)')
plt.axvline(x=0, color='gray', linestyle='--')
plt.scatter([0], [0.5], color='red', s=100, zorder=5, label='Score = 0')
plt.xlabel('Score $(\\vec{x}^T \\times \\vec{w})$', fontsize=12)
plt.ylabel('$\\sigma(\\vec{x}^T \\times \\vec{w})$', fontsize=12)
plt.title('Sigmoid Function', fontsize=14)
plt.grid(True, alpha=0.3)
plt.legend()
plt.show()

print("Key sigmoid values:")
for z in [-6, -2, 0, 2, 6]:
    print(f"  σ({z:2.0f}) = {expit(z):.4f}")

## 5. Prepare Design Matrix

Add a bias column (intercept) to the feature matrix:

$$\Phi = [\vec{1} \quad X]$$

In [None]:
# Design matrix: add bias column
Phi = np.c_[np.ones(len(X_train)), X_train]
print(f"Design matrix Φ shape: {Phi.shape}")
print(f"\nFirst 5 rows of Φ:")
print(Phi[:5])

## 6. Initialize Weights

Start with small random weights near zero.

In [None]:
np.random.seed(42)
weights = np.random.randn(Phi.shape[1]) * 0.01
print(f"Initial weights: {weights}")

## 7. Step-by-Step: One Gradient Descent Iteration

Let's walk through a single iteration of gradient descent:

### Step 1: Compute scores (logits)

In [None]:
scores = Phi @ weights
print(f"Scores shape: {scores.shape}")
print(f"First 5 scores: {scores[:5]}")

### Step 2: Apply sigmoid to get probabilities

$$P(y=1|\vec{x}, \vec{w}) = \sigma(\vec{x}^T \times \vec{w}) = \frac{1}{1 + e^{-\vec{x}^T \times \vec{w}}}$$

In [None]:
probabilities = expit(scores)
print(f"Probabilities shape: {probabilities.shape}")
print(f"First 5 probabilities: {probabilities[:5]}")
print(f"All probabilities in [0,1]: {np.all((probabilities >= 0) & (probabilities <= 1))}")

### Step 3: Compute errors (residuals)

$$\text{errors} = \vec{p} - \vec{y}$$

Where $\vec{p}$ are the predicted probabilities.

In [None]:
errors = probabilities - y_train
print(f"Errors shape: {errors.shape}")
print(f"First 5 errors: {errors[:5]}")

### Step 4: Compute Negative Log-Likelihood (NLL) Loss

$$J(\vec{w}) = -\sum_{i=1}^{N} \left[ y^{(i)} \log p^{(i)} + (1-y^{(i)}) \log(1-p^{(i)}) \right]$$

This is also known as **binary cross-entropy loss**.

In [None]:
# Compute NLL (with numerical stability)
epsilon = 1e-15
p_safe = np.clip(probabilities, epsilon, 1 - epsilon)
nll = -np.sum(y_train * np.log(p_safe) + (1 - y_train) * np.log(1 - p_safe))
print(f"Negative Log-Likelihood (NLL): {nll:.4f}")

### Step 5: Compute Gradient

$$\nabla_{\vec{w}} J = \Phi^T (\vec{p} - \vec{y})$$

In [None]:
gradient = Phi.T @ errors
print(f"Gradient shape: {gradient.shape}")
print(f"Gradient: {gradient}")

### Step 6: Update Weights

$$\vec{w}_{\text{new}} = \vec{w}_{\text{old}} - \alpha \nabla_{\vec{w}} J$$

In [None]:
learning_rate = 0.1
weights_new = weights - learning_rate * gradient
print(f"Old weights: {weights}")
print(f"New weights: {weights_new}")
print(f"Weight change: {weights_new - weights}")

## 8. Full Training Loop

Now let's run the complete gradient descent algorithm.

In [None]:
# Reset weights
np.random.seed(42)
weights = np.random.randn(Phi.shape[1]) * 0.01

# Hyperparameters
learning_rate = 0.1
num_iterations = 1000
tolerance = 1e-6

# Storage for tracking
loss_history = []
weight_history = [weights.copy()]

for iteration in range(num_iterations):
    # Forward pass
    scores = Phi @ weights
    probabilities = expit(scores)
    
    # Compute loss
    p_safe = np.clip(probabilities, epsilon, 1 - epsilon)
    nll = -np.sum(y_train * np.log(p_safe) + (1 - y_train) * np.log(1 - p_safe))
    loss_history.append(nll)
    
    # Compute gradient
    errors = probabilities - y_train
    gradient = Phi.T @ errors
    
    # Check convergence
    if np.linalg.norm(gradient) < tolerance:
        print(f"Converged at iteration {iteration}")
        break
    
    # Update weights
    weights = weights - learning_rate * gradient
    weight_history.append(weights.copy())
    
    # Print progress
    if (iteration + 1) % 100 == 0 or iteration == 0:
        print(f"Iteration {iteration+1:4d}: NLL = {nll:10.4f}, ||gradient|| = {np.linalg.norm(gradient):.6f}")

print(f"\nFinal weights: {weights}")
print(f"Final NLL: {loss_history[-1]:.4f}")

## 9. Visualize Convergence

In [None]:
plt.figure(figsize=(10, 6))
plt.plot(loss_history, linewidth=2)
plt.xlabel('Iteration', fontsize=12)
plt.ylabel('Negative Log-Likelihood', fontsize=12)
plt.title('Training Loss Over Time', fontsize=14)
plt.grid(True, alpha=0.3)
plt.show()

## 10. Visualize Decision Boundary

The decision boundary is where $P(y=1|\vec{x}, \vec{w}) = 0.5$, which occurs when $\vec{x}^T \times \vec{w} = 0$.

In [None]:
# Create a mesh for visualization
x1_min, x1_max = X_train[:, 0].min() - 1, X_train[:, 0].max() + 1
x2_min, x2_max = X_train[:, 1].min() - 1, X_train[:, 1].max() + 1
xx1, xx2 = np.meshgrid(np.linspace(x1_min, x1_max, 200),
                        np.linspace(x2_min, x2_max, 200))

# Compute probabilities for the mesh
Phi_mesh = np.c_[np.ones(xx1.ravel().shape[0]), xx1.ravel(), xx2.ravel()]
probs_mesh = expit(Phi_mesh @ weights)
probs_mesh = probs_mesh.reshape(xx1.shape)

# Plot
plt.figure(figsize=(12, 8))
plt.contourf(xx1, xx2, probs_mesh, levels=20, cmap='RdBu_r', alpha=0.6)
plt.colorbar(label='P(y=1|x,w)')
plt.contour(xx1, xx2, probs_mesh, levels=[0.5], colors='black', linewidths=2, linestyles='dashed')

plt.scatter(X_train[y_train == 0, 0], X_train[y_train == 0, 1], 
            c='orange', label='Class 0', edgecolors='k', s=50)
plt.scatter(X_train[y_train == 1, 0], X_train[y_train == 1, 1], 
            c='skyblue', label='Class 1', edgecolors='k', s=50)

plt.xlabel('$x_1$', fontsize=12)
plt.ylabel('$x_2$', fontsize=12)
plt.title('Logistic Regression Decision Boundary', fontsize=14)
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

## 11. Make Predictions and Evaluate

In [None]:
# Predict probabilities
y_proba = expit(Phi @ weights)

# Predict classes (threshold = 0.5)
y_pred = (y_proba >= 0.5).astype(int)

# Compute accuracy
accuracy = np.mean(y_pred == y_train)
print(f"Training Accuracy: {accuracy:.4f}")

# Confusion matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_train, y_pred)
print(f"\nConfusion Matrix:")
print(cm)

## 12. Comparison with scikit-learn

In [None]:
from sklearn.linear_model import LogisticRegression

# Train sklearn model
sklearn_model = LogisticRegression(penalty=None, max_iter=1000)
sklearn_model.fit(X_train, y_train)

# Compare weights
print("Our weights:     ", weights)
print("sklearn weights: ", np.concatenate([sklearn_model.intercept_, sklearn_model.coef_[0]]))

# Compare accuracy
sklearn_accuracy = sklearn_model.score(X_train, y_train)
print(f"\nOur accuracy:      {accuracy:.4f}")
print(f"sklearn accuracy:  {sklearn_accuracy:.4f}")

## Summary

In this walkthrough, we:

1. Generated synthetic binary classification data
2. Understood the **sigmoid function** that maps scores to probabilities
3. Implemented **gradient descent** to minimize the negative log-likelihood loss
4. Visualized the **decision boundary** learned by the model
5. Compared our implementation with scikit-learn

**Key Takeaways:**
- Logistic regression models $P(y=1|\vec{x}, \vec{w}) = \sigma(\vec{x}^T \times \vec{w})$
- The loss function is the negative log-likelihood (binary cross-entropy)
- Gradient descent iteratively adjusts weights to minimize loss
- The decision boundary is linear: $\vec{x}^T \times \vec{w} = 0$