<a href="https://colab.research.google.com/github/sreent/machine-learning/blob/main/Logistic%20Regression/Logistic%20Regression%20Hands-On%20Lab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Logistic Regression: Hands-On Lab

## Learning Objectives

By the end of this lab, you will be able to:

1. **Understand** how logistic regression models binary classification problems using the sigmoid function
2. **Implement** a custom logistic regression classifier using gradient descent
3. **Apply** logistic regression to real-world classification datasets
4. **Evaluate** model performance using accuracy, precision, recall, and F1-score
5. **Optimize** hyperparameters using K-fold cross-validation
6. **Visualize** decision boundaries and probability distributions

## Algorithm Overview

**Logistic Regression** is a classification algorithm that models the probability of a binary outcome:

$$P(y=1|\vec{x}, \vec{w}) = \sigma(\vec{x}^T \times \vec{w}) = \frac{1}{1 + e^{-\vec{x}^T \times \vec{w}}}$$

Where:
- $\vec{x}$ is the input feature vector
- $\vec{w}$ is the weight vector
- $\sigma$ is the sigmoid (logistic) function

**Loss Function** (Negative Log-Likelihood):

$$J(\vec{w}) = -\sum_{i=1}^{N} \left[ y^{(i)} \log p^{(i)} + (1-y^{(i)}) \log(1-p^{(i)}) \right]$$

**Gradient:**

$$\nabla_{\vec{w}} J = \Phi^T (\vec{p} - \vec{y})$$

**Gradient Descent Update:**

$$\vec{w} = \vec{w} - \alpha \nabla_{\vec{w}} J$$

Where $\alpha$ is the learning rate.

---

### Why These Components?

**Why sigmoid?** The sigmoid function $\sigma(z) = \frac{1}{1 + e^{-z}}$ is used because:
- It maps any real number to (0,1), perfect for probabilities
- It's differentiable everywhere (needed for gradient descent)
- It has nice mathematical properties: $\sigma'(z) = \sigma(z)(1-\sigma(z))$

**Why NLL loss?** The negative log-likelihood is the natural loss function for probabilistic classification because it directly measures how well our predicted probabilities match the true labels. Maximizing likelihood = minimizing NLL.

**Why gradient descent?** Unlike linear regression, logistic regression has no closed-form solution. Gradient descent iteratively finds the optimal weights by following the direction that most reduces the loss.

## When to Use Logistic Regression

Logistic regression is a fundamental classification algorithm with specific strengths and limitations. Understanding when to use it is crucial for effective model selection.

### ‚úÖ Use Logistic Regression When:

**1. Binary Classification with Linearly Separable Features**
- Classes can be separated by a linear decision boundary (or made separable with feature transformations)
- Examples: spam detection, medical diagnosis (disease/no disease), customer churn prediction

**2. Need Probabilistic Outputs**
- When you need P(y=1|x) probabilities, not just class predictions
- Critical for risk assessment, confidence scoring, or threshold tuning
- Example: "This email has 87% probability of being spam"

**3. Interpretable Coefficients Required**
- Each feature has a clear weight showing its contribution
- Positive coefficient ‚Üí feature increases probability of class 1
- Negative coefficient ‚Üí feature decreases probability of class 1
- Essential in healthcare, finance, and regulated industries

**4. Small to Medium Datasets**
- Works well with datasets from hundreds to hundreds of thousands of samples
- Efficient training with gradient descent or closed-form solutions
- Lower computational cost than complex models

**5. Baseline Model for Comparison**
- Start with logistic regression as a simple, interpretable baseline
- Compare more complex models against it to justify added complexity

### ‚ùå Don't Use Logistic Regression When:

**1. Highly Non-Linear Decision Boundaries**
- If classes require complex, curved boundaries that can't be approximated with polynomial features
- **Better alternatives**: Kernel SVM, Decision Trees, Random Forests, Neural Networks

**2. Many Categorical Features**
- Logistic regression struggles with high-cardinality categorical features
- One-hot encoding creates many sparse features
- **Better alternatives**: Tree-based methods (Random Forest, XGBoost, LightGBM)

**3. Very Large Datasets with Many Features**
- Standard gradient descent can be slow for millions of samples
- **Better alternatives**: SGD-based approaches, Neural Networks with mini-batch training

**4. Multiclass Classification (Without Extensions)**
- Standard logistic regression is binary; needs One-vs-Rest or multinomial extensions
- **Better alternatives**: Softmax regression, tree-based methods, neural networks

**5. Need Feature Interactions Without Manual Engineering**
- Logistic regression requires explicit polynomial features for interactions
- **Better alternatives**: Decision Trees (automatically find interactions), Neural Networks

### Quick Comparison: Logistic Regression vs Other Classifiers

| Criterion | Logistic Regression | Decision Trees | SVM (RBF) | Neural Networks |
|-----------|-------------------|----------------|-----------|-----------------|
| **Linear boundaries** | ‚úÖ Excellent | ‚ùå Weak | ‚úÖ Good | ‚úÖ Good |
| **Non-linear boundaries** | ‚ö†Ô∏è Manual features | ‚úÖ Excellent | ‚úÖ Excellent | ‚úÖ Excellent |
| **Interpretability** | ‚úÖ Excellent | ‚úÖ Good | ‚ùå Poor | ‚ùå Very Poor |
| **Probabilistic output** | ‚úÖ Natural | ‚ö†Ô∏è Approximation | ‚ö†Ô∏è Calibration needed | ‚úÖ Softmax |
| **Training speed** | ‚úÖ Fast | ‚úÖ Fast | ‚ö†Ô∏è Moderate | ‚ùå Slow |
| **Small datasets** | ‚úÖ Excellent | ‚úÖ Good | ‚úÖ Excellent | ‚ùå Overfits |
| **Large datasets** | ‚úÖ Good | ‚úÖ Excellent | ‚ö†Ô∏è Slow | ‚úÖ Excellent |
| **Categorical features** | ‚ö†Ô∏è One-hot | ‚úÖ Native | ‚ö†Ô∏è One-hot | ‚ö†Ô∏è Embedding |

### Real-World Applications Where Logistic Regression Excels:

1. **Medical Diagnosis**: Predicting disease presence based on symptoms and test results
2. **Credit Scoring**: Assessing loan default risk with interpretable coefficients for regulators
3. **Marketing**: Predicting customer purchase probability or email click-through rates
4. **Fraud Detection**: Identifying fraudulent transactions (when combined with good features)
5. **Customer Churn**: Predicting which customers will leave a service
6. **A/B Testing**: Analyzing treatment effects in experiments

### The Bottom Line:

**Logistic regression is your go-to algorithm when:**
- You need a simple, fast, interpretable binary classifier
- Decision boundaries are approximately linear (or can be made so with feature engineering)
- You need calibrated probability estimates
- You're establishing a baseline before trying complex models

**Consider alternatives when:**
- Decision boundaries are highly non-linear and feature engineering is impractical
- You have massive datasets and need maximum predictive power
- Interpretability is not a requirement

## Pseudocode for Logistic Regression

```
# Logistic Regression ‚Äî Gradient Descent on NLL
# Inputs
# data ‚Üê (X, y) with y ‚àà {0,1}
# Œ∑ ‚Üê learning rate
# max_iter ‚Üê maximum iterations
# tol ‚Üê stop when ||‚àáL(w)|| ‚â§ tol
# X_query ‚Üê examples to predict

# ----- fit -----
Œ¶ ‚Üê concat_column(ones(N), X)      # design matrix with bias
w ‚Üê zeros(columns(Œ¶))               # initialize

# NLL: L(w) = - Œ£ [ y log p + (1‚àíy) log(1‚àíp) ], p = œÉ(Œ¶w)
FOR t = 1 TO max_iter DO
    z ‚Üê Œ¶ ¬∑ w
    p ‚Üê 1 / (1 + exp(‚àíz))           # sigmoid
    g ‚Üê transpose(Œ¶) ¬∑ (p ‚àí y)      # ‚àáL(w)
    IF norm(g) ‚â§ tol THEN BREAK
    w ‚Üê w ‚àí Œ∑ ¬∑ g                   # GD step
END FOR

# ----- predict -----
Œ¶* ‚Üê concat_column(ones(|X_query|), X_query)
p* ‚Üê 1 / (1 + exp(‚àíŒ¶* ¬∑ w))
≈∑ ‚Üê 1 if p* ‚â• 0.5 else 0
RETURN p*, ≈∑
```

**Note:** The **design matrix** Œ¶ is the feature matrix X with a column of 1s prepended for the bias term (intercept). This allows us to include the bias in the weight vector w, simplifying the math: instead of computing $w_0 + x_1w_1 + x_2w_2$, we compute $\vec{\phi}^T \cdot \vec{w}$ where $\vec{\phi} = [1, x_1, x_2]$.

## Learning Rate

Now that we've seen the algorithm, let's understand the **learning rate** $\alpha$ - a critical hyperparameter that controls how much we adjust weights at each iteration:

- **Too small**: Slow convergence, may take many iterations
- **Too large**: May overshoot the minimum, fail to converge
- **Just right**: Converges efficiently to the optimum

Typical values: $\alpha \in [0.001, 0.1]$ for normalized features.

**Why does feature scaling affect learning rate?** If features have different scales (e.g., $x_1 \in [0,1]$ and $x_2 \in [0,10000]$), the gradient components will have very different magnitudes. A learning rate suitable for $x_2$ might be too large for $x_1$, causing oscillation. Feature scaling ensures all gradients are on similar scales, allowing one learning rate to work well for all features.

## Import Libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.base import BaseEstimator, ClassifierMixin
from sklearn.model_selection import train_test_split, KFold, cross_val_score
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.metrics import precision_score, recall_score, f1_score
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.linear_model import LogisticRegression as SklearnLogisticRegression
from scipy.special import expit  # expit is the numerically stable sigmoid function: œÉ(z) = 1/(1+e^(-z))
import warnings
warnings.filterwarnings('ignore')

# Set random seed
np.random.seed(42)

## Exercise 1: Implement Sigmoid and Prediction

Welcome to the hands-on implementation! We'll build the MyLogisticRegression class in **three independent exercises** to help you test and debug each component.

**In this exercise, you'll implement:**
- `_sigmoid()`: The sigmoid activation function
- `predict_proba()`: Probability prediction method

**Why separate exercises?**
- Test each component immediately
- No cascading failures
- Build confidence step-by-step
- Easy to debug if something goes wrong

In [None]:
class MyLogisticRegression(BaseEstimator, ClassifierMixin):
    """
    Custom Logistic Regression classifier using gradient descent.

    Parameters
    ----------
    learning_rate : float, default=0.01
        Step size for gradient descent updates
    max_iter : int, default=1000
        Maximum number of iterations
    tol : float, default=1e-6
        Tolerance for gradient norm to declare convergence
    random_state : int, default=None
        Random seed for weight initialization
    """

    def __init__(self, learning_rate=0.01, max_iter=1000, tol=1e-6, random_state=None):
        self.learning_rate = learning_rate
        self.max_iter = max_iter
        self.tol = tol
        self.random_state = random_state

    def _sigmoid(self, z):
        """
        Compute the sigmoid function: œÉ(z) = 1 / (1 + e^(-z))

        Parameters
        ----------
        z : array-like
            Input values (scores)

        Returns
        -------
        sigmoid : array-like
            Sigmoid outputs in range (0, 1)

        Notes
        -----
        We use scipy.special.expit for numerical stability instead of 1/(1+exp(-z))
        """
        # TODO: Implement sigmoid function using expit from scipy.special
        # Hint: expit(z) computes 1/(1+exp(-z)) in a numerically stable way
        return None

    def predict_proba(self, X):
        """
        Predict class probabilities.

        Parameters
        ----------
        X : array-like, shape (n_samples, n_features)
            Samples

        Returns
        -------
        proba : array, shape (n_samples, 2)
            Probabilities for each class [P(y=0), P(y=1)]
        """
        # TODO: Create design matrix Phi by adding column of 1s to X for bias
        # Hint: Use np.c_[np.ones(X.shape[0]), X]
        Phi = None

        # TODO: Compute scores z = Phi @ weights
        scores = None

        # TODO: Apply sigmoid to get P(y=1|X) using your _sigmoid method
        p1 = None

        # Return probabilities for both classes [P(y=0), P(y=1)]
        return np.column_stack([1 - p1, p1])

In [None]:
print("=" * 70)
print("EXERCISE 1 VERIFICATION: Testing Sigmoid and Prediction")
print("=" * 70)

# Create a test instance
model_ex1 = MyLogisticRegression()

# Provide pre-trained weights for testing
# These weights were learned from a simple XOR-like problem
model_ex1.weights_ = np.array([0.5, 1.2, -0.8])

# Test data: 4 samples, 2 features
X_test_ex1 = np.array([
    [0.0, 0.0],
    [1.0, 1.0],
    [1.0, 0.0],
    [0.0, 1.0]
])

print("\n1. Testing _sigmoid function:")
print("-" * 70)
test_scores = np.array([-2, -1, 0, 1, 2])
sigmoid_output = model_ex1._sigmoid(test_scores)
print(f"Input scores:  {test_scores}")
print(f"Sigmoid output: {sigmoid_output}")
print(f"Expected:       [0.119  0.269  0.500  0.731  0.881] (approximately)")

# Verify sigmoid properties
print(f"\n‚úì All values in (0,1)? {np.all((sigmoid_output > 0) & (sigmoid_output < 1))}")
print(f"‚úì sigmoid(0) ‚âà 0.5? {np.abs(model_ex1._sigmoid(0) - 0.5) < 0.001}")
print(f"‚úì sigmoid(-z) = 1 - sigmoid(z)? {np.allclose(model_ex1._sigmoid(-test_scores), 1 - model_ex1._sigmoid(test_scores))}")

print("\n2. Testing predict_proba function:")
print("-" * 70)
probabilities = model_ex1.predict_proba(X_test_ex1)
print(f"Test samples:\n{X_test_ex1}\n")
print(f"Predicted probabilities [P(y=0), P(y=1)]:")
for i, (x, probs) in enumerate(zip(X_test_ex1, probabilities)):
    print(f"  Sample {i} {x}: P(y=0)={probs[0]:.3f}, P(y=1)={probs[1]:.3f}")

print("\nExpected probabilities (approximately):")
print("  Sample 0 [0. 0.]: P(y=0)=0.378, P(y=1)=0.622")
print("  Sample 1 [1. 1.]: P(y=0)=0.452, P(y=1)=0.548")
print("  Sample 2 [1. 0.]: P(y=0)=0.142, P(y=1)=0.858")
print("  Sample 3 [0. 1.]: P(y=0)=0.669, P(y=1)=0.331")

# Verify probability properties
print("\n‚úì Probabilities sum to 1? ", np.allclose(probabilities.sum(axis=1), 1.0))
print(f"‚úì All probabilities in [0,1]? {np.all((probabilities >= 0) & (probabilities <= 1))}")

print("\n" + "=" * 70)
print("If your outputs match the expected values, proceed to Exercise 2!")
print("=" * 70)

### ‚úÖ Checkpoint Question 1: What does the sigmoid function do?

A) Maps any real-valued number to the range zero to one, which can be interpreted as a probability for binary classification tasks

B) Maps probability values between zero and one to their corresponding unbounded real-valued logit scores on the entire number line

C) Computes the gradient of the negative log-likelihood loss function with respect to the model weights during gradient descent optimization

D) Normalizes input features to have zero mean and unit variance by subtracting mean and dividing by standard deviation

<details>
<summary>Click to see answer</summary>

**Answer: A**

**Key Insight:** The sigmoid function œÉ(z) = 1/(1 + e^(-z)) is the bridge between linear combinations of features (which can be any real number) and probabilities (which must be between 0 and 1). It enables logistic regression to output valid probability estimates for classification.

**Detailed Explanation:**

The sigmoid function maps any real-valued input to a value between 0 and 1, which can be interpreted as a probability. For example:
- œÉ(-5) ‚âà 0.007 (very low probability)
- œÉ(0) = 0.5 (neutral/boundary)
- œÉ(5) ‚âà 0.993 (very high probability)

This is essential for logistic regression because:
1. The linear combination x^T¬∑w can be any real number
2. We need P(y=1|x) which must be between 0 and 1
3. The sigmoid provides this transformation smoothly

**Why other answers are incorrect:**

- **B is FALSE**: This describes the inverse of the sigmoid (logit function: logit(p) = log(p/(1-p))), which maps probabilities (0,1) back to real numbers (-‚àû,+‚àû). The logit is used in deriving logistic regression but is not what the sigmoid does.
- **C is FALSE**: The sigmoid is used in computing predictions (forward pass), not gradients. The gradient is computed using the derivative of the loss function: ‚àáJ = Œ¶^T(p - y), where p comes from applying sigmoid but the gradient computation is a separate step.
- **D is FALSE**: This describes feature standardization (StandardScaler in scikit-learn), which is a preprocessing step unrelated to the sigmoid function. Standardization is applied to input features before training, while sigmoid is applied to the model's output scores.

</details>

## Exercise 2: Implement Gradient Computation

Excellent! You now have working sigmoid and prediction functions. Let's implement the **gradient computation** - the core of how the model learns.

**What you'll implement:**
- `_compute_gradient()`: Computes ‚àáJ = Œ¶·µÄ (p - y)

**Why this matters:**
The gradient tells us the direction and magnitude to adjust each weight to reduce the loss. Without a correct gradient, the model can't learn!

**Testing approach:**
We'll test the gradient computation with known values before using it in training.

In [None]:
# Add gradient computation method to MyLogisticRegression
def _compute_gradient(self, Phi, y, probabilities):
    """
    Compute the gradient of NLL loss with respect to weights.

    Parameters
    ----------
    Phi : array-like, shape (n_samples, n_features + 1)
        Design matrix (X with bias column)
    y : array-like, shape (n_samples,)
        True labels (0 or 1)
    probabilities : array-like, shape (n_samples,)
        Predicted probabilities P(y=1|X)

    Returns
    -------
    gradient : array, shape (n_features + 1,)
        Gradient vector ‚àáJ = Œ¶·µÄ (p - y)
    """
    # TODO: Compute gradient using the formula: Œ¶·µÄ (p - y)
    # Hint: Use @ operator or np.dot() for matrix multiplication
    # Shape check: Phi is (N, D), (p-y) is (N,), result should be (D,)
    gradient = None
    return gradient

# Add the method to the class
MyLogisticRegression._compute_gradient = _compute_gradient

### ‚úÖ Checkpoint Question 2: What is the gradient for logistic regression?

A) The gradient is computed as the design matrix transpose times the difference between true labels and predictions: Œ¶·µÄ (y - p)

B) The gradient is computed as the design matrix transpose times the difference between predictions and true labels: Œ¶·µÄ (p - y)

C) The gradient is negative two times the design matrix transpose times the difference between true labels and predictions: -2Œ¶·µÄ (y - p)

D) The gradient is computed as the design matrix times the difference between predictions and true labels without transposition: Œ¶ (p - y)

<details>
<summary>Click to see answer</summary>

**Answer: B**

**Key Insight:** The gradient ‚àáJ = Œ¶·µÄ(p - y) tells us how to adjust each weight to reduce loss. When predictions p are too high (p > y), the gradient is positive, so gradient descent subtracts from weights, reducing future predictions. When p is too low (p < y), the gradient is negative, so we add to weights, increasing predictions.

**Detailed Explanation:**

The gradient of the negative log-likelihood loss with respect to weights is:
$$\nabla_{\vec{w}} J = \Phi^T (\vec{p} - \vec{y})$$

Where:
- Œ¶ is the design matrix (N √ó D) with a column of 1s for bias
- p is the vector of predicted probabilities (N √ó 1)
- y is the vector of true labels (N √ó 1)
- The result is a D-dimensional vector showing how much to change each weight

**Example with numbers:**
If p = [0.9, 0.2, 0.7] and y = [1, 0, 1]:
- (p - y) = [-0.1, 0.2, -0.3]
- Then Œ¶·µÄ multiplies these errors by each feature's values
- Features that are active when p > y will have positive gradients (decrease weight)
- Features that are active when p < y will have negative gradients (increase weight)

**Gradient descent update:**
w_new = w_old - Œ± ¬∑ ‚àáJ = w_old - Œ± ¬∑ Œ¶·µÄ(p - y)

**Why other answers are incorrect:**

- **A is FALSE**: Œ¶·µÄ(y - p) has the wrong sign. This would be the gradient for maximizing the likelihood instead of minimizing the negative log-likelihood. Using this would cause gradient *ascent* instead of descent, making the loss increase.
- **C is FALSE**: The -2 coefficient appears in the gradient of Mean Squared Error (MSE), not NLL. For MSE, the gradient is ‚àáJ = -2Œ¶·µÄ(y - ≈∑) because d/dw[(y-≈∑)¬≤] = 2(y-≈∑)¬∑(-d≈∑/dw). This is not applicable to logistic regression's loss function.
- **D is FALSE**: Missing the transpose means the dimensions don't match. Œ¶ is (N √ó D) and (p - y) is (N √ó 1), so Œ¶(p - y) would be (N √ó 1), not a proper gradient vector of size (D √ó 1). We need Œ¶·µÄ to get the correct (D √ó 1) shape.

</details>

In [None]:
print("=" * 70)
print("EXERCISE 2 VERIFICATION: Testing Gradient Computation")
print("=" * 70)

# Test gradient computation with known values
Phi_test = np.array([
    [1.0, 0.5, -0.3],
    [1.0, -0.2, 0.8],
    [1.0, 0.9, 0.1],
    [1.0, -0.6, -0.5]
])

y_test = np.array([0, 1, 1, 0])
p_test = np.array([0.3, 0.7, 0.8, 0.2])

model_ex2 = MyLogisticRegression()
gradient = model_ex2._compute_gradient(Phi_test, y_test, p_test)

print("\nTest setup:")
print(f"Design matrix Phi shape: {Phi_test.shape}")
print(f"True labels y: {y_test}")
print(f"Predicted probabilities p: {p_test}")

print(f"\nComputed gradient: {gradient}")
print(f"Expected gradient: [ 0.2  -0.02  0.16 ] (approximately)")

# Verify gradient properties
print("\n‚úì Gradient shape correct?", gradient.shape == (3,))
print(f"‚úì Gradient values reasonable? {np.all(np.abs(gradient) < 10)}")

# Test on actual data
print("\n" + "-" * 70)
print("Testing gradient on real data:")
print("-" * 70)

# Use a small sample
np.random.seed(42)
X_small = np.array([[0, 0], [1, 1], [1, 0], [0, 1]])
y_small = np.array([0, 1, 1, 0])

# Create design matrix
Phi_small = np.c_[np.ones(X_small.shape[0]), X_small]

# Initialize some weights
weights_test = np.array([0.1, 0.2, -0.1])

# Compute probabilities
scores = Phi_small @ weights_test
probs = expit(scores)

# Compute gradient
grad = model_ex2._compute_gradient(Phi_small, y_small, probs)

print(f"Weights: {weights_test}")
print(f"Computed probabilities: {probs}")
print(f"Gradient: {grad}")

print("\nGradient interpretation:")
print(f"  - Bias gradient: {grad[0]:.4f} ({'increase' if grad[0] > 0 else 'decrease'} intercept)")
print(f"  - Feature 1 gradient: {grad[1]:.4f} ({'increase' if grad[1] > 0 else 'decrease'} weight)")
print(f"  - Feature 2 gradient: {grad[2]:.4f} ({'increase' if grad[2] > 0 else 'decrease'} weight)")

print("\n" + "=" * 70)
print("If your gradient computation works, proceed to Exercise 3!")
print("=" * 70)

## Exercise 3: Implement Full Training Loop

Excellent! You now have working components:
- ‚úÖ Sigmoid function
- ‚úÖ Probability prediction  
- ‚úÖ Gradient computation

Now let's put it all together and implement the **complete training loop** with gradient descent!

**What you'll implement:**
- `fit()`: The main training method using gradient descent
- `predict()`: Convert probabilities to class labels (threshold at 0.5)

**What happens in training:**
1. Initialize weights randomly
2. For each iteration:
   - Compute predictions
   - Calculate loss (Negative Log-Likelihood)
   - Compute gradient
   - Update weights: w = w - Œ±‚àáJ
   - Check convergence

**Testing:**
We'll train on a simple dataset and verify the model learns correctly.

### ‚úÖ Checkpoint Question 3: What is the loss function for logistic regression?

A) Sum of Squared Errors averaged over all training samples, measuring the squared difference between predictions and true labels continuously

B) Mean Squared Error computed between predicted probability values and true binary labels, penalizing errors proportional to their squared magnitude

C) Negative Log-Likelihood also known as Binary Cross-Entropy loss, penalizing confident wrong predictions more heavily than uncertain ones

D) Mean Absolute Error summed across all predictions and labels, measuring the average absolute deviation from correct classification labels

<details>
<summary>Click to see answer</summary>

**Answer: C**

**Key Insight:** Logistic regression uses Negative Log-Likelihood (NLL) because it's derived from maximum likelihood estimation for Bernoulli distributions. It heavily penalizes confident wrong predictions (e.g., predicting p=0.95 when y=0) while being lenient on uncertain predictions near 0.5.

**Detailed Explanation:**

The NLL loss for logistic regression is:
$$J(\vec{w}) = -\sum_{i=1}^{N} \left[ y^{(i)} \log p^{(i)} + (1-y^{(i)}) \log(1-p^{(i)}) \right]$$

This loss function has important properties:
- When y=1 and p‚Üí1: loss ‚Üí 0 (correct confident prediction)
- When y=1 and p‚Üí0: loss ‚Üí ‚àû (wrong confident prediction, heavily penalized)
- When y=0 and p‚Üí0: loss ‚Üí 0 (correct confident prediction)
- When y=0 and p‚Üí1: loss ‚Üí ‚àû (wrong confident prediction, heavily penalized)

**Example with numbers:**
- True label y=1, predicted p=0.9: loss = -log(0.9) = 0.105 (small)
- True label y=1, predicted p=0.1: loss = -log(0.1) = 2.303 (large!)
- True label y=1, predicted p=0.5: loss = -log(0.5) = 0.693 (moderate)

**Why other answers are incorrect:**

- **A is FALSE**: Sum of Squared Errors (SSE) is used for linear regression: Œ£(y - ≈∑)¬≤. It treats errors linearly and doesn't account for the probabilistic nature of classification. SSE would penalize p=0.4 vs y=0 equally to p=0.6 vs y=0, which doesn't reflect the classification problem structure.
- **B is FALSE**: Mean Squared Error (MSE = 1/N Œ£(y - p)¬≤) is also for regression. While it could theoretically be used for classification, it doesn't have the proper probabilistic interpretation and doesn't penalize confident mistakes as heavily as NLL.
- **D is FALSE**: Mean Absolute Error (MAE = Œ£|y - p|) is also a regression loss. It treats all errors equally regardless of confidence, unlike NLL which heavily penalizes confident wrong predictions.

</details>

In [None]:
# Add fit and predict methods to complete the class
def fit(self, X, y):
    """
    Fit the logistic regression model using gradient descent.

    Parameters
    ----------
    X : array-like, shape (n_samples, n_features)
        Training data
    y : array-like, shape (n_samples,)
        Target values (0 or 1)

    Returns
    -------
    self : object
        Returns self for method chaining
    """
    # TODO: Create design matrix Phi by adding bias column to X
    # Hint: Use np.c_[np.ones(X.shape[0]), X]
    Phi = None

    # TODO: Initialize weights with small random values
    # Hint: Use np.random.randn() and scale by 0.01
    if self.random_state is not None:
        np.random.seed(self.random_state)
    self.weights_ = None

    # Initialize loss history
    self.loss_history_ = []

    # Gradient descent loop
    for iteration in range(self.max_iter):
        # TODO: Compute probabilities using your _sigmoid method
        # Step 1: Compute scores (z = Phi @ weights)
        scores = None
        # Step 2: Apply sigmoid
        probabilities = None

        # TODO: Compute NLL loss: -Œ£[y*log(p) + (1-y)*log(1-p)]
        # Use epsilon for numerical stability
        epsilon = 1e-15
        p_safe = np.clip(probabilities, epsilon, 1 - epsilon)
        nll = None
        self.loss_history_.append(nll)

        # TODO: Compute gradient using your _compute_gradient method
        gradient = None

        # TODO: Check convergence - if gradient norm < tolerance, break
        if None:  # Replace with convergence check
            break

        # TODO: Update weights using gradient descent: w = w - learning_rate * gradient
        pass

    self.n_iter_ = iteration + 1
    return self

def predict(self, X):
    """
    Predict class labels.

    Parameters
    ----------
    X : array-like, shape (n_samples, n_features)
        Samples

    Returns
    -------
    y_pred : array, shape (n_samples,)
        Predicted class labels (0 or 1)
    """
    # TODO: Get probabilities using predict_proba and threshold at 0.5
    # Hint: predict_proba returns shape (n_samples, 2), we want column 1
    proba = None
    return (proba[:, 1] >= 0.5).astype(int)

# Add methods to the class
MyLogisticRegression.fit = fit
MyLogisticRegression.predict = predict

In [None]:
print("=" * 70)
print("EXERCISE 3 VERIFICATION: Testing Complete Training")
print("=" * 70)

# Test on simple dataset
np.random.seed(42)
X_simple = np.array([
    [0, 0], [0, 1], [1, 0], [1, 1],
    [0, 0.5], [0.5, 0], [1, 0.5], [0.5, 1]
])
y_simple = np.array([0, 0, 1, 1, 0, 0, 1, 1])

print("\n1. Training on simple dataset:")
print("-" * 70)
model_ex3 = MyLogisticRegression(learning_rate=0.1, max_iter=1000, random_state=42)
model_ex3.fit(X_simple, y_simple)

print(f"‚úì Training completed in {model_ex3.n_iter_} iterations")
print(f"‚úì Final weights: {model_ex3.weights_}")
print(f"‚úì Initial loss: {model_ex3.loss_history_[0]:.4f}")
print(f"‚úì Final loss: {model_ex3.loss_history_[-1]:.4f}")
print(f"‚úì Loss decreased? {model_ex3.loss_history_[-1] < model_ex3.loss_history_[0]}")

print("\n2. Testing predictions:")
print("-" * 70)
y_pred = model_ex3.predict(X_simple)
accuracy = accuracy_score(y_simple, y_pred)

print(f"True labels:      {y_simple}")
print(f"Predicted labels: {y_pred}")
print(f"Accuracy: {accuracy:.2%}")

print("\n3. Testing probabilities:")
print("-" * 70)
y_proba = model_ex3.predict_proba(X_simple)
print("Sample predictions:")
for i in range(len(X_simple)):
    print(f"  {X_simple[i]} -> True: {y_simple[i]}, "
          f"Pred: {y_pred[i]}, P(y=1): {y_proba[i, 1]:.3f}")

print("\n4. Visualizing loss convergence:")
print("-" * 70)
plt.figure(figsize=(10, 5))
plt.plot(model_ex3.loss_history_, linewidth=2)
plt.xlabel('Iteration', fontsize=12)
plt.ylabel('NLL Loss', fontsize=12)
plt.title('Training Loss - Exercise 3 Verification', fontsize=14)
plt.grid(True, alpha=0.3)
plt.show()

print("\n" + "=" * 70)
print("SUCCESS! Your complete implementation is working!")
print("=" * 70)
print("\nüéâ You can now proceed to apply your model to the full dataset!")

## Test on Simple Data

In [None]:
# Generate simple test data
np.random.seed(42)
X_simple = np.array([[0, 0], [1, 1], [1, 0], [0, 1]])
y_simple = np.array([0, 1, 1, 0])

# Fit model
model_simple = MyLogisticRegression(learning_rate=0.1, max_iter=1000)
model_simple.fit(X_simple, y_simple)

# Predict
y_pred_simple = model_simple.predict(X_simple)
print("True labels:", y_simple)
print("Predictions:", y_pred_simple)
print("Accuracy:", accuracy_score(y_simple, y_pred_simple))

## Generate Synthetic Binary Classification Data

We'll use the same data generation approach from the lecture slides (slide 26).

In [None]:
# Generate two-class data
m = 100  # samples per class
n = 2    # features

np.random.seed(0)

# Class 0: centered around (1.5, -1.5)
class_0 = np.hstack((
    1.5 + np.random.randn(m, 1),
    -1.5 + np.random.randn(m, 1)
))

# Class 1: centered around (-1.5, 1.5)
class_1 = np.hstack((
    -1.5 + np.random.randn(m, 1),
    1.5 + np.random.randn(m, 1)
))

# Combine
X = np.vstack((class_0, class_1))
y = np.concatenate([np.zeros(m), np.ones(m)])

print(f"Dataset shape: X={X.shape}, y={y.shape}")
print(f"Class distribution: {np.bincount(y.astype(int))}")

## Visualize the Data

In [None]:
plt.figure(figsize=(8, 6))
plt.scatter(X[y == 0, 0], X[y == 0, 1], c='orange', label='Class 0', edgecolors='k', s=50)
plt.scatter(X[y == 1, 0], X[y == 1, 1], c='skyblue', label='Class 1', edgecolors='k', s=50)
plt.xlabel('$x_1$', fontsize=12)
plt.ylabel('$x_2$', fontsize=12)
plt.title('Binary Classification Dataset', fontsize=14)
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

## Split into Train and Test Sets

In [None]:
# Split data (70% train, 30% test)
# stratify=y ensures both classes are equally represented in train/test splits
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

print(f"Training set: {X_train.shape[0]} samples")
print(f"Test set: {X_test.shape[0]} samples")

## Train the Model

In [None]:
# Create and train model
model = MyLogisticRegression(learning_rate=0.1, max_iter=1000, random_state=42)
model.fit(X_train, y_train)

print(f"Training completed in {model.n_iter_} iterations")
print(f"Final weights: {model.weights_}")

### Understanding the Learned Model

Let's interpret what the model has learned. The weights tell us how each feature affects the classification decision.

In [None]:
# Print learned weights
print("Learned Weights:")
print(f"  Intercept (bias): {model.weights_[0]:.4f}")
print(f"  Weight for x‚ÇÅ:    {model.weights_[1]:.4f}")
print(f"  Weight for x‚ÇÇ:    {model.weights_[2]:.4f}")

print("\nInterpretation:")
print("- Positive weight ‚Üí increasing this feature increases P(y=1)")
print("- Negative weight ‚Üí increasing this feature decreases P(y=1)")
print("- Larger magnitude ‚Üí stronger influence on classification")

print("\nDecision Boundary:")
print(f"The decision boundary is the line where x^T¬∑w = 0")
print(f"For our model: {model.weights_[0]:.4f} + {model.weights_[1]:.4f}*x‚ÇÅ + {model.weights_[2]:.4f}*x‚ÇÇ = 0")
print(f"Solving for x‚ÇÇ: x‚ÇÇ = {-model.weights_[0]/model.weights_[2]:.4f} + {-model.weights_[1]/model.weights_[2]:.4f}*x‚ÇÅ")

## Visualize Training Loss

### How to Read the Loss Curve

The loss curve provides important diagnostic information about your training process:

- **Still decreasing steadily** ‚Üí Model hasn't converged yet. Increase `max_iter` or increase learning rate
- **Oscillating or increasing** ‚Üí Learning rate is too large. Decrease `learning_rate` (e.g., from 0.1 to 0.01)
- **Flattened/plateaued** ‚Üí Converged successfully! The gradient is near zero and the model has found a minimum
- **Decreasing very slowly** ‚Üí Learning rate might be too small. Try increasing it for faster convergence

**Our curve above shows:** Rapid decrease then flattening = healthy convergence pattern!

In [None]:
plt.figure(figsize=(10, 6))
plt.plot(model.loss_history_, linewidth=2)
plt.xlabel('Iteration', fontsize=12)
plt.ylabel('Negative Log-Likelihood', fontsize=12)
plt.title('Training Loss Over Time', fontsize=14)
plt.grid(True, alpha=0.3)
plt.show()

## Make Predictions

In [None]:
# Predict on test set
y_pred = model.predict(X_test)
y_proba = model.predict_proba(X_test)

# Evaluate
accuracy = accuracy_score(y_test, y_pred)
print(f"Test Accuracy: {accuracy:.4f}")

# Show some predictions
print("\nSample predictions (first 10):")
for i in range(min(10, len(y_test))):
    print(f"True: {int(y_test.iloc[i] if hasattr(y_test, 'iloc') else y_test[i])}, "
          f"Predicted: {y_pred[i]}, "
          f"P(y=1): {y_proba[i, 1]:.3f}")

## Confusion Matrix

### ‚úÖ Checkpoint Question 4: What metric is most important for spam detection?

A) Overall accuracy measured across both spam and legitimate emails, providing a single comprehensive metric for model performance evaluation

B) Precision to minimize false positives and avoid blocking legitimate emails, which is more costly than letting spam through

C) Recall to minimize false negatives and catch all possible spam messages, ensuring comprehensive spam filtering without misses

D) F1-score to balance both precision and recall equally well, giving equal weight to false positives and negatives

<details>
<summary>Click to see answer</summary>

**Answer: B**

**Key Insight:** In spam detection, false positives (legitimate email marked as spam) are typically more costly than false negatives (spam in inbox) because users may miss important messages. Therefore, precision (minimizing false positives) is usually prioritized over recall, though the exact balance depends on the specific application context.

**Detailed Explanation:**

**Precision** = TP / (TP + FP) = "Of all emails marked as spam, how many were actually spam?"

For spam detection, **precision** is typically most important because:
- **False positives are very costly**: Missing an important work email, job offer, or password reset could have serious consequences
- **False negatives are annoying but manageable**: A few spam emails in the inbox are tolerable

**Example with numbers:**
```
Confusion Matrix:
                Predicted Spam    Predicted Legitimate
Actual Spam           950              50 (FN)
Actual Legit           10             990 (TN)

Precision = 950/(950+10) = 0.99 (99% of spam predictions are correct)
Recall = 950/(950+50) = 0.95 (caught 95% of actual spam)
```

Only 10 legitimate emails were incorrectly blocked - acceptable trade-off.

**Context matters:**
- **Consumer email (Gmail)**: High precision preferred (don't block important mail)
- **Corporate email filter**: Might balance precision/recall more equally
- **High-security environment**: Might even prioritize recall (catch all threats)

**Real-world approaches:**
- Most spam filters use a confidence threshold above 0.5 (e.g., 0.7 or 0.8)
- This increases precision at the cost of recall
- Example: Only mark as spam if P(spam) > 0.8

**Why other answers are incorrect:**

- **A is FALSE**: Accuracy can be very misleading with imbalanced data. If 95% of emails are legitimate, a naive classifier that always predicts "not spam" would achieve 95% accuracy while catching zero spam! Accuracy doesn't capture the asymmetric cost of errors in spam detection. We need to specifically minimize false positives, not just maximize overall correctness.
- **C is FALSE**: Prioritizing recall would mean flagging more aggressively to catch all spam, but this would result in many false positives (legitimate emails blocked). While catching 100% of spam sounds good, blocking 50 legitimate emails to catch 10 more spam messages is usually not worth it. Recall is important but typically secondary to precision.
- **D is FALSE**: F1-score gives equal weight to precision and recall (F1 = 2¬∑(P¬∑R)/(P+R)), but in spam detection, precision and recall do NOT have equal importance. The asymmetric cost structure (false positives >> false negatives) means we should prioritize precision. F1-score is useful when costs are balanced, but that's not the case here.

</details>

In [None]:
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(cm)
print("\n[TN  FP]")
print("[FN  TP]")

# Visualize
plt.figure(figsize=(6, 5))
plt.imshow(cm, cmap='Blues', interpolation='nearest')
plt.colorbar()
plt.title('Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('Actual')
for i in range(2):
    for j in range(2):
        plt.text(j, i, cm[i, j], ha='center', va='center', fontsize=20)
plt.xticks([0, 1], ['Class 0', 'Class 1'])
plt.yticks([0, 1], ['Class 0', 'Class 1'])
plt.show()

### ‚úÖ Checkpoint Question 5: What does the decision boundary represent?

A) The set of points where the model predicts maximum probability P(y=1|x) equals one point zero for class one predictions

B) The set of points where the model predicts equal probability P(y=1|x) equals zero point five, the classification threshold

C) The set of points where the model predicts minimum probability P(y=1|x) equals zero for class zero predictions exclusively

D) The region of maximum classification confidence where the model achieves its highest accuracy on training data samples predominantly

<details>
<summary>Click to see answer</summary>

**Answer: B**

**Key Insight:** The decision boundary is the geometric surface where P(y=1|x) = 0.5, which occurs when x^T¬∑w = 0. Points on one side have P(y=1|x) > 0.5 (classified as class 1), while points on the other side have P(y=1|x) < 0.5 (classified as class 0). For 2D features, this is a line; for 3D, a plane; for higher dimensions, a hyperplane.

**Detailed Explanation:**

The decision boundary is where the model is maximally uncertain - exactly at the threshold between the two classes.

Mathematically:
- Decision boundary: {x : œÉ(x^T¬∑w) = 0.5}
- Since œÉ(0) = 0.5, this simplifies to: {x : x^T¬∑w = 0}
- For 2 features: w‚ÇÄ + w‚ÇÅx‚ÇÅ + w‚ÇÇx‚ÇÇ = 0

**Example with numbers:**
If weights are w = [1, 2, -3]:
- Decision boundary: 1 + 2x‚ÇÅ - 3x‚ÇÇ = 0
- Solving for x‚ÇÇ: x‚ÇÇ = (1 + 2x‚ÇÅ)/3
- Points above this line: predicted as class 1
- Points below this line: predicted as class 0

**Visualizing predictions:**
- At x = [0, 0]: x^T¬∑w = 1 > 0, so p = œÉ(1) = 0.73 ‚Üí class 1
- At x = [-1, 0]: x^T¬∑w = -1 < 0, so p = œÉ(-1) = 0.27 ‚Üí class 0
- At x = [-0.5, 0]: x^T¬∑w = 0, so p = œÉ(0) = 0.50 ‚Üí on boundary

**Why other answers are incorrect:**

- **A is FALSE**: P(y=1|x) = 1.0 would require x^T¬∑w = +‚àû, which never happens in practice with finite feature values. Points very far from the boundary have high confidence (e.g., p = 0.99) but never reach exactly 1.0. The decision boundary is at p = 0.5, not p = 1.0.
- **C is FALSE**: Similarly, P(y=1|x) = 0 would require x^T¬∑w = -‚àû. The decision boundary is at the threshold (0.5), not at the extremes (0 or 1). Points with p ‚âà 0 are far from the boundary, not on it.
- **D is FALSE**: The decision boundary is actually where the model has *minimum* confidence (p = 0.5), not maximum. Points far from the boundary have high confidence (p close to 0 or 1), while points near the boundary have low confidence (p near 0.5). Also, the decision boundary is defined by the learned weights, not by training accuracy.

</details>

## Classification Report

In [None]:
print(classification_report(y_test, y_pred, target_names=['Class 0', 'Class 1']))

# Calculate metrics manually
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print(f"\nPrecision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1-Score: {f1:.4f}")

## Visualize Decision Boundary

### ‚úÖ Checkpoint Question 6: When should we use higher learning rates?

A) Always use high learning rates to achieve faster convergence to the optimum solution in all training scenarios regardless of data

B) When features are normalized to have similar scales and distributions, ensuring gradients have comparable magnitudes across all feature dimensions

C) When the model is overfitting to the training data, use higher rates to reduce model complexity and regularize the learning

D) Never use high learning rates because they always cause divergence problems and prevent the model from converging to any solution

<details>
<summary>Click to see answer</summary>

**Answer: B**

**Key Insight:** Feature normalization (e.g., StandardScaler) ensures all gradients have similar magnitudes, allowing a single learning rate to work well for all features. Without normalization, features with large scales dominate gradients, requiring a small learning rate that makes other features learn too slowly.

**Detailed Explanation:**

Higher learning rates (e.g., Œ± = 0.1 instead of 0.01) can be used when features are normalized because:

1. **Unnormalized features cause problems:**
   - Feature x‚ÇÅ ‚àà [0, 1] and x‚ÇÇ ‚àà [0, 10000]
   - Gradient for w‚ÇÇ will be ~10,000√ó larger than gradient for w‚ÇÅ
   - Large Œ± works for x‚ÇÅ but causes oscillation/divergence for x‚ÇÇ
   - Small Œ± works for x‚ÇÇ but makes x‚ÇÅ learn extremely slowly

2. **Normalized features enable higher learning rates:**
   - After StandardScaler: all features have mean=0, std=1
   - All gradients have similar magnitudes
   - Can use Œ± = 0.1 or higher safely
   - Faster convergence without instability

**Example with numbers:**
```
# Unnormalized
Feature 1: [1, 2, 3] ‚Üí gradient ‚âà 2.5
Feature 2: [1000, 2000, 3000] ‚Üí gradient ‚âà 2500
Ratio: 1000:1 (need very small Œ±)

# After StandardScaler
Feature 1: [-1, 0, 1] ‚Üí gradient ‚âà 2.5
Feature 2: [-1, 0, 1] ‚Üí gradient ‚âà 2.5
Ratio: 1:1 (can use larger Œ±)
```

**Recommended learning rates:**
- Normalized features: Œ± ‚àà [0.01, 0.5]
- Unnormalized features: Œ± ‚àà [0.0001, 0.01]

**Why other answers are incorrect:**

- **A is FALSE**: "Always" is too strong. Very high learning rates (e.g., Œ± = 10) can cause divergence even with normalized features. The gradient might overshoot the minimum, causing the loss to oscillate or increase. The optimal learning rate depends on the problem, dataset size, and loss surface curvature.
- **C is FALSE**: Overfitting is addressed through regularization (L1/L2 penalties), early stopping, or getting more training data - not by changing the learning rate. The learning rate controls convergence speed and stability, not model complexity. A higher learning rate might even worsen overfitting by causing instability.
- **D is FALSE**: "Never" and "always" are both too absolute. Higher learning rates are perfectly fine with proper feature scaling and can significantly speed up training. For example, Œ± = 0.1 often works well for normalized features and converges faster than Œ± = 0.001. The key is matching the learning rate to the feature scales.

</details>

In [None]:
# Create mesh
x1_min, x1_max = X[:, 0].min() - 1, X[:, 0].max() + 1
x2_min, x2_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx1, xx2 = np.meshgrid(np.linspace(x1_min, x1_max, 200),
                        np.linspace(x2_min, x2_max, 200))

# Predict probabilities for mesh
X_mesh = np.c_[xx1.ravel(), xx2.ravel()]
probs_mesh = model.predict_proba(X_mesh)[:, 1].reshape(xx1.shape)

# Plot
plt.figure(figsize=(12, 8))
plt.contourf(xx1, xx2, probs_mesh, levels=20, cmap='RdBu_r', alpha=0.6)
plt.colorbar(label='P(y=1|x,w)')
plt.contour(xx1, xx2, probs_mesh, levels=[0.5], colors='black', linewidths=2)

plt.scatter(X_train[y_train == 0, 0], X_train[y_train == 0, 1],
            c='orange', label='Train Class 0', edgecolors='k', s=50, marker='o')
plt.scatter(X_train[y_train == 1, 0], X_train[y_train == 1, 1],
            c='skyblue', label='Train Class 1', edgecolors='k', s=50, marker='o')
plt.scatter(X_test[y_test == 0, 0], X_test[y_test == 0, 1],
            c='orange', label='Test Class 0', edgecolors='k', s=100, marker='s')
plt.scatter(X_test[y_test == 1, 0], X_test[y_test == 1, 1],
            c='skyblue', label='Test Class 1', edgecolors='k', s=100, marker='s')

plt.xlabel('$x_1$', fontsize=12)
plt.ylabel('$x_2$', fontsize=12)
plt.title('Decision Boundary', fontsize=14)
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

## Experiment with Different Learning Rates

We've seen how our model performs with $\alpha = 0.1$. But how does the choice of learning rate affect convergence speed and final accuracy? Let's systematically compare different learning rates to understand this critical hyperparameter and observe the convergence patterns we discussed earlier.

In [None]:
learning_rates = [0.001, 0.01, 0.1, 0.5]
results = {}

for lr in learning_rates:
    model_lr = MyLogisticRegression(learning_rate=lr, max_iter=1000, random_state=42)
    model_lr.fit(X_train, y_train)
    accuracy_lr = accuracy_score(y_test, model_lr.predict(X_test))
    results[lr] = {
        'model': model_lr,
        'accuracy': accuracy_lr,
        'n_iter': model_lr.n_iter_,
        'final_loss': model_lr.loss_history_[-1]
    }
    print(f"Learning Rate={lr}: Accuracy={accuracy_lr:.4f}, "
          f"Iterations={model_lr.n_iter_}, Final Loss={model_lr.loss_history_[-1]:.4f}")

### Compare Loss Curves

In [None]:
plt.figure(figsize=(12, 6))
for lr in learning_rates:
    plt.plot(results[lr]['model'].loss_history_, label=f'Œ±={lr}', linewidth=2)
plt.xlabel('Iteration', fontsize=12)
plt.ylabel('Negative Log-Likelihood', fontsize=12)
plt.title('Training Loss for Different Learning Rates', fontsize=14)
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

## K-Fold Cross-Validation

So far we've evaluated our model using a single train/test split. However, our performance estimate might be sensitive to this particular split - we might have gotten "lucky" or "unlucky" with which samples ended up in the test set. K-fold cross-validation provides a more robust performance estimate by testing on multiple different splits of the data.

### Why Cross-Validation?

So far we've evaluated our model using a single train/test split. However, this approach has limitations:

**Problem with single split:**
- Performance estimate depends heavily on *which* samples ended up in the test set
- Might get "lucky" with an easy test set (overestimate performance)
- Might get "unlucky" with a hard test set (underestimate performance)
- Uses less data for training (70% in our case)

**K-Fold Cross-Validation solves this:**
1. **Split data into K folds** (e.g., K=5): divide training data into 5 equal parts
2. **Train K times**: Each time, use K-1 folds for training, 1 fold for validation
3. **Get K performance scores**: Each fold serves as validation set once
4. **Average the scores**: More robust estimate with confidence interval

**Benefits:**
- ‚úÖ More reliable performance estimate (average of K runs)
- ‚úÖ Confidence interval showing variance (e.g., 0.92 ¬± 0.03)
- ‚úÖ Uses all training data (every sample is validated once)
- ‚úÖ Detects if model is unstable across different data splits

**Visual representation of 5-fold CV:**
```
Fold 1: [Test][Train][Train][Train][Train] ‚Üí Score 1
Fold 2: [Train][Test][Train][Train][Train] ‚Üí Score 2
Fold 3: [Train][Train][Test][Train][Train] ‚Üí Score 3
Fold 4: [Train][Train][Train][Test][Train] ‚Üí Score 4
Fold 5: [Train][Train][Train][Train][Test] ‚Üí Score 5
                                             ‚Üì
                        Final: Mean ¬± Std of 5 scores
```

**Important:** We only use cross-validation on the training set. The held-out test set remains completely untouched for final evaluation.

## ‚öôÔ∏è OPTIONAL/ADVANCED: Polynomial Features for Non-Linear Decision Boundaries

**‚ö†Ô∏è Note:** This section covers **feature engineering** (creating polynomial features), not core logistic regression. It demonstrates how to handle non-linear decision boundaries when the basic logistic regression model with original features isn't sufficient.

**What you'll learn:**
- How to transform features to capture non-linear relationships
- When polynomial features are needed (non-linearly separable data)
- Trade-offs between model complexity and performance

**Core logistic regression concepts are complete** - this is an extension showing how to adapt the algorithm for more complex datasets.

Now let's work with a more complex dataset that requires non-linear decision boundaries. We'll use the mixture dataset from the lecture slides and apply polynomial features to capture the non-linear patterns.

In [None]:
# Use training data for cross-validation
kf = KFold(n_splits=5, shuffle=True, random_state=42)
cv_scores = cross_val_score(
    MyLogisticRegression(learning_rate=0.1, max_iter=1000, random_state=42),
    X_train, y_train, cv=kf, scoring='accuracy'
)

print(f"Cross-validation scores: {cv_scores}")
print(f"Mean CV Accuracy: {np.mean(cv_scores):.4f} (+/- {np.std(cv_scores):.4f})")

## Polynomial Features for Non-Linear Decision Boundaries

Now let's work with a more complex dataset that requires non-linear decision boundaries. We'll use the mixture dataset from the lecture slides and apply polynomial features to capture the non-linear patterns.

### Load Mixture Dataset from Google Drive

This dataset contains two classes with non-linear separation (as shown in the lecture slides).

In [None]:
# Download the mixture dataset from Google Drive
# File ID: 1Ls7f9OWKgeWswFR4EZ5eeoohfY9PACRT
# Direct download URL
url = 'https://drive.google.com/uc?id=1Ls7f9OWKgeWswFR4EZ5eeoohfY9PACRT'

# Load data
df_mixture = pd.read_csv(url)
print(f"Mixture dataset shape: {df_mixture.shape}")
print(f"\nFirst few rows:")
print(df_mixture.head())
print(f"\nColumn names: {df_mixture.columns.tolist()}")
print(f"Class distribution:\n{df_mixture.iloc[:, -1].value_counts()}")

### Prepare Mixture Data

In [None]:
# Extract features and labels
# Assuming last column is the label, and first columns are features
X_mixture = df_mixture.iloc[:, :-1].values
y_mixture = df_mixture.iloc[:, -1].values

print(f"Features shape: {X_mixture.shape}")
print(f"Labels shape: {y_mixture.shape}")
print(f"Unique labels: {np.unique(y_mixture)}")

### Visualize Mixture Data

In [None]:
plt.figure(figsize=(10, 8))
plt.scatter(X_mixture[y_mixture == 0, 0], X_mixture[y_mixture == 0, 1],
            c='orange', label='Class 0', edgecolors='k', s=50, alpha=0.7)
plt.scatter(X_mixture[y_mixture == 1, 0], X_mixture[y_mixture == 1, 1],
            c='skyblue', label='Class 1', edgecolors='k', s=50, alpha=0.7)
plt.xlabel('$x_1$', fontsize=12)
plt.ylabel('$x_2$', fontsize=12)
plt.title('Mixture Dataset (Non-Linear Boundary)', fontsize=14)
plt.legend(fontsize=12)
plt.grid(True, alpha=0.3)
plt.show()

### Split Mixture Data

In [None]:
# Split into train and test sets
X_mix_train, X_mix_test, y_mix_train, y_mix_test = train_test_split(
    X_mixture, y_mixture, test_size=0.3, random_state=42, stratify=y_mixture)

print(f"Mixture training set: {X_mix_train.shape[0]} samples")
print(f"Mixture test set: {X_mix_test.shape[0]} samples")

### Apply Polynomial Features to Mixture Data

Let's test different polynomial degrees to find the best model for this non-linear dataset.

In [None]:
# Test different polynomial degrees
degrees = [1, 2, 3, 4, 5]
poly_results = {}

for degree in degrees:
    # Create polynomial features
    poly = PolynomialFeatures(degree=degree, include_bias=False)
    X_mix_train_poly = poly.fit_transform(X_mix_train)
    X_mix_test_poly = poly.transform(X_mix_test)

    # Train model with smaller learning rate for higher dimensions
    lr = 0.01 if degree <= 2 else 0.001
    model_poly = MyLogisticRegression(learning_rate=lr, max_iter=3000, random_state=42)
    model_poly.fit(X_mix_train_poly, y_mix_train)

    # Evaluate
    y_mix_pred_poly = model_poly.predict(X_mix_test_poly)
    accuracy_poly = accuracy_score(y_mix_test, y_mix_pred_poly)

    poly_results[degree] = {
        'poly': poly,
        'model': model_poly,
        'accuracy': accuracy_poly,
        'n_features': X_mix_train_poly.shape[1]
    }

    print(f"Degree={degree}: Features={X_mix_train_poly.shape[1]}, "
          f"Accuracy={accuracy_poly:.4f}, Iterations={model_poly.n_iter_}")

### Visualize Polynomial Decision Boundaries on Mixture Data

Notice how higher-degree polynomials can capture more complex, non-linear boundaries.

In [None]:
# Create subplot grid based on number of degrees
fig, axes = plt.subplots(2, 3, figsize=(18, 12))
axes = axes.ravel()

# Get data ranges
x1_min_mix, x1_max_mix = X_mixture[:, 0].min() - 0.5, X_mixture[:, 0].max() + 0.5
x2_min_mix, x2_max_mix = X_mixture[:, 1].min() - 0.5, X_mixture[:, 1].max() + 0.5

for idx, degree in enumerate(degrees):
    ax = axes[idx]

    # Get polynomial transformer and model
    poly = poly_results[degree]['poly']
    model_poly = poly_results[degree]['model']

    # Create mesh
    xx1, xx2 = np.meshgrid(np.linspace(x1_min_mix, x1_max_mix, 200),
                            np.linspace(x2_min_mix, x2_max_mix, 200))
    X_mesh = np.c_[xx1.ravel(), xx2.ravel()]
    X_mesh_poly = poly.transform(X_mesh)
    probs_mesh = model_poly.predict_proba(X_mesh_poly)[:, 1].reshape(xx1.shape)

    # Plot contours and decision boundary
    ax.contourf(xx1, xx2, probs_mesh, levels=20, cmap='RdBu_r', alpha=0.6)
    ax.contour(xx1, xx2, probs_mesh, levels=[0.5], colors='black', linewidths=2.5)

    # Plot training data
    ax.scatter(X_mix_train[y_mix_train == 0, 0], X_mix_train[y_mix_train == 0, 1],
                c='orange', edgecolors='k', s=40, alpha=0.7, label='Class 0 (train)')
    ax.scatter(X_mix_train[y_mix_train == 1, 0], X_mix_train[y_mix_train == 1, 1],
                c='skyblue', edgecolors='k', s=40, alpha=0.7, label='Class 1 (train)')

    # Plot test data with different marker
    ax.scatter(X_mix_test[y_mix_test == 0, 0], X_mix_test[y_mix_test == 0, 1],
                c='orange', edgecolors='k', s=80, marker='s', alpha=0.9, label='Class 0 (test)')
    ax.scatter(X_mix_test[y_mix_test == 1, 0], X_mix_test[y_mix_test == 1, 1],
                c='skyblue', edgecolors='k', s=80, marker='s', alpha=0.9, label='Class 1 (test)')

    ax.set_title(f'Degree={degree}, Features={poly_results[degree]["n_features"]}, '\
                     f'Acc={poly_results[degree]["accuracy"]:.3f}', fontsize=12)
    ax.set_xlabel('$x_1$', fontsize=11)
    ax.set_ylabel('$x_2$', fontsize=11)
    ax.grid(True, alpha=0.3)
    if idx == 0:
        ax.legend(fontsize=9, loc='best')

# Hide the last subplot if we have fewer than 6 degrees
if len(degrees) < 6:
    axes[5].axis('off')

plt.tight_layout()
plt.show()

### Analysis of Polynomial Degrees

Observe the following:
- **Degree 1 (Linear)**: Cannot capture the non-linear boundary, lower accuracy
- **Degree 2 (Quadratic)**: Begins to capture curvature in the decision boundary
- **Degree 3-4**: Better fit for complex boundaries
- **Degree 5+**: Risk of overfitting - may fit training noise rather than true pattern

**Key insight**: The mixture dataset requires polynomial features because the classes are not linearly separable. This demonstrates why feature engineering (like polynomial features) is important for logistic regression.

## Comparison with scikit-learn

Now that we've implemented logistic regression and explored its behavior with different datasets, hyperparameters, and feature transformations, let's validate our implementation by comparing it with sklearn's professional implementation on our original synthetic dataset.

In [None]:
# Train sklearn model
sklearn_model = SklearnLogisticRegression(penalty=None, max_iter=1000, random_state=42)
sklearn_model.fit(X_train, y_train)

# Compare
our_accuracy = accuracy_score(y_test, model.predict(X_test))
sklearn_accuracy = sklearn_model.score(X_test, y_test)

print(f"Our model accuracy:      {our_accuracy:.4f}")
print(f"sklearn model accuracy:  {sklearn_accuracy:.4f}")
print(f"\nOur weights:     {model.weights_}")
print(f"sklearn weights: {np.concatenate([sklearn_model.intercept_, sklearn_model.coef_[0]])}")

## Best Practices and Tips

### 1. Feature Scaling
Always normalize/standardize features when using gradient descent:
```python
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
```

### 2. Learning Rate Selection
- Start with Œ± = 0.01 for normalized features
- If loss increases or oscillates: reduce Œ±
- If convergence is too slow: increase Œ±
- Monitor loss curve to diagnose

### 3. Handling Class Imbalance
- Use stratified splits: `train_test_split(..., stratify=y)`
- Consider weighted loss or resampling
- Focus on precision/recall instead of accuracy

### 4. Convergence
- Set reasonable `max_iter` (e.g., 1000-10000)
- Use `tol` to stop early when gradient is small
- Check if loss is still decreasing

### 5. Multiclass Classification
For more than 2 classes, use:
- One-vs-Rest (OvR): Train C binary classifiers
- Softmax regression (multinomial logistic regression)

### 6. Regularization
To prevent overfitting:
- L2 regularization: Add $\lambda ||\vec{w}||^2$ to loss
- L1 regularization: Add $\lambda ||\vec{w}||_1$ to loss

### 7. Evaluation
- Use cross-validation for small datasets
- Report multiple metrics: accuracy, precision, recall, F1
- Visualize confusion matrix
- Plot ROC curve and PR curve for threshold selection

## Summary

In this lab, you:

1. ‚úÖ Implemented a custom **Logistic Regression** classifier from scratch
2. ‚úÖ Understood the **sigmoid function** and how it models probabilities
3. ‚úÖ Applied **gradient descent** to minimize the negative log-likelihood loss
4. ‚úÖ Experimented with different **learning rates** and observed their effects
5. ‚úÖ Used **K-fold cross-validation** to evaluate model performance
6. ‚úÖ Applied **polynomial features** to model non-linear decision boundaries
7. ‚úÖ Evaluated models using **accuracy, precision, recall, and F1-score**
8. ‚úÖ Visualized **decision boundaries** and probability distributions
9. ‚úÖ Compared your implementation with scikit-learn

### Key Takeaways

- Logistic regression models $P(y=1|\vec{x}, \vec{w}) = \sigma(\vec{x}^T \times \vec{w})$
- The loss function is the negative log-likelihood (binary cross-entropy)
- The gradient is $\nabla J = \Phi^T (\vec{p} - \vec{y})$
- Learning rate must be tuned carefully
- Feature scaling improves convergence
- Different applications require different metric priorities (precision vs recall)
- Cross-validation provides more reliable performance estimates

### Next Steps

- Try logistic regression on real-world datasets (e.g., breast cancer, iris)
- Implement multiclass classification using One-vs-Rest
- Add L2 regularization to prevent overfitting
- Experiment with different optimization algorithms (SGD, Adam)
- Compare with other classifiers (SVM, Decision Trees, Neural Networks)