<a href="https://colab.research.google.com/github/sreent/machine-learning/blob/main/Logistic%20Regression/Logistic%20Regression%20Hands-On%20Lab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Logistic Regression: Hands-On Lab## Learning ObjectivesBy the end of this lab, you will be able to:1. **Understand** how logistic regression models binary classification problems using the sigmoid function2. **Implement** a custom logistic regression classifier using gradient descent3. **Apply** logistic regression to real-world classification datasets4. **Evaluate** model performance using accuracy, precision, recall, and F1-score5. **Optimize** hyperparameters using K-fold cross-validation6. **Visualize** decision boundaries and probability distributions## Algorithm Overview**Logistic Regression** is a classification algorithm that models the probability of a binary outcome:$$P(y=1|\vec{x}, \vec{w}) = \sigma(\vec{x}^T \times \vec{w}) = \frac{1}{1 + e^{-\vec{x}^T \times \vec{w}}}$$Where:- $\vec{x}$ is the input feature vector- $\vec{w}$ is the weight vector- $\sigma$ is the sigmoid (logistic) function**Loss Function** (Negative Log-Likelihood):$$J(\vec{w}) = -\sum_{i=1}^{N} \left[ y^{(i)} \log p^{(i)} + (1-y^{(i)}) \log(1-p^{(i)}) \right]$$**Gradient:**$$\nabla_{\vec{w}} J = \Phi^T (\vec{p} - \vec{y})$$**Gradient Descent Update:**$$\vec{w} = \vec{w} - \alpha \nabla_{\vec{w}} J$$Where $\alpha$ is the learning rate.

## Learning RateThe **learning rate** $\alpha$ controls how much we adjust weights at each iteration:- **Too small**: Slow convergence, may take many iterations- **Too large**: May overshoot the minimum, fail to converge- **Just right**: Converges efficiently to the optimumTypical values: $\alpha \in [0.001, 0.1]$ for normalized features.

## Pseudocode for Logistic Regression```# Logistic Regression — Gradient Descent on NLL# Inputs# data ← (X, y) with y ∈ {0,1}# η ← learning rate# max_iter ← maximum iterations# tol ← stop when ||∇L(w)|| ≤ tol# X_query ← examples to predict# ----- fit -----Φ ← concat_column(ones(N), X)      # design matrix with biasw ← zeros(columns(Φ))               # initialize# NLL: L(w) = - Σ [ y log p + (1−y) log(1−p) ], p = σ(Φw)FOR t = 1 TO max_iter DO    z ← Φ · w    p ← 1 / (1 + exp(−z))           # sigmoid    g ← transpose(Φ) · (p − y)      # ∇L(w)    IF norm(g) ≤ tol THEN BREAK    w ← w − η · g                   # GD stepEND FOR# ----- predict -----Φ* ← concat_column(ones(|X_query|), X_query)p* ← 1 / (1 + exp(−Φ* · w))ŷ ← 1 if p* ≥ 0.5 else 0RETURN p*, ŷ```

## Import Libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.base import BaseEstimator, ClassifierMixin
from sklearn.model_selection import train_test_split, KFold, cross_val_score
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.metrics import precision_score, recall_score, f1_score
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.linear_model import LogisticRegression as SklearnLogisticRegression
from scipy.special import expit
import warnings
warnings.filterwarnings('ignore')

# Set random seed
np.random.seed(42)

## Implement MyLogisticRegressionNow you'll implement a custom logistic regression classifier. The class follows the scikit-learn API with `fit`, `predict`, and `predict_proba` methods.

In [None]:
class MyLogisticRegression(BaseEstimator, ClassifierMixin):    """    Custom Logistic Regression classifier using gradient descent.        Parameters    ----------    learning_rate : float, default=0.01        Step size for gradient descent updates    max_iter : int, default=1000        Maximum number of iterations    tol : float, default=1e-6        Tolerance for gradient norm to declare convergence    random_state : int, default=None        Random seed for weight initialization    """        def __init__(self, learning_rate=0.01, max_iter=1000, tol=1e-6, random_state=None):        self.learning_rate = learning_rate        self.max_iter = max_iter        self.tol = tol        self.random_state = random_state        def fit(self, X, y):        """        Fit the logistic regression model.                Parameters        ----------        X : array-like, shape (n_samples, n_features)            Training data        y : array-like, shape (n_samples,)            Target values (0 or 1)                Returns        -------        self : object            Returns self for method chaining        """        # TODO: Create design matrix Phi by adding bias column to X        # Hint: Phi = np.c_[np.ones(len(X)), X]        Phi = np.c_[np.ones(len(X)), X]                # TODO: Initialize weights with small random values        # Hint: Use np.random.seed() and np.random.randn()        if self.random_state is not None:            np.random.seed(self.random_state)        self.weights_ = np.random.randn(Phi.shape[1]) * 0.01                # Initialize loss history        self.loss_history_ = []                # Gradient descent loop        for iteration in range(self.max_iter):            # TODO: Compute scores (z = Phi @ weights)            scores = Phi @ self.weights_                        # TODO: Apply sigmoid to get probabilities            # Hint: Use expit() from scipy.special            probabilities = expit(scores)                        # TODO: Compute NLL loss            # NLL = -Σ[y*log(p) + (1-y)*log(1-p)]            # Use epsilon = 1e-15 for numerical stability            epsilon = 1e-15            p_safe = np.clip(probabilities, epsilon, 1 - epsilon)            nll = -np.sum(y * np.log(p_safe) + (1 - y) * np.log(1 - p_safe))            self.loss_history_.append(nll)                        # TODO: Compute gradient: ∇L = Φ^T (p - y)            gradient = Phi.T @ (probabilities - y)                        # TODO: Check convergence (if gradient norm < tolerance, break)            if np.linalg.norm(gradient) < self.tol:                break                        # TODO: Update weights: w = w - learning_rate * gradient            self.weights_ -= self.learning_rate * gradient                self.n_iter_ = iteration + 1        return self        def predict_proba(self, X):        """        Predict class probabilities.                Parameters        ----------        X : array-like, shape (n_samples, n_features)            Samples                Returns        -------        proba : array, shape (n_samples, 2)            Probabilities for each class [P(y=0), P(y=1)]        """        # TODO: Create design matrix for X        Phi = np.c_[np.ones(len(X)), X]                # TODO: Compute scores        scores = Phi @ self.weights_                # TODO: Apply sigmoid to get P(y=1|X)        p1 = expit(scores)                # Return probabilities for both classes        return np.column_stack([1 - p1, p1])        def predict(self, X):        """        Predict class labels.                Parameters        ----------        X : array-like, shape (n_samples, n_features)            Samples                Returns        -------        y_pred : array, shape (n_samples,)            Predicted class labels (0 or 1)        """        # TODO: Get probabilities and threshold at 0.5        # Hint: Use predict_proba and check if P(y=1) >= 0.5        proba = self.predict_proba(X)        return (proba[:, 1] >= 0.5).astype(int)

## Test on Simple Data

In [None]:
# Generate simple test datanp.random.seed(42)X_simple = np.array([[0, 0], [1, 1], [1, 0], [0, 1]])y_simple = np.array([0, 1, 1, 0])# Fit modelmodel_simple = MyLogisticRegression(learning_rate=0.1, max_iter=1000)model_simple.fit(X_simple, y_simple)# Predicty_pred_simple = model_simple.predict(X_simple)print("True labels:", y_simple)print("Predictions:", y_pred_simple)print("Accuracy:", accuracy_score(y_simple, y_pred_simple))

## Generate Synthetic Binary Classification DataWe'll use the same data generation approach from the lecture slides (slide 26).

In [None]:
# Generate two-class datam = 100  # samples per classn = 2    # featuresnp.random.seed(0)# Class 0: centered around (1.5, -1.5)class_0 = np.hstack((    1.5 + np.random.randn(m, 1),    -1.5 + np.random.randn(m, 1)))# Class 1: centered around (-1.5, 1.5)class_1 = np.hstack((    -1.5 + np.random.randn(m, 1),    1.5 + np.random.randn(m, 1)))# CombineX = np.vstack((class_0, class_1))y = np.concatenate([np.zeros(m), np.ones(m)])print(f"Dataset shape: X={X.shape}, y={y.shape}")print(f"Class distribution: {np.bincount(y.astype(int))}")

## Visualize the Data

In [None]:
plt.figure(figsize=(8, 6))plt.scatter(X[y == 0, 0], X[y == 0, 1], c='orange', label='Class 0', edgecolors='k', s=50)plt.scatter(X[y == 1, 0], X[y == 1, 1], c='skyblue', label='Class 1', edgecolors='k', s=50)plt.xlabel('$x_1$', fontsize=12)plt.ylabel('$x_2$', fontsize=12)plt.title('Binary Classification Dataset', fontsize=14)plt.legend()plt.grid(True, alpha=0.3)plt.show()

## Split into Train and Test Sets

In [None]:
# Split data (70% train, 30% test)X_train, X_test, y_train, y_test = train_test_split(    X, y, test_size=0.3, random_state=42, stratify=y)print(f"Training set: {X_train.shape[0]} samples")print(f"Test set: {X_test.shape[0]} samples")

## Train the Model

In [None]:
# Create and train modelmodel = MyLogisticRegression(learning_rate=0.1, max_iter=1000, random_state=42)model.fit(X_train, y_train)print(f"Training completed in {model.n_iter_} iterations")print(f"Final weights: {model.weights_}")

## Visualize Training Loss

In [None]:
plt.figure(figsize=(10, 6))plt.plot(model.loss_history_, linewidth=2)plt.xlabel('Iteration', fontsize=12)plt.ylabel('Negative Log-Likelihood', fontsize=12)plt.title('Training Loss Over Time', fontsize=14)plt.grid(True, alpha=0.3)plt.show()

## Make Predictions

In [None]:
# Predict on test sety_pred = model.predict(X_test)y_proba = model.predict_proba(X_test)# Evaluateaccuracy = accuracy_score(y_test, y_pred)print(f"Test Accuracy: {accuracy:.4f}")# Show some predictionsprint("\nSample predictions (first 10):")for i in range(min(10, len(y_test))):    print(f"True: {int(y_test.iloc[i] if hasattr(y_test, 'iloc') else y_test[i])}, "          f"Predicted: {y_pred[i]}, "          f"P(y=1): {y_proba[i, 1]:.3f}")

## Confusion Matrix

In [None]:
cm = confusion_matrix(y_test, y_pred)print("Confusion Matrix:")print(cm)print("\n[TN  FP]")print("[FN  TP]")# Visualizeplt.figure(figsize=(6, 5))plt.imshow(cm, cmap='Blues', interpolation='nearest')plt.colorbar()plt.title('Confusion Matrix')plt.xlabel('Predicted')plt.ylabel('Actual')for i in range(2):    for j in range(2):        plt.text(j, i, cm[i, j], ha='center', va='center', fontsize=20)plt.xticks([0, 1], ['Class 0', 'Class 1'])plt.yticks([0, 1], ['Class 0', 'Class 1'])plt.show()

## Classification Report

In [None]:
print(classification_report(y_test, y_pred, target_names=['Class 0', 'Class 1']))# Calculate metrics manuallyprecision = precision_score(y_test, y_pred)recall = recall_score(y_test, y_pred)f1 = f1_score(y_test, y_pred)print(f"\nPrecision: {precision:.4f}")print(f"Recall: {recall:.4f}")print(f"F1-Score: {f1:.4f}")

## Visualize Decision Boundary

In [None]:
# Create meshx1_min, x1_max = X[:, 0].min() - 1, X[:, 0].max() + 1x2_min, x2_max = X[:, 1].min() - 1, X[:, 1].max() + 1xx1, xx2 = np.meshgrid(np.linspace(x1_min, x1_max, 200),                        np.linspace(x2_min, x2_max, 200))# Predict probabilities for meshX_mesh = np.c_[xx1.ravel(), xx2.ravel()]probs_mesh = model.predict_proba(X_mesh)[:, 1].reshape(xx1.shape)# Plotplt.figure(figsize=(12, 8))plt.contourf(xx1, xx2, probs_mesh, levels=20, cmap='RdBu_r', alpha=0.6)plt.colorbar(label='P(y=1|x,w)')plt.contour(xx1, xx2, probs_mesh, levels=[0.5], colors='black', linewidths=2)plt.scatter(X_train[y_train == 0, 0], X_train[y_train == 0, 1],             c='orange', label='Train Class 0', edgecolors='k', s=50, marker='o')plt.scatter(X_train[y_train == 1, 0], X_train[y_train == 1, 1],             c='skyblue', label='Train Class 1', edgecolors='k', s=50, marker='o')plt.scatter(X_test[y_test == 0, 0], X_test[y_test == 0, 1],             c='orange', label='Test Class 0', edgecolors='k', s=100, marker='s')plt.scatter(X_test[y_test == 1, 0], X_test[y_test == 1, 1],             c='skyblue', label='Test Class 1', edgecolors='k', s=100, marker='s')plt.xlabel('$x_1$', fontsize=12)plt.ylabel('$x_2$', fontsize=12)plt.title('Decision Boundary', fontsize=14)plt.legend()plt.grid(True, alpha=0.3)plt.show()

## Experiment with Different Learning Rates

In [None]:
learning_rates = [0.001, 0.01, 0.1, 0.5]results = {}for lr in learning_rates:    model_lr = MyLogisticRegression(learning_rate=lr, max_iter=1000, random_state=42)    model_lr.fit(X_train, y_train)    accuracy_lr = accuracy_score(y_test, model_lr.predict(X_test))    results[lr] = {        'model': model_lr,        'accuracy': accuracy_lr,        'n_iter': model_lr.n_iter_,        'final_loss': model_lr.loss_history_[-1]    }    print(f"Learning Rate={lr}: Accuracy={accuracy_lr:.4f}, "          f"Iterations={model_lr.n_iter_}, Final Loss={model_lr.loss_history_[-1]:.4f}")

### Compare Loss Curves

In [None]:
plt.figure(figsize=(12, 6))for lr in learning_rates:    plt.plot(results[lr]['model'].loss_history_, label=f'α={lr}', linewidth=2)plt.xlabel('Iteration', fontsize=12)plt.ylabel('Negative Log-Likelihood', fontsize=12)plt.title('Training Loss for Different Learning Rates', fontsize=14)plt.legend()plt.grid(True, alpha=0.3)plt.show()

## K-Fold Cross-ValidationLet's use K-fold cross-validation to get a more reliable estimate of model performance.

In [None]:
# Use training data for cross-validationkf = KFold(n_splits=5, shuffle=True, random_state=42)cv_scores = cross_val_score(    MyLogisticRegression(learning_rate=0.1, max_iter=1000, random_state=42),    X_train, y_train, cv=kf, scoring='accuracy')print(f"Cross-validation scores: {cv_scores}")print(f"Mean CV Accuracy: {np.mean(cv_scores):.4f} (+/- {np.std(cv_scores):.4f})")

## Polynomial Features for Non-Linear Decision Boundaries

Now let's work with a more complex dataset that requires non-linear decision boundaries. We'll use the mixture dataset from the lecture slides and apply polynomial features to capture the non-linear patterns.

### Load Mixture Dataset from Google Drive

This dataset contains two classes with non-linear separation (as shown in the lecture slides).

In [None]:
# Download the mixture dataset from Google Drive
# File ID: 1Ls7f9OWKgeWswFR4EZ5eeoohfY9PACRT
# Direct download URL
url = 'https://drive.google.com/uc?id=1Ls7f9OWKgeWswFR4EZ5eeoohfY9PACRT'

# Load data
df_mixture = pd.read_csv(url)
print(f"Mixture dataset shape: {df_mixture.shape}")
print(f"\nFirst few rows:")
print(df_mixture.head())
print(f"\nColumn names: {df_mixture.columns.tolist()}")
print(f"Class distribution:\n{df_mixture.iloc[:, -1].value_counts()}")

### Prepare Mixture Data

In [None]:
# Extract features and labels
# Assuming last column is the label, and first columns are features
X_mixture = df_mixture.iloc[:, :-1].values
y_mixture = df_mixture.iloc[:, -1].values

print(f"Features shape: {X_mixture.shape}")
print(f"Labels shape: {y_mixture.shape}")
print(f"Unique labels: {np.unique(y_mixture)}")

### Visualize Mixture Data

In [None]:
plt.figure(figsize=(10, 8))
plt.scatter(X_mixture[y_mixture == 0, 0], X_mixture[y_mixture == 0, 1], 
            c='orange', label='Class 0', edgecolors='k', s=50, alpha=0.7)
plt.scatter(X_mixture[y_mixture == 1, 0], X_mixture[y_mixture == 1, 1], 
            c='skyblue', label='Class 1', edgecolors='k', s=50, alpha=0.7)
plt.xlabel('$x_1$', fontsize=12)
plt.ylabel('$x_2$', fontsize=12)
plt.title('Mixture Dataset (Non-Linear Boundary)', fontsize=14)
plt.legend(fontsize=12)
plt.grid(True, alpha=0.3)
plt.show()

### Split Mixture Data

In [None]:
# Split into train and test sets
X_mix_train, X_mix_test, y_mix_train, y_mix_test = train_test_split(
    X_mixture, y_mixture, test_size=0.3, random_state=42, stratify=y_mixture)

print(f"Mixture training set: {X_mix_train.shape[0]} samples")
print(f"Mixture test set: {X_mix_test.shape[0]} samples")

### Apply Polynomial Features to Mixture Data

Let's test different polynomial degrees to find the best model for this non-linear dataset.

In [None]:
# Test different polynomial degrees
degrees = [1, 2, 3, 4, 5]
poly_results = {}

for degree in degrees:
    # Create polynomial features
    poly = PolynomialFeatures(degree=degree, include_bias=False)
    X_mix_train_poly = poly.fit_transform(X_mix_train)
    X_mix_test_poly = poly.transform(X_mix_test)
    
    # Train model with smaller learning rate for higher dimensions
    lr = 0.01 if degree <= 2 else 0.001
    model_poly = MyLogisticRegression(learning_rate=lr, max_iter=3000, random_state=42)
    model_poly.fit(X_mix_train_poly, y_mix_train)
    
    # Evaluate
    y_mix_pred_poly = model_poly.predict(X_mix_test_poly)
    accuracy_poly = accuracy_score(y_mix_test, y_mix_pred_poly)
    
    poly_results[degree] = {
        'poly': poly,
        'model': model_poly,
        'accuracy': accuracy_poly,
        'n_features': X_mix_train_poly.shape[1]
    }
    
    print(f"Degree={degree}: Features={X_mix_train_poly.shape[1]}, "
          f"Accuracy={accuracy_poly:.4f}, Iterations={model_poly.n_iter_}")

### Visualize Polynomial Decision Boundaries on Mixture Data

Notice how higher-degree polynomials can capture more complex, non-linear boundaries.

In [None]:
# Create subplot grid based on number of degrees
fig, axes = plt.subplots(2, 3, figsize=(18, 12))
axes = axes.ravel()

# Get data ranges
x1_min_mix, x1_max_mix = X_mixture[:, 0].min() - 0.5, X_mixture[:, 0].max() + 0.5
x2_min_mix, x2_max_mix = X_mixture[:, 1].min() - 0.5, X_mixture[:, 1].max() + 0.5

for idx, degree in enumerate(degrees):
    ax = axes[idx]
    
    # Get polynomial transformer and model
    poly = poly_results[degree]['poly']
    model_poly = poly_results[degree]['model']
    
    # Create mesh
    xx1, xx2 = np.meshgrid(np.linspace(x1_min_mix, x1_max_mix, 200),
                            np.linspace(x2_min_mix, x2_max_mix, 200))
    X_mesh = np.c_[xx1.ravel(), xx2.ravel()]
    X_mesh_poly = poly.transform(X_mesh)
    probs_mesh = model_poly.predict_proba(X_mesh_poly)[:, 1].reshape(xx1.shape)
    
    # Plot contours and decision boundary
    ax.contourf(xx1, xx2, probs_mesh, levels=20, cmap='RdBu_r', alpha=0.6)
    ax.contour(xx1, xx2, probs_mesh, levels=[0.5], colors='black', linewidths=2.5)
    
    # Plot training data
    ax.scatter(X_mix_train[y_mix_train == 0, 0], X_mix_train[y_mix_train == 0, 1],
                c='orange', edgecolors='k', s=40, alpha=0.7, label='Class 0 (train)')
    ax.scatter(X_mix_train[y_mix_train == 1, 0], X_mix_train[y_mix_train == 1, 1],
                c='skyblue', edgecolors='k', s=40, alpha=0.7, label='Class 1 (train)')
    
    # Plot test data with different marker
    ax.scatter(X_mix_test[y_mix_test == 0, 0], X_mix_test[y_mix_test == 0, 1],
                c='orange', edgecolors='k', s=80, marker='s', alpha=0.9, label='Class 0 (test)')
    ax.scatter(X_mix_test[y_mix_test == 1, 0], X_mix_test[y_mix_test == 1, 1],
                c='skyblue', edgecolors='k', s=80, marker='s', alpha=0.9, label='Class 1 (test)')
    
    ax.set_title(f'Degree={degree}, Features={poly_results[degree]["n_features"]}, '\
                     f'Acc={poly_results[degree]["accuracy"]:.3f}', fontsize=12)
    ax.set_xlabel('$x_1$', fontsize=11)
    ax.set_ylabel('$x_2$', fontsize=11)
    ax.grid(True, alpha=0.3)
    if idx == 0:
        ax.legend(fontsize=9, loc='best')

# Hide the last subplot if we have fewer than 6 degrees
if len(degrees) < 6:
    axes[5].axis('off')

plt.tight_layout()
plt.show()

### Analysis of Polynomial Degrees

Observe the following:
- **Degree 1 (Linear)**: Cannot capture the non-linear boundary, lower accuracy
- **Degree 2 (Quadratic)**: Begins to capture curvature in the decision boundary
- **Degree 3-4**: Better fit for complex boundaries
- **Degree 5+**: Risk of overfitting - may fit training noise rather than true pattern

**Key insight**: The mixture dataset requires polynomial features because the classes are not linearly separable. This demonstrates why feature engineering (like polynomial features) is important for logistic regression.

## Comparison with scikit-learn

In [None]:
# Train sklearn modelsklearn_model = SklearnLogisticRegression(penalty=None, max_iter=1000, random_state=42)sklearn_model.fit(X_train, y_train)# Compareour_accuracy = accuracy_score(y_test, model.predict(X_test))sklearn_accuracy = sklearn_model.score(X_test, y_test)print(f"Our model accuracy:      {our_accuracy:.4f}")print(f"sklearn model accuracy:  {sklearn_accuracy:.4f}")print(f"\nOur weights:     {model.weights_}")print(f"sklearn weights: {np.concatenate([sklearn_model.intercept_, sklearn_model.coef_[0]])}")

## Multiple Choice QuestionsTest your understanding of logistic regression!### Question 1: What does the sigmoid function do?A) Maps any real number to [0, 1]  B) Maps probabilities to real numbers  C) Computes the gradient  D) Normalizes features  <details><summary>Click to see answer</summary>**Answer: A**The sigmoid function $\sigma(z) = \frac{1}{1 + e^{-z}}$ maps any real-valued input to a value between 0 and 1, which can be interpreted as a probability.</details>

### Question 2: What is the loss function for logistic regression?A) Sum of Squared Errors (SSE)  B) Mean Squared Error (MSE)  C) Negative Log-Likelihood (NLL) / Binary Cross-Entropy  D) Absolute Error  <details><summary>Click to see answer</summary>**Answer: C**Logistic regression uses the Negative Log-Likelihood (also called binary cross-entropy):$$J(\vec{w}) = -\sum_{i=1}^{N} \left[ y^{(i)} \log p^{(i)} + (1-y^{(i)}) \log(1-p^{(i)}) \right]$$This loss function is appropriate for classification because it measures how well predicted probabilities match the actual labels.</details>

### Question 3: What is the gradient for logistic regression?A) $\nabla J = \Phi^T (\vec{y} - \vec{p})$  B) $\nabla J = \Phi^T (\vec{p} - \vec{y})$  C) $\nabla J = -2\Phi^T (\vec{y} - \vec{p})$  D) $\nabla J = \Phi (\vec{p} - \vec{y})$  <details><summary>Click to see answer</summary>**Answer: B**The gradient of the NLL loss with respect to weights is:$$\nabla_{\vec{w}} J = \Phi^T (\vec{p} - \vec{y})$$Where $\vec{p}$ are the predicted probabilities and $\vec{y}$ are the true labels.</details>

### Question 4: What does the decision boundary represent?A) Where P(y=1|x) = 1.0  B) Where P(y=1|x) = 0.5  C) Where P(y=1|x) = 0.0  D) The maximum probability  <details><summary>Click to see answer</summary>**Answer: B**The decision boundary is where $P(y=1|\vec{x}, \vec{w}) = 0.5$, which occurs when $\vec{x}^T \times \vec{w} = 0$. Points on one side of this boundary are classified as class 1, and points on the other side as class 0.</details>

### Question 5: When should we use higher learning rates?A) Always, to converge faster  B) When features are normalized  C) When the model is overfitting  D) Never, always use small learning rates  <details><summary>Click to see answer</summary>**Answer: B**Higher learning rates (e.g., 0.1) can be used when features are normalized (e.g., using StandardScaler) because the gradients are on a similar scale. Without normalization, features with different scales can cause unstable updates, requiring smaller learning rates.</details>

### Question 6: What metric is most important for spam detection?A) Accuracy  B) Precision (minimize false positives)  C) Recall (minimize false negatives)  D) F1-score  <details><summary>Click to see answer</summary>**Answer: B - Precision**For spam detection, **precision** is typically more important because:- **False positives** (legitimate emails marked as spam) are very costly - users might miss important emails- **False negatives** (spam in inbox) are annoying but less costlyHowever, this depends on the specific application requirements. Some systems might prioritize recall to catch all spam, even if it means occasionally flagging legitimate emails.</details>

## Best Practices and Tips### 1. Feature ScalingAlways normalize/standardize features when using gradient descent:```pythonfrom sklearn.preprocessing import StandardScalerscaler = StandardScaler()X_scaled = scaler.fit_transform(X)```### 2. Learning Rate Selection- Start with α = 0.01 for normalized features- If loss increases or oscillates: reduce α- If convergence is too slow: increase α- Monitor loss curve to diagnose### 3. Handling Class Imbalance- Use stratified splits: `train_test_split(..., stratify=y)`- Consider weighted loss or resampling- Focus on precision/recall instead of accuracy### 4. Convergence- Set reasonable `max_iter` (e.g., 1000-10000)- Use `tol` to stop early when gradient is small- Check if loss is still decreasing### 5. Multiclass ClassificationFor more than 2 classes, use:- One-vs-Rest (OvR): Train C binary classifiers- Softmax regression (multinomial logistic regression)### 6. RegularizationTo prevent overfitting:- L2 regularization: Add $\lambda ||\vec{w}||^2$ to loss- L1 regularization: Add $\lambda ||\vec{w}||_1$ to loss### 7. Evaluation- Use cross-validation for small datasets- Report multiple metrics: accuracy, precision, recall, F1- Visualize confusion matrix- Plot ROC curve and PR curve for threshold selection

## SummaryIn this lab, you:1. ✅ Implemented a custom **Logistic Regression** classifier from scratch2. ✅ Understood the **sigmoid function** and how it models probabilities3. ✅ Applied **gradient descent** to minimize the negative log-likelihood loss4. ✅ Experimented with different **learning rates** and observed their effects5. ✅ Used **K-fold cross-validation** to evaluate model performance6. ✅ Applied **polynomial features** to model non-linear decision boundaries7. ✅ Evaluated models using **accuracy, precision, recall, and F1-score**8. ✅ Visualized **decision boundaries** and probability distributions9. ✅ Compared your implementation with scikit-learn### Key Takeaways- Logistic regression models $P(y=1|\vec{x}, \vec{w}) = \sigma(\vec{x}^T \times \vec{w})$- The loss function is the negative log-likelihood (binary cross-entropy)- The gradient is $\nabla J = \Phi^T (\vec{p} - \vec{y})$- Learning rate must be tuned carefully- Feature scaling improves convergence- Different applications require different metric priorities (precision vs recall)- Cross-validation provides more reliable performance estimates### Next Steps- Try logistic regression on real-world datasets (e.g., breast cancer, iris)- Implement multiclass classification using One-vs-Rest- Add L2 regularization to prevent overfitting- Experiment with different optimization algorithms (SGD, Adam)- Compare with other classifiers (SVM, Decision Trees, Neural Networks)