# 013: Logistic Regression for Classification

Logistic regression predicts **probability** of class membership using the sigmoid function to map linear combinations to [0, 1].

### 📊 Classification Concept

```mermaid
graph LR
    A[Features X] --> B[Linear Combination]
    B --> C[Sigmoid Function]
    C --> D[Probability 0 to 1]
    D --> E{Threshold 0.5}
    E -->|P >= 0.5| F[Class 1]
    E -->|P < 0.5| G[Class 0]
    style C fill:#FF9800,stroke:#333,stroke-width:2px,color:#fff
    style F fill:#4CAF50,stroke:#333,stroke-width:2px,color:#fff
    style G fill:#f44336,stroke:#333,stroke-width:2px,color:#fff
```

### The Sigmoid Function

$$\sigma(z) = \frac{1}{1 + e^{-z}}$$

$$P(y=1|\mathbf{x}) = \sigma(\mathbf{w}^T \mathbf{x} + b)$$

Where:
- $z = \mathbf{w}^T \mathbf{x} + b$ is the linear combination
- $\sigma(z)$ maps $(-\infty, +\infty) \rightarrow (0, 1)$
- $P(y=1|\mathbf{x})$ is probability of positive class

### Decision Boundary

Classification decision at $\sigma(z) = 0.5$:
$$\mathbf{w}^T \mathbf{x} + b = 0$$

This defines a **linear** decision boundary in feature space.

### 🎯 Logistic Regression Workflow

```mermaid
graph TD
    A[Labeled Data] --> B[Explore Class Balance]
    B --> C{Imbalanced?}
    C -->|Yes| D[Handle Imbalance]
    C -->|No| E[Train-Test Split]
    D --> E
    E --> F[Fit Logistic Model]
    F --> G[Predict Probabilities]
    G --> H[Evaluate: Accuracy, Precision, Recall, ROC]
    H --> I{Good Performance?}
    I -->|Yes| J[Deploy]
    I -->|No| K[Feature Engineering or Try Non-linear]
    K --> F
    style F fill:#FF9800,stroke:#333,stroke-width:2px,color:#fff
    style J fill:#4CAF50,stroke:#333,stroke-width:2px,color:#fff
```

### When to Use Logistic Regression?

✅ **Use when:**
- Binary or multi-class classification task
- Need probability estimates (not just class labels)
- Interpretability important (coefficients show feature impact)
- Baseline model before complex classifiers
- Linear decision boundary sufficient

❌ **Don't use when:**
- Non-linear decision boundaries required
- Classes highly overlapping
- Need to capture complex interactions
- Image/video classification (use CNNs)

### 🏭 Real-World Applications

**Post-Silicon Validation:**
- Pass/fail prediction from parametric tests
- Bin classification (speed grades, quality tiers)
- Defect detection (faulty vs good devices)
- Wafer acceptance decisions (ship vs scrap)

**General AI/ML:**
- Customer churn prediction
- Fraud detection
- Email spam classification
- Medical diagnosis
- Credit approval

---

## 2. Setup and Data Preparation

In [None]:
# Import required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import (accuracy_score, precision_score, recall_score, 
                             f1_score, confusion_matrix, classification_report,
                             roc_curve, roc_auc_score, precision_recall_curve)
import warnings
warnings.filterwarnings('ignore')

# Set random seed
np.random.seed(42)

# Configure plotting
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')

print('✅ Libraries imported successfully')
print(f'NumPy version: {np.__version__}')
print(f'Pandas version: {pd.__version__}')

### 📝 What's Happening in This Code?

**Purpose:** Import classification-specific libraries and metrics

**Key Points:**
- **LogisticRegression**: Scikit-learn's classifier with multiple solvers and regularization
- **Classification metrics**: Accuracy, precision, recall, F1, ROC-AUC for thorough evaluation
- **Confusion matrix**: Shows true/false positives/negatives for error analysis
- **ROC/PR curves**: Visualize tradeoffs between metrics at different thresholds

**Why This Matters:**
- Classification requires different metrics than regression (not RMSE/R²)
- Single metric insufficient - need precision AND recall
- ROC curve essential for threshold tuning in production

### 2.1 Generate Binary Classification Dataset

### 📝 What's Happening in This Code?

**Purpose:** Create synthetic pass/fail classification data mimicking semiconductor testing

**Key Points:**
- **Two classes**: Pass (1) vs Fail (0) devices
- **Linearly separable**: With some overlap to mimic real measurement noise
- **STDF-like features**: Voltage, current, frequency, temperature
- **Class balance**: 70-30 split (realistic for yield scenarios)

**Why This Approach:**
- Mimics real STDF pass/fail outcomes
- Known ground truth for validation
- Demonstrates binary classification mechanics clearly

In [None]:
def generate_device_passfail_data(n_samples=500, noise=0.3):
    """
    Generate semiconductor device pass/fail classification data
    """
    # Features: voltage, current, frequency, temperature
    voltage = np.random.uniform(0.9, 1.1, n_samples)
    current = np.random.uniform(80, 120, n_samples)
    frequency = np.random.uniform(2.0, 3.0, n_samples)
    temperature = np.random.uniform(25, 85, n_samples)
    
    # Decision boundary (pass if conditions met)
    score = (voltage - 1.0) * 100 + (current - 100) * 0.5 - (temperature - 55) * 0.3 + frequency * 10
    score += np.random.randn(n_samples) * noise * 10
    
    # Pass (1) if score > 0, Fail (0) otherwise
    pass_fail = (score > 0).astype(int)
    
    # Create feature matrix
    X = np.column_stack([voltage, current, frequency, temperature])
    y = pass_fail
    
    return X, y

# Generate dataset
X, y = generate_device_passfail_data(n_samples=500, noise=0.3)

# Create DataFrame
df = pd.DataFrame(X, columns=['Voltage_V', 'Current_mA', 'Frequency_GHz', 'Temperature_C'])
df['Pass_Fail'] = y
df['Class_Label'] = df['Pass_Fail'].map({0: 'Fail', 1: 'Pass'})

print('✅ Binary classification dataset generated')
print(f'Total samples: {len(df)}')
print(f'\nClass distribution:')
print(df['Class_Label'].value_counts())
print(f'\nClass balance: {df["Pass_Fail"].mean()*100:.1f}% Pass, {(1-df["Pass_Fail"].mean())*100:.1f}% Fail')
print('\nFirst 5 samples:')
print(df.head())

### 2.2 Exploratory Data Analysis

### 📝 What's Happening in This Code?

**Purpose:** Visualize feature distributions and class separability

**Key Points:**
- **Box plots**: Show feature distributions for each class (Pass vs Fail)
- **Separation**: Clear difference indicates features are predictive
- **Overlap**: Some overlap expected (real-world noise)
- **Feature importance preview**: Features with more separation matter more

**Why This Matters:**
- Visual confirmation that classification is feasible
- Identifies which features discriminate classes best
- Helps set expectations for model accuracy

In [None]:
# Visualize class distributions
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
axes = axes.ravel()

for i, col in enumerate(['Voltage_V', 'Current_mA', 'Frequency_GHz', 'Temperature_C']):
    df.boxplot(column=col, by='Class_Label', ax=axes[i])
    axes[i].set_title(f'{col} by Class')
    axes[i].set_xlabel('Class')
    axes[i].set_ylabel(col)

plt.suptitle('Feature Distributions by Pass/Fail Class', fontsize=14, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()

print('📊 Interpretation:')
print('   → Pass devices: Higher voltage, current, frequency')
print('   → Fail devices: Lower performance parameters, higher temperature')
print('   → Overlap indicates classification challenge')

### 2.3 Train-Test Split

### 📝 What's Happening in This Code?

**Purpose:** Split data with stratification to preserve class balance

**Key Points:**
- **Stratify**: Ensures both train and test have same class proportions
- **Critical for imbalanced data**: Prevents test set having all one class
- **80-20 split**: Standard ratio
- **StandardScaler**: Less critical than regularization, but good practice

**Why This Matters:**
- Without stratification, might get unlucky split (e.g., 90% Pass in train, 50% in test)
- Ensures consistent evaluation across experiments
- Fair comparison with other models

In [None]:
# Prepare features and target
X = df[['Voltage_V', 'Current_mA', 'Frequency_GHz', 'Temperature_C']].values
y = df['Pass_Fail'].values

# Split with stratification
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Standardize features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print(f'✅ Data split completed')
print(f'Training samples: {len(X_train)} (Pass: {y_train.sum()}, Fail: {len(y_train)-y_train.sum()})')
print(f'Test samples: {len(X_test)} (Pass: {y_test.sum()}, Fail: {len(y_test)-y_test.sum()})')
print(f'\nClass balance preserved:')
print(f'  Train: {y_train.mean()*100:.1f}% Pass')
print(f'  Test: {y_test.mean()*100:.1f}% Pass')

---

## 3. Mathematical Foundation

### 3.1 Sigmoid Function Properties

$$\sigma(z) = \frac{1}{1 + e^{-z}}$$

Properties:
- $\lim_{z \to \infty} \sigma(z) = 1$
- $\lim_{z \to -\infty} \sigma(z) = 0$
- $\sigma(0) = 0.5$
- $\sigma'(z) = \sigma(z)(1 - \sigma(z))$

### 3.2 Log-Likelihood and Cross-Entropy Loss

**For binary classification:**
$$\mathcal{L} = -\frac{1}{n}\sum_{i=1}^{n} \left[ y_i \log(\hat{p}_i) + (1-y_i)\log(1-\hat{p}_i) \right]$$

Where $\hat{p}_i = \sigma(\mathbf{w}^T \mathbf{x}_i + b)$

**Goal:** Minimize cross-entropy by finding optimal $\mathbf{w}$, $b$

### 3.3 Gradient Descent Update

No closed-form solution → use iterative optimization:

$$\mathbf{w}_{t+1} = \mathbf{w}_t - \eta \nabla_{\mathbf{w}} \mathcal{L}$$

Gradient:
$$\nabla_{\mathbf{w}} \mathcal{L} = \frac{1}{n} \mathbf{X}^T (\hat{\mathbf{p}} - \mathbf{y})$$

### 3.4 Odds and Logit

**Odds ratio:**
$$\text{Odds} = \frac{P(y=1)}{P(y=0)} = \frac{P(y=1)}{1 - P(y=1)}$$

**Logit (log-odds):**
$$\text{logit}(p) = \log\left(\frac{p}{1-p}\right) = \mathbf{w}^T \mathbf{x} + b$$

Logistic regression models **log-odds** as linear function of features.

---

## 4. Implementation from Scratch

### 📝 What's Happening in This Code?

**Purpose:** Build logistic regression from scratch using gradient descent

**Key Points:**
- **Sigmoid implementation**: Core transformation function
- **Cross-entropy loss**: Proper loss for classification (not MSE)
- **Gradient descent**: Iterative weight updates with learning rate
- **Convergence tracking**: Monitor loss to verify training progress

**Why This Matters:**
- Demystifies logistic regression internals
- Shows why gradient descent needed (no closed form)
- Understanding helps debug convergence issues in production

In [None]:
class LogisticRegressionScratch:
    """
    Logistic Regression from scratch using gradient descent
    """
    def __init__(self, learning_rate=0.01, n_iterations=1000):
        self.lr = learning_rate
        self.n_iter = n_iterations
        self.weights = None
        self.bias = None
        self.losses = []
    
    def sigmoid(self, z):
        return 1 / (1 + np.exp(-np.clip(z, -500, 500)))  # Clip for numerical stability
    
    def compute_loss(self, y_true, y_pred_prob):
        """Binary cross-entropy loss"""
        epsilon = 1e-15
        y_pred_prob = np.clip(y_pred_prob, epsilon, 1 - epsilon)
        return -np.mean(y_true * np.log(y_pred_prob) + (1 - y_true) * np.log(1 - y_pred_prob))
    
    def fit(self, X, y):
        n_samples, n_features = X.shape
        
        # Initialize parameters
        self.weights = np.zeros(n_features)
        self.bias = 0
        
        # Gradient descent
        for i in range(self.n_iter):
            # Forward pass
            linear_pred = X @ self.weights + self.bias
            y_pred_prob = self.sigmoid(linear_pred)
            
            # Compute loss
            loss = self.compute_loss(y, y_pred_prob)
            self.losses.append(loss)
            
            # Compute gradients
            dw = (1/n_samples) * X.T @ (y_pred_prob - y)
            db = (1/n_samples) * np.sum(y_pred_prob - y)
            
            # Update parameters
            self.weights -= self.lr * dw
            self.bias -= self.lr * db
        
        return self
    
    def predict_proba(self, X):
        linear_pred = X @ self.weights + self.bias
        return self.sigmoid(linear_pred)
    
    def predict(self, X, threshold=0.5):
        return (self.predict_proba(X) >= threshold).astype(int)
    
    def score(self, X, y):
        return accuracy_score(y, self.predict(X))

# Train from-scratch model
model_scratch = LogisticRegressionScratch(learning_rate=0.1, n_iterations=1000)
model_scratch.fit(X_train_scaled, y_train)

# Evaluate
train_acc = model_scratch.score(X_train_scaled, y_train)
test_acc = model_scratch.score(X_test_scaled, y_test)

print('✅ From-Scratch Logistic Regression')
print(f'Training Accuracy: {train_acc:.4f}')
print(f'Test Accuracy: {test_acc:.4f}')
print(f'Final Loss: {model_scratch.losses[-1]:.4f}')

# Plot loss curve
plt.figure(figsize=(10, 5))
plt.plot(model_scratch.losses, linewidth=2)
plt.xlabel('Iteration', fontsize=12)
plt.ylabel('Cross-Entropy Loss', fontsize=12)
plt.title('Training Loss Convergence', fontsize=13, fontweight='bold')
plt.grid(True, alpha=0.3)
plt.show()

---

## 5. Production Implementation with Scikit-learn

### 📝 What's Happening in This Code?

**Purpose:** Train production-grade logistic regression with regularization

**Key Points:**
- **Penalty**: L2 (Ridge) by default prevents overfitting
- **Solver**: 'lbfgs' efficient for small datasets, 'saga' for large/L1
- **max_iter**: Increased to ensure convergence
- **Probability calibration**: Well-calibrated probabilities for decision making

**Why This Matters:**
- Production model needs regularization for robustness
- Different solvers optimize different scenarios
- Probability estimates critical for threshold tuning

In [None]:
# Train sklearn model with L2 regularization
model_sklearn = LogisticRegression(penalty='l2', C=1.0, solver='lbfgs', max_iter=1000, random_state=42)
model_sklearn.fit(X_train_scaled, y_train)

# Predictions
y_train_pred = model_sklearn.predict(X_train_scaled)
y_test_pred = model_sklearn.predict(X_test_scaled)
y_test_pred_proba = model_sklearn.predict_proba(X_test_scaled)[:, 1]

# Comprehensive metrics
train_acc = accuracy_score(y_train, y_train_pred)
test_acc = accuracy_score(y_test, y_test_pred)
test_precision = precision_score(y_test, y_test_pred)
test_recall = recall_score(y_test, y_test_pred)
test_f1 = f1_score(y_test, y_test_pred)
test_auc = roc_auc_score(y_test, y_test_pred_proba)

print('🎯 Logistic Regression Performance (Scikit-learn)\n')
print('='*60)
print(f'Training Accuracy:     {train_acc:.4f}')
print(f'Test Accuracy:         {test_acc:.4f}')
print(f'Test Precision:        {test_precision:.4f}')
print(f'Test Recall:           {test_recall:.4f}')
print(f'Test F1-Score:         {test_f1:.4f}')
print(f'Test ROC-AUC:          {test_auc:.4f}')
print('='*60)

# Classification report
print('\n📊 Detailed Classification Report:\n')
print(classification_report(y_test, y_test_pred, target_names=['Fail', 'Pass']))

### 5.1 Confusion Matrix Analysis

### 📝 What's Happening in This Code?

**Purpose:** Visualize prediction errors with confusion matrix

**Key Points:**
- **True Positives (TP)**: Correctly predicted Pass
- **True Negatives (TN)**: Correctly predicted Fail
- **False Positives (FP)**: Predicted Pass, actually Fail (Type I error)
- **False Negatives (FN)**: Predicted Fail, actually Pass (Type II error)

**Why This Matters:**
- Different errors have different costs (FP vs FN)
- In semiconductor: FP = shipping bad devices (costly recalls)
- FN = scrapping good devices (lost revenue)
- Helps set optimal classification threshold

In [None]:
# Compute confusion matrix
cm = confusion_matrix(y_test, y_test_pred)

# Visualize
fig, ax = plt.subplots(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', cbar=False,
            xticklabels=['Fail', 'Pass'], yticklabels=['Fail', 'Pass'],
            annot_kws={'size': 16}, ax=ax)
ax.set_xlabel('Predicted Label', fontsize=12)
ax.set_ylabel('True Label', fontsize=12)
ax.set_title('Confusion Matrix (Test Set)', fontsize=13, fontweight='bold')
plt.tight_layout()
plt.show()

print('🔍 Confusion Matrix Breakdown:')
print(f'   True Negatives (TN):  {cm[0, 0]} - Correctly predicted Fail')
print(f'   False Positives (FP): {cm[0, 1]} - Predicted Pass, actually Fail ⚠️')
print(f'   False Negatives (FN): {cm[1, 0]} - Predicted Fail, actually Pass')
print(f'   True Positives (TP):  {cm[1, 1]} - Correctly predicted Pass')
print(f'\n   → Precision = TP/(TP+FP) = {cm[1,1]}/{cm[1,1]+cm[0,1]} = {test_precision:.3f}')
print(f'   → Recall = TP/(TP+FN) = {cm[1,1]}/{cm[1,1]+cm[1,0]} = {test_recall:.3f}')

### 5.2 ROC Curve and AUC

### 📝 What's Happening in This Code?

**Purpose:** Evaluate classifier performance across all thresholds

**Key Points:**
- **ROC Curve**: Plots True Positive Rate vs False Positive Rate
- **AUC**: Area Under Curve - single metric for overall performance (higher better)
- **AUC = 1.0**: Perfect classifier
- **AUC = 0.5**: Random guessing
- **Threshold selection**: Choose point on curve based on cost tradeoffs

**Why This Matters:**
- Default 0.5 threshold may not be optimal
- ROC shows full performance spectrum
- AUC enables model comparison (one number)

In [None]:
# Compute ROC curve
fpr, tpr, thresholds = roc_curve(y_test, y_test_pred_proba)

# Plot ROC curve
plt.figure(figsize=(10, 6))
plt.plot(fpr, tpr, linewidth=2, label=f'ROC Curve (AUC = {test_auc:.3f})')
plt.plot([0, 1], [0, 1], 'k--', linewidth=2, label='Random Classifier (AUC = 0.5)')
plt.xlabel('False Positive Rate', fontsize=12)
plt.ylabel('True Positive Rate (Recall)', fontsize=12)
plt.title('ROC Curve - Logistic Regression', fontsize=13, fontweight='bold')
plt.legend(loc='lower right', fontsize=11)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print('📈 ROC Curve Interpretation:')
print(f'   → AUC = {test_auc:.4f} (closer to 1.0 = better)')
if test_auc > 0.9:
    print('   → Excellent classifier (AUC > 0.9)')
elif test_auc > 0.8:
    print('   → Good classifier (AUC > 0.8)')
elif test_auc > 0.7:
    print('   → Fair classifier (AUC > 0.7)')
else:
    print('   → Poor classifier (AUC < 0.7)')

### 5.3 Precision-Recall Curve

### 📝 What's Happening in This Code?

**Purpose:** Alternative view focusing on precision-recall tradeoff

**Key Points:**
- **Precision-Recall Curve**: Better for imbalanced datasets than ROC
- **High precision**: Few false positives (strict predictions)
- **High recall**: Few false negatives (catch all positives)
- **Tradeoff**: Can't maximize both simultaneously

**Why This Matters:**
- In production, choose threshold based on business costs
- Semiconductor: High precision (avoid shipping bad devices) may sacrifice recall
- Different applications need different thresholds

In [None]:
# Compute Precision-Recall curve
precision_vals, recall_vals, pr_thresholds = precision_recall_curve(y_test, y_test_pred_proba)

# Plot
plt.figure(figsize=(10, 6))
plt.plot(recall_vals, precision_vals, linewidth=2, label='Precision-Recall Curve')
plt.xlabel('Recall (True Positive Rate)', fontsize=12)
plt.ylabel('Precision', fontsize=12)
plt.title('Precision-Recall Curve - Logistic Regression', fontsize=13, fontweight='bold')
plt.legend(loc='best', fontsize=11)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print('⚖️ Precision-Recall Tradeoff:')
print('   → Higher threshold → Higher precision, Lower recall')
print('   → Lower threshold → Higher recall, Lower precision')
print(f'   → Current (0.5 threshold): Precision={test_precision:.3f}, Recall={test_recall:.3f}')

### 5.4 Feature Importance (Coefficients)

### 📝 What's Happening in This Code?

**Purpose:** Interpret which features drive classification decisions

**Key Points:**
- **Positive coefficients**: Feature increases probability of Pass
- **Negative coefficients**: Feature decreases probability of Pass
- **Magnitude**: Larger |coefficient| → stronger effect
- **Odds ratio**: $e^{\beta}$ shows multiplicative effect on odds

**Why This Matters:**
- Explains model to domain experts ("high voltage → more likely to pass")
- Validates domain knowledge (coefficients match physics)
- Identifies unexpected relationships (debugging data issues)

In [None]:
# Extract coefficients
feature_names = ['Voltage_V', 'Current_mA', 'Frequency_GHz', 'Temperature_C']
coefficients = model_sklearn.coef_[0]
intercept = model_sklearn.intercept_[0]

# Create DataFrame
coef_df = pd.DataFrame({
    'Feature': feature_names,
    'Coefficient': coefficients,
    'Abs_Coefficient': np.abs(coefficients),
    'Odds_Ratio': np.exp(coefficients)
}).sort_values('Abs_Coefficient', ascending=False)

print('📊 Feature Importance (Logistic Regression Coefficients)\n')
print(coef_df.to_string(index=False))
print(f'\nIntercept (bias): {intercept:.4f}')

# Visualize
plt.figure(figsize=(10, 6))
colors = ['green' if c >= 0 else 'red' for c in coef_df['Coefficient']]
plt.barh(coef_df['Feature'], coef_df['Coefficient'], color=colors, edgecolor='black', alpha=0.7)
plt.xlabel('Coefficient Value', fontsize=12)
plt.ylabel('Feature', fontsize=12)
plt.title('Feature Importance (Logistic Regression Coefficients)', fontsize=13, fontweight='bold')
plt.axvline(0, color='black', linewidth=0.8)
plt.grid(True, alpha=0.3, axis='x')
plt.tight_layout()
plt.show()

print('\n🔍 Interpretation:')
print('   → Positive coefficients increase Pass probability')
print('   → Negative coefficients increase Fail probability')
print('   → Odds ratio > 1: Feature increases odds of Pass')
print('   → Odds ratio < 1: Feature decreases odds of Pass')

---

## 6. Multi-Class Classification

### 📝 What's Happening in This Code?

**Purpose:** Extend binary logistic regression to multi-class problems

**Key Points:**
- **Three strategies**: One-vs-Rest (OvR), One-vs-One (OvO), Softmax (multinomial)
- **OvR**: Train N binary classifiers (class vs all others)
- **Softmax**: Direct multi-class probabilities (preferred)
- **Bin classification**: Realistic semiconductor scenario (speed grades)

**Why This Matters:**
- Most real problems have >2 classes
- Semiconductor: Bins (Premium, Standard, Value, Reject)
- Softmax provides calibrated probabilities across all classes

In [None]:
# Generate 3-class dataset (device bins)
def generate_multiclass_bins(n_samples=500):
    X, _ = generate_device_passfail_data(n_samples, noise=0.3)
    
    # Create 3 bins based on performance score
    score = X[:, 0] * 100 + X[:, 1] * 0.5 + X[:, 2] * 10 - X[:, 3] * 0.3
    
    # Bin assignment: Premium (2), Standard (1), Reject (0)
    y_multi = np.zeros(n_samples, dtype=int)
    y_multi[score > np.percentile(score, 66)] = 2  # Top 33% → Premium
    y_multi[(score > np.percentile(score, 33)) & (score <= np.percentile(score, 66))] = 1  # Middle → Standard
    # Bottom 33% stays 0 → Reject
    
    return X, y_multi

X_multi, y_multi = generate_multiclass_bins(n_samples=500)

# Split
X_train_m, X_test_m, y_train_m, y_test_m = train_test_split(
    X_multi, y_multi, test_size=0.2, random_state=42, stratify=y_multi
)

# Scale
scaler_m = StandardScaler()
X_train_m_scaled = scaler_m.fit_transform(X_train_m)
X_test_m_scaled = scaler_m.transform(X_test_m)

# Train multi-class logistic regression (softmax)
model_multi = LogisticRegression(multi_class='multinomial', solver='lbfgs', max_iter=1000, random_state=42)
model_multi.fit(X_train_m_scaled, y_train_m)

# Evaluate
y_test_m_pred = model_multi.predict(X_test_m_scaled)
test_acc_multi = accuracy_score(y_test_m, y_test_m_pred)

print('🎯 Multi-Class Logistic Regression (3 Bins)\n')
print(f'Test Accuracy: {test_acc_multi:.4f}')
print('\nClassification Report:\n')
print(classification_report(y_test_m, y_test_m_pred, target_names=['Reject', 'Standard', 'Premium']))

# Confusion matrix
cm_multi = confusion_matrix(y_test_m, y_test_m_pred)
fig, ax = plt.subplots(figsize=(8, 6))
sns.heatmap(cm_multi, annot=True, fmt='d', cmap='Blues', cbar=False,
            xticklabels=['Reject', 'Standard', 'Premium'],
            yticklabels=['Reject', 'Standard', 'Premium'],
            annot_kws={'size': 14}, ax=ax)
ax.set_xlabel('Predicted Bin', fontsize=12)
ax.set_ylabel('True Bin', fontsize=12)
ax.set_title('Multi-Class Confusion Matrix', fontsize=13, fontweight='bold')
plt.tight_layout()
plt.show()

---

## 7. Real-World Projects

### 🔬 Post-Silicon Validation Projects

#### **Project 1: Parametric Test Pass/Fail Prediction**

**Objective:** Predict device pass/fail from early parametric test results to enable fast screening.

**Business Value:**
- Reduce test time by 30% (skip remaining tests for predicted fails)
- Early identification of process issues
- Real-time yield monitoring
- Lower cost per device tested

**Dataset Features:**
- Early tests: Basic DC parameters (Vdd, Idd, leakage)
- Fast digital tests (scan, BIST results)
- Temperature, voltage corner conditions
- Wafer spatial data (die_x, die_y)

**Implementation Guide:**
1. Use LogisticRegression with L2 penalty for robustness
2. Optimize threshold: Minimize FP (shipping bad devices)
3. Monitor precision (avoid costly field returns)
4. Cross-validate on different wafer lots
5. Feature engineering: Add test ratios, deltas

**Expected Outcomes:** 95%+ accuracy, <2% false positive rate

---

#### **Project 2: Multi-Bin Speed Grade Classification**

**Objective:** Classify devices into speed bins (Premium/Standard/Value) for pricing tiers.

**Business Value:**
- Maximize revenue (charge more for faster devices)
- Optimize product mix
- Enable market segmentation
- Reduce overtest (only test bins customers want)

**Implementation Guide:**
1. Use multinomial logistic regression (softmax)
2. Features: Frequency at different voltages, timing paths
3. Handle class imbalance (few premium devices)
4. Validate bin boundaries with product team
5. Consider cost of misclassification (Premium → Standard worse than Standard → Value)

**Expected Outcomes:** 90%+ 3-class accuracy, revenue optimization 15%

---

#### **Project 3: Wafer-Level Defect Detection**

**Objective:** Classify dies as defective vs good using spatial pattern analysis.

**Business Value:**
- Identify systematic defects (process issues)
- Guide process improvements
- Reduce false failures (improve yield)
- Lower test costs (skip tests on known defects)

**Implementation Guide:**
1. Features: Parametric values + spatial coordinates
2. Add neighbor features (defect clustering)
3. Handle severe class imbalance (1-5% defect rate)
4. Use class_weight='balanced' in LogisticRegression
5. Focus on recall (catch all defects)

**Expected Outcomes:** 98%+ recall, AUC > 0.95

---

#### **Project 4: Test Flow Optimization**

**Objective:** Predict which test category devices will fail to optimize test sequence.

**Business Value:**
- Reduce average test time (fail fast)
- Lower ATE costs
- Increase throughput
- Dynamic test reordering

**Implementation Guide:**
1. Multi-class: Predict fail category (Digital, Analog, Memory, Pass)
2. Train on historical test sequences
3. Optimize test order: High-failure tests first
4. A/B test reordered sequences
5. Monitor for process changes (retrain quarterly)

**Expected Outcomes:** 20% test time reduction, same quality

---

### 📊 General AI/ML Projects

#### **Project 5: Customer Churn Prediction**

**Objective:** Predict which customers likely to cancel subscriptions.

**Business Value:**
- Proactive retention campaigns
- Reduce churn rate by 25%
- Increase customer lifetime value
- Targeted interventions (discounts, support)

**Implementation:** Binary logistic regression on usage patterns, support tickets, billing history

---

#### **Project 6: Fraud Detection in Transactions**

**Objective:** Real-time classification of fraudulent vs legitimate transactions.

**Business Value:**
- Prevent financial losses
- Improve customer trust
- Regulatory compliance
- Minimize false positives (avoid blocking real transactions)

**Implementation:** Imbalanced classification (<1% fraud), optimize for high recall with acceptable precision

---

#### **Project 7: Email Spam Classification**

**Objective:** Filter spam emails from inbox using text features.

**Business Value:**
- Improve user experience
- Reduce security risks
- Save time (no manual filtering)
- Interpretable rules (explain why email is spam)

**Implementation:** TF-IDF features + Logistic Regression with L1 (feature selection)

---

#### **Project 8: Medical Diagnosis (Disease Prediction)**

**Objective:** Predict disease presence from patient symptoms and test results.

**Business Value:**
- Early diagnosis
- Guide treatment decisions
- Reduce diagnostic costs
- Interpretable (doctors understand coefficients)

**Implementation:** Multi-class for disease types, careful threshold selection (minimize false negatives)

---

## 8. Key Takeaways

### ✅ When to Use Logistic Regression

1. **Binary or multi-class classification** with linearly separable classes
2. **Need probability estimates** for decision making
3. **Interpretability critical** (stakeholder approval)
4. **Baseline model** before trying complex classifiers
5. **Real-time inference** required (fast predictions)

### ⚠️ Limitations

- **Linear decision boundary**: Can't capture complex non-linear patterns
- **Feature engineering**: Need to manually create interactions/polynomials
- **Assumption of independence**: Features assumed independent (rarely true)
- **Sensitive to outliers**: Regularization helps but not perfect

**Better Alternatives:**
- **Tree-based models** (Random Forest, XGBoost): Non-linear, minimal tuning
- **Neural networks**: Very complex decision boundaries
- **SVM with kernels**: Non-linear with implicit feature spaces

### 🎯 Best Practices

1. **Check class balance**: Use stratified split, handle imbalance if needed
2. **Scale features**: StandardScaler for numerical stability
3. **Use regularization**: L2 (Ridge) by default, L1 (Lasso) for feature selection
4. **Multiple metrics**: Don't rely only on accuracy (especially with imbalance)
5. **Optimize threshold**: Default 0.5 may not be optimal for your costs
6. **Cross-validation**: Ensure robustness across data splits
7. **Monitor calibration**: Probabilities should be well-calibrated

### 📚 Next Learning Steps

1. **`014_Support_Vector_Regression.ipynb`** - SVMs and kernel methods
2. **`016_Decision_Trees.ipynb`** - Non-linear classifiers
3. **`024_Support_Vector_Machines.ipynb`** - SVM for classification
4. **Neural networks** - Deep learning for complex patterns

### 🔑 Core Concepts Mastered

✅ Sigmoid function and probability estimation  
✅ Cross-entropy loss and gradient descent  
✅ Confusion matrix and classification metrics  
✅ ROC curve, AUC, and threshold selection  
✅ Multi-class strategies (OvR, softmax)  
✅ Feature interpretation via coefficients  
✅ Handling class imbalance  

---

**Congratulations!** You now understand classification fundamentals, probability estimation, and how to evaluate classifiers properly. These skills apply to all classification algorithms, not just logistic regression.

Continue to **014_Support_Vector_Regression** for robust regression with kernel methods.