# Logistic Regression:
## With Hyper Parameter Tunning and Cross Validation

## What is Logistic Regression?
- Logistic Regression is a **binary classification algorithm** (2 class problem). It can predict if output is True/False or Yes/No or 0/1 or Class1/Class2.
- It finds the **best fit decision boundary** between two classes. (One vs One).
- Instead of a straight-line output like Linear Regression, it **maps predictions to probabilities** using the **sigmoid function**.
- If we want to solve multi class problem i.e. 3 classes, you can do this by creating more than 1 Logistic Model. (One Vs Rest).

## Logistic Regression Formula:
$P(Y=1 | X) = \sigma(z) = \frac{1}{1 + e^{-z}}$

Where:
- $P(Y=1 | X)$ → Probability that the output $Y$ is **1** given input $X$.
- $\sigma(z)$ → Sigmoid function that converts $z$ into a probability.
- $e$ → Euler's number (**≈2.718**), the base of the natural logarithm.
- $z$ → Linear combination of input features:

  $z = w_0 + w_1 x_1 + w_2 x_2 + ... + w_n x_n$

  - $w_0$ → Bias term (intercept).
  - $w_1, w_2, ..., w_n$ → Model weights (learned from data).
  - $x_1, x_2, ..., x_n$ → Input feature values.

## Example:  
  
Predicting whether a Customer will buy a Product based on age and income per year.
  
$z = w_0 + w_1x_1 + w_2x_2$  
  
- $x_1 =$ Customer Age (in years)   
- $x_2 =$ Customer Income (in 1000s per year)  
- $w_0 = -4$ (bias / intercept)  
- $w_1 = 0.05$ (weight / slope for age)  
- $w_2 = 0.01$ (weight / slope for income)​  

Solve for Age $x_1 = 30$  
Salary per year $x_2 = 50,000$ USD  

### Solution:

**Step 1: Compute Z**

$z = w_0 + w_1x_1 + w_2x_2$  

z = (−4) + (0.05×30) + (0.1×50)  
z = −4 + 1.5 + 5  
z = 2.5  

**Step 2: Apply the Sigmoid Function**

$\sigma(z) = \frac{1}{1 + e^{-z}}$  
  
$\sigma(2.5) = \frac{1}{1 + e^{-2.5}}, {where}: e \approx 2.718$  

$\sigma(2.5) = \frac{1}{1 + 0.082}$  

​$\sigma(2.5) \approx 0.923$


**Step 3: Interpret the Probability**

- The model predicts a 92.3% probability that the customer will buy the product.
- Since $\sigma(2.5) = 0.923$ that is > 0.5, we classify the customer as likely to buy.

## Model Performance

### Confusion Matrix:



<img src="./data/ConfusionMatrix.jpg" alt="Confusion Matrix" width="600" height="400">

### Confusion Matrix in Python.

In Python the Confusion matrix is actually a 2D array hence it is shown flipped.

[[TN   FP]  
 [FN   TP]]  

Where:
|   | $0_{Predicated}$ | $1_{Predicated}$ |
|---|---|---|
| $0_{Actual}$ | $TN_{[0, 0]}$ | $FP_{[0, 1]}$ |
| $1_{Actual}$ | $FN_{[1, 0]}$ | $TP_{[1, 1]}$ |

Where:  
- [0, 0] = TN -> $Actual_{False}$, $Predicated_{False}$  
- [0, 1] = FP -> $Actual_{False}$, $Predicated_{True}$  
- [1, 0] = FN -> $Actual_{True}$, $Predicated_{False}$  
- [1, 1] = TP -> $Actual_{True}$, $Predicated_{True}$  

### Other Metrics

- Precision -> $\frac{TP}{(TP + FP)}$  

- Recall -> $\frac{TP}{(TP + FN)}$  

- F1 Score -> $2 \times \frac{(Precision * Recall)}{(Precision + Recall)}$  

- Accuracy -> $\frac{(TP + TN)}{(TP + TN + FP + FN)}$

- Error Rate -> $1 - Accuracy$

In [471]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report, precision_score, recall_score, f1_score

## Generate simulated data

In [472]:
# Generate simulated dataset (numpy array)
X, y = make_classification(n_samples=10000, n_features=10, n_classes=2, random_state=42)
X = pd.DataFrame(X)

X.head(5), y[:5]

(          0         1         2         3         4         5         6  \
 0  0.633563  0.357385 -0.503931  0.935066  0.647981 -0.050796 -1.933989   
 1  1.283905  1.109459 -0.908953  1.006586  0.492219  1.107295  1.243526   
 2 -0.966476 -0.593314  0.458020  1.032323  1.283685 -0.317640  1.499045   
 3  2.429309 -1.306530 -1.869925  3.092164  2.028800 -0.879635 -0.393494   
 4 -1.204798  0.078464  0.705181  0.224765  0.618707  1.534946 -0.302288   
 
           7         8         9  
 0  2.081684  0.041266 -0.258298  
 1 -0.172200  1.150359  0.147744  
 2  0.434477  0.423678  1.251380  
 3 -0.101213 -1.624066  0.443553  
 4  2.325055  0.495505  0.538133  ,
 array([1, 1, 1, 1, 0]))

In [473]:
# Check the unique values in their count in array y
np.unique(y, return_counts=True)

(array([0, 1]), array([4988, 5012]))

## Train the Model on Training Data

In [474]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

X_train.shape, y_train.shape, X_test.shape, y_test.shape

((7000, 10), (7000,), (3000, 10), (3000,))

In [475]:
logistic_m1 = LogisticRegression()

In [476]:
# Fit the model using training data
logistic_m1 = logistic_m1.fit(X_train, y_train)

## Make Prediction on Test Data

In [477]:
# Predict on Test Data
y_pred_m1 = logistic_m1.predict(X_test)
y_pred_m1

array([1, 1, 1, ..., 0, 1, 0])

In [478]:
# Get predicted probabilities
pred_prob_m1 = logistic_m1.predict_proba(X_test)

pred_prob_m1[:5]

array([[0.01640705, 0.98359295],
       [0.11288143, 0.88711857],
       [0.11091352, 0.88908648],
       [0.93541396, 0.06458604],
       [0.82523652, 0.17476348]])

In [479]:
# From predicted probabilities convert to Class1 and 1 Class2
y_pred_m1_class = pd.DataFrame(np.where(y_pred_m1 > 0.5, 'Class 1', 'Class 2'))

y_pred_m1_class.head()

Unnamed: 0,0
0,Class 1
1,Class 1
2,Class 1
3,Class 2
4,Class 2


## Evaluate Performance Metrics

In [480]:
# Helping Method to Display Report
def evaluate_display_model(y_true, y_pred, desc, class_rpt = True):
    """
    Computes and displays classification metrics including:
    - Confusion Matrix
    - Precision
    - Recall
    - F1 Score
    - Accuracy
    - Error Rate
    - Classification Report (Conditional)
    
    Args:
    y_true (array-like): Actual class labels
    y_pred (array-like): Predicted class labels
    desc: Description to be printed as Model Stats heading
    class_rpt (bool, optional): If True (Default), displays a classification report.
    
    Returns:
    dict: A dictionary containing all computed metrics.
    """
    # Compute confusion matrix
    conf_matrix = confusion_matrix(y_true, y_pred)

    # Extract True Positives (TP), False Positives (FP), False Negatives (FN), True Negatives (TN)
    TN, FP, FN, TP = conf_matrix.ravel()

    # Compute metrics
    precision = precision_score(y_true, y_pred)
    recall = recall_score(y_true, y_pred)
    f1score = f1_score(y_true, y_pred)
    accuracy = accuracy_score(y_true, y_pred)
    error_rate = 1 - accuracy
    class_report = ""
    if class_rpt:
        class_report = classification_report(y_true, y_pred, zero_division=1)

    # Print Confusion Matrix
    print(f"\n{desc} Performance:")
    print("\nConfusion Matrix:")
    print(conf_matrix)
    print(f"\nWhere:")
    print(f"TP (Actual +ve, Predicted +ve) = {TP}")
    print(f"FP (Actual -ve, Predicted +ve) = {FP}")
    print(f"FN (Actual +ve, Predicted -ve) = {FN}")
    print(f"TN (Actual -ve, Predicted -ve) = {TN}")

    # Print Other Metrics
    print(f"\nOther Metrics:")
    print(f"Precision (TP / (TP + FP)): {precision:.4f}")
    print(f"Recall (TP / (TP + FN)): {recall:.4f}")
    print(f"F1 Score (2 * (Precision * Recall) / (Precision + Recall)): {f1score:.4f}")
    print(f"Accuracy ((TP + TN) / (TP + TN + FP + FN)): {accuracy:.4f}")
    print(f"Error Rate (1 - Accuracy): {error_rate:.4f}")
    if class_rpt:
        print(f"\nClassification Report:\n{class_report}")

    # Return all metrics in a dictionary (useful for further analysis)
    return {
        "Confusion Matrix": conf_matrix,
        "Precision": precision,
        "Recall": recall,
        "F1 Score": f1score,
        "Accuracy": accuracy,
        "Error Rate": error_rate,
        "Classification Report": class_report
    }

In [481]:
# Performance Metrics
evaluate_display_model(y_test, y_pred_m1, "Model 1 (Before Tuning)")


Model 1 (Before Tuning) Performance:

Confusion Matrix:
[[1329  155]
 [ 168 1348]]

Where:
TP (Actual +ve, Predicted +ve) = 1348
FP (Actual -ve, Predicted +ve) = 155
FN (Actual +ve, Predicted -ve) = 168
TN (Actual -ve, Predicted -ve) = 1329

Other Metrics:
Precision (TP / (TP + FP)): 0.8969
Recall (TP / (TP + FN)): 0.8892
F1 Score (2 * (Precision * Recall) / (Precision + Recall)): 0.8930
Accuracy ((TP + TN) / (TP + TN + FP + FN)): 0.8923
Error Rate (1 - Accuracy): 0.1077

Classification Report:
              precision    recall  f1-score   support

           0       0.89      0.90      0.89      1484
           1       0.90      0.89      0.89      1516

    accuracy                           0.89      3000
   macro avg       0.89      0.89      0.89      3000
weighted avg       0.89      0.89      0.89      3000



{'Confusion Matrix': array([[1329,  155],
        [ 168, 1348]]),
 'Precision': np.float64(0.8968729208250167),
 'Recall': np.float64(0.8891820580474934),
 'F1 Score': np.float64(0.8930109307717787),
 'Accuracy': 0.8923333333333333,
 'Error Rate': 0.10766666666666669,
 'Classification Report': '              precision    recall  f1-score   support\n\n           0       0.89      0.90      0.89      1484\n           1       0.90      0.89      0.89      1516\n\n    accuracy                           0.89      3000\n   macro avg       0.89      0.89      0.89      3000\nweighted avg       0.89      0.89      0.89      3000\n'}

## Interpret the Performance Metrics

- High Precission (84%)  
- High Recall (85%)
- High F1 Score (84.5%)
- High Accuracy (85%)
- Low False Positive Rate (15%)
- Low False Negative Rate (15%)

**Conclusion:** This is a good model and generalizes well on unseen data.

## HyperParameter Tuning And Cross Validation (CV)

### **HyperParameter Tuning**

**Hyperparameter tuning** is the process of selecting the best combination of **hyperparameters** to optimize a machine learning model’s performance.

- **Grid Search CV**
  - **Grid Search CV** is an exhaustive search technique that evaluates all possible **combinations** of hyperparameters.
  - It performs a **brute-force search** over a predefined **grid of values**.
  - **Best for small datasets** where computational cost is manageable.
  - **Disadvantage:** Can be **computationally expensive** for large datasets.
   
- **Random Search CV**
  - **Random Search CV** is a hyperparameter tuning technique that, instead of searching every possible combination like Grid Search, **randomly selects values** for hyperparameters.
  - It is **faster than Grid Search CV** and can be **more effective in high-dimensional spaces**.
  - **Best for large hyperparameter spaces** where exhaustive search is impractical.
  - **Disadvantage:** It might **miss the best combination** since it does not check all possibilities.


### **Cross Validation**

**Cross-validation** is a technique used to evaluate the performance of a model on unseen data. It involves splitting the training data into K samples.

- **Leave One Out Cross Validation (LOOCV):** 
  
  - Model is validated on **each individual data point** in the training set.
  - Requires **N iterations** (where N = number of samples).
  - **Example**: If the dataset has 100 samples, **100 models are trained**.
  - **Advantage:** Utilizes **maximum data** for training.
  - **Disadvantage:** Computationally expensive for large datasets.
  
- **Leave P Out Cross Validation (LPOCV):**
  - Similar to LOOCV, but instead of taking one data point out at a time, P data points.
  - Take N-P Iterations.
  - **Example**: If dataset has 100 samples and P=5, then **95 samples are used for training, and 5 for validation** in each iteration.
  
- **K-Fold Cross Validation:** 
  - Splits the dataset into **K equal parts** (or "folds").
  - The model is trained on **K-1 folds** and validated on the **remaining fold**.
  - This process repeats **K times**, each time using a different fold for validation.
  - **Total Iterations:** $N/K$
  - **Example**:
    - If dataset size = **500** and **K = 10**, then:
      - **Iterations** = $500/10 = 50$
    - The **average accuracy across K iterations** is used as the performance metric.
  - **Advantage:** More efficient than LOOCV.
  - **Disadvantage:** May not perform well on small datasets.- Takes K Number of samples to train the model.

- **Stratified K-Fold Cross Validation:**
  - Works like **K-Fold CV** but **preserves the class distribution** in each fold.
  - Useful for **imbalanced datasets** where some classes appear much less frequently.
  - Ensures that **each fold has the same proportion of classes** as the original dataset.
  - **Example**:
    - If dataset has **70% class A and 30% class B**, each fold maintains the **same ratio**.
  - **Advantage:** Prevents bias due to imbalanced data.
  - **Disadvantage:** Slightly more complex than standard K-Fold CV. 

- **Time Series Cross Validation:**
  - Used for **time-dependent data** (e.g., stock prices, weather forecasting).
  - Unlike random splits, **future data cannot be used to train past data**.
  - Splitting is done **sequentially**, preserving the time order.
  - **Example**:
    - Train on **January to June**, test on **July**.
    - Train on **January to July**, test on **August**.
    - And so on...
  - **Advantage:** Maintains temporal dependencies.
  - **Disadvantage:** Not all data is used in early iterations.

| Cross-Validation Method | Best Used For | Iterations | Key Feature |
|-------------------------|--------------|------------|-------------|
| **LOOCV** | Small datasets | N | Uses 1 sample for validation per iteration |
| **LPOCV** | Small datasets | N-P | Uses P samples for validation per iteration |
| **K-Fold CV** | General-purpose | N/K | Splits into K folds and rotates validation set |
| **Stratified K-Fold** | Imbalanced datasets | N/K | Preserves class proportions in folds |
| **Time Series CV** | Time-dependent data | Variable | Preserves chronological order |

## Grid Search Cross Validadtion:

In [482]:
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import StratifiedKFold


### Defining Hyperparameters

In [483]:
# Define hyper parameters
#penalty = ['l1', 'l2', 'elasticnet'] #L1 = Lasso, L2 = Ridge, ElasticNet = Combination of Both
penalty = ['l1', 'l2']
c_values = [100, 10, 1.0, 0.1, 0.01] # Inverse of regularization strength
#solver = ['newton-cg', 'lbfgs', liblinear', 'sag', 'saga'] # Algorithm to be used. Only solvers that support 'l1' and 'elasticnet'
solver = ['liblinear', 'saga'] # Saga supports all the penalty terms l1, l2, and elasticnet hence passing saga
l1_ratio = [0.1, 0.5, 0.9]  # Required for ElasticNet

In [484]:
# Create a dictionary for hyperparameters to pass to GridSeach
params = dict( 
    penalty = penalty,
    C = c_values,
    solver = solver,
    #l1_ratio = l1_ratio # Only needed for ElasticNet penalty 
)

In [485]:
# Startified K-Fold Cross Validation
# We want to do Cross Validation 5 times
startify_kfold_cv = StratifiedKFold(n_splits=5, shuffle=False)

In [486]:
# Create Logistic Model
logistic_m2 = LogisticRegression()

In [487]:
# Define Grid Search Cross Validator to tune the Hyperparmeters
grid_search = GridSearchCV(
    estimator = logistic_m2,
    param_grid = params,
    scoring = 'accuracy', # I am looking for increased accuracy
    cv = startify_kfold_cv, # How many times I need to do K-Fold cross validation
    n_jobs = -1, # Use all available cores of our CPU
)

### Training the Model with Grid Search Cross CV

In [488]:
grid_search = grid_search.fit(X_train, y_train)

grid_search

In [489]:
grid_search.best_params_, grid_search.best_score_

({'C': 0.01, 'penalty': 'l1', 'solver': 'liblinear'},
 np.float64(0.8917142857142857))

### Testing the Tuned Model

In [490]:
y_pred_m2 = grid_search.predict(X_test)

### Comparing Models Performance

#### We can observe slight improvement in Accuracy (Our Selected Metric in Hyper Parameter) for the tuned model.

In [491]:
# Performance Metrics
evaluate_display_model(y_test, y_pred_m1, "Model 1 (Before Tuning)")
pass


Model 1 (Before Tuning) Performance:

Confusion Matrix:
[[1329  155]
 [ 168 1348]]

Where:
TP (Actual +ve, Predicted +ve) = 1348
FP (Actual -ve, Predicted +ve) = 155
FN (Actual +ve, Predicted -ve) = 168
TN (Actual -ve, Predicted -ve) = 1329

Other Metrics:
Precision (TP / (TP + FP)): 0.8969
Recall (TP / (TP + FN)): 0.8892
F1 Score (2 * (Precision * Recall) / (Precision + Recall)): 0.8930
Accuracy ((TP + TN) / (TP + TN + FP + FN)): 0.8923
Error Rate (1 - Accuracy): 0.1077

Classification Report:
              precision    recall  f1-score   support

           0       0.89      0.90      0.89      1484
           1       0.90      0.89      0.89      1516

    accuracy                           0.89      3000
   macro avg       0.89      0.89      0.89      3000
weighted avg       0.89      0.89      0.89      3000



In [492]:
evaluate_display_model(y_test, y_pred_m2, "Model 2 (After GridSearchCV)")
pass


Model 2 (After GridSearchCV) Performance:

Confusion Matrix:
[[1332  152]
 [ 163 1353]]

Where:
TP (Actual +ve, Predicted +ve) = 1353
FP (Actual -ve, Predicted +ve) = 152
FN (Actual +ve, Predicted -ve) = 163
TN (Actual -ve, Predicted -ve) = 1332

Other Metrics:
Precision (TP / (TP + FP)): 0.8990
Recall (TP / (TP + FN)): 0.8925
F1 Score (2 * (Precision * Recall) / (Precision + Recall)): 0.8957
Accuracy ((TP + TN) / (TP + TN + FP + FN)): 0.8950
Error Rate (1 - Accuracy): 0.1050

Classification Report:
              precision    recall  f1-score   support

           0       0.89      0.90      0.89      1484
           1       0.90      0.89      0.90      1516

    accuracy                           0.90      3000
   macro avg       0.89      0.90      0.89      3000
weighted avg       0.90      0.90      0.90      3000



## Randomized Search Cross Validation

In [493]:
from sklearn.model_selection import RandomizedSearchCV

In [494]:
logistic_m3 = LogisticRegression()

In [495]:
random_search = RandomizedSearchCV(
    estimator=logistic_m3, 
    param_distributions=params,
    cv = 5,
    scoring='accuracy'
)

In [496]:
random_search = random_search.fit(X_train, y_train)

In [497]:
random_search.best_params_, random_search.best_score_

({'solver': 'liblinear', 'penalty': 'l1', 'C': 0.01},
 np.float64(0.8917142857142857))

In [498]:
y_pred_m3 = random_search.predict(X_test)

In [499]:
evaluate_display_model(y_test, y_pred_m3, "Model 3 (After RandomSearchCV)")
pass


Model 3 (After RandomSearchCV) Performance:

Confusion Matrix:
[[1332  152]
 [ 163 1353]]

Where:
TP (Actual +ve, Predicted +ve) = 1353
FP (Actual -ve, Predicted +ve) = 152
FN (Actual +ve, Predicted -ve) = 163
TN (Actual -ve, Predicted -ve) = 1332

Other Metrics:
Precision (TP / (TP + FP)): 0.8990
Recall (TP / (TP + FN)): 0.8925
F1 Score (2 * (Precision * Recall) / (Precision + Recall)): 0.8957
Accuracy ((TP + TN) / (TP + TN + FP + FN)): 0.8950
Error Rate (1 - Accuracy): 0.1050

Classification Report:
              precision    recall  f1-score   support

           0       0.89      0.90      0.89      1484
           1       0.90      0.89      0.90      1516

    accuracy                           0.90      3000
   macro avg       0.89      0.90      0.89      3000
weighted avg       0.90      0.90      0.90      3000

