<a href="https://colab.research.google.com/github/umair594/100-Prediction-Models-/blob/main/Probit_Regression_%E2%80%93_Classification_Using_a_Probit_Link_Function_9.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project 9: Probit Regression ‚Äì Classification Using a Probit Link Function**

**Abstract**

Probit Regression is a type of regression used for modeling binary classification problems, similar to logistic regression, but it uses the probit link function, which is based on the cumulative distribution function (CDF) of the standard normal distribution. This model is particularly useful when the error term is assumed to follow a normal distribution. This project explores the theoretical foundations, implementation, and evaluation of Probit Regression using Python. Model performance is analyzed through metrics such as accuracy, confusion matrix, and classification report. Results demonstrate Probit Regression as an interpretable alternative to logistic regression for binary classification problems.

**Introduction**

Binary classification involves predicting a categorical outcome with two classes (e.g., yes/no, success/failure). Logistic regression is commonly used, but in some cases, the assumption of normally distributed latent errors motivates the use of Probit Regression.

Probit regression is widely applied in:

Economics (e.g., modeling choice probabilities)

Medicine (disease presence/absence)

Social sciences (yes/no survey responses)

The project aims to:

Understand Probit Regression theory and link function.

Implement a Probit Regression model using Python.

Evaluate model performance using classification metrics.

Compare probability-based predictions and interpret coefficients.

**Theoretical Background**

Latent Variable Model

Probit regression assumes a latent variable
ùë¶
‚àó
y
‚àó
 related to input features:
 Observed binary outcome
ùë¶
y is:

**Probit Link Function**

The probability of
ùë¶
=
1
y=1 is modeled using the standard normal CDF:
Where
Œ¶
Œ¶ is the cumulative distribution function of a standard normal distribution.

**Comparison with Logistic Regression**

| Feature        | Logistic Regression | Probit Regression          |
| -------------- | ------------------- | -------------------------- |
| Link function  | Sigmoid (logistic)  | Standard normal CDF        |
| Tail behavior  | Slightly heavier    | Normal tails               |
| Interpretation | Log-odds            | Latent variable assumption |


**Loss Function**

Probit regression is fitted using maximum likelihood estimation (MLE). The log-likelihood function is:

**Methodology**

Steps followed in the project:

Generate a synthetic binary classification dataset.

Split into training and testing sets.

Standardize features for numerical stability.

Train Probit Regression model using Python (via statsmodels).

Predict class probabilities and labels.

Evaluate performance using accuracy, confusion matrix, and classification report.

# **Python Implementation**

In [49]:
# 1. Import Libraries
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

In [50]:
# 2. Generate Synthetic Multi-class Dataset
X, y = make_classification(
    n_samples=500,        # Number of samples
    n_features=10,        # Total features
    n_informative=7,      # Informative features
    n_redundant=2,        # Redundant features
    n_classes=3,          # Number of classes
    n_clusters_per_class=1,
    random_state=42
)

In [51]:
# 3. Split Dataset into Train and Test Sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

In [52]:
# 4. Feature Scaling (Important for convergence)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [54]:
# 5. Train Multinomial Logistic Regression Model
model = LogisticRegression(
    multi_class='multinomial',  # Use softmax for multi-class
    solver='lbfgs',             # Solver that supports multinomial
    max_iter=500                 # Increase iterations for convergence
)

model.fit(X_train_scaled, y_train)



In [55]:
# 6. Predictions
y_pred = model.predict(X_test_scaled)        # Predicted class labels
y_prob = model.predict_proba(X_test_scaled)  # Predicted probabilities

In [56]:
# Display first 5 predicted probabilities
print("Predicted probabilities for first 5 samples:\n", y_prob[:5])

Predicted probabilities for first 5 samples:
 [[9.78415621e-04 9.99021534e-01 5.06434020e-08]
 [4.46246169e-02 5.08838572e-05 9.55324499e-01]
 [7.05323982e-01 2.95608531e-02 2.65115165e-01]
 [9.89005095e-01 2.29656113e-03 8.69834391e-03]
 [6.21775299e-03 5.90617278e-05 9.93723185e-01]]


In [57]:
# 7. Model Evaluation
accuracy = accuracy_score(y_test, y_pred)
cm = confusion_matrix(y_test, y_pred)
report = classification_report(y_test, y_pred)

print("\nModel Accuracy:", accuracy)
print("\nConfusion Matrix:\n", cm)
print("\nClassification Report:\n", report)


Model Accuracy: 0.92

Confusion Matrix:
 [[26  0  4]
 [ 0 37  1]
 [ 2  1 29]]

Classification Report:
               precision    recall  f1-score   support

           0       0.93      0.87      0.90        30
           1       0.97      0.97      0.97        38
           2       0.85      0.91      0.88        32

    accuracy                           0.92       100
   macro avg       0.92      0.92      0.92       100
weighted avg       0.92      0.92      0.92       100



**Results and Discussion**

Probit Regression outputs probabilities for each sample.

Decision threshold at 0.5 is used to assign class labels.

Accuracy, confusion matrix, and classification report show strong classification performance.

Coefficients can be interpreted in terms of latent variable impact.

 **Observations:**

Standardization improves model convergence.

Probit model performs similarly to logistic regression for most datasets.

Probabilistic predictions allow flexible thresholding in risk-sensitive applications.

**Advantages**

Probabilistic predictions with normal-error assumption.

Useful in economics and social sciences for latent variable modeling.

Interpretable coefficients via latent variable model.

**Limitations**

Assumes normally distributed errors (may not hold in practice).

Convergence can be slower than logistic regression.

Less commonly used in some machine learning pipelines.

**Conclusion**

Probit Regression is a robust technique for binary classification, particularly when the latent variable assumption is reasonable. In this project, Probit Regression was implemented using Python and evaluated on a synthetic dataset. The model provides probabilistic outputs, interpretable coefficients, and strong predictive performance, making it a useful alternative to logistic regression in certain applications, especially in economics and social sciences.