Purpose: compare predictors of the Probit and Logit models from Python statsmodels.formula.api package
- Full list of models avaliable: https://www.statsmodels.org/stable/api.html#statsmodels-formula-api

Dataset - Spambase: compare predictors of Logiv vs Probit
    #> Binary target: Spam vs. not spam
    #> Why useful: Text‑derived numeric predictors; good for seeing how logit vs. probit behave with many predictors & correlated features.
    #> Expectation: the curves for the two models will almost overlap. The difference will be the scaling on the coefficient.

*Logit Model - Summarised:*

>Logit assumes a logistic distribution (fatter tails, slightly flatter peak): 
- mean = 0 and Variance ≈ π²/3 = 3.29
- Logit predicted probabilities spread slightly more into 0 and 1 because of heavier tails
> Compared to Probit:
- Logit is simplier and easier to interpret (odds ratios)
- Industry standard
-  Logit coefficients tend to be about 1.6× (≈ π/√3) larger than probit coefficients, as distribution have fatter tail ends

*Probit Model - Summarised:*

>Probit assumes a standard normal distribution (thinner tails, sharper peak): 
- Mean = 0 and Varaince = 1
- Probit predicted probabilities tend to cluster more tightly around 0.5

>For many behavioural, psychometric, and economic processes, it’s often more natural to assume the underlying propensity follows a normal distribution—making a probit specification conceptually cleaner.
- reduces sensitivity to extreme values = Slightly more conservative estimates at extremes
- Causal inference = probit is a typical model for propensity scores as model often aligns with normal distributional assumptions when estimating treatment assignment probabilities

In [1]:
# Set up: import packages
import pandas as pd
import statsmodels.formula.api as sm
import pandas as pd

# Set up dummy data: Spambase
url = "https://raw.githubusercontent.com/readytensor/rt-datasets-binary-classification/main/datasets/processed/spambase/spambase.csv"
    #> from git Hub repository of datasets

df_Spam = pd.read_csv(url)


In [5]:
# Descriptive stats
# Step 1: File shape, column names, dtypes, suspected mis‑types (IDs as numeric, categories as free text), duplicated rows/IDs.
    #> Structure
df_Spam.shape, df_Spam.columns.tolist()

    #> Types and basic nulls
df_Spam.info()

    #> Duplicate rows / duplicate keys (adjust key list)
dupe_rows = df_Spam.duplicated().sum()
dupe_keys = df_Spam.duplicated(subset=["id"]).sum() if "id" in df_Spam.columns else None
dupe_rows, dupe_keys

# Step 2: EDA specific to Probit / Logit requirements (binary target, class balance, feature distributions and scales, multicollinearity, outliers/influential points)


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4601 entries, 0 to 4600
Data columns (total 59 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   id                          4601 non-null   int64  
 1   word_freq_make              4601 non-null   float64
 2   word_freq_address           4601 non-null   float64
 3   word_freq_all               4601 non-null   float64
 4   word_freq_3d                4601 non-null   float64
 5   word_freq_our               4601 non-null   float64
 6   word_freq_over              4601 non-null   float64
 7   word_freq_remove            4601 non-null   float64
 8   word_freq_internet          4601 non-null   float64
 9   word_freq_order             4601 non-null   float64
 10  word_freq_mail              4601 non-null   float64
 11  word_freq_receive           4601 non-null   float64
 12  word_freq_will              4601 non-null   float64
 13  word_freq_people            4601 

(np.int64(0), np.int64(0))