# Logistic Regression

This is an example of logistic regression prepared for **Econ5150**. 

We demonstrate this method with a real-world bank marketing dataset. Logistic regression helps us understand the probability of a certain outcome (in this case, customer response) based on various features.


## Setup

In [1]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from ucimlrepo import fetch_ucirepo

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


- $y$: The target variable, representing the targeted outcome. In this Bank Marketing dataset, this outcome is **whether a customer will subscribe to a term deposit**. The values in $y$ are binary: 'yes' (1) or 'no' (0).

- $X$: The feature matrix, containing the regressors. These features include both numerical and categorical variables such as age, job, marital status, and so forth, describing the characteristics of each customer.


## Data Preparation

In [2]:
# Fetch the Bank Marketing Dataset
# the data is from UCI's machine learning repo
bank_marketing = fetch_ucirepo(id=222)

# Extract features (X) and target (y)
X = bank_marketing.data.features
y = bank_marketing.data.targets

# Display the first five rows of X
print("First five rows of X:")
print(X.head())

First five rows of X:
   age           job  marital  education default  balance housing loan  \
0   58    management  married   tertiary      no     2143     yes   no   
1   44    technician   single  secondary      no       29     yes   no   
2   33  entrepreneur  married  secondary      no        2     yes  yes   
3   47   blue-collar  married        NaN      no     1506     yes   no   
4   33           NaN   single        NaN      no        1      no   no   

  contact  day_of_week month  duration  campaign  pdays  previous poutcome  
0     NaN            5   may       261         1     -1         0      NaN  
1     NaN            5   may       151         1     -1         0      NaN  
2     NaN            5   may        76         1     -1         0      NaN  
3     NaN            5   may        92         1     -1         0      NaN  
4     NaN            5   may       198         1     -1         0      NaN  


Data processing.

Prepare $y$ and $X$.

In [3]:
# Ensure y is a Series by selecting the target column
y = y['y'] if 'y' in y.columns else y.squeeze()

# Convert the target variable to binary (1 for 'yes', 0 for 'no')
y = y.apply(lambda x: 1 if x == 'yes' else 0)

# Convert categorical variables in X to dummy variables
X = pd.get_dummies(X, drop_first=True)

# Display the first five rows of X
print("First five rows of X:")
print(X.head())

# Display the names of the regressors
print("\nNames of the regressors:")
print(X.columns.tolist())

First five rows of X:
   age  balance  day_of_week  duration  campaign  pdays  previous  \
0   58     2143            5       261         1     -1         0   
1   44       29            5       151         1     -1         0   
2   33        2            5        76         1     -1         0   
3   47     1506            5        92         1     -1         0   
4   33        1            5       198         1     -1         0   

   job_blue-collar  job_entrepreneur  job_housemaid  ...  month_jan  \
0            False             False          False  ...      False   
1            False             False          False  ...      False   
2            False              True          False  ...      False   
3             True             False          False  ...      False   
4            False             False          False  ...      False   

   month_jul  month_jun  month_mar  month_may  month_nov  month_oct  \
0      False      False      False       True      False      Fal

## Statsmodel

In [4]:
# Detailed Statistical Summary Using Statsmodels
import statsmodels.api as sm

# Add a constant to X for the intercept term
X_with_constant = sm.add_constant(X)  # Adding intercept

# Ensure all data in X_with_constant is numeric
# Convert the DataFrame to numeric and explicitly cast as float
X_with_constant = X_with_constant.apply(pd.to_numeric, errors='coerce').fillna(0).astype(float)

# Ensure y is numeric by converting it to a Pandas Series and casting as float
y = pd.Series(y).astype(float)  # Ensure y is numeric and compatible

# Fit logistic regression using statsmodels
model_sm = sm.Logit(y, X_with_constant)
result = model_sm.fit()

# Display the summary of coefficients, standard errors, t-statistics, and p-values
print(result.summary())


Optimization terminated successfully.
         Current function value: 0.244641
         Iterations 8
                           Logit Regression Results                           
Dep. Variable:                      y   No. Observations:                45211
Model:                          Logit   Df Residuals:                    45172
Method:                           MLE   Df Model:                           38
Date:                Wed, 19 Feb 2025   Pseudo R-squ.:                  0.3221
Time:                        21:37:27   Log-Likelihood:                -11060.
converged:                       True   LL-Null:                       -16315.
Covariance Type:            nonrobust   LLR p-value:                     0.000
                          coef    std err          z      P>|z|      [0.025      0.975]
---------------------------------------------------------------------------------------
const                  -2.4638      0.156    -15.798      0.000      -2.769      -2.158
ag

## Sklearn

In [5]:
# Initialize the StandardScaler
# it fails to work if the data is not scaled
scaler = StandardScaler()

# Fit the scaler on the feature matrix X and transform it
X_scaled = scaler.fit_transform(X)

# Fit logistic regression using sklearn on the scaled data
model_scaled = LogisticRegression(max_iter=3000)
model_scaled.fit(X_scaled, y)

# Estimated coefficients
coefficients_scaled = model_scaled.coef_.flatten()
intercept_scaled = model_scaled.intercept_

# Scale back the coefficients to the original units
coefficients_original = coefficients_scaled / scaler.scale_
intercept_original = intercept_scaled - np.sum(coefficients_scaled * scaler.mean_ / scaler.scale_)

# Display the coefficients in original units
print("Intercept (original):", intercept_original)
print("Estimated Coefficients (original):", coefficients_original)
# Calculate the log-likelihood of the fitted model
log_likelihood_sklearn = model_scaled.predict_log_proba(X_scaled)[:, 1].sum()

# Display the log-likelihood
print("Log-Likelihood (sklearn):", log_likelihood_sklearn)

Intercept (original): [-2.47195882]
Estimated Coefficients (original): [-9.73108981e-04  1.40964078e-05  3.36730733e-03  4.10148618e-03
 -9.17527487e-02  9.95652705e-04  1.62519516e-02 -3.55723372e-01
 -3.84209255e-01 -5.31792127e-01 -1.56355267e-01  2.75301765e-01
 -3.20069055e-01 -2.35377404e-01  4.79490997e-01 -1.70994414e-01
 -1.75416021e-01 -1.77461501e-01  1.24890271e-01  1.43797527e-01
  3.81887722e-01 -7.71475858e-02 -7.53775099e-01 -4.53171445e-01
 -1.78620949e-02 -6.95758844e-01  5.79087415e-01 -2.37103152e-01
 -1.21777582e+00 -7.86229264e-01 -6.24502355e-01  1.52183861e+00
 -1.00546633e+00 -8.50307412e-01  7.64567430e-01  6.79691320e-01
  3.07442584e-01  2.43273243e+00]
Log-Likelihood (sklearn): -131807.51007666523


In [6]:
from sklearn.metrics import log_loss

# Predict probabilities
y_prob = model_scaled.predict_proba(X_scaled)

# Compute log-loss
log_likelihood = -log_loss(y, y_prob) * len(y)
# the default log_loss = -loglikelihood/len(y)

print("Fitted Log-Likelihood:", log_likelihood)

Fitted Log-Likelihood: -11060.485513395308


See the documentation: [File1](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.log_loss.html); [File2](https://scikit-learn.org/stable/modules/model_evaluation.html#log-loss). This is important for us to understand the underlying mechanism.