<a href="https://colab.research.google.com/github/sancarhacer/MachineLearning/blob/main/LogisticRegressionIntro.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

🔹 Normal K-Fold

* Suppose we have 100 samples:

* * 60 belong to class 0

* * 40 belong to class 1

* With normal K-Fold, the data is split randomly into folds.

* * This means that in some folds, you may end up with 50 samples of class 0 and only 2 samples of class 1.

* * As a result, the training and test sets can become imbalanced, which may cause the model to learn poorly.

🔹 Stratified K-Fold

* Stratified K-Fold splits the data while preserving the class distribution.

*  In each fold, there will be approximately 60% class 0 and 40% class 1.

* This ensures that every training/test split has a distribution similar to the original dataset.

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [4]:
df = pd.read_csv("6-bank_customers.csv")
df.columns

Index(['age', 'job_satisfaction', 'balance', 'duration_last_call',
       'num_contacts_last_month', 'has_housing_loan', 'has_personal_loan',
       'communication_type', 'days_since_last_contact',
       'campaign_response_score', 'subscribed'],
      dtype='object')

In [5]:
# Scenario: Predicting whether a customer will subscribe to a term deposit (bank marketing use case)
df.head()

Unnamed: 0,age,job_satisfaction,balance,duration_last_call,num_contacts_last_month,has_housing_loan,has_personal_loan,communication_type,days_since_last_contact,campaign_response_score,subscribed
0,-0.377957,1.043895,1.043494,-0.101838,-1.617442,0.402713,0.913601,-0.067192,0.175471,-1.049646,0
1,-0.325259,1.276263,-0.686123,-2.463205,-0.489426,-0.240715,-1.469496,1.006633,-0.833692,0.957744,0
2,0.739019,-0.600903,-0.177294,1.335714,-0.817332,-0.790047,1.457365,-0.218981,0.878643,-1.25774,0
3,0.474312,-1.103002,1.189936,-0.800186,0.912377,-0.406451,-1.13095,1.985111,1.379029,1.041768,1
4,0.927365,1.114796,0.080284,1.261064,0.761179,0.921563,0.440832,0.184645,-1.567739,-0.142107,1


In [6]:
df.describe()

Unnamed: 0,age,job_satisfaction,balance,duration_last_call,num_contacts_last_month,has_housing_loan,has_personal_loan,communication_type,days_since_last_contact,campaign_response_score,subscribed
count,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0
mean,-0.015084,0.031559,0.045069,-0.020771,0.016769,-0.042135,-0.025139,0.058969,-0.015637,0.022327,0.503
std,1.021848,1.004281,1.00691,1.392738,1.268761,1.029279,1.26749,0.992629,1.000664,1.135131,0.500241
min,-3.718638,-3.116027,-3.534257,-4.999018,-2.587178,-3.250031,-4.932878,-3.187779,-2.819987,-2.071129,0.0
25%,-0.685005,-0.657545,-0.647051,-1.057938,-0.98132,-0.72297,-0.855392,-0.598669,-0.69333,-1.050201,0.0
50%,-0.030546,0.045869,0.040586,0.204506,-0.317137,-0.022343,0.148199,0.052208,0.018807,-0.03753,1.0
75%,0.662124,0.701601,0.731615,1.061181,1.017388,0.596557,1.020255,0.722446,0.641338,0.760782,1.0
max,2.765266,2.658705,3.357941,3.47721,3.954873,3.206344,2.34338,3.600187,3.477044,4.124285,1.0


In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 11 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   age                      1000 non-null   float64
 1   job_satisfaction         1000 non-null   float64
 2   balance                  1000 non-null   float64
 3   duration_last_call       1000 non-null   float64
 4   num_contacts_last_month  1000 non-null   float64
 5   has_housing_loan         1000 non-null   float64
 6   has_personal_loan        1000 non-null   float64
 7   communication_type       1000 non-null   float64
 8   days_since_last_contact  1000 non-null   float64
 9   campaign_response_score  1000 non-null   float64
 10  subscribed               1000 non-null   int64  
dtypes: float64(10), int64(1)
memory usage: 86.1 KB


In [8]:
X=df.drop('subscribed',axis=1)
y=df['subscribed']

In [9]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.30,random_state=15)

In [10]:
from sklearn.linear_model import LogisticRegression
logistic=LogisticRegression()
logistic.fit(X_train,y_train)
y_pred=logistic.predict(X_test)

In [11]:
from sklearn.metrics import accuracy_score,classification_report,confusion_matrix
score=accuracy_score(y_pred,y_test)
print("score: " , score)
print(classification_report(y_pred,y_test))
print("confusion matrix: \n" , confusion_matrix(y_pred,y_test))

score:  0.92
              precision    recall  f1-score   support

           0       0.92      0.91      0.92       147
           1       0.92      0.93      0.92       153

    accuracy                           0.92       300
   macro avg       0.92      0.92      0.92       300
weighted avg       0.92      0.92      0.92       300

confusion matrix: 
 [[134  13]
 [ 11 142]]


In [12]:
#hyperparameter tuning
model=LogisticRegression()
penalty=['l1', 'l2', 'elasticnet']
c_values=[100,10,1.0,0.1,0.01]
solver=['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']
params=dict(penalty=penalty,C=c_values,solver=solver)

In [13]:
from sklearn.model_selection import StratifiedKFold
cv=StratifiedKFold()

In [15]:
## GridSearchCV
# Grid Search tries specific value ranges for these hyperparameters (a "grid").
# It trains the model on all defined combinations and evaluates performance (e.g., accuracy, RMSE, R²).
# It finds the hyperparameter combination that gives the best result.
# CV stands for "cross-validation" → meaning each hyperparameter combination will be tested using cross-validation.

from sklearn.model_selection import GridSearchCV

grid = GridSearchCV(
    estimator=model,       # The selected ML model (e.g., DecisionTree, SVM, LogisticRegression, etc.)
    param_grid=params,     # The parameter grid to search over (in dictionary format)
    scoring='accuracy',    # Evaluation metric (here, accuracy)
    cv=cv,                 # Cross-validation strategy (StratifiedKFold in this case)
    n_jobs=-1              # Use all CPU cores to run in parallel
)
grid

In [16]:
grid.fit(X_train,y_train)

200 fits failed out of a total of 375.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
25 fits failed with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/sklearn/model_selection/_validation.py", line 866, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/usr/local/lib/python3.12/dist-packages/sklearn/base.py", line 1389, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/sklearn/linear_model/_logistic.py", line 1193, in fit
    solver = _check_solver(self.solver, self.penalty, self.dual)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

In [18]:
grid.best_params_

{'C': 0.1, 'penalty': 'l2', 'solver': 'liblinear'}

In [19]:
grid.best_score_

np.float64(0.9228571428571429)

In [20]:
y_pred=grid.predict(X_test)
score=accuracy_score(y_pred,y_test)
print("score: " , score)
print(classification_report(y_pred,y_test))
print("confusion matrix: \n" , confusion_matrix(y_pred,y_test))

score:  0.9233333333333333
              precision    recall  f1-score   support

           0       0.94      0.90      0.92       152
           1       0.90      0.95      0.92       148

    accuracy                           0.92       300
   macro avg       0.92      0.92      0.92       300
weighted avg       0.92      0.92      0.92       300

confusion matrix: 
 [[137  15]
 [  8 140]]


In [21]:
# random search
from sklearn.model_selection import RandomizedSearchCV
model=LogisticRegression()
randomcv=RandomizedSearchCV(estimator=model,param_distributions=params,cv=5,scoring='accuracy')
randomcv.fit(X_train,y_train)

25 fits failed out of a total of 50.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
10 fits failed with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/sklearn/model_selection/_validation.py", line 866, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/usr/local/lib/python3.12/dist-packages/sklearn/base.py", line 1389, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/sklearn/linear_model/_logistic.py", line 1193, in fit
    solver = _check_solver(self.solver, self.penalty, self.dual)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

In [22]:
randomcv.best_score_

np.float64(0.9214285714285714)

In [23]:
y_pred=randomcv.predict(X_test)
score=accuracy_score(y_pred,y_test)
print("score: " , score)
print(classification_report(y_pred,y_test))
print("confusion matrix: \n" , confusion_matrix(y_pred,y_test))

score:  0.9166666666666666
              precision    recall  f1-score   support

           0       0.92      0.91      0.91       146
           1       0.92      0.92      0.92       154

    accuracy                           0.92       300
   macro avg       0.92      0.92      0.92       300
weighted avg       0.92      0.92      0.92       300

confusion matrix: 
 [[133  13]
 [ 12 142]]
