<a href="https://colab.research.google.com/github/sheelaj123/Machine-Learning-Course--2024/blob/main/Logistic_Reg_steps_IN_ML.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Logistic regression on Credit Risk data
---
let us understand the use of logistic regression on another problem by building  a model to classifying the credit risk for a loan applicant.

The first step is to import basic libraries, data and understanding the data.

Download the Credit risk dataset here.

In [2]:
# Importing required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
# Importing the dataset
credit_data = pd.read_csv("credit_risk.csv")


In [3]:
credit_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 21 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   over_draft              1000 non-null   object
 1   credit_usage            1000 non-null   int64 
 2   credit_history          1000 non-null   object
 3   purpose                 1000 non-null   object
 4   current_balance         1000 non-null   int64 
 5   Average_Credit_Balance  1000 non-null   object
 6   employment              1000 non-null   object
 7   location                1000 non-null   int64 
 8   personal_status         1000 non-null   object
 9   other_parties           1000 non-null   object
 10  residence_since         1000 non-null   int64 
 11  property_magnitude      1000 non-null   object
 12  cc_age                  1000 non-null   int64 
 13  other_payment_plans     1000 non-null   object
 14  housing                 1000 non-null   object
 15  exist

In [4]:
# Understanding the values the 'class' column (our target column in this analysis) can take
credit_data['class'].unique()

array(['good', 'bad'], dtype=object)

Let us use the get_dummies() function of pandas to encode the categorical input columns.

In [5]:
# Selecting predictors as all columns except the 'class' column
X = credit_data.columns.drop("class")
# Setting the target as the 'class' column
y = credit_data['class']


In [6]:
# Encoding all the features/predictor variables using the get_dummies method()
credit_data_encoded = pd.get_dummies(credit_data[X])
# Checking the shape of the input data
credit_data_encoded.shape


(1000, 61)

In [7]:
credit_data_encoded.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 61 columns):
 #   Column                                         Non-Null Count  Dtype
---  ------                                         --------------  -----
 0   credit_usage                                   1000 non-null   int64
 1   current_balance                                1000 non-null   int64
 2   location                                       1000 non-null   int64
 3   residence_since                                1000 non-null   int64
 4   cc_age                                         1000 non-null   int64
 5   existing_credits                               1000 non-null   int64
 6   num_dependents                                 1000 non-null   int64
 7   over_draft_0<=X<200                            1000 non-null   uint8
 8   over_draft_<0                                  1000 non-null   uint8
 9   over_draft_>=200                               1000 non-null   uint8
 10  o

This data can be split into training and test set to build the Logistic Regression Model.

#Training & Testing

In this example, the data is split into training and test datasets in the ratio of 85:15.



In [8]:
# Importing the required module
from sklearn.model_selection import train_test_split
#splitting data into train and test datasets in 85:15 ratio
X_train,X_test,y_train,y_test = train_test_split(credit_data_encoded, y,test_size=0.15,random_state=100)
# Checking the shapes of the resulting datasets
print("Shape of X_train:", X_train.shape)
print("Shape of y_train:", y_train.shape)
print("Shape of X_test:", X_test.shape)
print("Shape of y_test:", y_test.shape)


Shape of X_train: (850, 61)
Shape of y_train: (850,)
Shape of X_test: (150, 61)
Shape of y_test: (150,)


#Building the model

Let us build the logistic regression model using sklearn.

In [9]:
# Importing the required class.
from sklearn.linear_model import LogisticRegression
# Instantiating the required algorithm for model building.
model = LogisticRegression()
# Building the model based on the training data.
model.fit(X_train,y_train)


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Now that you have built the model, the next step is to evaluate the model's performance or accuracy on the training and test data:

In [10]:
# Getting the accuracy on training data
train_accuracy = model.score(X_train,y_train)
print("Train accuracy = ", train_accuracy)
# Getting the accuracy on test data
test_accuracy = model.score(X_test,y_test)
print("Test accuracy = ", test_accuracy)


Train accuracy =  0.7752941176470588
Test accuracy =  0.74


#Measuring Model Performance using Confusion Matrix

Confusion matrix helps in assessing how good a model is by comparing the actual target values with the predicted target values.

Let us see how to generate the Confusion Matrix for a model in sklearn:

In [11]:
# Predicting targets based on the model built
train_predictions = model.predict(X_train)
test_predictions = model.predict(X_test)
# Importing the required function
from sklearn.metrics import confusion_matrix
# Creating a confusion matrix on the training data
train_conf_matrix = confusion_matrix(y_train,train_predictions)
# Converting the train_conf_matrix into a DataFrame for better readability
pd.DataFrame(train_conf_matrix,columns=model.classes_,index=model.classes_)


Unnamed: 0,bad,good
bad,125,132
good,59,534


The rows of a Confusion Matrix represent the actual target values and the columns represent the predicted target values.

In the above matrix for training data, we can observe that the model predicted -

140 actually 'bad' credit risks as 'bad'

117 actually 'bad' credit risks as 'good'

60 actually 'good' credit risks as 'bad'

533 actually 'good' credit risks as 'good'

In [12]:
# Confusion matrix for the test data
test_conf_matrix = confusion_matrix(y_test,test_predictions)
pd.DataFrame(test_conf_matrix,columns=model.classes_,index=model.classes_)


Unnamed: 0,bad,good
bad,19,24
good,15,92


Let us compute the accuracy for our training and test datasets using the above expression.

In [13]:
# Calculating train accuracy from confusion matrix
train_correct_predictions = train_conf_matrix[0][0]+train_conf_matrix[1][1]
train_total_predictions = train_conf_matrix.sum()
train_accuracy = train_correct_predictions/train_total_predictions
print(train_accuracy)


0.7752941176470588


In [14]:
# Calculating test accuracy from confusion matrix
test_correct_predictions = test_conf_matrix[0][0]+test_conf_matrix[1][1]
total_predictions = test_conf_matrix.sum()
test_accuracy = test_correct_predictions/total_predictions
print(test_accuracy)


0.74


In [None]:
#Training Accuracy is : 0.7752941176470588
#Testing Accuracy is : 0.74

#Precision, Recall, and F1-score:

In [15]:
# Importing the required function
from sklearn.metrics import classification_report
# Generating the report and printing the same
print(classification_report(y_test,test_predictions))


              precision    recall  f1-score   support

         bad       0.56      0.44      0.49        43
        good       0.79      0.86      0.83       107

    accuracy                           0.74       150
   macro avg       0.68      0.65      0.66       150
weighted avg       0.73      0.74      0.73       150



##Topic Ends Here......Thanks for Visiting...>Happy Learning >