💰 Predicting Loan Default Using Machine Learning

Introduction

Lending a loan to a person could be crucial to banks for them to operate and maintain demand. Taking a look at different banks, we see that the amount of loan that they give to different customers varies when they consider a few factors that are important for them to decide to give a loan or not respectively. In addition to this, the total amount of loan that they lend to a person also depends on these factors. However, there is sometimes a possibility that the loan might be given to a person who might not be able to return it to the bank or vice-versa. It becomes important for banks to understand the behavior of the customer before they can take action to lend money to people for different purposes.

Machine Learning Analysis

Companies could use machine learning to understand some of the important features and insights and also, get predictions so that they could determine whether they have to give a loan to a person or not. It would be really good if based on a given set of features, one is able to predict whether a customer would default on a loan or not. This could be addressed with machine learning.

Exploratory Data Analysis (EDA)

We should be performing exploratory data analysis (EDA) to understand and use various features for our prediction. Below are some key insights that were generated as a result of exploratory data analysis (EDA).

There are a large number of people in our data that do not have a partner.
There is a large portion of our data that contains missing values.
We find that the number of people who defaulted on a loan are significantly lower than the number of people who did not default on a loan. Therefore, class balancing should be done before giving the data to the ML models for prediction.
Based on the salary amounts, it could be seen that a large portion of people has salaries ranging from about $1,00,000 - $1,40,000. There are also a few people who make about $8,00,000 but they are outliers in the data.
It could also be seen that a large number of people have taken a loan or credit of about $2,50,000. There are very few people (outliers) who have taken a loan of about $20,00,000.
We have data that contains a lot of missing values. Therefore, we can either take into account the imputation methods or remove the features that contain more than 80 percent as missing values.

Sampling Methods

Since we are dealing with data that is not balanced, it is important to perform the balancing, especially for the minority class which in our case is the possibility that a person might default on a loan. We use various sampling techniques such as SMOTE and Random sampling to get the best outputs from the machine learning models.

Metrics

The output variable in our case is discrete. Therefore, metrics that compute the outcomes for discrete variables should be taken into consideration and the problem should be mapped under classification. Below are the metrics for the classification problem of predicting whether a person would default on a loan or not.

Visualizations

In this section, we would be primarily focusing on the visualizations from the analysis and the ML model prediction matrices to determine the best model for deployment.

After taking a look at a few rows and columns in the dataset, there are features such as whether the loan applicant has a car, gender, type of loan, and most importantly whether they have defaulted on a loan or not.

A large portion of the loan applicants are unaccompanied meaning that they are not married. There are a few child applicants along with spouse categories. There are a few other types of categories that are yet to be determined according to the dataset.

The plot below shows the total number of applicants and whether they have defaulted on a loan or not. A large portion of the applicants were able to pay back their loans in a timely manner. There are still a few set of applicants who failed to pay the loan back. This resulted in a loss to financial institutes as the amount was not paid back.

Missingno plots give a good representation of the missing values present in the dataset. The white strips in the plot indicate the missing values (depending on the colormap). After taking a look at this plot, there are a large number of missing values present in the data. Therefore, various imputation methods can be used. In addition, features that do not give a lot of predictive information can be removed.

These are the features with the top missing values. The number on the y-axis indicates the percentage number of the missing values.

Looking at the type of loans taken by the applicants, a large portion of the dataset contains information about Cash Loans followed by Revolving Loans. Therefore, we have more information present in the dataset about 'Cash Loan' types which can be used to determine the chances of default on a loan.

Based on the results from the plots, a lot of information is present about female applicants shown in the plot. There are a few categories that are unknown. These categories can be removed as they do not aid in the model prediction about the chances of default on a loan.

A large portion of applicants also do not own a car. It can be interesting to see how much of an impact would this make in predicting whether an applicant is going to default on a loan or not.

As seen from the distribution of income plot, a large number of people make income as indicated by the spike presented by the green curve. However, there are also loan applicants who make a large amount of money but they are relatively few in number. This is indicated by the spread in the curve.

Plotting missing values for a few sets of features, there tends to be a lot of missing values for features such as TOTALAREA_MODE and EMERGENCYSTATE_MODE respectively. Steps such as imputation or removal of those features can be performed to enhance the performance of AI models. We will also look at other features that contain missing values based on the plots generated.

We also check for numerical missing values to find them. By looking at the plot below clearly shows that there are only a few missing values in the dataset. Since they are numerical, methods such as mean imputation, median imputation, and mode imputation could be used in this process of filling in the missing values.

After performing imputation, notice how the white strips are removed. This indicates that the missing values are imputed so that they could be fed to ML models for predictions in the later stages.

Model Performance

Random Oversampling

In this set of visualizations, let us focus on the model performance on unseen data points. Since this is a binary classification task, metrics such as precision, recall, f1-score, and accuracy can be taken into consideration. Various plots that indicate the performance of the model can be plotted such as confusion matrix plots and AUC curves. Let us look at how the models are performing in the test data.

Logistic Regression - This was the first model used to make a prediction about the chances of a person defaulting on a loan. Overall, it does a good job of classifying defaulters. However, there are many false positives and false negatives in this model. This could be mainly due to high bias or lower complexity of the model.

AUC curves give a good idea of the performance of ML models. After using logistic regression, it is seen that the AUC is about 0.54 respectively. This means that there is a lot more room for improvement in performance. The higher the area under the curve, the better the performance of ML models.

Naive Bayes Classifier - This classifier works well when there is textual information. Based on the results generated in the confusion matrix plot below, it can be seen that there is a large number of false negatives. This can have an impact on the business if not addressed. False negatives mean that the model predicted a defaulter as a non-defaulter. As a result, banks might have a higher chance to lose income especially if money is lent to defaulters. Therefore, we can go ahead and look for alternate models.

The AUC curves also showcase that the model needs improvement. The AUC of the model is around 0.52 respectively. We can also look for alternate models that can improve performance even further.

Decision Tree Classifier - As shown from the plot below, the performance of the decision tree classifier is better than logistic regression and Naive Bayes. However, there are still possibilities for improvement of model performance even further. We can explore another list of models as well.

Based on the results generated from the AUC curve, there is an improvement in the score compared to logistic regression and decision tree classifier. However, we can test a list of other possible models to determine the best for deployment.

Random Forest Classifier - They are a group of decision trees that ensure that there is less variance during training. In our case, however, the model is not performing well on its positive predictions. This can be due to the sampling approach chosen for training the models. In the later parts, we can focus our attention on other sampling methods.

After looking at the AUC curves, it can be seen that better models and over-sampling methods can be chosen to improve the AUC scores. Let us now do SMOTE oversampling to determine the overall performance of ML models.

SMOTE Oversampling

Decision Tree Classifier - In this analysis, the same decision tree classifier was trained but using SMOTE oversampling method. The performance of the ML model has improved significantly with this method of oversampling. We can also try a more robust model such as a random forest and determine the performance of the classifier.

Focusing our attention on the AUC curves, there is a significant improvement in the performance of the decision tree classifier. The AUC score is about 0.81 respectively. Therefore, SMOTE oversampling was useful in improving the overall performance of the classifier.

Random Forest Classifier - This random forest model was trained on SMOTE oversampled data. There is a good improvement in the performance of the models. It is able to accurately predict the chances of default on a loan. There are only a few false positives. There are some false negatives but they are fewer as compared to a list of all the models used previously.

The performance of the random forest classifier is exceptional as it is able to give an AUC score of about 0.95 respectively. This is depicted in the plot below. Therefore, we can deploy this model in real-time as it shows a lot of promise in predicting the chances of applicants defaulting on a loan.

Machine Learning Models

We know that there are millions of records in our data. Hence, it is important to use the most appropriate machine learning model that deals with high-dimensional data well. Below are the machine learning models used for predicting whether a person would default on a loan or not.

Machine Learning Models	Accuracy	Precision	Recall	F1-Score	AUC Score
1. Logistic Regression	64.5%	0.64	0.63	0.63	0.69
2. Naive Bayes Classifier	50.0%	0.50	0.99	0.70	0.64
3. Decision Tree Classifier	81.0%	0.76	0.84	0.80	0.81
4. Random Forest Classifier	86.0%	0.74	0.98	0.84	0.95
5. Deep Neural Networks	73.0%	0.66	0.77	0.71	0.76

👉 Directions to download the repository and run the notebook

This is for the Washington Bike Demand Prediction repository. But the same steps could be followed for this repository.

You'll have to download and install Git that could be used for cloning the repositories that are present. The link to download Git is https://git-scm.com/downloads.

Once "Git" is downloaded and installed, you'll have to right-click on the location where you would like to download this repository. I would like to store it in "Git Folder" location.

If you have successfully installed Git, you'll get an option called "Gitbash Here" when you right-click on a particular location.

Once the Gitbash terminal opens, you'll need to write "Git clone" and then paste the link of the repository.

The link of the repository can be found when you click on "Code" (Green button) and then, there would be a html link just below. Therefore, the command to download a particular repository should be "Git clone html" where the html is replaced by the link to this repository.

After successfully downloading the repository, there should be a folder with the name of the repository as can be seen below.

Once the repository is downloaded, go to the start button and search for "Anaconda Prompt" if you have anaconda installed.

Later, open the jupyter notebook by writing "jupyter notebook" in the Anaconda prompt.

Now the following would open with a list of directories.

Search for the location where you have downloaded the repository. Be sure to open that folder.

You might now run the .ipynb files present in the repository to open the notebook and the python code present in it.

That's it, you should be able to read the code now. Thanks.

Name		Name	Last commit message	Last commit date
Latest commit History 297 Commits
.ipynb_checkpoints		.ipynb_checkpoints
images		images
.gitattributes		.gitattributes
Data Preprocessing.ipynb		Data Preprocessing.ipynb
Exploratory Data Analysis (EDA).ipynb		Exploratory Data Analysis (EDA).ipynb
Machine Learning Models.ipynb		Machine Learning Models.ipynb
Payday-Loan.gif		Payday-Loan.gif
README.md		README.md
application_test.csv		application_test.csv
application_train.csv		application_train.csv
flask_file.py		flask_file.py
loan application form.jpg		loan application form.jpg

suhasmaddali/Predicting-Loan-Default-Using-Machine-Learning

Folders and files

Latest commit

History

Repository files navigation

💰 Predicting Loan Default Using Machine Learning

Introduction

Machine Learning Analysis

Exploratory Data Analysis (EDA)

Sampling Methods

Metrics

Visualizations

Model Performance

Random Oversampling

SMOTE Oversampling

Machine Learning Models

👉 Directions to download the repository and run the notebook

About

Resources

Stars

Watchers

Forks

Languages