# Loan Prediction Project 

In this project we will be working with a loan customer behaviour data set, indicating whether or not a loan customer will default on loan or not. We will try to create a model that will predict whether or not they will default on a loan based on customer behaviour.

This data set contains the following features:

* 'income': Income of the user
* 'Age': cutomer age in years
* 'experience': Professional experience of the user in years
* 'profession': Profession
* 'married': Whether married or single
* 'City': City of consumer
* 'house_ownership': Owned or rented or neither
* 'currentjobyears': Years of experience in the current job
* 'currenthouseyears': Number of years in the current residence
* 'risk_flag	': Defaulted on a loan
* 'state	': State of residence



In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_auc_score
%matplotlib inline
sns.set_style("whitegrid")
sns.set_theme()

In [None]:
customer = pd.read_csv('/kaggle/input/loan-prediction-based-on-customer-behavior/Training Data.csv')

In [None]:
customer.head()

In [None]:
customer.info()

In [None]:
customer.isnull().sum()

In [None]:
customer.drop('Id',axis = 1, inplace = True)

In [None]:
customer.head()

In [None]:
customer.describe()

In [None]:
corr = customer.corr()
corr

In [None]:
plt.figure(figsize=(10,5))
sns.heatmap(corr,annot = True,cmap = 'coolwarm')

* From the correlation we can see that customer who have bigger income and and have lived in their house for a long time tend not to default on loans 
* overall the higher the number in the variable the less likely they are to default on loan

In [None]:
sns.distplot(customer['Age'],kde=False,color='darkred',bins = 30)

In [None]:
sns.countplot(x='Experience',data=customer)

In [None]:
customer['Risk_Flag'].value_counts()

In [None]:
sns.countplot(x='Risk_Flag', data=customer)

from the value counts and countplot we can see that the data is very unbalanced which can cause poor performance in prediction, to overcome this we will do SMOTE(Synthetic Minority Oversampling Technique

In [None]:
sns.countplot(x='CURRENT_HOUSE_YRS',hue ='Risk_Flag', data=customer)

In [None]:
sns.countplot(x='Married/Single',hue ='Risk_Flag', data=customer)

In [None]:
customer['Profession'].nunique()

In [None]:
customer['Profession'].value_counts()

In [None]:
sns.boxplot(x='Risk_Flag',y='Age',hue='Married/Single',data=customer)

In [None]:
customer['Income']/ 1000

In [None]:
customer.head()

In [None]:
Marriage = pd.get_dummies(customer['Married/Single'],drop_first=True)
House_Ownership = pd.get_dummies(customer['House_Ownership'],drop_first=True)
Car_Ownership = pd.get_dummies(customer['Car_Ownership'],drop_first=True)

In [None]:
customer.drop(['Married/Single','House_Ownership','Car_Ownership','Profession','CITY','STATE',],axis=1,inplace=True)

In [None]:
customer

In [None]:
customer = pd.concat([customer,Marriage,House_Ownership,Car_Ownership],axis=1)

In [None]:
customer

In [None]:
X = customer.drop('Risk_Flag',axis=1)
y = customer['Risk_Flag']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

In [None]:
from imblearn.over_sampling import SMOTE

oversample = SMOTE()
X_train, y_train = oversample.fit_resample(X_train, y_train)

In [None]:
from sklearn.tree import DecisionTreeClassifier

dct = DecisionTreeClassifier()
dct.fit(X_train,y_train)
dct_predict = dct.predict(X_test)

In [None]:
print(accuracy_score (y_test, dct_predict))
print(roc_auc_score (y_test, dct_predict))

In [None]:
dct_cfm = confusion_matrix(y_test,dct_predict)

group_counts = ['{0:0.0f}'.format(value) for value in
                dct_cfm.flatten()]
group_percentages = ['{0:.2%}'.format(value) for value in
                     dct_cfm.flatten()/np.sum(dct_cfm)]
labels = [f'{v2}\n{v3}' for  v2, v3 in
          zip(group_counts,group_percentages)]
labels = np.asarray(labels).reshape(2,2)
sns.heatmap(dct_cfm, annot=labels, fmt='', cmap='Blues')

* DecisionTreeClassifier give us 0.86 accuracy and 0.86 auc score 
* From the confusion matrix we can see that the model is tend to have 0 for the output it is because the data is imbalance

In [None]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression()
lr.fit(X_train,y_train)
lr_predictions = lr.predict(X_test)

In [None]:
print(accuracy_score (y_test, lr_predictions))
print(roc_auc_score (y_test, lr_predictions)) 

In [None]:
lr_cfm = confusion_matrix(y_test,lr_predictions)

group_counts = ['{0:0.0f}'.format(value) for value in
                lr_cfm.flatten()]
group_percentages = ['{0:.2%}'.format(value) for value in
                     lr_cfm.flatten()/np.sum(lr_cfm)]
labels = [f'{v2}\n{v3}' for  v2, v3 in
          zip(group_counts,group_percentages)]
labels = np.asarray(labels).reshape(2,2)
sns.heatmap(lr_cfm, annot=labels, fmt='', cmap='mako')

* LogisticRegression give us 0.88 accuracy and 0.5 auc score
* From the confusion matrix we can see that the model always give negative or 0 for the output so even though it has high accuracy this model cannot be used 

In [None]:
from sklearn.ensemble import RandomForestClassifier

rft = RandomForestClassifier()
rft.fit(X_train,y_train)
rft_predict = rft.predict(X_test)

In [None]:
print(accuracy_score (y_test, rft_predict))
print(roc_auc_score (y_test, rft_predict)) 

In [None]:
rft_cfm = confusion_matrix(y_test,rft_predict)

group_counts = ['{0:0.0f}'.format(value) for value in
                rft_cfm.flatten()]
group_percentages = ['{0:.2%}'.format(value) for value in
                     rft_cfm.flatten()/np.sum(rft_cfm)]
labels = [f'{v2}\n{v3}' for  v2, v3 in
          zip(group_counts,group_percentages)]
labels = np.asarray(labels).reshape(2,2)
sns.heatmap(rft_cfm, annot=labels, fmt='', cmap='rocket_r')

* RandomForestClassifier give us 0.90 accuracy and 0.86 auc score 
* From the confusion matrix we can see that the model is tend to have 0 for the output it is because the data is imbalance

##  Conclusion

* From the three model that we use RandomForestClassifier give us higher accuracy and auc score than the other two model that we use 
* There will be improvement soon to have bette auc score any feedback is welcome and appreciated