# Credit Risk 

## Objective:

The objective is to build machine learning models based on given dataset to predict whether a particular customer will repay the loan or not. 

### Applicability:
* Target Label is known before Supervised Learning Models could build upon dataset.
* Target classes are discrete so any Classifier model can be built
* Here, we considered traditional Classifer models- Logistic Regression, Decision Tree Classifer, RandomForestClassifier
    
 


In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
#import necessary modules
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

sns.set()

In [None]:
credit= pd.read_csv('/kaggle/input/credit-risk/original.csv')
credit.head()        

##### Understanding the dataset

In [None]:
print(credit.shape)

In [None]:
credit.info()

In [None]:
credit.describe(include='all')

age cannot be negative. some of the values in age column is negative. we need to make this negative values to either '0' or 'NaN'

In [None]:
#lets replace this negative values with nan values
credit.loc[~(credit['age'] > 0), 'age']=np.nan

In [None]:
unique_vals= {
    k: credit[k].unique()
    for k in credit.columns
    
}

unique_vals

1. Clientid has unique values for each observations. keeping it for modeling will make our model complex. we will ignore this column
2. income, age, loan looks like numerical columns
3. default have only two values. The objective is to create model if the customer is default or not. This column is our output variable. Since we know the Target value and target variable is discrete. We need to build a Supervised Learning model
4. Convert the default column to category

In [None]:
#drop clientid from dataset
credit= credit.drop('clientid', axis=1)

#### missing values

In [None]:
credit.isnull().sum()

In [None]:
# 6 missing values in 2000 records is roughly 1.2% of total records. we will drop null values
credit= credit.dropna()
credit.shape

In [None]:
credit['default']= credit['default'].astype('category')
credit['age']=credit['age'].astype('int')

#### Exploratory Data Analytics

In [None]:
credit.describe()

income and loan have high variance between them compared to age. we need to scale them before we fit our model

In [None]:
credit.corr()

In [None]:
credit.var()

as stated above, variance is high in income and loan.

In [None]:
credit['default'].value_counts()

In [None]:
credit.groupby('default').mean()

youngsters will repay their loan soon. 
because of their repayment power, they are offered with higher loans

So, age is important feature for outcome variable

In [None]:
credit.groupby('age').mean()

##### Visual EDA

In [None]:
plt.figure(figsize=(20,10))
credit.hist()
plt.show()

In [None]:
fig, (ax1, ax2, ax3)= plt.subplots(1,3)
credit['age'].plot(kind='box', ax=ax1, figsize=(12,6))
credit['income'].plot(kind='box', ax=ax2, figsize=(12,6))
credit['loan'].plot(kind='box', ax=ax3, figsize=(12,6))
plt.show()

looks like there are outliers present in loan

In [None]:
sns.barplot(y='age', x='default', data=credit)
plt.xlabel('Defaults')
plt.ylabel('age of defaulters')
plt.title('Average age of defaulters on Loan', fontsize=12)
plt.show()

Out of all people defaulted to loan, most of them are above age 40. Because of their extended lifetime, People of age around 20-30 are very keen on repaying loan

In [None]:
sns.pairplot(data=credit, hue='default',diag_kind='kde')

We couldnt find any pairwise relationships between features. 

In [None]:
# Find the mean and standard dev
std = credit['loan'].std()
mean = credit['loan'].mean()
# Calculate the cutoff
cut_off = std * 3
lower, upper = mean - cut_off, mean + cut_off
# Trim the outliers
trimmed_df = credit[(credit['loan'] < upper) \
                           & (credit['loan'] > lower)]
trimmed_df.shape

In [None]:
# The trimmed box plot
trimmed_df[['loan']].boxplot()
plt.show()

There is no statistically differnce in removing outlier value from dataframe

We will use all the features of original dataframe to build our base model

In [None]:
#Split the independent and outcome variable
X= credit.iloc[:,0:3]
y=credit.iloc[:,3]
y.value_counts()

Target labels have uneven distribution; test and training sets might not be representative samples of our data and could bias the model we are trying to train. We will use stratified Sampling to split up the dataset according to the y dataset

In [None]:
#lets split to training and test set for training the model and validating the model
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test= train_test_split(X, y, random_state=9999, stratify=y)
#stratify is used since the target class distribution is imbalanced

X_train.shape, X_test.shape, y_train.shape, y_test.shape

In [None]:
#lets perform scaling. all our features are numerical columns
#it is important that we need to have our features to be in same scale.

from sklearn.preprocessing import StandardScaler

sc= StandardScaler()
X_train= sc.fit_transform(X_train)
X_test=sc.fit_transform(X_test)

### Model Building- Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression

#instantiate LogisticRegression model
logreg= LogisticRegression(solver='lbfgs')

In [None]:
#perform cross validation to ensure the model is good model
from sklearn.model_selection import cross_val_score

cv_scores= cross_val_score(logreg, X, y, cv=5)

# Print the 5-fold cross-validation scores
print(cv_scores)
print("Average 5-Fold CV Score: {}".format(np.mean(cv_scores)))

Based on CV score, We have our benchmark accuracy score for our Logistic Regression model. if our test set accuracy is between 90-96, we can safely assume that our model is best model

In [None]:
#Fit the linear regression model to training data
logreg.fit(X_train, y_train)

# Predict the test set
y_pred = logreg.predict(X_test)
y_pred

#### Model Validation- Logistic Regression

In [None]:
# Making the confusion matrix
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report

cm = confusion_matrix(y_test,y_pred)
acc_score = accuracy_score(y_test, y_pred)

print(f"Accuracy = {acc_score*100:.2f}%")
print(f"Confusion matrix = \n{cm}")

In [None]:
#Check Training and Test Set Accuracy

training_accuracy= logreg.score(X_train, y_train)
test_accuracy= logreg.score(X_test, y_test)

print(f"Training Set accuracy = {training_accuracy*100:.2f}%")
print(f"Test Set accuracy = {test_accuracy*100:.2f}%")

Training and Test Set accuracy are high and they are almost same. Thus, there is no chance of Overfitting and Underfitting

In [None]:
# Complete classification report
print(classification_report(y_test,y_pred))

In [None]:
# Coefficients of the model and its intercept
print(dict(zip(X.columns, abs(logreg.coef_[0]).round(2))))
print(logreg.intercept_)

None of feature coefficients are close to zero. So, There is no need to drop any of these features.

However, we can perform RFE to understand if the accuracy is improved by dropping any features 

In [None]:
from sklearn.feature_selection import RFE
from sklearn.metrics import accuracy_score

# Create the RFE with a LogisticRegression estimator and 2 features to select
rfe = RFE(estimator=logreg, n_features_to_select=2, verbose=1)
# Fits the eliminator to the data
rfe.fit(X_train, y_train)
# Print the features and their ranking (high = dropped early on)
print(dict(zip(X.columns, rfe.ranking_)))
# Print the features that are not eliminated
print(X.columns[rfe.support_])
# Calculates the test set accuracy
acc = accuracy_score(y_test, rfe.predict(X_test))
print("{0:.1%} accuracy on test set.".format(acc))

Dropping features doesnot improve our model accuracy

#### Model Evaluation:

Evaluate model performance by plotting an ROC curve

In [None]:
from sklearn.metrics import roc_curve, auc

#compute predicted probabilities: y_pred_prob
y_pred_prob= logreg.predict_proba(X_test)[:,1]

#Generate ROC curve values: fpr, tpr, thresholds
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)

# Calculate the AUC

roc_auc = auc(fpr, tpr)
print ('ROC AUC: %0.3f' % roc_auc )

#Plot ROC curve
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr, tpr)
plt.plot(fpr, tpr, label='ROC curve (area = %0.3f)' % roc_auc)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend(loc="lower right")
plt.show()

The AUC for both the test and train samples when run on my logistic regression demonstrates relatively strong power of separation between positive and negative occurences (repay - 1, default - 0)

## Comparing with other ML models

We will build our data with other Classifier models and compare which model best fit to dataset

### Model Building- RandomForestClassifier

In [None]:
#instantiate RandomForestClassifier
from sklearn.ensemble import RandomForestClassifier

rfc= RandomForestClassifier(n_estimators=10, max_depth=3)

#Fit the RandomForest model to training data
rfc.fit(X_train, y_train)

# Predict the test set
y_pred_rfc = rfc.predict(X_test)
y_pred_rfc

#### Model Validation

In [None]:
# Making the confusion matrix
cm_rfc = confusion_matrix(y_test,y_pred_rfc)
acc_score_rfc = accuracy_score(y_test, y_pred_rfc)

print(f"Accuracy = {acc_score_rfc*100:.2f}%")
print(f"Confusion matrix = \n{cm_rfc}")

In [None]:
#Check Training and Test Set Accuracy

training_accuracy_rfc= rfc.score(X_train, y_train)
test_accuracy_rfc= rfc.score(X_test, y_test)

print(f"Training Set accuracy = {training_accuracy_rfc*100:.2f}%")
print(f"Test Set accuracy = {test_accuracy_rfc*100:.2f}%")

Model underfits on test set. we will perform hyperparameter tuning to get best params to fit our model

##### Hyperparameter Tuning

In [None]:
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.metrics import make_scorer


#lets get hyperparameters defined in our model
rfc.get_params()

We have 18 hyperparameters present in rfc model. We can use most important hyperparameter which affects model accuracy.
We will tune four parameters out of 18 parameters
        * max_depth
        * max_leaf_nodes
        * min_samples_split
        * n_estimators

In [None]:
param_grid= {"max_depth": [2, 4, 6, 8, 10],
            "max_leaf_nodes": [2, 4, 6],
            "min_samples_split":[2, 4, 6, 8],
            "n_estimators": [10, 50, 100, 150]}

#create scoring parameter as accuracy_score. There are some default scoring methods defined. however if we want to create we can create using make_Scorer
#Here i am using Accuracy score as scorring method. we can also use recall_score etc
scorer= make_scorer(accuracy_score)

In [None]:
rcv =RandomizedSearchCV(estimator=rfc,param_distributions=param_grid,n_iter=10,cv=5,scoring=scorer)
rcv.fit(X, y)

# print the mean test scores:
print('The accuracy for each run was: {}.'.format(rcv.cv_results_['mean_test_score']))
# print the best model score:
print('The best accuracy for a single model was: {}'.format(rcv.best_params_))

In [None]:
#Use the best params and reinstantiate RandomForestClassifier model
model=RandomForestClassifier(n_estimators= 50, min_samples_split= 2, max_leaf_nodes= 6, max_depth= 10)

#fit the training set to model
model.fit(X_train, y_train)

# Making the confusion matrix
cm_rfc2 = confusion_matrix(y_test,model.predict(X_test))
acc_score_rfc2 = accuracy_score(y_test, model.predict(X_test))

print(f"Accuracy = {acc_score_rfc2*100:.2f}%")
print(f"Confusion matrix = \n{cm_rfc2}")

In [None]:
#Check Training and Test Set Accuracy

training_accuracy_rfc2= model.score(X_train, y_train)
test_accuracy_rfc2= model.score(X_test, y_test)

print(f"Training Set accuracy = {training_accuracy_rfc2*100:.2f}%")
print(f"Test Set accuracy = {test_accuracy_rfc2*100:.2f}%")

Though the Test set accuracy is lower than base model, Training set accuracy subsequently increases with hyper parameter tuning

#### Model Estimation

In [None]:
from sklearn.metrics import roc_curve, auc

#compute predicted probabilities: y_pred_prob
y_pred_prob_rfc= model.predict_proba(X_test)[:,1]

#Generate ROC curve values: fpr, tpr, thresholds
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob_rfc)

# Calculate the AUC

roc_auc = auc(fpr, tpr)
print ('ROC AUC: %0.3f' % roc_auc )

#Plot ROC curve
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr, tpr)
plt.plot(fpr, tpr, label='ROC curve (area = %0.3f)' % roc_auc)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve of RandomForest Model')
plt.legend(loc="lower right")
plt.show()

### Model Building- Decision Tree Classifier

In [None]:
#Instantiate Decision Tree Classifier
from sklearn.tree import DecisionTreeClassifier

dt= DecisionTreeClassifier(max_depth=4, random_state=9999)

dt.fit(X_train, y_train)

#fit the training set to model
dt.fit(X_train, y_train)

#### Model Validation

In [None]:
# Making the confusion matrix
cm_dt = confusion_matrix(y_test,dt.predict(X_test))
acc_score_dt = accuracy_score(y_test, dt.predict(X_test))

print(f"Accuracy = {acc_score_dt*100:.2f}%")
print(f"Confusion matrix = \n{cm_dt}")

In [None]:
#Check Training and Test Set Accuracy

training_accuracy_dt= dt.score(X_train, y_train)
test_accuracy_dt= dt.score(X_test, y_test)

print(f"Training Set accuracy = {training_accuracy_dt*100:.2f}%")
print(f"Test Set accuracy = {test_accuracy_dt*100:.2f}%")

In [None]:
# Complete classification report
print(classification_report(y_test,dt.predict(X_test)))

#### Model Estimation

In [None]:
#compute predicted probabilities: y_pred_prob
y_pred_prob_dt= dt.predict_proba(X_test)[:,1]

#Generate ROC curve values: fpr, tpr, thresholds
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob_dt)

# Calculate the AUC

roc_auc = auc(fpr, tpr)
print ('ROC AUC: %0.3f' % roc_auc )

#Plot ROC curve
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr, tpr)
plt.plot(fpr, tpr, label='ROC curve (area = %0.3f)' % roc_auc)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve of Decision Tree Model')
plt.legend(loc="lower right")
plt.show()

In [None]:
#lets draw decision tree

from sklearn import tree

decision_tree= tree.export_graphviz(dt, out_file='tree.dot', feature_names=credit.iloc[:, :3].columns, 
                                    max_depth=4, filled=True, rounded=True)

In [None]:
!dot -Tpng tree.dot -o tree.png

In [None]:
image= plt.imread('tree.png')
plt.figure(figsize=(20, 20))
plt.imshow(image)

## Conclusion: 
    By all means, Decision Tree performs better than LogisticRegression and RandomForest Classifier models