# *Summary*
### In this notebook, the credit risk data was cleaned, explored for better understanding of the current credit risk situation, and modelled the data to accurately predict the probability of default of a loan. This can be used to automate approving and declining loan applcations more accurately.

### An 86% accuracy level was achieved in predicting the loan defaults on 32,576 loans and 12 benchmarks. With this model, the default rate would decrease by 8%, resulting in minimized risk for both the lender and applicant.
   


In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib as mpl
import matplotlib.pyplot as plt
from sklearn import model_selection,linear_model, metrics

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
cr_data = pd.read_csv("/kaggle/input/credit-risk-dataset/credit_risk_dataset.csv")
shape = cr_data.shape
print("There are {} rows and {} features.".format(shape[0], shape[1]))
print(cr_data.dtypes)
cr_data

In [None]:
# we will shorten the last 2 feature names and address the null values
cr_data = cr_data.rename(columns = {"cb_person_default_on_file":"default_hist", "cb_person_cred_hist_length": "cr_hist_len"})
cr_data.isnull().sum()

In [None]:
# percentage of null values from loan int rate col
cr_data.loan_int_rate.isnull().sum() / cr_data.shape[0]

There are 2 features which has null values. Since the  relative to our sample size, we will investigate their distributions and decide how to fill the NaNs.

In [None]:
plt.hist(cr_data['person_emp_length'])
plt.xlabel("Employment Length")
plt.ylabel("Frequency")
plt.title("Freq vs Employment Length")
plt.show()

plt.hist(cr_data['loan_int_rate'])
plt.xlabel("Interest Rate")
plt.ylabel("Frequency")
plt.title("Freq vs Interest Rate")

Both features are not normally distributed. Therefore we will fill the NaNs with the median values for both the loan interest rate and employment length features.

In [None]:
emp_len_null = cr_data[cr_data['person_emp_length'].isnull()].index
int_rate_null = cr_data[cr_data['loan_int_rate'].isnull()].index

cr_data['person_emp_length'].fillna((cr_data['person_emp_length'].median()), inplace=True)
cr_data['loan_int_rate'].fillna((cr_data['loan_int_rate'].median()), inplace = True)

Now let's consider if there are outliers.

In [None]:
# check distribution of age and interest rate


colors = ["blue","red"]
plt.scatter(cr_data['person_age'], cr_data['loan_int_rate'],
            c = cr_data['loan_status'],
            cmap = mpl.colors.ListedColormap(colors), alpha=0.5)
plt.xlabel("Person Age")
plt.ylabel("Loan Interest Rate")
plt.title("Interest Rate vs Age")


There are individuals who are above 120 years of age with loans and are unlikely to apply for new loans in the future. Therefore we will remove individuals who exceed 100 years of age. 

There is no outlier for loan interest rates.

In [None]:
# Clean 1
cr_clean1 = cr_data[cr_data['person_age']<=100]

cr_data[cr_data['person_age']>100]

The entries shown above have been removed and created a cleaned dataset saved as `cr_clean1`.

In [None]:
pd.crosstab(cr_clean1['default_hist'], cr_clean1['loan_grade'])

There is no surpise here as we see the lender focuses on issuing higher grade loans to clients with better credit history and less loans to those with worse credit history.

In [None]:
# note 0 is non default and 1 is default
default_hist_status_tab = pd.crosstab(cr_clean1['default_hist'], cr_clean1['loan_status'])
default_hist_status_tab

In [None]:
total1 = default_hist_status_tab.iloc[0].sum()
defaulted1 = default_hist_status_tab.iloc[0,1]

total2 = default_hist_status_tab.iloc[1].sum()
defaulted2 = default_hist_status_tab.iloc[1,1]

first_default = round(defaulted1 / total1 * 100, 2)
second_default = round(defaulted2 / total2 * 100, 2)

print("Despite the measures taken, {}% of clients defaulted for the first time.".format(first_default))
print("And {}% of clients who had previously defaulted, defaulted again.".format(second_default))

In [None]:
pd.crosstab(cr_clean1['default_hist'], cr_clean1['loan_intent'], 
            values = cr_clean1['loan_int_rate'], aggfunc = 'median')

Those who had not previously defaulted has a median loan interest rate 4% less than those who have defaulted. Issing a loan to client who may default has negative outcomes not only for the lender but also long term negative consequences for the client. We will use machine learning algorithms to improve credit risk modelling to reduce risk for both the lender and client.

In [None]:
cr_clean1

In [None]:
# one hot encoding categorical variables
num_col = cr_clean1.select_dtypes(exclude = 'object')
char_col = cr_clean1.select_dtypes(include = 'object')

encoded_char_col = pd.get_dummies(char_col)

cr_clean2 = pd.concat([num_col, encoded_char_col], axis=1)
cr_clean2

In [None]:
# Split Train and Test Sets
Y = cr_clean2['loan_status']
X = cr_clean2.drop('loan_status',axis=1)
 


x_train, x_test, y_train, y_test = model_selection.train_test_split(X, Y, random_state=2020, test_size=.30)

#Start of Classification Logistics Regression

log_clf = linear_model.LogisticRegression()

log_clf.fit(x_train, np.ravel(y_train))

In [None]:
col_effect = pd.DataFrame()
col_effect['col_names'] = X.columns
col_effect['col_coef'] = log_clf.coef_[0]
col_effect

This tells  to deaultfor every one unit of increase in each column, the person is more likely when the coefficient is more positive and less likely when the coefficient is more negative.

In [None]:
int_val = float(log_clf.intercept_)
print('The overall probablity of non default is {:.3%}'.format(int_val))


We can use he previous the intercept and coefficient values of calculate the probability of default ( P = 1 ) and non default ( P = 0 ).

First we need the sum of the intercept and coefficients x column value. For example, int_coef_sum = intercept + [col_coef] X [col_values].

Then we can calculate the probabilities of default and non default witht eh logistic regression formula.

prob_default = 1/ (1 + np.exp(-int_coef_sum))

prob_nondefault = 1 - prob_default 

In [None]:
# first column is the logistic regression value
# second column is the predicted probability of default == 1
predict_log = pd.DataFrame(log_clf.predict_proba(x_test)[:,1], columns=['prob_default'])

pred_df = pd.concat([y_test.reset_index(drop=True), predict_log],axis=1)
pred_df

In [None]:
# check the accuracy
initial_accuracy = round(log_clf.score(x_test,  y_test),2)
print("The initial accuracy is {}".format(initial_accuracy))

We want to whether there is a more suitable threshold to improve our accuracy.

In [None]:
thresh = np.linspace(0,1,21)
thresh

In [None]:
metrics.recall_score(pred_df.iloc[:,0],y_test, labels = [0,1])

In [None]:
def find_opt_thresh(predict,thr =thresh, y_true = y_test):
    data = predict
    
    def_recalls = []
    nondef_recalls = []
    accs =[]

    
    for threshold in thr:
        # predicted values for each threshold
        data['loan_status'] = data['prob_default'].apply(lambda x: 1 if x > threshold else 0 )
        
        accs.append(metrics.accuracy_score(y_true, data['loan_status']))
        
        stats = metrics.precision_recall_fscore_support(y_true, data['loan_status'])
        
        def_recalls.append(stats[1][1])
        nondef_recalls.append(stats[1][0])
        
        
    return accs, def_recalls, nondef_recalls

accs, def_recalls, nondef_recalls= find_opt_thresh(pred_df)

In [None]:
plt.plot(thresh,def_recalls)
plt.plot(thresh,nondef_recalls)
plt.plot(thresh,accs)
plt.xlabel("Probability Threshold")
plt.xticks(thresh, rotation = 'vertical')
plt.legend(["Default Recall","Non-default Recall","Model Accuracy"])
#plt.axvline(x=0.45, color='pink')
plt.show()


In [None]:
max_accuracy_index = accs.index(max(accs))

print('The maximum accuracy is {:.0%}.'.format(accs[max_accuracy_index]))
print('Therefore we should have a threshold of {:.0%}.'.format(thresh[max_accuracy_index]))

# Further optimize the accuracy level with PCA

In [None]:
cr_clean2

In [None]:
from sklearn.decomposition import PCA
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler(feature_range=[0, 1])
data_rescaled = scaler.fit_transform(cr_clean2)

#Fitting the PCA algorithm with our Data
pca = PCA().fit(data_rescaled)
#Plotting the Cumulative Summation of the Explained Variance
plt.figure()
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('Number of Components')
plt.ylabel('Variance (%)') #for each component
plt.title('Hotel Booking Dataset Explained Variance')
plt.show()

In the figure above, I have identified 14 components would be optiminal number to have the most simplied model with the most amount of information.

In [None]:
# normalize data
from sklearn import preprocessing
from sklearn.decomposition import PCA

pie = cr_clean2.drop('loan_status',axis=1)

data_scaled = pd.DataFrame(preprocessing.scale(pie),columns = pie.columns) 

# PCA
pca = PCA(n_components=14)
pca_val = pca.fit_transform(data_scaled)
pca_dataset = pd.DataFrame(pca_val)

In [None]:
x_train, x_test, y_train, y_test = model_selection.train_test_split(pca_dataset, Y, random_state=2020, test_size=.32)

#Start of Classification Logistics Regression

log_clf = linear_model.LogisticRegression()

log_clf.fit(x_train, np.ravel(y_train))

# first column is the logistic regression value
# second column is the predicted probability of default == 1
pca_predict_log = pd.DataFrame(log_clf.predict_proba(x_test)[:,1], columns=['prob_default'])

pca_pred_df = pd.concat([y_test.reset_index(drop=True), predict_log],axis=1)
pca_pred_df

pca_accuracy = round(log_clf.score(x_test,  y_test),2)
pca_accuracy


We have improved the accuracy of our model from ****81% to 86%**** by leveraging ****Principle Component Analysis*** and ****hyperparameter tuning****.

The current process for credit assessment had a default rate of 22% as shown below. The new credit risk assessment algorithm which we had develop with principal component analysis and logistic regression had a reduced default rate from ****22% to 14%****  (1-0.86). 

The ****5% increase in accuracy resulted in the an 8% reduction of defaulted loans****, minimizing the lender's risk and improving their confidence to lend credit.

In [None]:
round(default_hist_status_tab.iloc[:,1].sum() / pca_dataset.shape[0],2)
