Improve on the state of the art in credit scoring by predicting the probability that somebody will experience financial distress in the next two years.The goal of this analysis is to build a model that borrowers can use to help make the best financial decisions.

In [None]:
import pandas as pd
import numpy as py
from matplotlib import pyplot as plt
import seaborn as sns
import numpy as np

Import Data

In [None]:
#test data
test = pd.read_csv("../input/GiveMeSomeCredit/cs-test.csv")

In [None]:
#train data
training = pd.read_csv("../input/GiveMeSomeCredit/cs-training.csv")

In [None]:
#data dictionary
data_dictionary = pd.read_excel("../input/GiveMeSomeCredit/Data Dictionary.xls")

In [None]:
training.head(10)

<br>

Exploratory Data Analysis

In [None]:
training.isna().sum()

Null Values in Monthly Income and Number of Dependents : Need to do cleaning of the data before modeling.

In [None]:
test.isna().sum()

In [None]:
sns.barplot(x=training['SeriousDlqin2yrs'].value_counts().index,y=training['SeriousDlqin2yrs'].value_counts())
plt.title("Distribution of the Defaulters in the data")

**ANALYSIS OF AGE**

In [None]:
print("Minimum Age",training.age.min())
print("Maximum Age",training.age.max())
print("Median Age",training.age.median())
print("Mean Age",training.age.mean())
print("Mode Age",training.age.mode()[0])

sns.distplot(training['age'],bins=10)

In [None]:
training.loc[training['age'] == 0, 'age']

It can be observed that the data includes a record with age = 0 which is not a valid age ,updating the record with mode age of 49.

In [None]:
training.loc[training['age'] == 0, 'age'] = training.age.mode()[0]

Checking for the age distribution for the defaulters and non-defaulters

In [None]:
default_0 = training[training['SeriousDlqin2yrs'] == 0]
sns.distplot(default_0['age'],bins=7)

In [None]:
default_1 = training[training['SeriousDlqin2yrs'] == 1]
sns.distplot(default_1['age'],bins=7)

People with age group between 35 - 55 are the major defaulters in the data

DebtRatio vs Age

In [None]:
sns.scatterplot(x=training['DebtRatio'],y=training['age'])

DebtRatio vs RevolvingUtilizationOfUnsecuredLines

In [None]:
sns.scatterplot(x=training['DebtRatio'],y=training['RevolvingUtilizationOfUnsecuredLines'])

NumberOfOpenCreditLinesAndLoans vs Monthly Income

In [None]:
training['MonthlyIncome'].describe()

In [None]:
monthly_income_less_10000 = training[training['MonthlyIncome'] < training['MonthlyIncome'].quantile(0.99)]

In [None]:
sns.scatterplot(x=monthly_income_less_10000['MonthlyIncome'],y=monthly_income_less_10000['NumberOfOpenCreditLinesAndLoans'])

**Correlation Plot:**

In [None]:
f, ax = plt.subplots(figsize=(10, 8))
corr = training.corr()
sns.heatmap(corr, mask=np.zeros_like(corr, dtype=np.bool), cmap=sns.diverging_palette(220, 10, as_cmap=True),
            square=True, ax=ax)

Numberoftime90daysLate is highly correlated with NumberofTime60-89DaysPastDueNotWorse as both of the variables are denoting the defaulters days.

<br>

**MONTHLY INCOME**

In [None]:
print("Minimum MonthlyIncome",training.MonthlyIncome.min())
print("Maximum MonthlyIncome",training.MonthlyIncome.max())
print("Median MonthlyIncome",training.MonthlyIncome.median())
print("Mean MonthlyIncome",training.MonthlyIncome.mean())
print("Mode MonthlyIncome",training.MonthlyIncome.mode()[0])
print("Null Values",training.MonthlyIncome.isna().sum())

sns.distplot(training['MonthlyIncome'])

As we have many outliers in the data for Monthly Income we would be replacing the nulls in the Monthly income with the Median Value of the Monthly Income.

In [None]:
training.loc[training.MonthlyIncome.isna(),'MonthlyIncome'] = training.MonthlyIncome.median()

<br>

**Number of Dependents**

In [None]:
print("Minimum NumberOfDependents",training.NumberOfDependents.min())
print("Maximum NumberOfDependents",training.NumberOfDependents.max())
print("Median NumberOfDependents",training.NumberOfDependents.median())
print("Mean NumberOfDependents",training.NumberOfDependents.mean())
print("Mode NumberOfDependents",training.NumberOfDependents.mode()[0])
print("Null Values",training.NumberOfDependents.isna().sum())

sns.distplot(training['NumberOfDependents'])

As there are many outliers in the NumberOfdependents it is better to replace the nulls of the data with the required median value of 0.

In [None]:
training.loc[training.NumberOfDependents.isna(),'NumberOfDependents'] = training.NumberOfDependents.median()

In [None]:
sns.barplot(x=training['NumberOfTime30-59DaysPastDueNotWorse'].value_counts().index,y=training['NumberOfTime30-59DaysPastDueNotWorse'].value_counts())

In [None]:
training.isna().sum()

Now that there are no null values we would be starting with modeling of data.


**XGBoost Classifier**

In [None]:
#Spliting of Data:
from sklearn.model_selection import train_test_split
import xgboost as xgb
from sklearn.metrics import auc, accuracy_score, confusion_matrix, mean_squared_error,roc_curve

In [None]:
y = training.loc[:,training.columns.isin(['SeriousDlqin2yrs'])]
X_attributes=[
       'RevolvingUtilizationOfUnsecuredLines', 'age',
       'NumberOfTime30-59DaysPastDueNotWorse', 'DebtRatio', 'MonthlyIncome',
       'NumberOfOpenCreditLinesAndLoans', 'NumberOfTimes90DaysLate',
       'NumberRealEstateLoansOrLines',
       'NumberOfDependents'] # Excluding NumberOfTime60-89DaysPastDueNotWorse' because of strong collinearity
X = training.loc[:,training.columns.isin(X_attributes)]

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

In [None]:
xgb_model = xgb.XGBClassifier(objective="binary:logistic",random_state=42)

In [None]:
xgb_model.fit(X_train,y_train.values.ravel())

In [None]:
y_pred = xgb_model.predict(X_test)
y_probab = xgb_model.predict_proba(X_test)

In [None]:
accuracy_score(y_pred,y_test)

In [None]:
#Feature Importance Plot
feature_important = xgb_model.get_booster().get_score(importance_type='weight')
keys = list(feature_important.keys())
values = list(feature_important.values())
data = pd.DataFrame(data=values, index=keys, columns=["score"]).sort_values(by = "score", ascending=False)
data.plot(kind='barh')

In [None]:
def plot_roc(y_test,probs):
    fpr, tpr, thresholds = roc_curve(y_test, probs)
    plt.plot([0, 1], [0, 1], linestyle='--')
    plt.plot(fpr, tpr, marker='.')
    plt.title("ROC curve")
    plt.xlabel('false positive rate')
    plt.ylabel('true positive rate')
    plt.show()

In [None]:
plot_roc(y_test,y_probab[:,1])

Making Predictions on Test Data..!!

In [None]:
test.isna().sum()

In [None]:
test.loc[test['MonthlyIncome'].isna(),'MonthlyIncome'] = test['MonthlyIncome'].dropna().median()
test.loc[test['NumberOfDependents'].isna(),'NumberOfDependents'] = test['NumberOfDependents'].dropna().mode()

In [None]:
test_proba = xgb_model.predict_proba(test.loc[:,test.columns.isin(X_attributes)])

In [None]:
len(np.arange(1,len(test_proba)+1))

In [None]:
len(test_proba)

In [None]:
df = pd.DataFrame({'Id':np.arange(1,len(test_proba)+1),'Probability':test_proba[:,1]})

In [None]:
#Test data predicitions
df

In [None]:
!pwd

In [None]:
df.to_csv('submission.csv', index = False)

MODEL INTERPRETATION WITH SHAP

In [None]:
import shap

mybooster=xgb_model.get_booster()

model_bytearray = mybooster.save_raw()[4:]

def myfun(self=None):
    return model_bytearray

mybooster.save_raw = myfun

In [None]:
explainerXGB = shap.TreeExplainer(mybooster)

In [None]:
shap_values = explainerXGB.shap_values(X_train.loc[:,X_train.columns.isin(feature_important)])
shap.summary_plot(
    shap_values,
    X_train.loc[:,X_train.columns.isin(feature_important)],
    max_display=110,
    show=True,
)

The above SHAP summary plot for the XGBOost Model 

- Higher the RevolvingUtilizationofUnsecuredLines higher the defaulter probability
- Higher the number of times the borrower has past due more is the probability of being a defaulter.
- Lower the age high are the likelihood of being defaulter
- With more number of open credit lines and loans high are probability for being a defaulter. 
- Lower the Monthly Income higher the chances of being a defaulter
- Higher the Number of dependents and realestate loans more the probability of being a defaulter.