In [1]:
import pandas as pd

# Credit scoring 

Credit scoring algorithms, which make a guess at the probability of default, are the method banks use to determine whether or not a loan should be granted.
I build a Random Forest to predict whether the customer will repay their credit within 90 days.

In [2]:
# load the data
url = 'https://raw.githubusercontent.com/um-perez-alvaro/Data-Science-Practice/master/Data/credit_scoring.csv'
credit_scoring = pd.read_csv(url)
credit_scoring.head()

Unnamed: 0,SeriousDlqin2yrs,age,NumberOfTime30-59DaysPastDueNotWorse,DebtRatio,NumberOfTimes90DaysLate,NumberOfTime60-89DaysPastDueNotWorse,MonthlyIncome,NumberOfDependents
0,0,64,0,0.249908,0,0,8158.0,0.0
1,0,58,0,3870.0,0,0,,0.0
2,0,41,0,0.456127,0,0,6666.0,0.0
3,0,43,0,0.00019,0,0,10500.0,2.0
4,1,49,0,0.27182,0,0,400.0,0.0


**Data Description**

| Feature | Description |
| :- | -: |
|SeriousDlqin2yrs (target variable) | Customer hasn't paid the loan debt within 90 days 
|age	| Customer age
|DebtRatio | Total monthly loan payments (loan, alimony, etc.) / Total monthly income percentage
|NumberOfTime30-59DaysPastDueNotWorse | The number of cases when client has overdue 30-59 days (not worse) on other loans |during the last 2 years
|NumberOfTimes90DaysLate	Input Feature | Number of cases when customer had 90+dpd overdue on other credits
|NumberOfTime60-89DaysPastDueNotWorse | 	Number of cased when customer has 60-89dpd (not worse) during the last 2 years
|NumberOfDependents | The number of customer dependents


**goal** is to train a Random Forest classifier that predicts the target column (`SeriousDlqin2yrs`), tune the Random Forest hyperparameters, and test the performance of the classification model (useing `recall` and `accuracy` to evaluate the performance.)

In [3]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split,GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score,recall_score,confusion_matrix, plot_confusion_matrix
from sklearn import set_config
set_config(display='diagram')

In [4]:
x=credit_scoring.drop(['SeriousDlqin2yrs'],axis=1)

In [5]:
y=credit_scoring.SeriousDlqin2yrs

In [6]:
pipe= Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler',MinMaxScaler()),
    ('tree_clf',RandomForestClassifier())
])

In [7]:
x_train,x_test,y_train,y_test=train_test_split(x,y)

In [8]:
pipe.fit(x_train,y_train)

In [10]:
param_dic ={'tree_clf__n_estimators':[5,10,25,50,100,200],
            'tree_clf__max_depth':[2,5,10,20],
            'tree_clf__min_samples_split':[2,4,8,16,32]}


In [11]:
grid=GridSearchCV(pipe,
                  param_dic,
                  cv=10,
                  scoring='accuracy',
                  n_jobs=-1,verbose=1)

In [12]:
grid.fit(x_train,y_train)

Fitting 10 folds for each of 120 candidates, totalling 1200 fits


In [13]:
grid.best_params_

{'tree_clf__max_depth': 20,
 'tree_clf__min_samples_split': 32,
 'tree_clf__n_estimators': 200}

In [14]:
best_clf=grid.best_estimator_

In [15]:
best_clf.fit(x_train,y_train)

In [20]:
y_test_pred= best_clf.predict(x_test)

In [21]:
accuracy_score(y_test,y_test_pred)

0.836321675838807

In [22]:
recall_score(y_test,y_test_pred)

0.47027687296416937