# Home Credit Python Scoring Workflow v.0.2.0

**Contributors:**
- Pavel Sůva (HCI Reserach & Development)
- Marek Teller (HCI Reserach & Development)
- Martin Kotek (HCI Reserach & Development)
- Sergey Gerasimov (HCRU Big Data & Scoring)
- Valentina Kalenichenko (HCRU Big Data & Scoring)

## Import packages
- time, datetime - ability to get current time for logs
- math - basic mathematical functions (as logarithm etc.))
- random - generate random selection from probability distributions
- NumPy - for scientific, mathematical, numerical calculations
- Pandas - for efficient work with large data structures
- cx_Oracle and sqlalchemy - for loading data from Oracle database (DWH etc.)
- statsmodels - library with some statistical functions and models
- scikit-learn - all important machine learning (and statistical) algorithms used for training the models
- matplotlib - for plotting the charts
- seaborn - for statistical visualisations
- scoring - functions and objects from scoring.py (part of our scoring workflow)

In [None]:
import time
import datetime
import math
import random
import numpy as np
import pandas as pd
import cx_Oracle
from sqlalchemy import create_engine
from sklearn.metrics import roc_auc_score
from sklearn.linear_model import LogisticRegression
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.utils import check_array
from sklearn.utils import as_float_array
from sklearn.utils.validation import check_is_fitted
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import scoring

In [None]:
%matplotlib inline
sns.set()
from IPython.display import display
pd.options.display.max_columns = None
pd.options.display.max_rows = 15

## Import data
Importing data from a CSV file. It is important to set the following parameters:

encoding: usually 'utf-8' or windows-xxxx on Windows machines, where xxxx is 1250 for Central Europe, 1251 for Cyrilic etc.
sep: separator of columns in the file
decimal: decimal dot or coma
index_col: which columns is used as index - should be the unique credit case identifier

In [None]:
data = pd.read_csv(r'C:/Analyses/HQ_20171213_PythonWorkflow/ExampleData2.csv', sep = ',', decimal = '.', 
                   encoding = 'windows-1251', index_col = 'ID', low_memory = False)
print('Data loaded on',datetime.datetime.fromtimestamp(time.time()).strftime('%Y-%m-%d %H:%M:%S'))

Optionally the data can be loaded also from a database. The function read_sql uses cache, so the data don't have to be downloaded from the database repeatedly. The cache will be located in a new folder called **db_cache**.

In [None]:
#engine = create_engine('oracle://PAVELS[GP_HQ_RISK]:password@(description=(address=(protocol=tcp)(host=dbdwhru.homecredit.ru)(port=1521))(connect_data=(sid=DWHRU)))', echo=False)

In [None]:
#from scoring.db import read_sql
#data = read_sql('select * from owner_dwh.f_application_tt where rownum=1',engine, index_col = 'sk_application')
#print('Data loaded on',datetime.datetime.fromtimestamp(time.time()).strftime('%Y-%m-%d %H:%M:%S'))

If you need to download data from the database again (and not from cache), use the parameter refresh:

In [None]:
#from scoring.db import read_sql
#data = read_sql('select * from owner_dwh.f_application_tt where rownum=1',engine, index_col = 'sk_application', refresh=True)
#print('Data loaded on',datetime.datetime.fromtimestamp(time.time()).strftime('%Y-%m-%d %H:%M:%S'))

In [None]:
print('Number of rows:',data.shape[0])
print('Number of columns:',data.shape[1])

In [None]:
data.head(5)

## Metadata definitions
Assigning ID column, target column, time column and month column. The month column don't have to exist in the dataset, it will be created later in this workflow.

In [None]:
#name of the time column
col_time = "TIME"
#name of the month column
col_month = "MONTH"
#name of the day column
col_day = "DAY"
#name of the target column
col_target = "DEF"
#name of the base column
col_base = "BASE"

If you don't have base column in your data set, the following code adds it (value 1 for each observation). **Otherwise, don't run it.**

In [None]:
data[col_base] = 1
print('Column',col_base,'added/modified. Number of columns:',data.shape[1])

Create the month and day column from the time column
- take the time column and tell in which format the time is saved in
- strip the format just to year, month, day string
- convert the string to number
- the new column will be added to the dataset as day
- truncate this column to just year and month and add it to dataset as month

In [None]:
data.loc[:,col_day] = pd.to_numeric(pd.to_datetime(data[col_time], format='%Y-%m-%d %H:%M:%S').dt.strftime('%Y%m%d'))
data[col_month] = data[col_day].apply(lambda x: math.trunc(x/100))
print('Columns',col_day,'and',col_month,'added/modified. Number of columns:',data.shape[1])

Load the predictors list from a csv file. The csv should have just one column, without any header, containing the name of the variables that should be used as predictors.

In [None]:
cols_pred = list(pd.read_csv(r'C:/Analyses/HQ_20171213_PythonWorkflow/ExamplePredList.csv', sep = ',', decimal = '.', 
                   encoding = 'windows-1251', low_memory = False, header = None)[0])

cols_pred_cat = list(set([c[0] for c in list(zip(data.columns, data.dtypes)) if c[1]=='O']) & set(cols_pred))
cols_pred_num = list(set([c[0] for c in list(zip(data.columns, data.dtypes)) if c[1]!='O']) & set(cols_pred))

# ALTERNATIVELY, DEFINE THE PREDICTOR NAMES MANUALLY

#cols_pred_num = ["Numerical_1","Numerical_2","Numerical_3","Numerical_4","Numerical_5"]
#cols_pred_cat = ["Categorical_1","Categorical_2","Categorical_3","Categorical_4","Categorical_5"]


cols_pred = cols_pred_num + cols_pred_cat

print(len(cols_pred_num),'numerical predictors:')
for p in cols_pred_num: print(p)
print(len(cols_pred_cat),'categorical predictors:')
for p in cols_pred_cat: print(p)

## Data exploration

In [None]:
descrip = data.describe(include='all').transpose()
pd.options.display.max_rows = 1000
display(descrip)
pd.options.display.max_rows = 15

**explore_df** function creates a simple text report about the important variable. The report can be then printed either to the screen or to a file.

In the following code, only such part of data that has col_base = 1 is analyzed. You can remove the condition if you wish.

In [None]:
from scoring.data_exploration import explore_df
st = explore_df(data[data[col_base]==1],col_month,col_target,cols_pred)
print(st,file=open("data_exp.txt", "w"))
print(st)

**Default rate in time**: Simple visualisation of observation count and default rate in time

In [None]:
from scoring.plot import plot_dataset
plot_dataset(data,col_month,col_target)

## Data split

- Split data into three parts (in time training, in time validation, in time test, out of time).
- Adds a new column indicating to which part the observations belong.
- The split parameters are set at the beginning of the code

In [None]:
share_train = 0.7
share_validation = 0.3
last_intime_day = 20170531

data['random_value'] = 1
data['random_value'] = data['random_value'].apply(lambda x: random.uniform(0, 1)) 

data.loc[(data['random_value']<=share_train)&(data[col_day]<=last_intime_day),'data_type'] = 'train'
data.loc[(data['random_value']>share_train)&(data['random_value']<=share_train+share_validation)&(data[col_day]<=last_intime_day),
      'data_type'] = 'valid'
data.loc[(data['random_value']>share_train+share_validation)&(data[col_day]<=last_intime_day),'data_type'] = 'test'
data.loc[(data[col_day]>last_intime_day),'data_type'] = 'oot'

data= data.drop(['random_value'],axis = 1)

train_mask = (data.data_type == 'train')& (data[col_base] == 1) 
valid_mask = (data.data_type == 'valid')& (data[col_base] == 1) 
test_mask = (data.data_type == 'test')& (data[col_base] == 1) 
oot_mask = (data.data_type == 'oot')& (data[col_base] == 1) 

print('Train observations:',data[train_mask].shape[0])
print('Validation observations:',data[valid_mask].shape[0])
print('Test observations:',data[test_mask].shape[0])
print('Out-of-time observations:',data[oot_mask].shape[0])

## Grouping and WOE transformation of variables

Don't use such variables which have only 1 unique level. Grouping don't work for them.

In [None]:
descrip_train = data.loc[train_mask,cols_pred].describe(include='all').transpose()

# comment the following 2 rows if there are no numerical predictors
del_num = set(descrip_train[descrip_train['min']==descrip_train['max']].index)
cols_pred_num = list(set(cols_pred_num) - del_num)

# comment the following 2 rows if there are no categorical predictors
del_cat = set(descrip_train[descrip_train['unique']==1].index)
cols_pred_cat = list(set(cols_pred_cat) - del_cat)

cols_pred = cols_pred_num + cols_pred_cat
print('Variables',list(del_num),',',list(del_cat),'will not be further used as they have only 1 unique level.')

Automatic grouping of numerical and categorical variables.

In [None]:
from scoring.grouping import Grouping

grouping = Grouping(columns = cols_pred,group_count=5, min_samples=100) 

Grouping is fitted on training data and applied to the full data set.

In [None]:
grouping.fit(data[train_mask][cols_pred],data[train_mask][col_target])
data_woe = grouping.transform(data)
if len(grouping.bins_data_) > 0:
    for v,g in grouping.bins_data_.items():
        print('Variable:',v)
        print('Bins:',g['bins'])
        print('WOEs:',g['woes'])
        print('nan WOE:',g['nan_woe'])
        print()

Save grouping to an external file.

In [None]:
model_filename = 'woes'
grouping.save(model_filename)
print('Grouping data saved to file',model_filename)

Load the grouping from a file (don't forget to set the right filename) and add the WOE columns to the original dataset.

In [None]:
#model_filename = 'woes'
#grouping.load(model_filename)

Plot the fitted WOEs

In [None]:
%matplotlib inline

if len(grouping.bins_data_) > 0:
    for v,g in grouping.bins_data_.items():
        bin_names = []
        bin_woes = []
        for j in range(0,len(g['bins'])):
            if (g['bins'].dtype == 'float64') and (j < len(g['bins'])-1):
                bin_names.append(str(round(g['bins'][j],2))+' - '+str(round(g['bins'][j+1],2)))
                bin_woes.append(g['woes'][j])
            elif (g['bins'].dtype != 'float64'):
                bin_names.append(g['bins'][j])
                bin_woes.append(g['woes'][j])
        bin_names.append('nan')
        bin_woes.append(g['nan_woe'])
        plt.figure(figsize = (10,3))
        plt.plot(np.arange(len(bin_woes)),bin_woes, marker='o')#,'o')
        plt.xticks(np.arange(len(bin_woes)), bin_names, rotation = 90)
        plt.title(v)
        plt.show()

Add WOE variabes to the data set.

In [None]:
data_woe = grouping.transform(data)
for c in data_woe:
    if c+'_WOE' in data:
        data = data.drop(c+'_WOE', 1)
        print('Column',c+'_WOE','dropped as it already existed in the data set.')
data = data.join(data_woe,rsuffix='_WOE')
print('Added WOE variables. Number of columns:',data.shape[1])

## Predictor power analysis

Calculates IV and Gini of each predictor, sorts the predictors by their power. The power is calculated for each of the samples (train, validate, test, OOT). If one or more of the samples are empty, comment the according part of the code.

In [None]:
cols_woe = [s + '_WOE' for s in cols_pred]

In [None]:
from scoring.metrics import iv,gini,lift

power_tab = []
for j in range(0,len(cols_woe)):
    power_tab.append({'Name':cols_woe[j],
                    'IV Train':iv(data.loc[train_mask,col_target],data.loc[train_mask,cols_woe[j]]),
                    'Gini Train':gini(data.loc[train_mask,col_target],-data.loc[train_mask,cols_woe[j]]),
                    'IV Validate':iv(data.loc[valid_mask,col_target],data.loc[valid_mask,cols_woe[j]]),
                    'Gini Validate':gini(data.loc[valid_mask,col_target],-data.loc[valid_mask,cols_woe[j]]),
                    #'IV Test':iv(data.loc[test_mask,col_target],data.loc[test_mask,cols_woe[j]]),
                    #'Gini Test':gini(data.loc[test_mask,col_target],-data.loc[test_mask,cols_woe[j]]),
                    'IV OOT':iv(data.loc[oot_mask,col_target],data.loc[oot_mask,cols_woe[j]]),
                    'Gini OOT':gini(data.loc[oot_mask,col_target],-data.loc[oot_mask,cols_woe[j]])
                         })
power_out = pd.DataFrame.from_records(power_tab)
power_out = power_out.set_index('Name')
power_out = power_out.sort_values('Gini Train',ascending=False)

pd.options.display.max_rows = 1000
display(power_out)
pd.options.display.max_rows = 15

Define a shortlist of predictors to enter the modelling in the next steps.

In [None]:
cols_shortlist = cols_woe

## Stepwise logistic Regression

We run stepwise logistic regression on training data set. We start with no predictor in the model and try to add predictors from list called **cols_shortlist** which is defined below (by default, we put there all the WOE variables).

Stepwise process can be tuned using various parameters:
 - *initial_predictors*: set of starting predictors (useful for backward method)
 - *max_iter*: maximal number of iterations
 - *min_increase*: minimal marginal Gini contribution for predictor to be added
 - *max_decrease*: minimal marginal Gini diminution for predictor to be removed
 - *max_correlation*: maximal absolute value of correlation of predictors in the model (variable with larger correlation with existing predictors will not be added to the model)
 - *beta_sgn_correlation*: if this is set to True, all the betas in the model must have the same signature (all positive or all negative)
 - *penalty, C*: regularization parameters for logitic regression (sklearn library)
 - *correlation_sample*: for better performance, correlation matrix is calculated just on a sample of data. The size of the sample is set in this parameter
 - *selection_method*: stepwise or forward or backward
 
The *fit* method can be called with two arguments *fit(X,y)* or with four agruments *fit(X_train,y_train,X_valid,y_valid)*. When called with four arguments, the Gini is measured on the validation sample (i.e. validation sample is used for decisions about what steps to be done in stepwise).

In [None]:
from scoring.model_selection import GiniStepwiseLogit

clf = GiniStepwiseLogit(initial_predictors = set(), max_iter=1000, min_increase=0.7, max_decrease=0.5, max_predictors=0,
                    max_correlation=0.45, beta_sgn_criterion=False, penalty='l2', C=10e10, correlation_sample=10000,
                    selection_method='stepwise')

clf.fit(data[train_mask][cols_shortlist],data[train_mask][col_target]
#        ,data[valid_mask][cols_shortlist],data[valid_mask][col_target]
       )

In [None]:
it = range(0,len(clf.model_progress_[clf.model_progress_['addrm']==0]['prednum']))
pn = clf.model_progress_[clf.model_progress_['addrm']==0]['prednum']
ginis = clf.model_progress_[clf.model_progress_['addrm']==0]['Gini']
plt.figure(figsize = (7,7))
plt.plot(it, ginis)
ymin, ymax = plt.ylim()
plt.xlabel('Iteration')
plt.ylabel('Gini')
plt.title('Stepwise model selection')
plt.axis('tight')
plt.show()

## L1 regularized Logistic Regression

This alternative to stepwise feature selection is better for data sets with high number of covariate because it makes significantly less model fits when searching for the optimal predictor set. Use it for such data sets instead of stepwise.

In [None]:
from scoring.model_selection import L1GiniModelSelection

clf = L1GiniModelSelection(steps = 100, grid_length=5, max_predictors=50,
                           max_correlation=1, beta_sgn_criterion=False, correlation_sample = 10000)

clf.fit(data[train_mask][cols_shortlist],data[train_mask][col_target],
        data[valid_mask][cols_shortlist],data[valid_mask][col_target]
       )

In [None]:
coefs_ = np.array(clf.coefs_)
cs = clf.model_progress_['C']
plt.figure(figsize = (7,7))
plt.plot(np.log10(cs), coefs_)
ymin, ymax = plt.ylim()
plt.xlabel('log10(C)')
plt.ylabel('Coefficients')
plt.title('Logistic Regression Path')
plt.axis('tight')
plt.legend(cols_shortlist, loc='upper center', bbox_to_anchor=(1.20,1.0))
plt.show()

In [None]:
plt.figure(figsize = (7,7))
ginis = clf.model_progress_[['gini train','gini validate']]
plt.plot(np.log10(cs), ginis)
ymin, ymax = plt.ylim()
plt.xlabel('log10(C)')
plt.ylabel('Ginis')
plt.title('Logistic Regression Path')
plt.axis('tight')
plt.legend(['Train','Validate'], loc='upper center', bbox_to_anchor=(1.20,1.0))
plt.show()

## Score the dataset

In [None]:
cols_final_predictors = list(clf.final_predictors_)
print('FINAL PREDICTORS SELECTED TO THE MODEL:',cols_final_predictors)

Create a new column with the prediction (probability of default).

In [None]:
col_score = 'SCORE'

data[col_score] = clf.predict(data)
print('Column',col_score,'with the prediction added/modified. Number of columns:',data.shape[1])

## Scorecard table output
Output the scorecard to a table. Stats are calculated on a subset of data given by the mask defined below.

In [None]:
# this mask is an union of masks for training, validation, testing and out of time data sets
table_mask = train_mask|valid_mask|test_mask|oot_mask

In [None]:
scorecard = []

if len(grouping.bins_data_) > 0:
    for v,g in grouping.bins_data_.items():
        if v+'_WOE' in clf.final_predictors_:
            ii = list(clf.final_predictors_).index(v+'_WOE')
            bin_names = []
            bin_woes = []
            for j in range(0,len(g['bins'])):
                if (g['bins'].dtype == 'float64') and (j < len(g['bins'])-1):
                    subset = data[(table_mask) & (data[v]>=g['bins'][j]) & (data[v]<g['bins'][j+1])]
                    obs = subset[col_base].sum()
                    bads = subset[col_target].sum()
                    scorecard.append({'Variable':v,
                                     'Min':g['bins'][j],
                                     'Max':g['bins'][j+1],
                                     'Value':np.nan,
                                     'WOE':g['woes'][j],
                                     'Beta':clf.final_coef_[0][ii],
                                     'BiXi':g['woes'][j]*clf.final_coef_[0][ii],
                                     'Observations':obs,
                                     'Bads':bads})
                elif (g['bins'].dtype != 'float64'):
                    subset = data[(table_mask) & (data[v]==g['bins'][j])]
                    obs = subset[col_base].sum()
                    bads = subset[col_target].sum()
                    scorecard.append({'Variable':v,
                                     'Min':np.nan,
                                     'Max':np.nan,
                                     'Value':g['bins'][j],
                                     'WOE':g['woes'][j],
                                     'Beta':clf.final_coef_[0][ii],
                                     'BiXi':g['woes'][j]*clf.final_coef_[0][ii],
                                     'Observations':obs,
                                     'Bads':bads})
            subset = data[(table_mask) & (pd.isnull(data[v]))]
            obs = subset[col_base].sum()
            bads = subset[col_target].sum()
            scorecard.append({'Variable':v,
                             'Min':np.nan,
                             'Max':np.nan,
                             'Value':'null',
                             'WOE':g['nan_woe'],
                             'Beta':clf.final_coef_[0][ii],
                             'BiXi':g['nan_woe']*clf.final_coef_[0][ii],
                             'Observations':obs,
                             'Bads':bads})

all_obs = data[table_mask][col_base].sum()
all_bads = data[table_mask][col_target].sum() 
scorecard.append({'Variable':'_Intercept',
                  'Value':np.nan,
                  'Min':np.nan,
                  'Max':np.nan,
                  'WOE':0,
                  'Beta':clf.final_model_.intercept_[0],
                  'BiXi':1*clf.final_model_.intercept_[0],
                  'Observations':all_obs,
                  'Bads':all_bads})

scorecard_out = pd.DataFrame.from_records(scorecard)[
    ['Variable','Min','Max','Value','WOE','Beta','BiXi','Observations','Bads']]
scorecard_out2 = scorecard_out.copy()
scorecard_out2['Value'] = scorecard_out2['Value'] + ','
scorecard_out2 = scorecard_out2.groupby(['Variable','WOE']).agg({
    'Variable':min,'Min':min,'Max':max,'Value':sum,'WOE':min,'Beta':min,'BiXi':min,'Observations':sum,'Bads':sum
})
scorecard_out2.loc[pd.isnull(scorecard_out2['Value']),'Value'] = ','
scorecard_out2['Value'] = scorecard_out2['Value'].astype(str).str[:-1]
scorecard_out2['Goods'] = scorecard_out2['Observations'] - scorecard_out2['Bads']
scorecard_out2['Bad Rate'] = scorecard_out2['Bads']/scorecard_out2['Observations']
all_badrate = all_bads/all_obs
scorecard_out2['Bad Rate relative to population'] = scorecard_out2['Bad Rate'] / all_badrate
scorecard_out2['% Observations'] = scorecard_out2['Observations'] / all_obs
scorecard_out2['% Bads'] = scorecard_out2['Bads'] / all_bads
scorecard_out2['% Goods'] = scorecard_out2['Goods'] / (all_obs-all_bads)
scorecard_out2['Lift'] = scorecard_out2['% Bads'] / scorecard_out2['% Goods']
scorecard_out2 = pd.DataFrame.from_records(scorecard_out2.sort_values(['Variable','Min','Max','WOE','Value']))

pd.options.display.max_rows = 1000
#display(scorecard_out)
display(scorecard_out2)
pd.options.display.max_rows = 15

## Performance characteristics
Performance characteristics of the model (Gini, Lift) and their visualisations.

In [None]:
from scoring.metrics import gini, lift

In [None]:
print ('Train Gini:',gini(data[train_mask][col_target],data[train_mask][col_score]))
print ('Valid Gini:',gini(data[valid_mask][col_target],data[valid_mask][col_score]))
#print ('Test Gini:',gini(data[test_mask][col_target],data[test_mask][col_score]))
print ('OOT Gini:',gini(data[oot_mask][col_target],data[oot_mask][col_score]))

In [None]:
print ('Train Lift 10%:',lift(data[train_mask][col_target],-data[train_mask][col_score],10))
print ('Valid Lift 10%:',lift(data[valid_mask][col_target],-data[valid_mask][col_score],10))
#print ('Test Lift 10%:',lift(data[test_mask][col_target],-data[test_mask][col_score],10))
print ('OOT Lift 10%:',lift(data[oot_mask][col_target],-data[oot_mask][col_score],10))

In [None]:
#calculate data for Gini and Lift curves
from scoring.tools import calculate_gini_and_lift
train_stats, train_curve = calculate_gini_and_lift(data[train_mask], col_target, col_score, pct = 10)
train_curve = list(zip(*train_curve))
valid_stats, valid_curve = calculate_gini_and_lift(data[valid_mask], col_target, col_score, pct = 10)
valid_curve = list(zip(*valid_curve))
#test_stats, test_curve = calculate_gini_and_lift(data[test_mask], col_target, col_score, pct = 10)
#test_curve = list(zip(*test_curve))
oot_stats, oot_curve = calculate_gini_and_lift(data[oot_mask], col_target, col_score, pct = 10)
oot_curve = list(zip(*oot_curve))

In [None]:
plt.figure(figsize = (7,7))
plt.axis([0, 1, 0, 1])
plt.plot([0] + list(train_curve[2]),[0] + list(train_curve[3]), label = 'train', color = 'g')
plt.plot([0] + list(valid_curve[2]), [0] + list(valid_curve[3]), label = 'validation', color = 'r')
#plt.plot([0] + list(test_curve[2]), [0] + list(test_curve[3]), label = 'test', color = 'y')
plt.plot([0] + list(oot_curve[2]), [0] + list(oot_curve[3]), label = 'out-of-time', color = 'b')
plt.plot(list(range(0, 101)), list(range(0, 101)), color='k')
plt.xlabel('Cumulative good count')
plt.ylabel('Cumulative bad count')
plt.legend(loc = "lower right")
plt.show()

In [None]:
plt.figure(figsize = (10,5))
plt.axis([0, 100, 0, max(train_curve[1])+0.5])
plt.plot(train_curve[0], train_curve[1], label = 'train', color = 'g')
plt.plot(valid_curve[0], valid_curve[1], label = 'validation', color = 'r')
#plt.plot(test_curve[0], test_curve[1], label = 'test', color = 'y')
plt.plot(oot_curve[0], oot_curve[1], label = 'out-of-time', color = 'b')
plt.xlabel('Cumulative count [%]')
plt.ylabel('Lift')
plt.legend(loc = "upper right")
plt.show()

In [None]:
from sklearn.metrics import roc_curve, auc

def proc_gini(x,y,z):
    fpr, tpr, _ = roc_curve(x[y], x[z], pos_label=0)
    roc_gini = (auc(fpr, tpr)-0.5)*2
    return roc_gini
%matplotlib inline
plt.figure(figsize = (10,7))
grouped = data[train_mask].groupby(col_month, axis=0)
res_train= grouped.apply(proc_gini, col_target ,col_score)
plt.plot(range(len(res_train)),-res_train, linewidth=2.0,label='Train', color = 'g', marker='o')

grouped = data[valid_mask].groupby(col_month, axis=0)
res_valid= grouped.apply(proc_gini, col_target ,col_score)
plt.plot(range(len(res_valid)),-res_valid, linewidth=2.0,label='Validation', color = 'r', marker='o')

grouped = data[oot_mask].groupby(col_month, axis=0)
res_oot= grouped.apply(proc_gini, col_target ,col_score)
plt.plot(range(len(res_train),len(res_train)+len(res_oot)),-res_oot, linewidth=2.0,label='OOT', color = 'b', marker='o')

#grouped = data[test_mask].groupby(col_month, axis=0)
#res_test= grouped.apply(proc_gini, col_target ,col_score)
#plt.plot(range(len(res_test)),-res_test, linewidth=2.0,label='Test', color = 'y', marker='o')

plt.xticks(range(len(res_train)+len(res_oot)), np.sort(data[col_month].unique()), rotation=45)

plt.ylim([0,1])
plt.title('Gini by months')
plt.legend(loc='center left', bbox_to_anchor=(1, 0.5))
plt.xlabel('Months')
plt.ylabel('Gini')
plt.show()

Calibration chart

In [None]:
from scoring.plot import plot_calib
plot_calib(data[col_score],data[col_target],bins=20)

## Correlations
Calculate and visualise correlation matrix

In [None]:
cormat = data[cols_final_predictors].corr()

matplotlib.rcParams.update({'font.size': 15})
sns.set()
a4_dims = (12,10)

fig, ax = plt.subplots(figsize=a4_dims, dpi=50)
fig.suptitle('Correlations of Predictors',fontsize=25)
sns.heatmap(cormat, ax=ax, annot=True, fmt="0.1f", linewidths=.5, annot_kws={"size":15},cmap="OrRd")
plt.tick_params(labelsize=15)
plt.xticks(rotation=90)
plt.yticks(rotation=0)

plt.show()

Show the list of the highest correlation (restricted to correlations that are, in absolute value, higher than *max_ok_correlation* parameter):

In [None]:
max_ok_correlation = 0.0

# find highest pairwise correlation (correlation greater than .. in absolute value)
hicors = []
for i in range(0,len(cormat)):
    for j in range(0,len(cormat)):
        if ((cormat.iloc[i][j] > max_ok_correlation or cormat.iloc[i][j] < -max_ok_correlation) and i < j):
            hicors.append((i,j,cormat.index[i],cormat.index[j],cormat.iloc[i][j],abs(cormat.iloc[i][j])))
hicors.sort(key= lambda tup: tup[5], reverse=True)

hicors2 = pd.DataFrame(list(zip(*list(zip(*hicors))[2:5])))

# print list of highest correlations
hicors2

## Time stability of predictors

In [None]:
#for the charts to be drawn more quickly, set a reasonable size of sample
sample_size = 10000

data_chart = data.copy()
if sample_size < len(data_chart):
    data_chart = data_chart.sample(sample_size, random_state=241)

for j in list(clf.final_predictors_):
    sns.set(rc={"figure.figsize": (30, 20)})

    for i in data_chart[j].unique():
        tmp = pd.DataFrame(np.sort(data_chart[col_month].unique()), columns = [col_month] ).join(
        pd.DataFrame(data_chart.groupby(by = [j,col_month])[col_target].sum()).loc[i], on = col_month).set_index(col_month).join(
        data_chart.groupby(by = [j,col_month])[col_base].sum().loc[i]).join(data_chart.groupby(by = [col_month])[col_base].sum(), rsuffix='_all')
        tmp['bad_rate'] = tmp[col_target]/tmp[col_base]
        tmp['obs_rate'] = tmp[col_base]/tmp[col_base+str('_all')]
        plt.subplot(221)
        plt.plot(range(len(data_chart[col_month].unique())),tmp['bad_rate'],
                label = i,linewidth=4)
        plt.xticks(range(len(data_chart[col_month].unique())), np.sort(data_chart[col_month].unique()), rotation=45,
                   fontsize=20)
        plt.yticks(fontsize=20)
        plt.title(j+str(': bad rate'),fontsize=25)
        
        plt.subplot(222)
        plt.plot(range(len(data_chart[col_month].unique())),tmp['obs_rate'],
                label = i,linewidth=4) 
        plt.xticks(range(len(data_chart[col_month].unique())), np.sort(data_chart[col_month].unique()), rotation=45,
                   fontsize=20)
        plt.yticks(fontsize=20)
        plt.title(j+str(': observation rate'),fontsize=25)
        
    plt.legend(loc='center', bbox_to_anchor=(1.2, 0.5),fontsize=25) 
    plt.show()

## Comparison with another score
Similar charts to what were already done for the new scorecard are now drawn to compare the new scorecard to another scorecard. The value of the old score should be saved in a special column of original data set.

In [None]:
col_oldscore = 'OLD_SCORE'

#if the score gives the complementary probability (of non-default), run this:
data[col_oldscore]=1-data[col_oldscore]

In [None]:
print ('New score Gini (validation sample):',gini(data[valid_mask][col_target],data[valid_mask][col_score]))
print ('Old score Gini (validation sample):',gini(data[valid_mask][col_target],data[valid_mask][col_oldscore]))
print ('New score Gini (out-of-time sample):',gini(data[oot_mask][col_target],data[oot_mask][col_score]))
print ('Old score Gini (out-of-time sample):',gini(data[oot_mask][col_target],data[oot_mask][col_oldscore]))

In [None]:
print ('New score Lift 10% (validation sample):',lift(data[valid_mask][col_target],-data[valid_mask][col_score],10))
print ('Old score Lift 10% (validation sample):',lift(data[valid_mask][col_target],-data[valid_mask][col_oldscore],10))
print ('New score Lift 10% (out-of-time sample):',lift(data[oot_mask][col_target],-data[oot_mask][col_score],10))
print ('Old score Lift 10% (out-of-time sample):',lift(data[oot_mask][col_target],-data[oot_mask][col_oldscore],10))

In [None]:
from scoring.tools import calculate_gini_and_lift
newscore_stats, newscore_curve = calculate_gini_and_lift(data[valid_mask|oot_mask], col_target, col_score, pct = 10)
newscore_curve = list(zip(*newscore_curve))
oldscore_stats, oldscore_curve = calculate_gini_and_lift(data[valid_mask|oot_mask], col_target, col_oldscore, pct = 10)
oldscore_curve = list(zip(*oldscore_curve))

In [None]:
plt.figure(figsize = (7,7))
plt.axis([0, 1, 0, 1])
plt.plot([0] + list(newscore_curve[2]),[0] + list(newscore_curve[3]), label = 'new score', color = 'g')
plt.plot([0] + list(oldscore_curve[2]), [0] + list(oldscore_curve[3]), label = 'old score', color = 'r')
plt.plot(list(range(0, 101)), list(range(0, 101)), color='k')
plt.xlabel('Cumulative good count')
plt.ylabel('Cumulative bad count')
plt.legend(loc = "lower right")
plt.show()

In [None]:
plt.figure(figsize = (10,5))
plt.axis([0, 100, 0, max(train_curve[1])+0.5])
plt.plot(newscore_curve[0], newscore_curve[1], label = 'new score', color = 'g')
plt.plot(oldscore_curve[0], oldscore_curve[1], label = 'old score', color = 'r')
plt.xlabel('Cumulative count [%]')
plt.ylabel('Lift')
plt.legend(loc = "upper right")
plt.show()

In [None]:
from sklearn.metrics import roc_curve, auc

def proc_gini(x,y,z):
    fpr, tpr, _ = roc_curve(x[y], x[z], pos_label=0)
    roc_gini = (auc(fpr, tpr)-0.5)*2
    return roc_gini
%matplotlib inline
plt.figure(figsize = (10,7))
grouped = data[valid_mask|oot_mask].groupby(col_month, axis=0)
res_new= grouped.apply(proc_gini, col_target ,col_score)
plt.plot(range(len(res_new)),-res_new, linewidth=2.0,label='new score', color = 'g', marker='o')

grouped = data[valid_mask|oot_mask].groupby(col_month, axis=0)
res_old= grouped.apply(proc_gini, col_target ,col_oldscore)
plt.plot(range(len(res_old)),-res_old, linewidth=2.0,label='old score', color = 'r', marker='o')

plt.xticks(range(len(res_valid)+len(res_oot)), np.sort(data[col_month].unique()), rotation=45)

plt.ylim([0,1])
plt.title('Gini by months')
plt.legend(loc='center left', bbox_to_anchor=(1, 0.5))
plt.xlabel('Months')
plt.ylabel('Gini')
plt.show()

## Transition matrices

In [None]:
new_score_dec = pd.DataFrame(pd.qcut(data[valid_mask|oot_mask][col_score],10,labels=False))
old_score_dec = pd.DataFrame(pd.qcut(data[valid_mask|oot_mask][col_oldscore],10,labels=False))
dec_data = pd.concat([new_score_dec,old_score_dec,data[valid_mask|oot_mask][col_target]],axis=1)
dec_data_agg = dec_data.groupby([col_oldscore,col_score]).agg({
    col_target:['sum','count']
})
dec_data_agg.columns = ['bads','obs']
dec_data_agg2 = dec_data.groupby([col_oldscore]).agg({
    col_target:['count']
})
dec_data_agg2.columns = ['old decile obs']
dec_data_all = dec_data_agg.reset_index().join(dec_data_agg2,on=[col_oldscore]).set_index([col_oldscore,col_score])
dec_data_all['default rate'] = dec_data_all['bads']/dec_data_all['obs']
dec_data_all['share'] = dec_data_all['obs']/dec_data_all['old decile obs']
matrix_DR = np.matrix(dec_data_all.unstack()[['default rate']])
matrix_OS = np.matrix(dec_data_all.unstack()[['share']])

Default rate for each decile/decile combination

In [None]:
fig, ax = plt.subplots(figsize=(10,10))
sns.heatmap(matrix_DR, annot=True, cmap=sns.cubehelix_palette(light=1, as_cmap=True))
ax.set_ylabel('old score decile')
ax.set_xlabel('new score decile')
plt.show()

Transition matrix

In [None]:
fig, ax = plt.subplots(figsize=(10,10))
sns.heatmap(matrix_OS, annot=True, cmap=sns.cubehelix_palette(light=1, as_cmap=True))
ax.set_ylabel('old score decile')
ax.set_xlabel('new score decile')
plt.show()

## Performance on short target
If there is also a shorter (e.g. FPD30) target in the original dataset, we draw also charts for performance on this target in this part of the workflow.

In [None]:
#name of the short target column
col_short = "FPD"
#name of the short target's base column
col_shortbase = "FPD_BASE"

If you don't have base column in your data set, the following code adds it (value 1 for each observation). **Otherwise, don't run it.**

In [None]:
data[col_shortbase] = 1
print('Column',col_shortbase,'added/modified. Number of columns:',data.shape[1])

In [None]:
shortbase_mask = ((data.data_type == 'valid')|(data.data_type == 'oot'))& (data[col_shortbase] == 1) 

In [None]:
print ('Short target Gini:',gini(data[shortbase_mask][col_short],data[shortbase_mask][col_score]))
print ('Short target Lift 10%:',lift(data[shortbase_mask][col_short],-data[shortbase_mask][col_score],10))

In [None]:
from sklearn.metrics import roc_curve, auc

def proc_gini(x,y,z):
    fpr, tpr, _ = roc_curve(x[y], x[z], pos_label=0)
    roc_gini = (auc(fpr, tpr)-0.5)*2
    return roc_gini
%matplotlib inline
plt.figure(figsize = (10,7))
grouped = data[valid_mask|oot_mask].groupby(col_month, axis=0)
res_new= grouped.apply(proc_gini, col_target ,col_score)
plt.plot(range(len(res_new)),-res_new, linewidth=2.0,label='target', color = 'g', marker='o')

grouped = data[shortbase_mask].groupby(col_month, axis=0)
res_short= grouped.apply(proc_gini, col_short ,col_score)
plt.plot(range(len(res_short)),-res_short, linewidth=2.0,label='short target', color = 'r', marker='o')

plt.xticks(range(len(res_short)), np.sort(data[col_month].unique()), rotation=45)

plt.ylim([0,1])
plt.title('Gini by months')
plt.legend(loc='center left', bbox_to_anchor=(1, 0.5))
plt.xlabel('Months')
plt.ylabel('Gini')
plt.show()