# Home Credit Python Scoring Workflow v.0.4.1

**Contributors:**
- Pavel Sůva (HCI Research & Development)
- Sergey Gerasimov (HCRU Scoring & Big Data)
- Valentina Kalenichenko (HCRU Scoring & Big Data)
- Marek Teller (HCI Research & Development)
- Martin Kotek (HCI Research & Development)
- Jan Zeller (HCI Research & Development)

## Import packages
- time, datetime - ability to get current time for logs
- math - basic mathematical functions (as logarithm etc.))
- random - generate random selection from probability distributions
- NumPy - for scientific, mathematical, numerical calculations
- Pandas - for efficient work with large data structures (you need pandas **version 0.21 or higher**)
- cx_Oracle and sqlalchemy - for loading data from Oracle database (DWH etc.)
- statsmodels - library with some statistical functions and models
- scikit-learn - all important machine learning (and statistical) algorithms used for training the models
- matplotlib - for plotting the charts
- seaborn - for statistical visualisations
- os - for setting output paths for generated image files
- pickle - to save models to external files

**If any of these packages is missing, you have to install it from the Anaconda prompt using command *conda install packagename* where *packagename* is the name of the installed package.**

There is another package called *scoring*, which is distributed along with this workflow. **The folder *scoring* must be located in the same folder as this workflow for the package to be loaded correctly.** Alternatively, you can locate it somewhere else and then use *sys.path.insert()* to map this location.

### Other important prerequisites:

For the grouping some **extensions for Jupyter must be installed and enabled before Jupyter is started and the notebook is loaded**. These extensions are Javascripts running in the browser, so it is necessary to have a compatibile browser. Generally, Chrome is OK, Internet Explorer 11 is NOT OK. To install the extensions, run this in your Anaconda prompt:

- *conda install ipywidgets*
- *jupyter nbextension enable --py --sys-prefix widgetsnbextension*
- *conda config --add channels conda-forge*
- *conda install qgrid* 
- *jupyter nbextension enable --py --sys-prefix qgrid*

Please, make sure that qgrid library that you installed in this step is **verison 1.0.2 or higher**. 

To be able to connect to Oracle database (to get the data directly from your DWH) you need a compatibile Oracle driver to be installed on your computer. **With 64-bit Python, you need to have 64-bit Oracle driver installed.** Before you install the driver, you need to have Java 8 JDK (JRE is not enough) installed on your computer.

In [None]:
import time
import datetime
import operator
import math
import random
import numpy as np
import pandas as pd
import cx_Oracle
from sqlalchemy import create_engine
from sklearn.metrics import roc_auc_score
from sklearn.linear_model import LogisticRegression
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.utils import check_array
from sklearn.utils import as_float_array
from sklearn.utils.validation import check_is_fitted
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import os.path
import pickle

#import sys
#sys.path.insert(0, 'C:/py_src/scoring/hcfsc')
import scoring
#import importlib
#importlib.reload(scoring)
#importlib.reload(scoring.grouping)

Set general technical parameters and paths.

In [None]:
sns.set()
%matplotlib inline
%config InlineBackend.close_figures=True
from IPython.display import display
pd.options.display.max_columns = None
pd.options.display.max_rows = 15
output_folder = 'documentation'

if not os.path.exists(output_folder): os.makedirs(output_folder)
if not os.path.exists(output_folder+'/performance'): os.makedirs(output_folder+'/performance')
if not os.path.exists(output_folder+'/predictors'): os.makedirs(output_folder+'/predictors')
if not os.path.exists(output_folder+'/stability'): os.makedirs(output_folder+'/stability')
if not os.path.exists(output_folder+'/analysis'): os.makedirs(output_folder+'/analysis')
if not os.path.exists(output_folder+'/model'): os.makedirs(output_folder+'/model')

## Import data
Importing data from a CSV file. It is important to set the following parameters:

encoding: usually 'utf-8' or windows-xxxx on Windows machines, where xxxx is 1250 for Central Europe, 1251 for Cyrilic etc.
sep: separator of columns in the file
decimal: decimal dot or coma
index_col: which columns is used as index - should be the unique credit case identifier

In [None]:
data = pd.read_csv(r'ExampleData4.CSV', sep = ',', decimal = '.', 
                   encoding = 'utf-8', index_col = 'ID', low_memory = False)
print('Data loaded on',datetime.datetime.fromtimestamp(time.time()).strftime('%Y-%m-%d %H:%M:%S'))

The data need to have index column which has unique value per each row. If not, it can cause some problems later. Run this to deal with such rows:

In [None]:
#Option 1: remove rows with duplicated index
data=data[~data.index.duplicated(keep='first')]

#Option 2: reset index
#data.reset_index(inplace=True)

Optionally the data can be loaded also from a database. The function read_sql uses cache, so the data don't have to be downloaded from the database repeatedly. The cache will be located in a new folder called **db_cache**.

In [None]:
#engine = create_engine('oracle://PAVELS[GP_HQ_RISK]:xxx@(DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=DBDWHRU.HOMECREDIT.RU)(PORT=1521))(CONNECT_DATA=(SERVICE_NAME=DWHRU)))', echo=False)

In [None]:
#from scoring.db import read_sql
#ru_data = read_sql('select * from owner_dwh.f_application_tt where rownum<11',engine, index_col = 'sk_application')
#print('Data loaded on',datetime.datetime.fromtimestamp(time.time()).strftime('%Y-%m-%d %H:%M:%S'))

If you need to download data from the database again (and not from cache), use the parameter refresh:

In [None]:
#from scoring.db import read_sql
#data = read_sql('select * from owner_dwh.f_application_base_tt where rownum=1',engine, index_col = 'skp_application',refresh=True)
#print('Data loaded on',datetime.datetime.fromtimestamp(time.time()).strftime('%Y-%m-%d %H:%M:%S'))

In [None]:
print('Number of rows:',data.shape[0])
print('Number of columns:',data.shape[1])

In [None]:
data.head(5)

## Metadata definitions
Assigning ID column, target column, time column and month column. The month column don't have to exist in the dataset, it will be created later in this workflow.

In [None]:
#name of the time column
col_time = "TIME"
#name of the month column
col_month = "MONTH"
#name of the day column
col_day = "DAY"
#name of the target column
col_target = "DEF"
#name of the base column
col_base = "BASE"

In [None]:
pd.DataFrame.from_records([['col_time',col_time],['col_month',col_month],['col_day',col_day],['col_target',col_target],['col_base',col_base]]) \
.to_csv(output_folder+'/model/metadata.csv',index=0,header=None)

If you don't have base column in your data set, the following code adds it (based on if target is filled). **Otherwise, don't run it.**

In [None]:
if col_base not in data:
    data[col_base] = 0
    data.loc[data[col_target]==0,col_base] = 1
    data.loc[data[col_target]==1,col_base] = 1
    print('Column',col_base,'added/modified. Number of columns:',data.shape[1])
else:
    print('Column',col_base,'already exists.')

Create the month and day column from the time column is doing the following
- take the time column and tell in which format the time is saved in - **you need to specify this in variable *dtime_input_format*** (see https://docs.python.org/3/library/time.html#time.strftime for reference)
- strip the format just to year, month, day string
- convert the string to number
- the new column will be added to the dataset as day
- truncate this column to just year and month and add it to dataset as month

In [None]:
dtime_input_format = '%Y-%m-%d %H:%M:%S'

In [None]:
data.loc[:,col_day] = pd.to_numeric(pd.to_datetime(data[col_time], format=dtime_input_format).dt.strftime('%Y%m%d'))
data[col_month] = data[col_day].apply(lambda x: math.trunc(x/100))
print('Columns',col_day,'and',col_month,'added/modified. Number of columns:',data.shape[1])

In [None]:
data.head(5)

Load the predictors list from a csv file. The csv should have just one column, without any header, containing the name of the variables that should be used as predictors.

In [None]:
cols_pred = list(pd.read_csv(r'ExamplePredList.CSV', sep = ',', decimal = '.', 
                   encoding = 'windows-1251', low_memory = False, header = None)[0])

cols_pred_cat = list(set([c[0] for c in list(zip(data.columns, data.dtypes)) if c[1]=='O']) & set(cols_pred))
cols_pred_num = list(set([c[0] for c in list(zip(data.columns, data.dtypes)) if c[1]!='O']) & set(cols_pred))

# ALTERNATIVELY, DEFINE THE PREDICTOR NAMES MANUALLY

#cols_pred_num = ["Numerical_1","Numerical_2","Numerical_3","Numerical_4","Numerical_5"]
#cols_pred_cat = ["Categorical_1","Categorical_2","Categorical_3","Categorical_4","Categorical_5"]


cols_pred = cols_pred_num + cols_pred_cat

print(len(cols_pred_num),'numerical predictors:')
for p in cols_pred_num: print(p)
print('-'*100)
print()
print(len(cols_pred_cat),'categorical predictors:')
for p in cols_pred_cat: print(p)

## Data exploration

In [None]:
descrip = data.describe(include='all').transpose()
pd.options.display.max_rows = 1000
display(descrip)
pd.options.display.max_rows = 15

**exploreNominal** and **exploreInterval** functions give graphical data exploratory analyses. They can also output even more comprehensive analysis into html files. You just need to specify the folder for output.

These functions analyze only the part of data where target is not null even if it is not explicitly specified.

In [None]:
from scoring.data_exploration import exploreNominal, exploreInterval

for c in sorted(cols_pred_num):
    if (data[c].count() > 0) and (data[c].max() != data[c].min()):
        exploreInterval(data[c],data[col_target],htmlOut=True,ntbOut=True,OutFolder2='dexp',bin_count=10)

for c in sorted(cols_pred_cat):
    if (data[c].count() > 0) and (len(list(set(data[c].unique()) - {np.nan})) > 1):
        exploreNominal(data[c],data[col_target],htmlOut=True,ntbOut=True,OutFolder2='dexp')

**explore_df** function creates a simple text report about the important variable. The report can be then printed either to the screen or to a file.

In the following code, only such part of data that has col_base = 1 is analyzed. You can remove the condition if you wish.

In [None]:
from scoring.data_exploration import explore_df
st = explore_df(data[data[col_base]==1],col_month,col_target,cols_pred)
print(st,file=open("data_exp.txt", "w", encoding='utf-8'))
print(st)

**Default rate in time**: Simple visualisation of observation count and default rate in time

In [None]:
from scoring.plot import plot_dataset
plot_dataset(data,col_month,col_target,'Count and bad rate',col_base,savepath=output_folder+'/analysis/')

## Data split

- Split data into five parts (in time training, in time validation, in time test, out of time, historical out of time).
- Adds a new column indicating to which part the observations belong.
- The split parameters are set at the beginning of the code.
- In the first line you can set the random seed so the results are replicable.

In [None]:
random.seed(12345)
share_train = 0.6
share_validation = 0.2
first_train_day = 20170201 #first day of train, everything before it will be considered "old", historical out of time
first_oot_day = 20170601 #first day of "new" out of time, i.e. out of time after train

data['random_value'] = 1
data['random_value'] = data['random_value'].apply(lambda x: random.uniform(0, 1)) 

data.loc[(data['random_value']<=share_train)&(data[col_day]<first_oot_day)&
         (data[col_day]>=first_train_day),'data_type'] = 'train'
data.loc[(data['random_value']>share_train)&(data['random_value']<=share_train+share_validation)&(data[col_day]<first_oot_day)&
         (data[col_day]>=first_train_day),'data_type'] = 'valid'
data.loc[(data['random_value']>share_train+share_validation)&(data[col_day]<first_oot_day)&
         (data[col_day]>=first_train_day),'data_type'] = 'test'
data.loc[(data[col_day]>=first_oot_day),'data_type'] = 'oot'
data.loc[(data[col_day]<first_train_day),'data_type'] = 'hoot'

data= data.drop(['random_value'],axis = 1)

train_mask = (data.data_type == 'train')& (data[col_base] == 1) 
valid_mask = (data.data_type == 'valid')& (data[col_base] == 1) 
test_mask = (data.data_type == 'test')& (data[col_base] == 1) 
oot_mask = (data.data_type == 'oot')& (data[col_base] == 1) 
hoot_mask = (data.data_type == 'hoot')& (data[col_base] == 1) 

print('Train observations:',data[train_mask].shape[0])
print('Validation observations:',data[valid_mask].shape[0])
print('Test observations:',data[test_mask].shape[0])
print('Out-of-time observations:',data[oot_mask].shape[0])
print('Historical-out-of-time observations:',data[hoot_mask].shape[0])

Data summary (number of defaults, number in base, number of observations, default rate) by month and by sample

In [None]:
data_summary = data.groupby([col_month,'data_type']).aggregate({
    col_target:'sum',col_base:['sum','count']
})
data_summary.columns = [col_target,col_base,'Rows']
data_summary[col_target+' rate'] = data_summary[col_target]/data_summary[col_base]

data_summary = data_summary.reset_index(level='data_type').pivot(columns='data_type')
display(data_summary)
data_summary.to_csv(output_folder+'/analysis/summary.csv')

## Grouping and WOE transformation of variables

Don't use such variables which have only 0 or 1 unique level. Grouping don't work for them.

In [None]:
cols_del = set()
for c in cols_pred_num:
    if (data[train_mask][c].count() == 0) or (data[train_mask][c].max() == data[train_mask][c].min()):
        cols_del = cols_del | {c}
        cols_pred_num = list(set(cols_pred_num) - {c})
for c in cols_pred_cat:
    if (data[train_mask][c].count() == 0) or (len(list(set(data[train_mask][c].unique()) - {np.nan})) <= 1):
        cols_del = cols_del | {c}
        cols_pred_cat = list(set(cols_pred_cat) - {c})
            
cols_pred = cols_pred_num + cols_pred_cat

if len(list(cols_del)) > 0:
    print('Variables',cols_del,'will not be further used as they have only 1 unique level.')
else:
    print('All predictors have more than 1 unique level.')

There are two options how to group your variables. 
1. Automatic grouping groups the variables using a decision tree. User can't change the grouping in any interactive way. The grouping can be saved into external file using its method *save()*. 
2. Interactive grouping is suitable for smaller numbers of variables. User can control which values of each varible will enter which group. The grouping can be saved into external file using the interactive environment.

### Option 1: Automatic Grouping
The grouping uses decision tree algorithm and the grouping is supervised based on the target variable. In the following code:

A new instance of **Grouping** class is created. There are two important parameters:
 - *colums*: list of numerical columns to be grouped
 - *cat_columns*: list of categorical columns to be grouped
 - *group_count*: (maximal) number of final groups of each variable
 - *min_samples*: minimal number of observations in each group of each numerical variable
 - *min_samples_cat*: minimal number of observations in each group of each categorical variable

In [None]:
from scoring.grouping import Grouping

grouping = Grouping(columns = sorted(cols_pred_num),
                    cat_columns = sorted(cols_pred_cat),
                    group_count=5, 
                    min_samples=100, 
                    min_samples_cat=100) 

Then you fit the grouping using **fit** method. The parameters are array of the predictors and series of the target. Grouping is fitted on training data only.

In [None]:
grouping.fit(data[train_mask][cols_pred],
             data[train_mask][col_target])

if len(grouping.bins_data_) > 0:
    for v,g in grouping.bins_data_.items():
        print('Variable:',v)
        print('Bins:',g['bins'])
        print('WOEs:',g['woes'])
        if v in cols_pred_num:
            print('nan WOE:',g['nan_woe'])
        if v in cols_pred_cat:
            print('WOE for unknown values:',g['unknown_woe'])
        print()

Then you apply the grouping on your full data set using the **transform** method.

In [None]:
data_woe = grouping.transform(data)

Save grouping to an external file.

In [None]:
model_filename = 'myGrouping'
grouping.save(model_filename)
print('Grouping data saved to file',model_filename)

### Option 2: Interactive Grouping (beta)

A new instance of **InteractiveGrouping** class is created. There are two important parameters:
 - *colums*: list of numerical columns to be grouped
 - *cat_columns*: list of categorical columns to be grouped
 - *group_count*: (maximal) number of final groups of each variable
 - *min_samples*: minimal number of observations in each group of each numerical variable
 - *min_samples_cat*: minimal number of observations in each group of each categorical variable

In [None]:
#import importlib
#importlib.reload(scoring)
#importlib.reload(scoring.grouping)

from scoring.grouping import Grouping, InteractiveGrouping

grouping = InteractiveGrouping(columns = sorted(cols_pred_num),
                               cat_columns = sorted(cols_pred_cat),
                               group_count=5,
                               min_samples=100, 
                               min_samples_cat=100,
                               woe_smooth_coef=0.001) 

Then you open the interactive environment using **display** method. The important parameters are:
 - *train_t*: training dataset the grouping should be based on
 - *colums*: list of numerical columns to be grouped and displayed
 - *cat_columns*: list of categorical columns to be grouped and displayed
 - *target_column*: as the grouping is supervised and calculates WOE values, you need to specify the target column name
 - *filename*: use only if you want to load a grouping that you created and saved previously
 - *group_count*: (maximal) number of final groups of each variable
 - *min_samples*: minimal number of observations in each group of each numerical variable
 - *min_samples_cat*: minimal number of observations in each group of each categorical variable

In the interactive environment, you can see four sections. From top to bottom:
- **Chart section**: 
 - For **numerical variables**, there is chart with equifrequncy fine classing (observations as bars, default rate as line), equidistant fine classing and the final groups.
 - For **categorical varibles** there is chart with each of the original categorical values and a chart with the final groups.
- **Variable section**: here you can choose tab with varible which you want to edit. 
 - For **numerical variables**, the tab contains of the borders of the final groups. You can edit these borders, add new with [+] button and remove them with [-] button. You can also manually set WOE for nulls. There is also a button to perform automatic grouping on the selected variable.
 - For **categorical variables**, the tab contains of two tables. In the top table, you can see some statistics for each of the categorical values. In the rightmost column, there is the number of group which is assigned to the category. You can edit this value (doubleclick on it) to change the grouping. In the bottom table you can see statistics for the groups. It is not editable. There is also a button to perform automatic grouping on the selected variable.
- **Save section**: here you can save the grouping. Edit the file name and click the [Apply and Save] button.
- **Settings section**: If you perform automatic grouping on some varible, the grouping algorithm uses some parameters. These parameters can be set here. You can set how many final groups do you want to have and what is their minimal size.

Known bugs:
- This bug may occur only if there are any categorical variables. In such case, user must open at least 1 tab with such variable before saving the grouping (otherwise there will be an error during the saving).
- If zoom level in the web browser is set to something else than 100%, the charts might get broken.

In [None]:
sns.reset_orig()
%matplotlib notebook
%config InlineBackend.close_figures=False

grouping.display(train_t = data[train_mask][cols_pred_num+cols_pred_cat+[col_target]],
                 columns = sorted(cols_pred_num),
                 cat_columns = sorted(cols_pred_cat),
                 target_column = col_target,
                 #filename = 'myIntGrouping',
                 bin_count=20,
                 woe_smooth_coef=0.001,
                 group_count=5,
                 min_samples=100,
                 min_samples_cat=100)

In [None]:
#reset the graphical environment to be used by the normal non-interactive charts
sns.set()
%matplotlib inline
%config InlineBackend.close_figures=True

Don't forget to apply the grouping to the data

In [None]:
data_woe = grouping.transform(data)

### Use the grouping on the dataset

Load the grouping from a file (don't forget to set the right filename) and add the WOE columns to the original dataset.

In [None]:
#from scoring.grouping import Grouping
#grouping = Grouping(columns = sorted(cols_pred_num),
#                    cat_columns = sorted(cols_pred_cat),
#                    group_count=5, 
#                    min_samples=100, 
#                    min_samples_cat=100) 
#g_filename = 'myIntGrouping'
#grouping.load(g_filename)

Plot the fitted WOEs.

In [None]:
from scoring.plot import print_binning_stats_num, print_binning_stats_cat

if len(grouping.bins_data_) > 0:
    for v,g in sorted(grouping.bins_data_.items(), key=operator.itemgetter(0)):
        print('-'*125)
        print(v)
        if v in cols_pred_num:
            print_binning_stats_num(data[train_mask][[col_target, v]], v, col_target, g['bins'], g['woes'], g['nan_woe']
                           ,savepath=output_folder+'/predictors/'+v+'_')  
        elif v in cols_pred_cat:
            print_binning_stats_cat(data[train_mask][[col_target, v]], v, col_target
                           ,g['bins'].keys(), g['bins'].values() ,g['woes'], g['unknown_woe']
                           ,savepath=output_folder+'/predictors/'+v+'_')  

Add WOE variabes to the data set.

In [None]:
data_woe = grouping.transform(data)
for c in data_woe:
    if c+'_WOE' in data:
        data = data.drop(c+'_WOE', 1)
        print('Column',c+'_WOE','dropped as it already existed in the data set.')
data = data.join(data_woe,rsuffix='_WOE')
print('Added WOE variables. Number of columns:',data.shape[1])

## Predictor power analysis

Calculates IV and Gini of each predictor, sorts the predictors by their power. The power is calculated for each of the samples (train, validate, test, OOT, H.OOT). **If one or more of the samples are empty, comment the according part of the code.**

In [None]:
cols_woe = [s + '_WOE' for s in cols_pred]

In [None]:
from scoring.metrics import iv,gini,lift

power_tab = []
for j in range(0,len(cols_woe)):
    power_tab.append({'Name':cols_woe[j]
                    ,'IV Train':iv(data.loc[train_mask,col_target],data.loc[train_mask,cols_woe[j]])
                    ,'Gini Train':gini(data.loc[train_mask,col_target],-data.loc[train_mask,cols_woe[j]])
                    ,'IV Validate':iv(data.loc[valid_mask,col_target],data.loc[valid_mask,cols_woe[j]])
                    ,'Gini Validate':gini(data.loc[valid_mask,col_target],-data.loc[valid_mask,cols_woe[j]])
                    ,'IV Test':iv(data.loc[test_mask,col_target],data.loc[test_mask,cols_woe[j]])
                    ,'Gini Test':gini(data.loc[test_mask,col_target],-data.loc[test_mask,cols_woe[j]])
                    ,'IV OOT':iv(data.loc[oot_mask,col_target],data.loc[oot_mask,cols_woe[j]])
                    ,'Gini OOT':gini(data.loc[oot_mask,col_target],-data.loc[oot_mask,cols_woe[j]])
                    ,'IV HOOT':iv(data.loc[hoot_mask,col_target],data.loc[hoot_mask,cols_woe[j]])
                    ,'Gini HOOT':gini(data.loc[hoot_mask,col_target],-data.loc[hoot_mask,cols_woe[j]])
                         })
power_out = pd.DataFrame.from_records(power_tab)
power_out = power_out.set_index('Name')
power_out = power_out.sort_values('Gini Train',ascending=False)

pd.options.display.max_rows = 1000
display(power_out)
pd.options.display.max_rows = 15
power_out.to_csv(output_folder+'/predictors/covariates.csv')

## Variable clustering

Show correlation matrix of all the WOE variables

In [None]:
cormat_full = data[sorted(cols_woe)].fillna(0).corr()

matplotlib.rcParams.update({'font.size': 15})
sns.set()
%matplotlib inline
%config InlineBackend.close_figures=True
a4_dims = (12,10)

fig, ax = plt.subplots(figsize=a4_dims, dpi=50)
fig.suptitle('Correlations of Variables',fontsize=25)
sns.heatmap(cormat_full, ax=ax, annot=True, fmt="0.1f", linewidths=.5, annot_kws={"size":15},cmap="OrRd")
plt.tick_params(labelsize=15)
plt.xticks(rotation=90)
plt.yticks(rotation=0)

plt.savefig(output_folder+'/analysis/correlation_full.png', bbox_inches='tight', dpi = 72)
plt.show()

### Option 1: Hierarchical variable clustering based on correlations.
- Starts with each variable as a separate cluster
- Creates clusters based on highest average correlations
- The stopping criterion is parameter *max_cluster_correlation* - once no correlation between clusters is larger than this parameter, the clustering is finished

In [None]:
from scipy.cluster.hierarchy import dendrogram, linkage, fcluster
from scipy.spatial.distance import pdist

max_cluster_correlation = 0.5


if len(data[train_mask][cols_woe]) > 50000:
    data_for_clustering = data[train_mask][cols_woe].sample(50000)
else:
    data_for_clustering = data[train_mask][cols_woe]

Z = linkage(data_for_clustering.fillna(0).transpose(), method='average', metric='correlation')
clusters = fcluster(Z, 1-max_cluster_correlation, criterion='distance')

a4_dims = (10, int(len(clusters)/4))
fig, ax = plt.subplots(figsize=a4_dims, dpi=50)
dendrogram(Z, labels=[a+': '+str(b) for a,b in zip(cols_woe,list(clusters))], orientation='right')
plt.axvline(x=1-max_cluster_correlation, c='k')
plt.xlabel('correlation')
plt.xticks([0,0.2,0.4,0.6,0.8,1],[1,0.8,0.6,0.4,0.2,0])
plt.savefig(output_folder+'/analysis/clustering_dendrogram.png', bbox_inches='tight', dpi = 72)
plt.savefig(output_folder+'/analysis/clustering_dendrogram_full.png', bbox_inches='tight', dpi = 300)
plt.show()

Table with all the clusters. Subset of WOE variables called *vars_woe_restr* is created which contains only the strongest (based on training Gini) variable of each cluster. It's up to the user to choose whether they want to used the full set (in this workflow by default) or such restricted set.

In [None]:
clustered_variables = pd.DataFrame({'Name':cols_woe,'Cluster':list(clusters)},index=cols_woe).join(power_out)
clustered_variables = clustered_variables.sort_values(['Cluster','Gini Train'],ascending=[True,False])[[
    'Cluster','Gini Train','IV Train']]
clustered_variables['Order in cluster'] = clustered_variables.sort_values('Gini Train', ascending=False).groupby('Cluster')\
             .cumcount() + 1
pd.options.display.max_rows = 1000
display(clustered_variables)
pd.options.display.max_rows = 15
clustered_variables.to_csv(output_folder+'/predictors/predictor_clusters.csv')

vars_woe_restr = list(clustered_variables[clustered_variables['Order in cluster']==1].index)
print('Restricted set of WOE variables:',vars_woe_restr)

### Option 2: k-means clustering with given k

Performs k-means clustering. The value of parameter *k* is set by the user.

This option is better performance-wise, but might give less interpretable results.

In [None]:
from sklearn.cluster import KMeans

k_param = 7


if len(data[train_mask][cols_woe]) > 50000:
    data_for_clustering = data[train_mask][cols_woe].sample(50000)
else:
    data_for_clustering = data[train_mask][cols_woe]
    
km = KMeans(n_clusters = k_param)
km.fit(data_for_clustering.fillna(0).transpose())

Table with all the clusters. Subset of WOE variables called *vars_woe_restr* is created which contains only the strongest (based on training Gini) variable of each cluster. It's up to the user to choose whether they want to used the full set (in this workflow by default) or such restricted set.

In [None]:
clustered_variables = pd.DataFrame({'Name':cols_woe,'Cluster':list(km.labels_)},index=cols_woe).join(power_out)
clustered_variables = clustered_variables.sort_values(['Cluster','Gini Train'],ascending=[True,False])[[
    'Cluster','Gini Train','IV Train']]
clustered_variables['Order in cluster'] = clustered_variables.sort_values('Gini Train', ascending=False).groupby('Cluster')\
             .cumcount() + 1
pd.options.display.max_rows = 1000
display(clustered_variables)
pd.options.display.max_rows = 15
clustered_variables.to_csv(output_folder+'/predictors/predictor_clusters.csv')

vars_woe_restr = list(clustered_variables[clustered_variables['Order in cluster']==1].index)
print('Restricted set of WOE variables:',vars_woe_restr)

## Scorecard estimators

### 1) L1 regularized Logistic Regression

Efficient way how to select subset of predictors from a very big set of covariate. Uses grid search through value of L1 regularization parameter. We start with no predictor in the model and try to add predictors from list called **cols_shortlist** which is defined below (by default, we put there all the WOE variables). The best model selected based on validation Gini.

Interation process can be tuned using various parameters:
 - *steps*: number of steps of grid search
 - *grid_length*: length of the grid for grid search
 - *max_predictors*: maximal number of predictors to enter the model. Ignored if set to 0.
 - *max_correlation*: maximal absolute value of correlation of predictors in the model (variable with larger correlation with existing predictors will not be added to the model)
 - *beta_sgn_criterion*: if this is set to True, all the betas in the model must have the same signature (all positive or all negative)
 - *stop_immediately*: the iteration process will be stopped immediately after a model which is not fulfilling the criteria (max_predictors, max_correlation or beta_sgn_criterion) is found. No further models are searched for.
 - *correlation_sample*: for better performance, correlation matrix is calculated just on a sample of data. The size of the sample is set in this parameter
 
The *fit* method can be called with two arguments *fit(X,y)* or with four agruments *fit(X_train,y_train,X_valid,y_valid)*. When called with four arguments, the Gini is measured on the validation sample (i.e. validation sample is used for decisions about what steps to be done in stepwise).

In [None]:
#Define a shortlist of predictors to enter the modelling in the next steps.
cols_shortlist = cols_woe
#cols_shortlist = list(set(cols_woe) - set(['unwanted1','unwanted2']))

In [None]:
from scoring.model_selection import L1GiniModelSelection

modelL1 = L1GiniModelSelection(steps = 100, grid_length=5, max_predictors=200,
                           max_correlation=1, beta_sgn_criterion=False, stop_immediately=False, correlation_sample = 10000)

modelL1.fit(data[train_mask][cols_shortlist],data[train_mask][col_target],
        data[valid_mask][cols_shortlist],data[valid_mask][col_target]
       )

In [None]:
coefs_ = np.array(modelL1.coefs_)
cs = modelL1.model_progress_['C']
plt.figure(figsize = (7,7))
plt.plot(np.log10(cs), coefs_) 
ymin, ymax = plt.ylim()
plt.xlabel('log10(C)')
plt.ylabel('Coefficients')
plt.title('Logistic Regression Path')
plt.axis('tight')
plt.legend(cols_shortlist, loc='upper center', bbox_to_anchor=(1.20,1.0))
plt.savefig(output_folder+'/model/l1path.png', bbox_inches='tight', dpi = 72)
plt.show()

In [None]:
plt.figure(figsize = (7,7))
ginis = modelL1.model_progress_[['gini train','gini validate']]
plt.plot(np.log10(cs), ginis)
ymin, ymax = plt.ylim()
plt.xlabel('log10(C)')
plt.ylabel('Ginis')
plt.title('Logistic Regression Path')
plt.axis('tight')
plt.legend(['Train','Validate'], loc='upper center', bbox_to_anchor=(1.20,1.0))
plt.savefig(output_folder+'/model/l1gini.png', bbox_inches='tight', dpi = 72)
plt.show()

In [None]:
print('Predictors in the model:',list(modelL1.final_predictors_))

Save the model to disk.

In [None]:
model_filename1 = 'myModelL1'
pickle.dump(modelL1, open(model_filename1, 'wb'))

Load model.

In [None]:
#model_filename1 = 'myModelL1'
#modelL1 = pickle.load(open(model_filename1, 'rb'))

The drawback of regularized model is that it is not calibrated, so it must be refitted afterwards. In this workflow, there is stepwise regression after this L1 regression which can serve this purpose (i.e. fitting model with the same set or subset of predictors, but without the regularization).

### 2) Stepwise logistic Regression

We run stepwise logistic regression on training data set. We start with no predictor in the model and try to add predictors from list called **cols_shortlist2** which is defined below (by default, we put there all the WOE variables).

Stepwise process can be tuned using various parameters:
 - *initial_predictors*: set of starting predictors (useful for backward method)
 - *max_iter*: maximal number of iterations
 - *min_increase*: minimal marginal Gini contribution for predictor to be added
 - *max_decrease*: minimal marginal Gini diminution for predictor to be removed
 - *max_predictors*: maximal number of predictors to enter the model. Ignored if set to 0.
 - *max_correlation*: maximal absolute value of correlation of predictors in the model (variable with larger correlation with existing predictors will not be added to the model). **This parameter works for "forward" selection method only.**
 - *beta_sgn_criterion*: if this is set to True, all the betas in the model must have the same signature (all positive or all negative). **This parameter works for "forward" selection method only.**
 - *penalty, C*: regularization parameters for logitic regression (sklearn library)
 - *correlation_sample*: for better performance, correlation matrix is calculated just on a sample of data. The size of the sample is set in this parameter
 - *selection_method*: stepwise or forward or backward
 
The *fit* method can be called with two arguments *fit(X,y)* or with four agruments *fit(X_train,y_train,X_valid,y_valid)*. When called with four arguments, the Gini is measured on the validation sample (i.e. validation sample is used for decisions about what steps to be done in stepwise).

In [None]:
#We can use the output from L1 model as a shortlist for the next step
cols_shortlist2 = list(modelL1.final_predictors_)
#cols_shortlist2 = cols_woe

In [None]:
from scoring.model_selection import GiniStepwiseLogit

modelSW = GiniStepwiseLogit(initial_predictors = set(), max_iter=1000, min_increase=0.8, max_decrease=0.5,
                    max_predictors=0, max_correlation=0.45, beta_sgn_criterion=False, 
                    penalty='l2', C=10e10, correlation_sample=10000,
                    selection_method='stepwise')

modelSW.fit(data[train_mask][cols_shortlist2],data[train_mask][col_target]
        ,data[valid_mask][cols_shortlist2],data[valid_mask][col_target]
       )

In [None]:
it = range(0,len(modelSW.model_progress_[modelSW.model_progress_['addrm']==0]['prednum']))
pn = modelSW.model_progress_[modelSW.model_progress_['addrm']==0]['prednum']
ginis = modelSW.model_progress_[modelSW.model_progress_['addrm']==0]['Gini']
plt.figure(figsize = (7,7))
plt.plot(it, ginis)
ymin, ymax = plt.ylim()
plt.xlabel('Iteration')
plt.ylabel('Gini')
plt.title('Stepwise model selection')
plt.axis('tight')
plt.savefig(output_folder+'/model/stepwisegini.png', bbox_inches='tight', dpi = 72)
plt.show()

Save the model to disk.

In [None]:
model_filename2 = 'myModelSW'
pickle.dump(modelSW, open(model_filename2, 'wb'))

Load model.

In [None]:
#model_filename2 = 'myModelSW'
#modelSW = pickle.load(open(model_filename2, 'rb'))

In [None]:
print('Predictors in the model:',list(modelSW.final_predictors_))

### Score the dataset
First choose which model is your final model (into variable *clf*)

In [None]:
#clf = modelL1
clf = modelSW

cols_final_predictors = list(clf.final_predictors_)
pd.DataFrame(cols_final_predictors).to_csv(output_folder+'/predictors/predictors.csv',index=False,header=None)

print('FINAL MODEL COEFFICIENTS')
print('Intercept:',clf.intercept_[0])
for p,b in zip(cols_final_predictors,list(clf.coef_[0])):
    print(p,':',b)

Create a new column with the prediction (probability of default).

In [None]:
col_score = 'SCORE'

data[col_score] = clf.predict(data)
print('Column',col_score,'with the prediction added/modified. Number of columns:',data.shape[1])

## Scorecard table output
Output the scorecard to a table. Stats are calculated on a subset of data given by the mask defined below.

In [None]:
# this mask is an union of masks for training, validation, testing and out of time data sets
table_mask = train_mask|valid_mask|test_mask|oot_mask|hoot_mask

In [None]:
scorecard = []

if len(grouping.bins_data_) > 0:
    for v,g in grouping.bins_data_.items():
        if v+'_WOE' in clf.final_predictors_:
            ii = list(clf.final_predictors_).index(v+'_WOE')
            bin_names = []
            bin_woes = []
            data['sctable_print_subset'] = 0
            for j in range(0,len(g['bins'])):
                if (v in cols_pred_num) and (j < len(g['bins'])-1):
                    subset = data[(table_mask) & (data[v]>=g['bins'][j]) & (data[v]<g['bins'][j+1])]
                    data.loc[(table_mask) & (data[v]>=g['bins'][j]) & (data[v]<g['bins'][j+1]),'sctable_print_subset'] = 1
                    obs = subset[col_base].sum()
                    bads = subset[col_target].sum()
                    scorecard.append({'Variable':v,
                                     'Min':g['bins'][j],
                                     'Max':g['bins'][j+1],
                                     'Value':np.nan,
                                     'WOE':g['woes'][j],
                                     'Beta':clf.coef_[0][ii],
                                     'BiXi':g['woes'][j]*clf.coef_[0][ii],
                                     'Observations':obs,
                                     'Bads':bads})
                elif (v in cols_pred_cat):
                    if pd.isnull(list(g['bins'].keys())[j]):
                        subset = data[(table_mask) & (pd.isnull(data[v]))]
                        data.loc[(table_mask) & (pd.isnull(data[v])),'sctable_print_subset'] = 1
                        val = 'null'
                    else:
                        subset = data[(table_mask) & (data[v]==list(g['bins'].keys())[j])]
                        data.loc[(table_mask) & (data[v]==list(g['bins'].keys())[j]),'sctable_print_subset'] = 1
                        val = list(g['bins'].keys())[j]
                    obs = subset[col_base].sum()
                    bads = subset[col_target].sum()
                    scorecard.append({'Variable':v,
                                     'Min':np.nan,
                                     'Max':np.nan,
                                     'Value':val,
                                     'WOE':g['woes'][list(g['bins'].values())[j]],
                                     'Beta':clf.coef_[0][ii],
                                     'BiXi':g['woes'][list(g['bins'].values())[j]]*clf.coef_[0][ii],
                                     'Observations':obs,
                                     'Bads':bads})
            if (v in cols_pred_num):
                subset = data[(table_mask) & (pd.isnull(data[v]))]
                data.loc[(table_mask) & (pd.isnull(data[v])),'sctable_print_subset'] = 1
                obs = subset[col_base].sum()
                bads = subset[col_target].sum()
                scorecard.append({'Variable':v,
                                 'Min':np.nan,
                                 'Max':np.nan,
                                 'Value':'null',
                                 'WOE':g['nan_woe'],
                                 'Beta':clf.coef_[0][ii],
                                 'BiXi':g['nan_woe']*clf.coef_[0][ii],
                                 'Observations':obs,
                                 'Bads':bads})
            elif (v in cols_pred_cat):
                subset = data[(table_mask) & data['sctable_print_subset']==0]
                obs = subset[col_base].sum()
                bads = subset[col_target].sum()
                scorecard.append({'Variable':v,
                                 'Min':np.nan,
                                 'Max':np.nan,
                                 'Value':'else',
                                 'WOE':g['unknown_woe'],
                                 'Beta':clf.coef_[0][ii],
                                 'BiXi':g['unknown_woe']*clf.coef_[0][ii],
                                 'Observations':obs,
                                 'Bads':bads})

all_obs = data[table_mask][col_base].sum()
all_bads = data[table_mask][col_target].sum() 
scorecard.append({'Variable':'_Intercept',
                  'Value':np.nan,
                  'Min':np.nan,
                  'Max':np.nan,
                  'WOE':1,
                  'Beta':clf.intercept_[0],
                  'BiXi':1*clf.intercept_[0],
                  'Observations':all_obs,
                  'Bads':all_bads})

data.drop(columns=['sctable_print_subset'],inplace=True)

scorecard_out = pd.DataFrame.from_records(scorecard)[
    ['Variable','Min','Max','Value','WOE','Beta','BiXi','Observations','Bads']]
scorecard_out2 = scorecard_out.copy()
scorecard_out2['Value'] = scorecard_out2['Value'] + ','
scorecard_out2 = scorecard_out2.groupby(['Variable','WOE']).agg({
    'Variable':min,'Min':min,'Max':max,'Value':sum,'WOE':min,'Beta':min,'BiXi':min,'Observations':sum,'Bads':sum
})
scorecard_out2.loc[pd.isnull(scorecard_out2['Value']),'Value'] = ','
scorecard_out2['Value'] = scorecard_out2['Value'].astype(str).str[:-1]
scorecard_out2['Goods'] = scorecard_out2['Observations'] - scorecard_out2['Bads']
scorecard_out2['Bad Rate'] = scorecard_out2['Bads']/scorecard_out2['Observations']
all_badrate = all_bads/all_obs
scorecard_out2['Bad Rate relative to population'] = scorecard_out2['Bad Rate'] / all_badrate
scorecard_out2['% Observations'] = scorecard_out2['Observations'] / all_obs
scorecard_out2['% Bads'] = scorecard_out2['Bads'] / all_bads
scorecard_out2['% Goods'] = scorecard_out2['Goods'] / (all_obs-all_bads)
scorecard_out2['Lift'] = scorecard_out2['% Bads'] / scorecard_out2['% Goods']
scorecard_out2 = pd.DataFrame.from_records(scorecard_out2.sort_values(['Variable','Min','Max','WOE','Value']))

pd.options.display.max_rows = 1000
#display(scorecard_out)
display(scorecard_out2)
pd.options.display.max_rows = 15
scorecard_out2.to_csv(output_folder+'/model/scorecard.csv')

## Scorecard export as SQL query, Blaze table and Python code

Generate SQL code to run the scorecard on Oracle DWH.

In [None]:
#OUTER PART TRANSFORMING WOE TO BIXI
scoring_sql_outer = ['select\n1/(1+exp(-s.LINEAR_SCORE)) as SCORE,\ns.*\nfrom (\n    select\n']
#INNER PART TRANSFORMING VARIABLE TO WOE
scoring_sql_inner = ['        select\n']
nullWOE = None
elseWOE = None
tmp_variable = ''
for r in scorecard_out.itertuples():
    if r.Variable != tmp_variable:
        if tmp_variable != '':
            #OUTER PART TRANSFORMING WOE TO BIXI
            scoring_sql_outer.append('     + ')
            #INNER PART TRANSFORMING VARIABLE TO WOE
            if elseWOE is None: elseWOE = nullWOE
            if elseWOE is None: elseWOE = 0
            scoring_sql_inner.append('            else ' + str(elseWOE) + '\n        end as ' + str(tmp_variable) + '_WOE,\n')
        else:
            #OUTER PART TRANSFORMING WOE TO BIXI
            scoring_sql_outer.append('    ')
        #OUTER PART TRANSFORMING WOE TO BIXI
        scoring_sql_outer.append('w.' + str(r.Variable) + '_WOE * ' + str(r.Beta) + '\n')
        #INNER PART TRANSFORMING VARIABLE TO WOE
        scoring_sql_inner.append('        case\n')
        tmp_variable = r.Variable
        nullWOE = None
        elseWOE = None
    if r.Value == 'null':
        scoring_sql_inner.append('            when ' + str(r.Variable) + ' is null then ' + str(r.WOE) + '\n')
        nullWOE = r.WOE
    elif r.Value == 'else':
        elseWOE = r.WOE
    elif pd.notnull(r.Value):
        scoring_sql_inner.append('            when ' + str(r.Variable) + ' = "' + str(r.Value) + '" then ' + str(r.WOE) + '\n')
    elif pd.notnull(r.Min):
        if np.isfinite(r.Max):
            scoring_sql_inner.append('            when ' + str(r.Variable) + ' < ' + str(r.Max) + ' then ' + str(r.WOE) + '\n')
        else:
            scoring_sql_inner.append('            when ' + str(r.Variable) + ' >= ' + str(r.Min) + ' then ' + str(r.WOE) + '\n')
    elif r.Variable == '_Intercept':
        scoring_sql_inner.append('            when 1=1 then ' + str(r.WOE) + '\n')
#OUTER PART TRANSFORMING WOE TO BIXI
scoring_sql_outer.append('    as LINEAR_SCORE,\n    w.*\n    from (\n')
#INNER PART TRANSFORMING VARIABLE TO WOE
if elseWOE is None: elseWOE = nullWOE
if elseWOE is None: elseWOE = 0
scoring_sql_inner.append('            else ' + str(elseWOE) + '\n        end as ' + str(tmp_variable) + '_WOE\n')
scoring_sql_inner.append('        from _SOURCETABLENAME_\n')
scoring_sql_outer = ''.join(scoring_sql_outer)
scoring_sql_inner = ''.join(scoring_sql_inner)
scoring_sql_final = scoring_sql_outer + scoring_sql_inner + '    ) w\n) s'
scoring_sql_final = scoring_sql_final.replace('"',"'").replace('_Intercept','Intercept')
print(scoring_sql_final,file=open(output_folder+'/model/scorecard.sql', "w", encoding='utf-8'))
print(scoring_sql_final)

Generate table for easy import to Blaze.

In [None]:
balze_table =  pd.DataFrame(columns = ['Characteristic','Bin','Label','Score','ScorePoints',
                        'Range','Formula','Number_of_Values','Value1','Value2'])

for r in scorecard_out.itertuples():  
    if (r.Variable=='_Intercept'):
        balze_table=balze_table.append(pd.DataFrame( data= [['Intercept',1,'K00_All',
                                                         r.BiXi,0,1,'Intercept=\'integer\'',
                                                         0,'','']],
                                   columns = ['Characteristic','Bin','Label','Score','ScorePoints',
                                                  'Range','Formula','Number_of_Values','Value1','Value2']))
        balze_table=balze_table.append(pd.DataFrame( data= [['Intercept',2,'All Other',
                                                         0,0,1,'Intercept=\'integer\'',
                                                         0,'','']],
                                   columns = ['Characteristic','Bin','Label','Score','ScorePoints',
                                                  'Range','Formula','Number_of_Values','Value1','Value2']))

prv_Variable = ''
prv_BiXi = ''
prv_Value1 = ''
Bin = 1
k=0
for r in scorecard_out.itertuples():  
    if (r.Variable==prv_Variable):
        prv_Bin=Bin
        Bin+=1
    else:  
        prv_Bin=Bin
        Bin=1
        k=0
    if (r.BiXi!=prv_BiXi):
        k+=1  
    Number_of_Values=1   
    Value1 = ''
    Value2 = ''
    Formula = ''
    if pd.notnull(r.Value):
        Value1 = r.Value
        Label = 'K'+str(k)+'_{'+str(Value1)+'}'
        Formula = r.Variable + '= \'character\''
    if r.Value=='null':
        Value1 = ''
        Label = 'NA'
        Formula = ''
    if r.Value=='else':
        prv_Value1 = r.Value
        Value1 = ''
        Label = 'All Other'   
        Formula=''
    if pd.notnull(r.Min) and pd.notnull(r.Max) and (r.Min!=-np.inf) and (r.Max!=np.Inf):
        Number_of_Values=2
        Value1 = r.Min
        Value2 = r.Max
        Label = 'K'+str(k)+'_'+str(Value1)+'_'+str(Value2)
        Formula = r.Variable+ ' \'integer1\' <= .. <\'integer2\''
    if pd.notnull(r.Min) and (r.Max==np.inf):
        Value1 = r.Min
        Label = 'K'+str(k)+'_'+str(Value1)
        Formula = r.Variable+ ' >=\'integer\''  
    if pd.notnull(r.Max) and (r.Min==-np.inf):
        Value1 = r.Max
        Label = 'K'+str(k)+'_'+str(Value1) 
        Formula = r.Variable+ ' <\'integer\''
    if (r.Variable!=prv_Variable) and (r.Variable!='_Intercept'):
        if prv_Value1!='else' and prv_Variable != '':
            balze_table=balze_table.append(pd.DataFrame( data= [[prv_Variable,prv_Bin+1,'All Other',
                                                             0,'','','',
                                                             '','','']],
                                       columns = ['Characteristic','Bin','Label','Score','ScorePoints',
                                                      'Range','Formula','Number_of_Values','Value1','Value2']))
        prv_Value1 = ''
        prv_Variable = r.Variable
    if (r.Variable!='_Intercept'):   
        balze_table=balze_table.append(pd.DataFrame( data= [[r.Variable,Bin,Label,
                                                             r.BiXi,0,1,Formula,
                                                             Number_of_Values,Value1,Value2]],
                                       columns = ['Characteristic','Bin','Label','Score','ScorePoints',
                                                      'Range','Formula','Number_of_Values','Value1','Value2']))
        
balze_table.to_csv(output_folder+'/model/blaze_table.csv',sep = ';')  
display(balze_table)

Generate Python code to run the scorecard in any Python script independently on this workflow.

In [None]:
#OUTER PART TRANSFORMING WOE TO BIXI
scoring_python_outer = ['\n    LINEAR_SCORE = \\\n']
                        
#INNER PART TRANSFORMING VARIABLE TO WOE
scoring_python_inner = ['def score(row):']
tmp_variable = ''
nullWOE = 0
for r in scorecard_out.itertuples():
    if r.Variable != tmp_variable:
        if tmp_variable != '':
            #OUTER PART TRANSFORMING WOE TO BIXI
            scoring_python_outer.append(' + \\\n    ')
            #INNER PART TRANSFORMING VARIABLE TO WOE
            scoring_python_inner.append('    else: ' + str(tmp_variable) + '_WOE = ' + str(nullWOE) + '\n')
        else:
            #OUTER PART TRANSFORMING WOE TO BIXI
            scoring_python_outer.append('    ')

        #OUTER PART TRANSFORMING WOE TO BIXI
        scoring_python_outer.append(str(r.Variable) + '_WOE * ' + str(r.Beta))
        #INNER PART TRANSFORMING VARIABLE TO WOE
        scoring_python_inner.append('\n')
        tmp_variable = r.Variable
        nullWOE = 0
        
        scoring_python_inner.append('    if ')
    else:
        scoring_python_inner.append('    elif ')
        
    if r.Value == 'null':
        scoring_python_inner.append('row[\'' + str(r.Variable) + '\'] != row[\'' + str(r.Variable) + '\']: ' + str(tmp_variable) + '_WOE = ' + str(r.WOE) + '\n')
        nullWOE = r.WOE
    elif pd.notnull(r.Value):
        scoring_python_inner.append('row[\'' + str(r.Variable) + '\'] == "' + str(r.Value) + '": ' + str(tmp_variable) + '_WOE = ' + str(r.WOE) + '\n')
    elif pd.notnull(r.Min):
        if np.isfinite(r.Max):
            scoring_python_inner.append('row[\'' + str(r.Variable) + '\'] < ' + str(r.Max) + ': ' + str(tmp_variable) + '_WOE = ' + str(r.WOE) + '\n')
        else:
            scoring_python_inner.append('row[\'' + str(r.Variable) + '\'] >= ' + str(r.Min) + ': ' + str(tmp_variable) + '_WOE = ' + str(r.WOE) + '\n')
    elif r.Variable == '_Intercept':
        scoring_python_inner.append('1==1: _Intercept_WOE = ' + str(r.WOE) + '\n')

#INNER PART TRANSFORMING VARIABLE TO WOE
scoring_python_inner.append('    else: ' + str(tmp_variable) + '_WOE = ' + str(nullWOE) + '\n')



scoring_python_outer.append('\n\n    SCORE = 1-1/(1+np.exp(LINEAR_SCORE))\n') 
scoring_python_outer.append('\n    return SCORE\n') 

scoring_python_outer = ''.join(scoring_python_outer)
scoring_python_inner = ''.join(scoring_python_inner)
scoring_python_final = scoring_python_inner + scoring_python_outer
print(scoring_python_final,file=open(output_folder+'/model/scorecard.py', "w"))
print(scoring_python_final)

## Performance characteristics
Performance characteristics of the model (Gini, Lift) and their visualisations.

In [None]:
from scoring.metrics import gini, lift
lift_perc = 10

If some od these samples (train, valid, test, OOT, HOOT) are empty (i.e. you don't use them), **you need to comment their according rows in some of the following cells.** Such rows are marked by short comments at their ends.

In [None]:
perf = pd.DataFrame({'sample':[
    'train', #train
    'valid', #valid
    'test', #test
    'oot', #OOT
    'hoot' #HOOT
    ], 'gini':[
    gini(data[train_mask][col_target],data[train_mask][col_score]) #train
    ,gini(data[valid_mask][col_target],data[valid_mask][col_score]) #valid
    ,gini(data[test_mask][col_target],data[test_mask][col_score]) #test
    ,gini(data[oot_mask][col_target],data[oot_mask][col_score]) #OOT
    ,gini(data[hoot_mask][col_target],data[hoot_mask][col_score]) #HOOT
    ], 'lift_'+str(lift_perc):[
    lift(data[train_mask][col_target],-data[train_mask][col_score],lift_perc) #train
    ,lift(data[valid_mask][col_target],-data[valid_mask][col_score],lift_perc) #valid
    ,lift(data[test_mask][col_target],-data[test_mask][col_score],lift_perc) #test
    ,lift(data[oot_mask][col_target],-data[oot_mask][col_score],lift_perc) #OOT
    ,lift(data[hoot_mask][col_target],-data[hoot_mask][col_score],lift_perc) #HOOT
    ]}).set_index('sample')

In [None]:
display(perf)
perf.to_csv(output_folder+'/performance/performance.csv')

In [None]:
#calculate data for Gini and Lift curves
from scoring.tools import calculate_gini_and_lift
train_stats, train_curve = calculate_gini_and_lift(data[train_mask], col_target, col_score, pct = lift_perc) #train
train_curve = list(zip(*train_curve))                                                                        #train
valid_stats, valid_curve = calculate_gini_and_lift(data[valid_mask], col_target, col_score, pct = lift_perc) #valid
valid_curve = list(zip(*valid_curve))                                                                        #valid
test_stats, test_curve = calculate_gini_and_lift(data[test_mask], col_target, col_score, pct = lift_perc)    #test
test_curve = list(zip(*test_curve))                                                                          #test
oot_stats, oot_curve = calculate_gini_and_lift(data[oot_mask], col_target, col_score, pct = lift_perc)       #oot
oot_curve = list(zip(*oot_curve))                                                                            #oot
hoot_stats, hoot_curve = calculate_gini_and_lift(data[hoot_mask], col_target, col_score, pct = lift_perc)    #hoot
hoot_curve = list(zip(*hoot_curve))                                                                          #hoot

In [None]:
plt.figure(figsize = (7,7))
plt.axis([0, 1, 0, 1])
plt.plot([0] + list(train_curve[2]),[0] + list(train_curve[3]), label = 'Train', color = 'g') #train
plt.plot([0] + list(valid_curve[2]), [0] + list(valid_curve[3]), label = 'Validation', color = 'r') #valid
plt.plot([0] + list(test_curve[2]), [0] + list(test_curve[3]), label = 'Test', color = 'y') #test
plt.plot([0] + list(oot_curve[2]), [0] + list(oot_curve[3]), label = 'OOT', color = 'b') #oot
plt.plot([0] + list(hoot_curve[2]), [0] + list(hoot_curve[3]), label = 'Hist.OOT', color = 'm') #hoot
plt.plot(list(range(0, 101)), list(range(0, 101)), color='k')
plt.xlabel('Cumulative good count')
plt.ylabel('Cumulative bad count')
plt.legend(loc = "lower right")
plt.savefig(output_folder+'/performance/roc.png', bbox_inches='tight', dpi = 72)
plt.show()

In [None]:
plt.figure(figsize = (10,5))
plt.axis([0, 100, 0, max(train_curve[1])+0.5])
plt.plot(train_curve[0], train_curve[1], label = 'Train', color = 'g') #train
plt.plot(valid_curve[0], valid_curve[1], label = 'Validation', color = 'r') #valid
plt.plot(test_curve[0], test_curve[1], label = 'Test', color = 'y') #test
plt.plot(oot_curve[0], oot_curve[1], label = 'OOT', color = 'b') #oot
plt.plot(hoot_curve[0], hoot_curve[1], label = 'Hist.OOT', color = 'm') #hoot
plt.xlabel('Cumulative count [%]')
plt.ylabel('Lift')
plt.legend(loc = "upper right")
plt.savefig(output_folder+'/performance/lift.png', bbox_inches='tight', dpi = 72)
plt.show()

Gini in time

In [None]:
from sklearn.metrics import roc_curve, auc

def proc_gini(x,y,z):
    fpr, tpr, _ = roc_curve(x[y], x[z], pos_label=0)
    roc_gini = (auc(fpr, tpr)-0.5)*2
    return roc_gini
%matplotlib inline
plt.figure(figsize = (10,7))

len1 = 0

grouped = data[hoot_mask].groupby(col_month, axis=0) #hoot
res_hoot= grouped.apply(proc_gini, col_target ,col_score) #hoot
plt.plot(range(len1,len1+len(res_hoot)),-res_hoot, linewidth=2.0,label='hist. OOT', color = 'm', marker='o') #hoot

if res_hoot is not None: len1 = len1 + len(res_hoot)

grouped = data[train_mask].groupby(col_month, axis=0) #train
res_train= grouped.apply(proc_gini, col_target ,col_score) #train
plt.plot(range(len1,len1+len(res_train)),-res_train, linewidth=2.0,label='Train', color = 'g', marker='o') #train
grouped = data[valid_mask].groupby(col_month, axis=0) #valid
res_valid= grouped.apply(proc_gini, col_target ,col_score) #valid
plt.plot(range(len1,len1+len(res_valid)),-res_valid, linewidth=2.0,label='Validation', color = 'r', marker='o') #valid
grouped = data[test_mask].groupby(col_month, axis=0) #test
res_test= grouped.apply(proc_gini, col_target ,col_score) #test
plt.plot(range(len1,len1+len(res_test)),-res_test, linewidth=2.0,label='Test', color = 'y', marker='o') #test

if res_train is not None: len1 = len1 + len(res_train)

grouped = data[oot_mask].groupby(col_month, axis=0) #oot
res_oot= grouped.apply(proc_gini, col_target ,col_score) #oot
plt.plot(range(len1,len1+len(res_oot)),-res_oot, linewidth=2.0,label='OOT', color = 'b', marker='o') #oot

plt.xticks(range(len(res_train)+len(res_oot)), np.sort(data[col_month].unique()), rotation=45)

plt.ylim([0,1])
plt.title('Gini by months')
plt.legend(loc='center left', bbox_to_anchor=(1, 0.5))
plt.xlabel('Months')
plt.ylabel('Gini')
plt.savefig(output_folder+'/performance/ginistability.png', bbox_inches='tight', dpi = 72)
plt.show()

Calibration chart

In [None]:
from scoring.plot import plot_calib
plot_calib(data[col_score],data[col_target],bins=20,savepath=output_folder+'/model/')

## Correlations
Calculate and visualise correlation matrix

In [None]:
cormat = data[cols_final_predictors].corr()

matplotlib.rcParams.update({'font.size': 15})
sns.set()
%matplotlib inline
%config InlineBackend.close_figures=True
a4_dims = (12,10)

fig, ax = plt.subplots(figsize=a4_dims, dpi=50)
fig.suptitle('Correlations of Predictors',fontsize=25)
sns.heatmap(cormat, ax=ax, annot=True, fmt="0.1f", linewidths=.5, annot_kws={"size":15},cmap="OrRd")
plt.tick_params(labelsize=15)
plt.xticks(rotation=90)
plt.yticks(rotation=0)

plt.savefig(output_folder+'/analysis/correlation.png', bbox_inches='tight', dpi = 72)
plt.show()

Show the list of the highest correlation (restricted to correlations that are, in absolute value, higher than *max_ok_correlation* parameter):

In [None]:
max_ok_correlation = 0.0

# find highest pairwise correlation (correlation greater than .. in absolute value)
hicors = []
for i in range(0,len(cormat)):
    for j in range(0,len(cormat)):
        if ((cormat.iloc[i][j] > max_ok_correlation or cormat.iloc[i][j] < -max_ok_correlation) and i < j):
            hicors.append((i,j,cormat.index[i],cormat.index[j],cormat.iloc[i][j],abs(cormat.iloc[i][j])))
hicors.sort(key= lambda tup: tup[5], reverse=True)

hicors2 = pd.DataFrame(list(zip(*list(zip(*hicors))[2:5])))

# print list of highest correlations
hicors2

## Time stability of predictors

Set metadata for the stability charts. Two types of charts will be drawn:
- Stability of default rate, for which the variables with default and with base need to be set
- Stability of population, for which the variable with observation count needs to be set

In [None]:
target_for_default = col_target
base_for_default = col_base
data['ones'] = 1
obs_for_population = 'ones'

In [None]:
from scoring.plot import stability_chart

for j in list(clf.final_predictors_):
    stability_chart(data[j],data[target_for_default],data[base_for_default],data[obs_for_population],data[col_month],
                   savepath=output_folder+'/stability/'+j+'_')

## Comparison with another score
Similar charts to what were already done for the new scorecard are now drawn to compare the new scorecard to another scorecard. The value of the old score should be saved in a special column of original data set.

In [None]:
col_oldscore = 'OLD_SCORE'

#if the score gives the complementary probability (of non-default), run this:
data[col_oldscore]=1-data[col_oldscore]

In [None]:
perf_oldscore = pd.DataFrame({'scorecard':[
     'old' #valid
    ,'old' #test
    ,'old' #oot
    ,'old' #hoot
    ,'new' #valid
    ,'new' #test
    ,'new' #oot
    ,'new' #hoot
    ],'sample':[
    'valid' #valid
    ,'test' #test
    ,'oot'  #oot
    ,'hoot' #hoot
    ,'valid'#valid
    ,'test' #test
    ,'oot'  #oot
    ,'hoot' #hoot
    ], 'gini':[
    gini(data[valid_mask][col_target],data[valid_mask][col_oldscore]) #valid
    ,gini(data[test_mask][col_target],data[test_mask][col_oldscore]) #test
    ,gini(data[oot_mask][col_target],data[oot_mask][col_oldscore]) #oot
    ,gini(data[hoot_mask][col_target],data[hoot_mask][col_oldscore]) #hoot
    ,gini(data[valid_mask][col_target],data[valid_mask][col_score]) #valid
    ,gini(data[test_mask][col_target],data[test_mask][col_score]) #test
    ,gini(data[oot_mask][col_target],data[oot_mask][col_score]) #oot
    ,gini(data[hoot_mask][col_target],data[hoot_mask][col_score]) #hoot
    ], 'lift_'+str(lift_perc):[
    lift(data[valid_mask][col_target],-data[valid_mask][col_oldscore],lift_perc) #valid
    ,lift(data[test_mask][col_target],-data[test_mask][col_oldscore],lift_perc) #test
    ,lift(data[oot_mask][col_target],-data[oot_mask][col_oldscore],lift_perc) #oot
    ,lift(data[hoot_mask][col_target],-data[hoot_mask][col_oldscore],lift_perc) #hoot
    ,lift(data[valid_mask][col_target],-data[valid_mask][col_score],lift_perc) #valid
    ,lift(data[test_mask][col_target],-data[test_mask][col_score],lift_perc) #test
    ,lift(data[oot_mask][col_target],-data[oot_mask][col_score],lift_perc) #oot
    ,lift(data[hoot_mask][col_target],-data[hoot_mask][col_score],lift_perc) #hoot
    ]}).set_index('sample').pivot(columns='scorecard')

In [None]:
display(perf_oldscore)
perf_oldscore.to_csv(output_folder+'/performance/performance_oldscore.csv')

In [None]:
from scoring.tools import calculate_gini_and_lift
newscore_stats, newscore_curve = calculate_gini_and_lift(data[valid_mask|test_mask|oot_mask|hoot_mask],
                                                         col_target, col_score, pct = lift_perc)
newscore_curve = list(zip(*newscore_curve))
oldscore_stats, oldscore_curve = calculate_gini_and_lift(data[valid_mask|test_mask|oot_mask|hoot_mask],
                                                         col_target, col_oldscore, pct = lift_perc)
oldscore_curve = list(zip(*oldscore_curve))

In [None]:
plt.figure(figsize = (7,7))
plt.axis([0, 1, 0, 1])
plt.plot([0] + list(newscore_curve[2]),[0] + list(newscore_curve[3]), label = 'new score', color = 'g')
plt.plot([0] + list(oldscore_curve[2]), [0] + list(oldscore_curve[3]), label = 'old score', color = 'r')
plt.plot(list(range(0, 101)), list(range(0, 101)), color='k')
plt.xlabel('Cumulative good count')
plt.ylabel('Cumulative bad count')
plt.legend(loc = "lower right")
plt.savefig(output_folder+'/performance/roc_oldscore.png', bbox_inches='tight', dpi = 72)
plt.show()

In [None]:
plt.figure(figsize = (10,5))
plt.axis([0, 100, 0, max(train_curve[1])+0.5])
plt.plot(newscore_curve[0], newscore_curve[1], label = 'new score', color = 'g')
plt.plot(oldscore_curve[0], oldscore_curve[1], label = 'old score', color = 'r')
plt.xlabel('Cumulative count [%]')
plt.ylabel('Lift')
plt.legend(loc = "upper right")
plt.savefig(output_folder+'/performance/lift_oldscore.png', bbox_inches='tight', dpi = 72)
plt.show()

In [None]:
from sklearn.metrics import roc_curve, auc

def proc_gini(x,y,z):
    fpr, tpr, _ = roc_curve(x[y], x[z], pos_label=0)
    roc_gini = (auc(fpr, tpr)-0.5)*2
    return roc_gini
%matplotlib inline
plt.figure(figsize = (10,7))
grouped = data[valid_mask|test_mask|oot_mask|hoot_mask].groupby(col_month, axis=0)
res_new= grouped.apply(proc_gini, col_target ,col_score)
plt.plot(range(len(res_new)),-res_new, linewidth=2.0,label='new score', color = 'g', marker='o')

grouped = data[valid_mask|test_mask|oot_mask|hoot_mask].groupby(col_month, axis=0)
res_old= grouped.apply(proc_gini, col_target ,col_oldscore)
plt.plot(range(len(res_old)),-res_old, linewidth=2.0,label='old score', color = 'r', marker='o')

plt.xticks(range(len(res_valid)+len(res_oot)), np.sort(data[col_month].unique()), rotation=45)

plt.ylim([0,1])
plt.title('Gini by months')
plt.legend(loc='center left', bbox_to_anchor=(1, 0.5))
plt.xlabel('Months')
plt.ylabel('Gini')
plt.savefig(output_folder+'/performance/ginistability_oldscore.png', bbox_inches='tight', dpi = 72)
plt.show()

## Transition matrices

Matrices for the observable population

In [None]:
from scoring.plot import transmatrix
    
transmatrix(oldscore = data[valid_mask|test_mask|oot_mask|hoot_mask][col_oldscore],
            newscore = data[valid_mask|test_mask|oot_mask|hoot_mask][col_score],
            target = data[valid_mask|test_mask|oot_mask|hoot_mask][target_for_default],
            base = data[valid_mask|test_mask|oot_mask|hoot_mask][base_for_default],
            obs = data[valid_mask|test_mask|oot_mask|hoot_mask][base_for_default],
            draw_default_matrix=True,
            draw_transition_matrix=True,
            savepath=output_folder+'/analysis/devpop_',
            quantiles_count = 10)

Transition matrix for the whole population (put also the rejected etc. here)

In [None]:
pop_mask = data.data_type != 'train'

transmatrix(oldscore = data[pop_mask][col_oldscore],
            newscore = data[pop_mask][col_score],
            target = data[pop_mask][target_for_default],
            base = data[pop_mask][base_for_default],
            obs = data[pop_mask][obs_for_population],
            draw_default_matrix=False,
            draw_transition_matrix=True,
            savepath=output_folder+'/analysis/allpop_',
            quantiles_count = 10)

## Performance on short target
If there is also a shorter (e.g. FPD30) target in the original dataset, we draw also charts for performance on this target in this part of the workflow.

In [None]:
#name of the short target column
col_short = "FPD"
#name of the short target's base column
col_shortbase = "FPD_BASE"

If you don't have base column in your data set, the following code adds it. **Otherwise, don't run it.**

In [None]:
if col_shortbase not in data:
    data[col_shortbase] = 0
    data.loc[data[col_short]==0,col_shortbase] = 1
    data.loc[data[col_short]==1,col_shortbase] = 1
    print('Column',col_shortbase,'added/modified. Number of columns:',data.shape[1])
else:
    print('Column',col_base,'already exists.')

In [None]:
shortbase_mask = ((data.data_type == 'valid')|(data.data_type == 'test')|(data.data_type == 'oot')|(data.data_type == 'hoot')) \
&(data[col_shortbase] == 1) 

In [None]:
perf_shorttarget = pd.DataFrame({'gini':[
    gini(data[shortbase_mask][col_short],data[shortbase_mask][col_score])
    ], 'lift_'+str(lift_perc):[
    lift(data[shortbase_mask][col_short],-data[shortbase_mask][col_score],lift_perc)
    ]})

In [None]:
display(perf_shorttarget)
perf_shorttarget.to_csv(output_folder+'/performance/performance_shorttarget.csv')

In [None]:
from sklearn.metrics import roc_curve, auc

def proc_gini(x,y,z):
    fpr, tpr, _ = roc_curve(x[y], x[z], pos_label=0)
    roc_gini = (auc(fpr, tpr)-0.5)*2
    return roc_gini
%matplotlib inline
plt.figure(figsize = (10,7))
grouped = data[valid_mask|test_mask|oot_mask|hoot_mask].groupby(col_month, axis=0)
res_new= grouped.apply(proc_gini, col_target ,col_score)
plt.plot(range(len(res_new)),-res_new, linewidth=2.0,label='target', color = 'g', marker='o')

grouped = data[shortbase_mask].groupby(col_month, axis=0)
res_short= grouped.apply(proc_gini, col_short ,col_score)
plt.plot(range(len(res_short)),-res_short, linewidth=2.0,label='short target', color = 'r', marker='o')

plt.xticks(range(len(res_short)), np.sort(data[col_month].unique()), rotation=45)

plt.ylim([0,1])
plt.title('Gini by months')
plt.legend(loc='center left', bbox_to_anchor=(1, 0.5))
plt.xlabel('Months')
plt.ylabel('Gini')
plt.savefig(output_folder+'/performance/ginistability_shorttarget.png', bbox_inches='tight', dpi = 72)
plt.show()

## HTML documentation

Create a basic HTML document with important scorecard characteristics. The results of some of the previous parts of the workflow have to be already created on disk, this part will just wrap them up.

If some specific parts (short target analysis, old score comparison) were not done, use the parameters below and set them to *False*.

In [None]:
txt_scorecard_name = 'My PSW model'
txt_author_name = 'Pavel Sůva'
short_target_analysis_done = True
old_score_comparison_done = True

In [None]:
with open(output_folder+'/documentation.html', 'w', encoding='utf-8') as f:
    f.write('<html>\n<head>\n<title>'+txt_scorecard_name+'</title>\n')
    f.write('<meta charset="utf-8">\n')    
    f.write('<style>\nbody{font: normal 10pt Helvetica, Arial, sans-serif;}\n'+ \
            '.textbold{font-weight:bold;}\n' + \
            '.divcode{font-family:Courier New,Courier,Lucida Sans Typewriter,Lucida Typewriter,monospace;}\n' + \
            '.divpic{padding-bottom: 20pt;}\n' + \
            '.textlabel{font-style:italic;font-size:8pt;}\n' + \
            'table{border-collapse:collapse;}\n' + \
            '</style>\n')
    f.write('</head>\n<body>')
    f.write('<h1>'+txt_scorecard_name+' - documentation</h1>\n')
    f.write('<h2>Document information</h2>\n')
    f.write('<div class="divpar">\n')
    f.write(' <div class="divtext"><span class="textbold">Author:</span> '+txt_author_name+'</div>\n')
    f.write(' <div class="divtext"><span class="textbold">Date:</span> '+ \
            datetime.datetime.now().strftime("%Y-%m-%d %H:%M")+'</div>\n')
    f.write('</div>\n')
    f.write('<h2>Data sample</h2>\n')
    f.write('<h3>Target</h3>')
    f.write('<div class="divpar">\n')
    f.write(' <div class="divtext"><span class="textbold">Target variable:</span> '+\
            pd.read_csv(output_folder+'/model/metadata.csv',header=None,index_col=0).loc['col_target'][1]+'</div>\n')
    f.write(' <div class="divtext"><span class="textbold">Base variable:</span> '+\
            pd.read_csv(output_folder+'/model/metadata.csv',header=None,index_col=0).loc['col_base'][1]+'</div>\n')
    f.write('</div>\n')
    f.write('<h3>Sample characteristics</h3>\n')
    f.write('<div class="divpar">\n')
    f.write(' <div class="divpic"><img src="analysis/data.png" />\n' + \
            ' <br /><span class="textlabel">Observations and defaults in time</span></div>\n')
    f.write(' <div class="divtab">\n'+pd.read_csv(output_folder+'/analysis/summary.csv',header=[0,1],index_col=0) \
            .to_html(na_rep='')+'\n </div>\n')
    f.write('</div>\n')
    f.write('<h3>Covariates</h3>\n')
    f.write('<div class="divpar">\n')
    f.write(' <div class="divtab">\n'+pd.read_csv(output_folder+'/predictors/covariates.csv',header=0,index_col=0) \
            .to_html(na_rep='')+'\n </div>\n')
    f.write('</div>\n')
    f.write('<h2>Final scorecard</h2>\n')
    f.write('<h3>Scorecard</h3>\n')
    f.write('<div class="divpar">\n')
    f.write(' <div class="divtab">\n'+pd.read_csv(output_folder+'/model/scorecard.csv',header=0,index_col=0) \
            .to_html(na_rep='')+'\n </div>\n')
    f.write('</div>\n')
    f.write('<h3>Scoring SQL</h3>\n')
    f.write('<div class="divpar">\n')
    f.write(' <div class="divcode">\n'+open(output_folder+'/model/scorecard.sql', 'r').read() \
            .replace(' ','&nbsp;').replace('\n','<br />')+'\n </div>\n')
    f.write('</div>\n')
    f.write('<h2>Predictors</h2>\n')
    for pred in pd.read_csv(output_folder+'/predictors/predictors.csv',index_col=None,header=None)[0].tolist():
        pred0 = ''.join(pred.split())[:-4]
        f.write('<h3>'+pred0+'</h3>\n')
        f.write('<h4>Grouping</h4>')
        f.write('<div class="divpar">\n')
        f.write(' <div class="divpic"><img src="predictors/'+pred0+'_binning.png" /></div>\n')
        f.write('</div>\n')
        f.write('<h4>Stability</h4>')
        f.write('<div class="divpar">\n')
        f.write(' <div class="divpic"><img src="stability/'+pred+'_stability.png" /></div>\n')
        f.write('</div>\n')
    f.write('<h2>Correlations</h2>\n')
    f.write('<h3>Correlation matrix between WOE variables</h3>\n')
    f.write('<div class="divpar">\n')
    f.write(' <div class="divpic"><img src="analysis/correlation.png" />\n' + \
            ' <br /><span class="textlabel">Correlation of WOE variables</span></div>\n')
    f.write('</div>\n')
    f.write('<h2>Model evaluation</h2>\n')
    f.write('<h3>Performance</h3>\n')
    f.write('<h4>General performance</h4>\n')
    f.write('<div class="divpar">\n')
    f.write(' <div class="divtab">\n'+pd.read_csv(output_folder+'/performance/performance.csv',header=0,index_col=0) \
            .to_html(na_rep='')+'\n </div>\n')
    f.write(' <div class="divpic"><img src="performance/roc.png" />\n' + \
            ' <br /><span class="textlabel">ROC curve</span></div>\n')
    f.write(' <div class="divpic"><img src="performance/lift.png" />\n' + \
            ' <br /><span class="textlabel">Lift curve</span></div>\n')
    f.write('</div>\n')
    f.write('<h4>Performance stability</h4>\n')
    f.write('<div class="divpar">\n')
    f.write(' <div class="divpic"><img src="performance/ginistability.png" />\n' + \
            ' <br /><span class="textlabel">Stability of Gini in time</span></div>\n')
    f.write('</div>\n')
    if short_target_analysis_done:
        f.write('<h3>Performance on shorter target</h3>\n')
        f.write('<h4>General performance</h4>\n')
        f.write('<div class="divpar">\n')
        f.write(' <div class="divtab">\n'+ \
                pd.read_csv(output_folder+'/performance/performance_shorttarget.csv',header=0,index_col=0) \
                .to_html(na_rep='')+'\n </div>\n')
        f.write('</div>\n')
        f.write('<h4>Performance stability</h4>\n')
        f.write('<div class="divpar">\n')
        f.write(' <div class="divpic"><img src="performance/ginistability_shorttarget.png" />\n' + \
                ' <br /><span class="textlabel">Stability of Gini in time</span></div>\n')
        f.write('</div>\n')
        f.write('<h3>Calibration</h3>\n')
        f.write('<div class="divpar">\n')
        f.write(' <div class="divpic"><img src="model/calibration.png" />\n' + \
                ' <br /><span class="textlabel">Model calibration chart</span></div>\n')
        f.write('</div>\n')
    if old_score_comparison_done:
        f.write('<h2>Comparison with current model</h2>\n')
        f.write('<h3>Performance comparison</h3>\n')
        f.write('<h4>General performance</h4>\n')
        f.write('<div class="divpar">\n')
        f.write(' <div class="divtab">\n'+ \
                pd.read_csv(output_folder+'/performance/performance_oldscore.csv',header=[0,1],index_col=0) \
                .to_html(na_rep='')+'\n </div>\n')
        f.write(' <div class="divpic"><img src="performance/roc_oldscore.png" />\n' + \
                ' <br /><span class="textlabel">ROC curve</span></div>\n')
        f.write(' <div class="divpic"><img src="performance/lift_oldscore.png" />\n' + \
                ' <br /><span class="textlabel">Lift curve</span></div>\n')
        f.write('</div>\n')
        f.write('<h4>Performance stability</h4>\n')
        f.write('<div class="divpar">\n')
        f.write(' <div class="divpic"><img src="performance/ginistability_oldscore.png" />\n' + \
                ' <br /><span class="textlabel">Stability of Gini in time</span></div>\n')
        f.write('</div>\n')
        f.write('<h3>Transition matrices</h3>\n')
        f.write('<h4>Bad rate matrix</h4>\n')
        f.write('<div class="divpar">\n')
        f.write(' <div class="divpic"><img src="analysis/devpop_matrix_default.png" />\n' + \
                ' <br /><span class="textlabel">Default rate matrix</span></div>\n')
        f.write('</div>\n')
        f.write('<h4>Transition matrix - development sample</h4>\n')
        f.write('<div class="divpar">\n')
        f.write(' <div class="divpic"><img src="analysis/devpop_matrix_transition.png" />\n' + \
                ' <br /><span class="textlabel">Transition matrix</span></div>\n')
        f.write('</div>\n')
        f.write('<h4>Transition matrix - whole population</h4>\n')
        f.write('<div class="divpar">\n')
        f.write(' <div class="divpic"><img src="analysis/allpop_matrix_transition.png" />\n' + \
                ' <br /><span class="textlabel">Transition matrix</span></div>\n')
        f.write('</div>\n')  
    f.write('</body></html>')
    print('Created documentation in file',f.name)