<span style="font-size:30pt;font-weight:bold">General Model Workflow</font>

**Copyright:**

© 2017-2020, Pavel Sůva, Marek Teller, Martin Kotek, Jan Zeller, Marek Mukenšnabl, Kirill Odintsov, Jan Hynek, Elena Kuchina and Home Credit & Finance Bank Limited Liability Company, Moscow, Russia – all rights reserved

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the [License](http://www.apache.org/licenses/LICENSE-2.0)

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

For list of contributors see [Gitlab page](https://git.homecredit.net/risk/python-scoring-workflow) 

# Import packages, configure environment

For this workflow to work, you need to install some non-standard python packages that are not in standard Anaconda distribution - mainly widgets for GUI of some parts of this workflow:

`conda install ipywidgets`

`jupyter nbextension enable --py --sys-prefix widgetsnbextension`

`conda config --add channels conda-forge`

`conda install qgrid` 

`jupyter nbextension enable --py --sys-prefix qgrid`

`conda install tqdm`

In [None]:
import time
import datetime
import math
import random
import numpy as np
import pandas as pd
import sklearn
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import os.path
import pickle
import gc
from tqdm import tqdm_notebook as tqdm
import sys
import warnings

sys.path.insert(0, '..')
import scoring

from scipy.special import logit, expit
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import StratifiedKFold
from sklearn.linear_model import LogisticRegression

In [None]:
sns.set()
%matplotlib inline
%config InlineBackend.close_figures=True
from IPython.display import display, Markdown
pd.options.display.max_columns = None
pd.options.display.max_rows = 15
output_folder = 'documentation'

if not os.path.exists(output_folder): os.makedirs(output_folder)
if not os.path.exists(output_folder+'/performance'): os.makedirs(output_folder+'/performance')
if not os.path.exists(output_folder+'/predictors'): os.makedirs(output_folder+'/predictors')
if not os.path.exists(output_folder+'/stability'): os.makedirs(output_folder+'/stability')
if not os.path.exists(output_folder+'/stability_short'): os.makedirs(output_folder+'/stability_short')
if not os.path.exists(output_folder+'/analysis'): os.makedirs(output_folder+'/analysis')
if not os.path.exists(output_folder+'/model'): os.makedirs(output_folder+'/model')
if not os.path.exists(output_folder+'/nan_share'): os.makedirs(output_folder+'/nan_share')
scoring.check_version('0.8.2')

In [None]:
from scoring import doctools

documentation = doctools.ProjectParameters()

# Input data

## Import data

Importing data from a CSV file. It is important to set the following parameters:

encoding: usually 'utf-8' or windows-xxxx on Windows machines, where xxxx is 1250 for Central Europe, 1251 for Cyrilic etc.
sep: separator of columns in the file
decimal: decimal dot or coma
index_col: which columns is used as index - should be the unique credit case identifier

**Defining NA values:** In different datasets, there can be different values to be considered *N/A*. By default, we set only blank fields to be considered *N/A*, however you might want to change it and add values like *'NA'*, *'NAN'*, *'null'* to be also considered *N/A*. User parameter `na_values` for this.

The data need to have **index column which has unique value per each row**. If not, it will cause problems later.

In [None]:
from scoring import db
data = db.read_csv(r'demo_data\gmdata.CSV', sep = ',', decimal = '.',
                   optimize_types=True, encoding = 'utf-8', low_memory = False,
                   keep_default_na = False, na_values = [''])
print('Data loaded on',datetime.datetime.fromtimestamp(time.time()).strftime('%Y-%m-%d %H:%M:%S'))

In [None]:
print('Number of rows:',data.shape[0])
print('Number of columns:',data.shape[1])

In [None]:
data.head()

## Metadata definition

Assigning target column, time column and ID column (observation identifier) which have to exist within the dataset.

All the rest (i.e. base which tells where the target is observable, month and day which are derived from the time, and weight which gives weight - importance - to each row individually) are created automatically later if they don't exist at this moment.

In [None]:
### THESE COLUMNS MUST BE INCLUDED IN THE DATA SET ###
#name of the target column
col_target = "FPD30"
#name of the time column
col_time = "DATETIME"
#name of ID column
col_id = "ID"

### THESE COLUMNS DON'T HAVE TO BE INCLUDED IN THE DATA SET AND ARE CREATED AUTOMATICALLY LATER ###
#name of the base column
col_base = "APPROVED"
#name of the month column
col_month = "MONTH"
#name of the day column
col_day = "DAY"
#name of the weight column - CURRENTLY COMMENTED OUT BECAUSE OF REASONS MENTIONED LATER
col_weight = 'WEIGHT'

In [None]:
documentation.targets = [(col_target, col_base)]
documentation.time_variable = col_month

In [None]:
documentation.rowid_variable = col_id

In [None]:
pd.DataFrame.from_records([['col_time',col_time],['col_month',col_month],['col_day',col_day],['col_target',col_target],['col_base',col_base]]) \
.to_csv(output_folder+'/model/metadata.csv',index=0,header=None)

data[col_target] = data[col_target].astype(np.float)

If you don't have base column in your data set, the following code adds it (based on if target is filled).

In [None]:
if col_base not in data:
    data[col_base] = 0
    data.loc[data[col_target]==0,col_base] = 1
    data.loc[data[col_target]==1,col_base] = 1
    print('Column',col_base,'added/modified. Number of columns:',data.shape[1])
else:
    print('Column',col_base,'already exists.')

If you don't have weight column in your data set, the following code adds it, with value = 1 for each row.

In [None]:
if col_weight not in data:
    data[col_weight] = 1
    print('Column',col_weight,'added/modified. Number of columns:',data.shape[1])
else:
    print('Column',col_weight,'already exists.')

Create the month and day column from the time column is doing the following
- take the time column and tell in which format the time is saved in - **you need to specify this in variable *dtime_input_format*** (see https://docs.python.org/3/library/time.html#time.strftime for reference)
- strip the format just to year, month, day string
- convert the string to number
- the new column will be added to the dataset as day
- truncate this column to just year and month and add it to dataset as month

In [None]:
dtime_input_format = '%Y-%m-%d %H:%M:%S'

In [None]:
data[col_day] = pd.to_numeric(pd.to_datetime(data[col_time], format=dtime_input_format).dt.strftime('%Y%m%d'))
data[col_month] = data[col_day].apply(lambda x: math.trunc(x/100))
print('Columns',col_day,'and',col_month,'added/modified. Number of columns:',data.shape[1])

In [None]:
data.head(5)

Set of predictor (subscores) to be analyzed. These are the potential elements of the final General Model score.

In [None]:
cols_pred = [
    'INTERNAL',
    'TELCO_A',
    'TELCO_B',
    'BUREAU_X',
    'BUREAU_Y',
    'UTILITY',
    'DEVICE',
]

## Data exploration

In [None]:
descrip = data.describe(include='all').transpose()
pd.options.display.max_rows = 1000
display(descrip)
pd.options.display.max_rows = 15

**Default rate in time**: Simple visualisation of observation count and default rate in time

In [None]:
from scoring.plot import plot_dataset
plot_dataset(data,
             month_col=col_month,
             def_col=col_target,
             title='Count and bad rate',
             base_col=col_base,
             #weightCol=col_weight,
             savepath=output_folder+'/analysis/',
             zeroYlim=True)

**NaN share by month** for each variable in dataset:

In [None]:
from scoring.data_exploration import nan_share_development

nan_table = nan_share_development(data[cols_pred + [col_month]], col_month, 
                                  make_images=True, show_images=True, output_path=output_folder+'/nan_share/')
display(nan_table)

## Mask definitions

### Split to samples

Split data into five parts (training and validation). GM model is usually trained on small and time-limited sample, so test and out of time samples are usually not created.

This will add a new column indicating to which part the observations belong.

- The *splitting_points* (first date of train and first date of out of time sample) can be adjusted (there can be any number of such splitting points) - it should correspond to values of column specified by *time_column* parameter. If empty list is used, no time splits will be done.
- For each time interval, you can create multiple random splits (i.e. train/valid/test), the ratio of sizes of these splits is set by parameter *sample_sizes*. In our case we have just one time interval, so only one list of sample sizes.
- The random splits can be stratified by multiple variables, which are specified in a list - argument to *stratify_by_columns* parameter
- Set the random seed so the results are replicable

**Before you run data split, make sure that index in your dataset in unique!** If not, you need to create new unique index.

In [None]:
# data['INDEX_ORIGINAL'] = data.index
# data.reset_index(inplace=True)

In [None]:
from scoring.data_manipulation import data_sample_time_split

data['data_type'] = data_sample_time_split(data, 
                           time_column = col_month,
                           splitting_points = [],
                           sample_sizes = [[ 0.5   , 0.5   ]],
                           sample_names = [['train','valid']],
                           stratify_by_columns = [col_month,col_target],
                           random_seed = 1234)

Masks: boolean vectors corresponding to rows in the datasets. True if an row is observable and its data type belongs to given sample.

`observable_mask` is mask where all observable rows are included (i.e. valid observations from all the samples)

`everything_mask` is mask which is True for all rows.

In [None]:
train_mask = (data['data_type'] == 'train') & (data[col_base] == 1) 
valid_mask = (data['data_type'] == 'valid') & (data[col_base] == 1) 
observable_mask = data[col_base] == 1
everything_mask = pd.notnull(data['data_type'])

Add masks to _documentation_ object.

In [None]:
documentation.sample_dict = {
    "Train": train_mask,
    "Valid": valid_mask,
    "Observable": observable_mask,
}

Data summary (number of defaults, number in base, number of observations, default rate) by month and by sample

In [None]:
data_summary = data.groupby([col_month,'data_type']).aggregate({
    col_target:'sum',col_base:['sum','count']
})
data_summary.columns = [col_target,col_base,'Rows']
data_summary[col_target+' rate'] = data_summary[col_target]/data_summary[col_base]
display(data_summary)

data_summary = data_summary.reset_index(level='data_type').pivot(columns='data_type')
display(data_summary)
data_summary.to_csv(output_folder+'/analysis/summary.csv')

# Reweighting

This part serves to set weights based on hit rates of each score (data source).

If we have only limited number of observations for each score, but in production, we expect different hit rates, we can assign weights to each observation to give it such importance that it reflects the expected hit rates.

E.g. if we have only 25% of hits in the development population for certain data source, but we expect it to be 50% in production, we assign weight=3 to this hit population, so now the non-hit population and hit population have the same sum of weights.

## Hit rates

Calculates hit rates for each data source in our development data

In [None]:
hit_flags = data[[col_weight, col_month]].copy()
for col in cols_pred:
    hit_flags[col] = data[col].notnull()

### Hits by score

In [None]:
hit_rates = []
for col in cols_pred:
    hit_rates.append({'Variable': col,
                     'Weighted hit rate %': hit_flags[observable_mask & hit_flags[col]][col_weight].sum()*100 /
                                      hit_flags[observable_mask][col_weight].sum(),
                     })
hit_rates = pd.DataFrame(hit_rates).set_index('Variable')
display(hit_rates)
hit_rates.to_csv(output_folder+'/predictors/hit_rates_unweighted.csv')

### Hits by score in time

In [None]:
hit_rates_time = []
for col in cols_pred:
    for month in hit_flags[col_month].unique():
        hit_rates_time.append({'Variable': col,
                               'Month': month,
                               'Weighted hit rate %': hit_flags[observable_mask & hit_flags[col] & (hit_flags[col_month]==month)][col_weight].sum()*100 /
                                                hit_flags[observable_mask & (hit_flags[col_month]==month)][col_weight].sum(),
                              })
hit_rates_time = pd.DataFrame(hit_rates_time).pivot(index='Variable', columns='Month', values='Weighted hit rate %').reindex(cols_pred)
display(hit_rates_time)
hit_rates_time.to_csv(output_folder+'/predictors/hit_rate_time.csv')

### Combined hits

Caluclate hit rate interactions, i.e. what are the shares of various data source hits.

In [None]:
hit_rates_comb = pd.DataFrame(hit_flags[observable_mask].groupby(cols_pred)[col_weight].sum()*100 /
                              hit_flags[observable_mask][col_weight].sum()).rename(columns={col_weight:'Weighted obs %'})
display(hit_rates_comb)
hit_rates_comb.to_csv(output_folder+'/predictors/hit_rate_interactions.csv')

## Setting desired situation

Now we set what are the hit rates that we expect, i.e. what will be the situation in production.

In [None]:
hit_desired = hit_rates.copy()
hit_desired['Desired hit rate %'] = np.nan

### Setting desired hit rates in code

Either we can hardcode these numbers into the dataframe with desired hit rates...

In [None]:
hit_desired.loc['INTERNAL','Desired hit rate %'] = 100
hit_desired.loc['TELCO_A','Desired hit rate %'] = 50
hit_desired.loc['TELCO_B','Desired hit rate %'] = 50
hit_desired.loc['BUREAU_X','Desired hit rate %'] = 40
hit_desired.loc['BUREAU_Y','Desired hit rate %'] = 40
hit_desired.loc['UTILITY','Desired hit rate %'] = 65
hit_desired.loc['DEVICE','Desired hit rate %'] = 50

### Interactive tool

...or we can use interactive qgrid table to set it in nicer GUI. **Don't forget to run the piece of code under the table which saves the changed values into the dataframe.**

In [None]:
import qgrid

hit_widget = qgrid.show_grid(hit_desired, 
                             column_options={'editable':False}, 
                             column_definitions={'Desired hit rate %':{'editable':True}},
                            )
hit_widget

Save the changed values:

In [None]:
hit_desired = hit_widget.get_changed_df()

## Assign weights to reflect desired situation

The weights are caluclated, so the weighted hit rates (i.e. hit rates when each observation's importance is multiplied by the newly calculated weights) reflect the desired (expected production) hit rates from `hit_desired` dataframe.

**Assumptions:** Hit rates of individual scores are mutually independent and time consistent.

In [None]:
# data[col_weight] = 1

In [None]:
for var_name, hit_row in hit_desired.iterrows():
    if (hit_row['Desired hit rate %'] == 100) or (hit_row['Weighted hit rate %'] == 100):
        coef = 1
    else:
        coef = (hit_row['Desired hit rate %'] * data.loc[observable_mask & pd.isnull(data[var_name]), col_weight].sum()) /\
               ((100-hit_row['Desired hit rate %']) * data.loc[observable_mask & pd.notnull(data[var_name]), col_weight].sum())
    data.loc[pd.notnull(data[var_name]), col_weight] = \
        data.loc[pd.notnull(data[var_name]), col_weight] * coef
weight_calib_coef = data[col_weight].count() / data[col_weight].sum()
data[col_weight] = data[col_weight] * weight_calib_coef

In [None]:
hit_flags = data[[col_weight, col_month]].copy()
for col in cols_pred:
    hit_flags[col] = data[col].notnull()

hit_rates_w = []
for col in cols_pred:
    hit_rates_w.append({'Variable': col,
                     'Weighted hit rate %': hit_flags[observable_mask & hit_flags[col]][col_weight].sum()*100 /
                                      hit_flags[observable_mask][col_weight].sum(),
                     })
hit_rates_w = pd.DataFrame(hit_rates_w).set_index('Variable')
display(hit_rates_w)
hit_rates_w.to_csv(output_folder+'/predictors/hit_rates_weighted.csv')

*In the demo data that we have here, the assumptions of independence and time consistency are not fullfiled. Variable DEVICE is observable only in a month when no other variable is observable. This means that the final weights are not able to mimick the desired hit rates*

# Datasources analysis

We will analyze each score which can potentially enter the General Model. These scores were defined in list `cols_pred` above.

## Univariate power of raw scores

For each score, we calculate its Gini on such subset, where the target is observable and that particular score is not null. We then see what is the power of the score when we have it (i.e. on its *"hit" population*, where hit means that the datasource which returns the score was queried and valid score was returned)

In [None]:
from scoring.metrics import gini

uni_ginis = []
for col in cols_pred:
    if pd.api.types.is_numeric_dtype(data[col].dtype):
        col_gini = np.abs(gini(data[pd.notnull(data[col]) & observable_mask][col_target],
                               data[pd.notnull(data[col]) & observable_mask][col],
                               data[pd.notnull(data[col]) & observable_mask][col_weight]))
        uni_ginis.append({'Variable':col,
                          'Gini': col_gini,
                        })
uni_ginis = pd.DataFrame(uni_ginis)[['Variable','Gini']]
display(uni_ginis)
uni_ginis.to_csv(output_folder+'/predictors/univariate_gini.csv')

## Score transformation

Each datasource can return score in a different format. 
- Sometimes it is probability of default or non-default (and these can have various definitions for each source).
- Sometimes it is logit, i.e. real number which after expit transformation becomes the probability.
- Sometimes it is logit, but linearly transformed to fit into a specific scale.
- Sometimes the logic is not known but the score is somehow correlated with defaults.
- Sometimes there are specific "special values" which are out of standard score scale and which are telling us that something non-standard occured (e.g. error codes).

### Inverted logit to logit

`logit(Probability of default) = -logit(Probability of non-default)`

In [None]:
for col in [
    'INTERNAL'
]:
    data[col+'_LIN'] = -data[col]

### Logit from PD

In [None]:
for col in [
    'TELCO_A',
]:
    data[col+'_LIN'] = logit(data[col])

### Logit from inverted PD

In [None]:
for col in [
    'BUREAU_Y',
    'DEVICE',
]:
    data[col+'_LIN'] = logit(1-data[col])

### Logit by linear scaling

when the score is logit of PD, but linearly shifted, we run logistic regression with just this score as sole predictor to calculate proper intercept and slope to shift it back to calibrate it to our default

In [None]:
for col in [
    'TELCO_B',
]:
    telco_b_scaler = LogisticRegression(penalty = 'l2', C = 1000, solver='liblinear')
    telco_b_scaler.fit(data[(data[col].notnull()) & observable_mask][[col]],
                       data[(data[col].notnull()) & observable_mask][col_target])
    print(f'Intercept: {telco_b_scaler.intercept_[0]}, Slope: {telco_b_scaler.coef_[0][0]}')
    data[col+'_LIN'] = telco_b_scaler.intercept_[0] + telco_b_scaler.coef_[0][0] * data[col]

### WOE for unknown scale and categorical

If the score has unknown scale or is categorical, we can calculate Weight of Evidence as with common predictors. We will use the Interactive grouping for this.

A new instance of **InteractiveGrouping** class is created. There are two important parameters:
 - *colums*: list of numerical columns to be grouped
 - *cat_columns*: list of categorical columns to be grouped
 - *group_count*: (maximal) number of final groups of each variable
 - *min_samples*: minimal number of observations in each group of each numerical variable
 - *min_samples_cat*: minimal number of observations in each group of each categorical variable

Then you open the interactive environment using **display** method. The important parameters are:
 - *train_t*: training dataset the grouping should be based on
 - *colums*: list of numerical columns to be grouped and displayed
 - *cat_columns*: list of categorical columns to be grouped and displayed
 - *target_column*: as the grouping is supervised and calculates WOE values, you need to specify the target column name
 - *w_column*: vector of weights of obervation (if not filled, grouping behaves as there are equal weights)
 - *filename*: use only if you want to load a grouping that you created and saved previously
 - *group_count*: (maximal) number of final groups of each variable
 - *min_samples*: minimal number of observations in each group of each numerical variable
 - *min_samples_cat*: minimal number of observations in each group of each categorical variable

In the interactive environment, you can see four sections. From top to bottom:
- **Chart section**: 
 - For **numerical variables**, there is chart with equifrequncy fine classing (observations as bars, default rate as line), equidistant fine classing and the final groups.
 - For **categorical varibles** there is chart with each of the original categorical values and a chart with the final groups.
- **Variable section**: here you can choose tab with varible which you want to edit. 
 - For **numerical variables**, the tab contains of the borders of the final groups. You can edit these borders, add new with [+] button and remove them with [-] button. You can also manually set WOE for nulls. There is also a button to perform automatic grouping on the selected variable.
 - For **categorical variables**, the tab contains of two tables. In the top table, you can see some statistics for each of the categorical values. In the rightmost column, there is the number of group which is assigned to the category. You can edit this value (doubleclick on it) to change the grouping. In the bottom table you can see statistics for the groups. It is not editable. There is also a button to perform automatic grouping on the selected variable.
- **Save section**: here you can save the grouping. Edit the file name and click the [Apply and Save] button.
- **Settings section**: If you perform automatic grouping on some varible, the grouping algorithm uses some parameters. These parameters can be set here. You can set how many final groups do you want to have and what is their minimal size.

In [None]:
from scoring.grouping import Grouping, InteractiveGrouping

cols_togroup_num = [
                       'BUREAU_X',
]
cols_togroup_cat = [
                       'UTILITY',
]

grouping = InteractiveGrouping(columns = cols_togroup_num,
                               cat_columns = cols_togroup_cat,
                               group_count=5,
                               min_samples=100, 
                               min_samples_cat=100,
                               woe_smooth_coef=0.001)

sns.reset_orig()
%matplotlib notebook
%config InlineBackend.close_figures=False

grouping.display(#train_t = data[train_mask][cols_togroup_num+cols_togroup_cat+[col_target]],
                 train_t = data[train_mask][cols_togroup_num+cols_togroup_cat+[col_target]+[col_weight]], #for call with weight
                 columns = cols_togroup_num,
                 cat_columns = cols_togroup_cat,
                 target_column = col_target,
                 w_column = col_weight,
                 #filename = 'myIntGrouping',
                 bin_count=20,
                 woe_smooth_coef=0.001,
                 group_count=5,
                 min_samples=100,
                 min_samples_cat=100)

Don't forget to *Apply and Save* your changes.

In [None]:
#reset the graphical environment to be used by the normal non-interactive charts
sns.set()
%matplotlib inline
%config InlineBackend.close_figures=True

Load the grouping from a file (don't forget to set the right filename) and add the WOE columns to the original dataset.

In [None]:
# from scoring.grouping import Grouping
# grouping = Grouping(columns = cols_togroup_num,
#                     cat_columns = cols_togroup_cat,
#                     group_count=5, 
#                     min_samples=100, 
#                     min_samples_cat=100,
#                     woe_smooth_coef=0.001) 
# g_filename = 'myIntGrouping'
# grouping.load(g_filename)

Apply the grouping to the data. *Grouping.transform()* method now automatically renames columns with proper suffix. If you need to transform just subset of columns use parameter *columns_to_transform=\[...\]*.

In [None]:
data_woe = grouping.transform(data, transform_to='woe', progress_bar=True)

Add WOE variabes to the data set.

In [None]:
woe_columns_to_replace = list()
for column in data_woe.columns:
    if column in data:
        woe_columns_to_replace.append(column)
        print('Column', column ,'dropped as it already existed in the data set.')
data = data.drop(woe_columns_to_replace, axis='columns')
data = data.join(data_woe)

del data_woe
gc.collect()

print('Added WOE variables. Number of columns:',data.shape[1])
cols_woe = [s + '_WOE' for s in cols_pred]

## Score distribution and calibration

Once we transformed all the scores to form which is after expit transformation linearly dependent on default rate, we can draw charts of its distribution and also compare the predicted probability in each quantile with predicted default rate.

In [None]:
from scoring.plot import score_calibration

List of *transformed* scores (i.e. each score must be in its logit form).

In [None]:
cols_pred_transformed = [
    'INTERNAL_LIN',
    'TELCO_A_LIN',
    'TELCO_B_LIN',
    'BUREAU_X_WOE',
    'BUREAU_Y_LIN',
    'UTILITY_WOE',
    'DEVICE_LIN',
]

Each of the following charts consists of:
- red columns: histogram of score - number of observations in each quantile, tied to the left y axis
- black dashed line: predicted probability of default (i.e. `exp(x)/1+exp(x)`) of x axis
- blue line: acutal default rate in each quantile, tied to the right y axis, should correspond to the black dashed line
- green dashed line: actual default rate of observations where the score is null, tied to the right y axis

The numbers under each chart are:
- average predicted default rate (i.e. average of values of black dashed line of the whole population where the score is not null)
- average hit default rate (i.e. average of values of blue line of the whole population where the score is not null)
- average non-hit default rate (i.e. value of the green dashed line)
- hit gini (metric of quality of prediction for population where the score is not null)

In [None]:
for col in cols_pred_transformed:
    if col[-3:] == 'WOE':
        shift = np.log(data[observable_mask][col_target].sum() / (1-data[observable_mask][col_target]).sum())
        scale=-1
    else:
        shift=0
        scale=1
    score_calibration(data=data[observable_mask],
                      score=col,
                      target=col_target,
                      weight=col_weight,
                      shift=shift,
                      scale=scale,
                      ispd=False,
                      savefile=output_folder+'/predictors/'+col+'.png')

# Building the model step by step

In this part, we will build the model step by setp, so after transformation all variables to logit, we have to:
- impute the missing values of the subscores by valid numbers. These numbers should be based on relationship of score and default from the observations where the score is known
- fit the model using these imputed variables

## Missing value imputation

### Imputation based on missing

In case we have a certain data source available only for certain observations, we need to fill the data source's score with some placeholder values to be able to fit the regression model correctly.

This placeholder value should correspond to the average default rate of this group in such way that there is the same relationship between default rate and score as the observations where the score is known.

After each imputation, we draw a chart of the score distribution with a vertical reference line corresponding to the imputed value as a sanity check whether the makes sense to us.

#### Basic imputation assuming linear dependency of score expit and target

Calculates imputation value for score that is missing for some observations. Based on default rate in hit sample, average score in hit sample and default rate in no-hit sample, it caluclates score for no-hit sample (using simple proportion formula - "trojčlenka"). The score is taken in its logit form, then converted to probability of default and then the calculation is done (or, if parameter `ispd=True`, it is assumed that score already is in probability of default form).

$$\text{Imputation score (PD)} = \frac{ \text{Avg non-hit default rate}}{\text{Avg hit default rate}}\times\text{Avg hit score (PD)}$$

In [None]:
from scoring.score_imputation import missing_value
from scoring.plot import score_calibration

In [None]:
fill_val = missing_value(sample_hit = data[train_mask & ~np.isnan(data["TELCO_A_LIN"])],
                        sample_nohit = data[train_mask & np.isnan(data["TELCO_A_LIN"])],
                        score = "TELCO_A_LIN",
                        target = col_target,
                        weight = col_weight,
                        ispd = False)
print(f'Imputation value: {fill_val}')

data["TELCO_A_LIN_MV"] = data["TELCO_A_LIN"].copy()
data.loc[np.isnan(data["TELCO_A_LIN"]), "TELCO_A_LIN_MV"] = fill_val

score_calibration(data=data[train_mask],
                  score="TELCO_A_LIN",
                  target=col_target,
                  weight=col_weight,
                  vertical_lines=fill_val)

In [None]:
fill_val = missing_value(sample_hit = data[train_mask & ~np.isnan(data["TELCO_B_LIN"])],
                        sample_nohit = data[train_mask & np.isnan(data["TELCO_B_LIN"])],
                        score = "TELCO_B_LIN",
                        target = col_target,
                        ispd = False)
print(fill_val)

data["TELCO_B_LIN_MV"] = data["TELCO_B_LIN"].copy()
data.loc[np.isnan(data["TELCO_B_LIN"]), "TELCO_B_LIN_MV"] = fill_val

score_calibration(data=data[train_mask],
                  score="TELCO_B_LIN",
                  target=col_target,
                  weight=col_weight,
                  vertical_lines=fill_val)

In [None]:
fill_val = missing_value(sample_hit = data[train_mask & ~np.isnan(data["DEVICE_LIN"])],
                        sample_nohit = data[train_mask & np.isnan(data["DEVICE_LIN"])],
                        score = "DEVICE_LIN",
                        target = col_target,
                        weight = col_weight,
                        ispd = False)
print(fill_val)

data["DEVICE_LIN_MV"] = data["DEVICE_LIN"].copy()
data.loc[np.isnan(data["DEVICE_LIN"]), "DEVICE_LIN_MV"] = fill_val

score_calibration(data=data[train_mask],
                  score="DEVICE_LIN",
                  target=col_target,
                  weight=col_weight,
                  vertical_lines=fill_val)

#### Alternative imputer using not assuming linear dependence of score expit and target

Calculates imputation value for score that is missing for some observations. It divides hit sample into certain number of quantiles (sorted by score) and smoothens them (by joining neighbors together) to be monotonic in average default rate. Then, it finds the quantile whose average target value is closest to average target of no-hit sample. Average score of this quantile is then used as imputation value.
```   
Imputation score = Avg score of quantile which is argmin(|Avg default in quantile - default of non-hits|)
```

In [None]:
from scoring.score_imputation import quantile_imputer
from scoring.plot import score_calibration

In [None]:
fill_val = quantile_imputer(sample_hit = data[train_mask & ~np.isnan(data["TELCO_A_LIN"])],
                        sample_nohit = data[train_mask & np.isnan(data["TELCO_A_LIN"])],
                        score = "TELCO_A_LIN",
                        target = col_target,
                        weight = col_weight,
                        quantiles = 100)
print(fill_val)

data["TELCO_A_LIN_MV"] = data["TELCO_A_LIN"].copy()
data.loc[np.isnan(data["TELCO_A_LIN"]), "TELCO_A_LIN_MV"] = fill_val

score_calibration(data=data[train_mask],
                  score="TELCO_A_LIN",
                  target=col_target,
                  weight=col_weight,
                  vertical_lines=fill_val)

In [None]:
fill_val = quantile_imputer(sample_hit = data[train_mask & ~np.isnan(data["TELCO_B_LIN"])],
                        sample_nohit = data[train_mask & np.isnan(data["TELCO_B_LIN"])],
                        score = "TELCO_B_LIN",
                        target = col_target,
                        quantiles = 100)
print(fill_val)

data["TELCO_B_LIN_MV"] = data["TELCO_B_LIN"].copy()
data.loc[np.isnan(data["TELCO_B_LIN"]), "TELCO_B_LIN_MV"] = fill_val

score_calibration(data=data[train_mask],
                  score="TELCO_B_LIN",
                  target=col_target,
                  weight=col_weight,
                  vertical_lines=fill_val)

In [None]:
fill_val = quantile_imputer(sample_hit = data[train_mask & ~np.isnan(data["DEVICE_LIN"])],
                        sample_nohit = data[train_mask & np.isnan(data["DEVICE_LIN"])],
                        score = "DEVICE_LIN",
                        target = col_target,
                        weight = col_weight,
                        quantiles = 100)
print(fill_val)

data["DEVICE_LIN_MV"] = data["DEVICE_LIN"].copy()
data.loc[np.isnan(data["DEVICE_LIN"]), "DEVICE_LIN_MV"] = fill_val

score_calibration(data=data[train_mask],
                  score="DEVICE_LIN",
                  target=col_target,
                  weight=col_weight,
                  vertical_lines=fill_val)

### Imputation based on special values

Sometimes, we don't want to impute all the missing values together by one imputation values, because there are multiple types of missings (e.g. no-hit and no-ask).

We can use various masks to define these populations to be imputed, and call our imputation functions for each one separately

In [None]:
print(data["BUREAU_Y_STATUS"].unique())

#### Basic

In [None]:
fill_val_nohit = missing_value(sample_hit = data[train_mask & (data["BUREAU_Y_STATUS"]=='HIT')],
                              sample_nohit = data[train_mask & (data["BUREAU_Y_STATUS"]=='NO-HIT')],
                              score = "BUREAU_Y_LIN",
                              target = col_target,
                              weight = col_weight,
                              ispd = False)
print(fill_val_nohit)

fill_val_noask = missing_value(sample_hit = data[train_mask & (data["BUREAU_Y_STATUS"]=='HIT')],
                              sample_nohit = data[train_mask & (data["BUREAU_Y_STATUS"]=='NO-ASK')],
                              score = "BUREAU_Y_LIN",
                              target = col_target,
                              weight = col_weight,
                              ispd = False)
print(fill_val_noask)

data["BUREAU_Y_LIN_MV"] = data["BUREAU_Y_LIN"].copy()
data.loc[(data["BUREAU_Y_STATUS"]=='NO-HIT'), "BUREAU_Y_LIN_MV"] = fill_val_nohit
data.loc[(data["BUREAU_Y_STATUS"]=='NO-ASK'), "BUREAU_Y_LIN_MV"] = fill_val_noask

score_calibration(data=data[train_mask],
                  score="DEVICE_LIN",
                  target=col_target,
                  weight=col_weight,
                  vertical_lines=[fill_val_nohit, fill_val_noask])

#### Alternative

In [None]:
fill_val_nohit = quantile_imputer(sample_hit = data[train_mask & (data["BUREAU_Y_STATUS"]=='HIT')],
                              sample_nohit = data[train_mask & (data["BUREAU_Y_STATUS"]=='NO-HIT')],
                              score = "BUREAU_Y_LIN",
                              target = col_target,
                              weight = col_weight,
                              quantiles = 100)
print(fill_val_nohit)

fill_val_noask = quantile_imputer(sample_hit = data[train_mask & (data["BUREAU_Y_STATUS"]=='HIT')],
                              sample_nohit = data[train_mask & (data["BUREAU_Y_STATUS"]=='NO-ASK')],
                              score = "BUREAU_Y_LIN",
                              target = col_target,
                              weight = col_weight,
                              quantiles = 100)
print(fill_val_noask)

data["BUREAU_Y_LIN_MV"] = data["BUREAU_Y_LIN"].copy()
data.loc[(data["BUREAU_Y_STATUS"]=='NO-HIT'), "BUREAU_Y_LIN_MV"] = fill_val_nohit
data.loc[(data["BUREAU_Y_STATUS"]=='NO-ASK'), "BUREAU_Y_LIN_MV"] = fill_val_noask

score_calibration(data=data[train_mask],
                  score="DEVICE_LIN",
                  target=col_target,
                  weight=col_weight,
                  vertical_lines=[fill_val_nohit, fill_val_noask])

### Manual imputation

In some cases, we want to impute the missing values by values that we come up with expertly.

In the following list of dictionaries, we define:
- `fill_variable` - name of variable that should be imputed, i.e. *what*
- `nohit_condition` - pandas query defining which rows the variable should be imputed in, i.e. *where*
- `manual_value` - value the variable should be imputed by, i.e. *how*

In [None]:
mv_manual_imputation_meta = [
    {
        'fill_variable': 'TELCO_A_LIN',
        'nohit_condition': 'TELCO_A_LIN.isnull()',
        'manual_value': -2.2,
    },
    {
        'fill_variable': 'TELCO_B_LIN',
        'nohit_condition': 'TELCO_B_LIN.isnull()',
        'manual_value': -2.2,
    },
    {
        'fill_variable': 'DEVICE_LIN',
        'nohit_condition': 'DEVICE.isnull()',
        'manual_value': -2.6,
    },
    {
        'fill_variable': 'BUREAU_Y_LIN',
        'nohit_condition': 'BUREAU_Y_STATUS=="NO-HIT"',
        'manual_value': -2.8,
    },
    {
        'fill_variable': 'BUREAU_Y_LIN',
        'nohit_condition': 'BUREAU_Y_STATUS=="NO-ASK"',
        'manual_value': -2.3,
    },
]

Run imputation based on the previously defined list of dictionaries:

In [None]:
for imp in mv_manual_imputation_meta:
    if imp['fill_variable']+'_MV' not in data.columns:
        data[imp['fill_variable']+'_MV'] = data[imp['fill_variable']].copy()
    data.loc[data.eval(imp['nohit_condition'],engine='python'), imp['fill_variable']+'_MV'] = imp['manual_value']

## Simple logistic regression

Define list of transformed and imputed variable that will be used as predictors in logistic regression.

In [None]:
cols_pred_transformed_mv = ['INTERNAL_LIN',
                           'TELCO_A_LIN_MV',
                           'TELCO_B_LIN_MV',
                           'BUREAU_X_WOE',
                           'BUREAU_Y_LIN_MV',
                           'UTILITY_WOE',
                           'DEVICE_LIN_MV']

### Weighted model

Logitstic regression model using weights from `col_weight`.

In [None]:
from sklearn.linear_model import LogisticRegression

logreg_simple_w = LogisticRegression(penalty = 'l2', C = 1000, solver = 'liblinear')
logreg_simple_w.fit(X=data[train_mask][cols_pred_transformed_mv],
                    y=data[train_mask][col_target],
                    sample_weight=data[train_mask][col_weight])
data['score_weighted'] = logreg_simple_w.predict_proba(X=data[cols_pred_transformed_mv])[:,1]

for pred, coef in zip(['Intercept'] + cols_pred_transformed_mv, np.concatenate((logreg_simple_w.intercept_, logreg_simple_w.coef_[0]))):
    print (pred, coef)

### Unweighted model

Logistic regression model not using weights.

In [None]:
from sklearn.linear_model import LogisticRegression

logreg_simple = LogisticRegression(penalty = 'l2', C = 1000, solver = 'liblinear')
logreg_simple.fit(X=data[train_mask][cols_pred_transformed_mv],
                  y=data[train_mask][col_target])
data['score_unweighted'] = logreg_simple.predict_proba(X=data[cols_pred_transformed_mv])[:,1]

for pred, coef in zip(['Intercept'] + cols_pred_transformed_mv, np.concatenate((logreg_simple.intercept_, logreg_simple.coef_[0]))):
    print (pred, coef)

### Comparison of both approaches

Gini cross-check: comparing Gini of weighted model on weighted and unweighted sample and Gini of unweighted model on these samples to see whether the weights are or are not changing our model and its performance significantly.

In [None]:
from scoring.metrics import gini
approach_ginis = []
approach_ginis.append({'gini':gini(data[valid_mask][col_target], data[valid_mask]['score_weighted']),
                       'score':'weighted','measurement':'unweighted'})
approach_ginis.append({'gini':gini(data[valid_mask][col_target], data[valid_mask]['score_weighted'], data[valid_mask][col_weight]),
                       'score':'weighted','measurement':'weighted'})
approach_ginis.append({'gini':gini(data[valid_mask][col_target], data[valid_mask]['score_unweighted']),
                       'score':'unweighted','measurement':'unweighted'})
approach_ginis.append({'gini':gini(data[valid_mask][col_target], data[valid_mask]['score_unweighted'], data[valid_mask][col_weight]),
                       'score':'unweighted','measurement':'weighted'})
display(pd.DataFrame(approach_ginis).pivot(index='score', columns='measurement', values='gini'))

# Building the model using CrossValidatedGeneralModel class

Class `CrossValidatedGeneralModel` is able to run most of the previous steps automatically, essentially it is able to:
- transform variables which are in PD (probability, expit) form to logit form
- interactively define the missing value imputation logic
- run cross-validation or basic train/validation model fitting which will do the following:
    - impute missing values of variables using chosen algorithm or given constants
    - train logistic regression using the variables
- measure Gini of the model either in a deterministic way or using bootstrappings
- calculate marginal contribution analysis for
    - each predictor which is already in the model
    - additional predictors given by the user
- return tabular outputs with all the values of coefficients, imputation values, gini values etc. from the previous steps
- transform the data - calculate the score for each observation
- generate python transformation code

## Setup

### Metadata

#### List of predictors

List of variables we want to use in the model. These variables can be in form of probability of default (expit), logit or WOE. They don't have to be imputed as the imputation logic is implemented inside CrossValidatedGeneralModel class.

In [None]:
cols_pred_final = ['INTERNAL',
                   'TELCO_A',
                   'TELCO_B',
                   'BUREAU_X_WOE',
                   'BUREAU_Y',
                   'UTILITY_WOE',
                   'DEVICE',
                  ]

#### Probability scores to be transformed

List of predictors which are in the probability of default (expit) form. Should be subset of the list defined above. Can be empty.

In [None]:
cols_pd = ['TELCO_A',
           'BUREAU_Y',
           'DEVICE',
          ]

#### Imputation metadata

Logic for imputing empty values. In the following list of dictionaries, we define:
- `fill_variable` - name of variable that should be imputed, i.e. *what*
- `nohit_condition` - pandas query defining which rows the variable should be imputed in, i.e. *where*
- `manual_value` - value the variable should be imputed by, i.e. *how* - if `None`, this value is computed automatically using the following
- `hit_condition` - pandas query defining which rows should be used to calculate the relationship between score and default rate and so the imputation value, i.e. *how*

In [None]:
mv_imputation_meta = [
    {
        'fill_variable': 'TELCO_A_LIN',
        'hit_condition': 'TELCO_A_LIN.notnull()',
        'nohit_condition': 'TELCO_A_LIN.isnull()',
        'manual_value': -2.2,
    },
    {
        'fill_variable': 'TELCO_A',
        'hit_condition': 'TELCO_A.notnull()',
        'nohit_condition': 'TELCO_A.isnull()',
        'manual_value': None,
    },
    {
        'fill_variable': 'TELCO_B_LIN',
        'hit_condition': 'TELCO_B_LIN.notnull()',
        'nohit_condition': 'TELCO_B_LIN.isnull()',
        'manual_value': -2.2,
    },
    {
        'fill_variable': 'TELCO_B',
        'hit_condition': 'TELCO_B.notnull()',
        'nohit_condition': 'TELCO_B.isnull()',
        'manual_value': None,
    },
    {
        'fill_variable': 'DEVICE',
        'hit_condition': 'DEVICE.notnull()',
        'nohit_condition': 'DEVICE.isnull()',
        'manual_value': None,
    },
    {
        'fill_variable': 'BUREAU_Y',
        'hit_condition': 'BUREAU_Y_STATUS=="HIT"',
        'nohit_condition': 'BUREAU_Y_STATUS=="NO-HIT"',
        'manual_value': None,
    },
    {
        'fill_variable': 'BUREAU_Y',
        'hit_condition': 'BUREAU_Y_STATUS=="HIT"',
        'nohit_condition': 'BUREAU_Y_STATUS=="NO-ASK"',
        'manual_value': None,
    },
]

### Setting basic parameters

Creating new instance of `CrossValidatedGeneralModel`. The parameters are:
- `predictors` - list of predictors we want to use in the model
- `predictors_pd_form` - list of predictors which are in the probability of default (expit) form
- `imputation_dicts` - imputation metadata as defined above
- `cv` - boolean whether to use cross validation
- `cv_folds` - number of folds for cross validation
- `cv_seed` - random seed for cross validation
- `imputation_type` - *quantile* to use `quantile_imputer()` or *linear* to use `missing_value()` imputer
- `bootstrapped_gini` - boolean whether Gini should be calculated using bootstrap algorithm
- `bootstrap_seed` - random seed for Gini bootstrapping

In [None]:
# import importlib
# importlib.reload(scoring)
# importlib.reload(scoring.general_model)

from scoring.general_model import CrossValidatedGeneralModel

cvgm = CrossValidatedGeneralModel(
    predictors = cols_pred_final,
    predictors_pd_form = cols_pd,
    imputation_dicts = mv_imputation_meta,
    cv = True,
    cv_folds = 5,
    cv_seed = 1111,
    imputation_type = 'quantile',
    bootstrapped_gini = True,
    bootstrap_seed = 2222,
           )

### Changing metadata in a interactive way

Interactive tools using `qgrid` to change the metadata loaded when the instance was created.

#### Probabilty scores to be trasformed

From the `predictors` list, we can choose which of them are in probability form by ticking them in the *ispd* column.

In [None]:
cvgm.set_colspd_interactive()

#### Imputation metadata

Change the imputation metadata. You can also add a new row (i.e. new dictionary for missing value imputation) by clicking the *New empty row* button on remove row by deleting the content of its *fill_variable* field.

In [None]:
cvgm.set_imputations_interactive()

## Fitting the model

`.fit()` method is used to train the model. The arguments are:
- `X` - training dataframe with predictors
- `y` - series with targets corresponding to training dataframe
- `w` - series with weights corresponding to training dataframe (not mandatory)
- `X_valid` - validation dataframe with predictors (not mandatory)
- `y_valid` - series with targets corresponding to validation dataframe (not mandatory)
- `w_valid` - series with weights corresponding to validation dataframe (not mandatory)
- `predictors` - list of predictors from X to be used. Overwrites the list defined in initialization. (not mandatory)

In case we don't use cross validation and `X_valid` is defined, the model is trained using `X` dataset and Gini is measured using `X_valid` dataset. If `X_valid` is not defined, Gini is also meaured using `X` dataset.

In case we use cross validation and both `X` and `X_valid` are defined, they are first concatenated and then use as base dataset for cross validation data split. If `X_valid` is not defined, `X` is used as base dataset for cross validation data split.

In [None]:
cvgm.fit(X=data[train_mask],
        y=data[train_mask][col_target],
        w=data[train_mask][col_weight],
        X_valid=data[valid_mask],
        y_valid=data[valid_mask][col_target],
        w_valid=data[valid_mask][col_weight],)

In [None]:
result_scorecard = cvgm.scorecard_table
display(result_scorecard)
result_scorecard.to_csv(output_folder+'/model/scorecard.csv')

In [None]:
result_imputations = cvgm.imputation_table
display(result_imputations)
result_imputations.to_csv(output_folder+'/model/imputations.csv')

Gini in form of `(expected Gini, [5% Gini confidence interval border, 95% Gini confidence interval border])`

In case we did not use bootatrapping for Gini evaluation, the interval borders are not available.

In [None]:
cvgm.gini_result

## Outputs

### Marginal contribution

`.marginal_contribution()` method is used to measure marginal contribution of individual predictors:
- each predictor which is already in the model - we calculate the Gini difference between the base model and model without this predictor
- additional predictors given by the user - we calculate the Gini difference between model with a predictor added and the base model

Its argument is `predictors_to_add` - list with additional predictors, their marginal contribution to the base model is calculated. Can be empty list. (not mandatory)

In [None]:
mc = cvgm.marginal_contribution(predictors_to_add = ['TELCO_B_LIN'])
display(mc)
mc.to_csv(output_folder+'/model/marginal_contribution.csv')

### Scoring code

Python code that can be used for calculation of the score. It uses `pandas` dataframe as the source of the predictors and `expit` function for `scipy` library.

In [None]:
result_code = cvgm.transformation_code(dataset_name='data')
print(result_code)

result_code_file = open(output_folder+'/model/scoring_code.py', 'w') 
print(result_code, file = result_code_file) 
result_code_file.close() 

### Add score column to the dataset

In [None]:
col_score = 'GM_SCORE'

In [None]:
data[col_score] = cvgm.transform(data)
print(data[col_score])

### Column with the temporary scores from cross-validation folds

If we used cross validation to fit the scorecard, the final score was trained in the whole set, so the scorecard might be overfitted. 

For such cases, during each iteration of cross-validation process when a temporary scorecard is fitted, the scored testing data set from that iteration is saved. Then all these sets are concatenated to create dataset scored by non-overfitted temporary scorecards.

In [None]:
col_tmp_cv_score = 'GM_SCORE_TMP_CV'

In [None]:
data[col_tmp_cv_score] = cvgm.validation_prediction
eval_mask = data[col_tmp_cv_score].notnull()
data[col_tmp_cv_score]

# Subpopulation analyses

As General Model is directly used for rejection of clients, we might be curious how it will behave for various subpopulations of clients, e.g. hit populations of some data sources or some special client groups.

In the following part, we will measure Gini of the model using observations of these groups only; and we will estimate approval rate of these groups if we base the approval/rejection process just on this score.

## Hit population Gini

For each population of interest, we calculate Gini coefficient of our score.

We use the temporary validation score in case cross validation was used to fit the scorecard, so our Gini estimation is not biased.

In [None]:
populations_of_interest = [
    'TELCO_A.notnull()',
    'TELCO_B.notnull()',
    'BUREAU_X.notnull()',
    'BUREAU_Y.notnull()',
]

analysis_gini_mask = eval_mask

In [None]:
from scoring.metrics import gini, bootstrap_gini

population_gini = []
for pop in populations_of_interest:
    pop_mask = data.eval(pop, engine="python")
    g, g_std, g_ci = bootstrap_gini(data = data[analysis_gini_mask & pop_mask],
                                   col_target=col_target,
                                   col_score=col_tmp_cv_score,
                                   col_weight=col_weight,
                                   n_iter=100,
                                   ci_range=5,)
    population_gini.append({'Population':pop, 'Gini':g, 'Gini_5%':g_ci[0], 'Gini_95%':g_ci[1]})
population_gini = pd.DataFrame(population_gini)
population_gini.set_index('Population', inplace=True)

In [None]:
display(population_gini)
population_gini.to_csv(output_folder+'/analysis/subpopulation_gini.csv')

## Approval rates

For each population of interest we calculate theoretical approval rate. The estimation goes as follows:
- We define a reference approval rate for the whole population of incoming customers
- We calculate a cutoff value which corresponds to this targetted approval rate
- We set the same cutoff for the subpopulation
- We evaluate what would the approval rate on just this subpopulation be when the cutoff is applied.

If the subpopulation approval rate is different to the reference approval rate, it ususally means that the subpopulation is shifted, i.e. the estimated probability of default of such customers is different from probability of default of a typical customer.

In [None]:
populations_of_interest_ar = [
    'TELCO_A.notnull()',
    'TELCO_B.notnull()',
    'BUREAU_X.notnull()',
    'BUREAU_Y.notnull()',
]

analysis_ar_mask = everything_mask
reference_ar = 0.70

In [None]:
from scoring.general_model import expected_ar

population_ar = []
for pop in populations_of_interest_ar:
    ar = expected_ar(data = data[analysis_ar_mask],
                     col_score = col_score,
                     query_subset = pop,
                     col_weight = col_weight,
                     reference_ar = reference_ar,
                     def_by_score_ascending = False)
    population_ar.append({'Population':pop, 'AR':ar})
population_ar = pd.DataFrame(population_ar)
population_ar.set_index('Population', inplace=True)

In [None]:
display(population_ar)
population_ar.to_csv(output_folder+'/analysis/approval_rates.csv')

# Documentation outputs

## Correlation matrix

First we transform the data (including filling the missing values) using the outputs from CrossValidatedGeneralModel class and then draw the correlation matrix.

In [None]:
corr_data = data[cvgm.predictors].copy()
for col_pd in set(cvgm.predictors) & set(cvgm.predictors_pd_form):
    corr_data.loc[:,col_pd] = logit(corr_data[col_pd])
for imp in cvgm.imputation_values:
    if imp['fill_variable'] in corr_data.columns:
        corr_data.loc[data.eval(imp['nohit_condition'],engine="python"),imp['fill_variable']] = imp['fill_value']

In [None]:
documentation.Correlations(data=corr_data,
                           predictors=corr_data.columns,
                           sample="All",
                           output_folder=output_folder+"/analysis/",
                           filename="correlation.png")

## Transition matrices

Matrices describing the relationship between deciles of two scores. They show the default rate in each decile-decile combination and shares of observations from one decile of "old" score in each decile of "new" score.

In this example we compare the internal scorecard and the GM that was developed on top of that.

The transition matrix is calculated on just the observable rows first (where we have the target) and on all rows of dataset then.

In [None]:
from scoring.plot import transmatrix
col_comparison_score = 'INTERNAL'

transmatrix(oldscore = data[observable_mask][col_comparison_score],
            newscore = data[observable_mask][col_score],
            target = data[observable_mask][col_target],
            base = data[observable_mask][col_base],
            obs = data[observable_mask][col_base],
            draw_default_matrix=True,
            draw_transition_matrix=True,
            savepath=output_folder+'/analysis/devpop_',
            quantiles_count = 10)

from scoring.plot import transmatrix

transmatrix(oldscore = data[col_comparison_score],
            newscore = data[col_score],
            target = data[col_target],
            base = data[col_base],
            obs = data[col_base],
            draw_default_matrix=False,
            draw_transition_matrix=True,
            savepath=output_folder+'/analysis/allpop_',
            quantiles_count = 10)

## Gini and lift curves

We can choose for which samples (masks) we want to draw the Lift and Gini curves and calculate the performance metrics.

If we used cross validation for development of the model, we want to use the temporary score from the validation part of the cross-validation folds to evaluate the performance which we previously saved to column `col_tmp_cv_score`.

In [None]:
eval_masks = {
    'eval' : eval_mask,
}

In [None]:
from scoring.metrics import eval_performance_wrapper
from scoring.tools import curves_wrapper

perf = eval_performance_wrapper(data=data,
                                masks=eval_masks,
                                col_target=col_target,
                                col_score=col_tmp_cv_score,
                                col_weight=col_weight,
                                lift_perc=10)
display(perf)
perf.to_csv(output_folder+'/performance/performance.csv')

curves_wrapper(data=data,
               masks=eval_masks,
               col_target=col_target,
               col_score=col_tmp_cv_score,
               col_weight=col_weight,
               output_folder=output_folder+'/performance/')

## Score distribution

### Histogram of goods/bads

Distribution (histogram) of final GM score in its PD and logit form, and distribution of goods and bads in each part of the historgram.

In [None]:
from scoring.plot import plot_score_dist

plot_score_dist(data,
                score_name = col_score,
                target_name = col_target,
                weight_name = col_weight,
                n_bins = 40,
                labels = ['good','bad'],
                legend_loc = 'upper right',
                savefile = output_folder+'/model/distr_pd.png')

In [None]:
data[col_score+'_LIN'] = logit(data[col_score])

plot_score_dist(data,
                score_name = col_score+'_LIN',
                target_name = col_target,
                weight_name = col_weight,
                n_bins = 40,
                labels = ['good','bad'],
                legend_loc = 'upper right',
                savefile = output_folder+'/model/distr_linear.png')

### Histogram of hits of certain datasource

In [None]:
data[col_score+'_LIN'] = logit(data[col_score])

for pred in cvgm.predictors:
    
    data['_hit'] = data[pred].notnull().astype(int)
    
    if 0<data['_hit'].mean()<1:

        plot_score_dist(data,
                score_name = col_score+'_LIN',
                target_name = '_hit',
                weight_name = col_weight,
                n_bins = 40,
                labels = [f'{pred} no-hit',f'{pred} hit'],
                legend_loc = 'upper right',
                savefile = output_folder+'/model/distr_linear.png')
        
    data.drop(['_hit'], axis=1, inplace=True)

In [None]:
import importlib
importlib.reload(scoring)
importlib.reload(scoring.plot)

from scoring.plot import score_calibration

score_calibration(
    data=data,
    score=col_score+'_LIN',
    target=col_target,
    weight=col_weight,
    savefile=output_folder+'/model/calibration.png'
)

## Cutoff analysis

Simple analyses showing what would cutoff targeting to a given reference approval rate mean for the confusion matrix and bad rate of approved customers.

Before we do this analysis, we impute the target of rejected (non-observable) rows using the new GM score. From each of these rows, we create two new rows, one with target value `1`, other one with target value `0`. Weight of these rows equals to weight of the original row times the probability of default and non-default respectively.

In [None]:
# import importlib
# importlib.reload(scoring)
# importlib.reload(scoring.reject_inference)

from scoring.reject_inference import TargetImputer
col_reject = 'REJECTED'
data[col_reject] = 1 - data['APPROVED']

targimp = TargetImputer(imputation_type='weighted')
targimp.fit(data = data,
            col_probs = col_score,
            col_reject = col_reject,
            col_weight = col_weight,
            prob_of = 1)
data_imputed = targimp.transform(data = data,
                                col_target = col_target,
                                col_weight = col_weight,
                                as_new_columns = True,
                                reset_index = True)

col_target_imputed = col_target + '_IMPUTED'
col_weight_imputed = col_weight + '_IMPUTED'

`confusion_chart` plots the percentage of false rejects (number of good rejected clients divided by number of all good clients) and percentage of false approves (number of bad approved clients divided by number of all bad clients) in dependence on approval rate.

`expected_default_rate` plots the bad rate of rejected clients and bad rate of approved clients in dependence on approval rate.

In [None]:
from scoring.plot import confusion_chart, expected_default_rate
reference_ar_cutoff = 0.70

false_reject, false_approve = confusion_chart(
    data=data_imputed,
    col_score=col_score,
    col_target=col_target_imputed,
    col_weight=col_weight_imputed,
    reference_ar=reference_ar_cutoff,
    savefile=output_folder+'/analysis/distr_linear.png'
)

bad_rate_rejected, bad_rate_approved = expected_default_rate(
    data=data_imputed,
    col_score=col_score,
    col_target=col_target_imputed,
    col_weight=col_weight_imputed,
    reference_ar=reference_ar_cutoff,
    savefile=output_folder+'/analysis/distr_linear.png'
)

---
Nice to have todos
- Cost analysis of source XY
- Autoweighter which also reflects intersection values
- WOE part of CVGM
- Voting ensemble (avg of subscore weighted by their power), calibration first
---