<span style="font-size:28pt;font-weight:bold">Home Credit Python Scoring for Collections </font> <br><br>
<span style="font-size:28pt;font-weight:bold">    Data Preparation Workflow v.0.8.1</font>

**Copyright:**

© 2017-2020, Pavel Sůva, Marek Teller, Martin Kotek, Jan Zeller, Marek Mukenšnabl, Kirill Odintsov, Jan Hynek, Elena Kuchina, Lubor Pacák, Naďa Horká and Home Credit & Finance Bank Limited Liability Company, Moscow, Russia – all rights reserved

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the [License](http://www.apache.org/licenses/LICENSE-2.0)

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

For list of contributors see [Gitlab page](https://git.homecredit.net/risk/python-scoring-workflow) 

# Import Packages

In [None]:
import time
import datetime
import math
import numpy as np
import pandas as pd
# # import cx_Oracle
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import os.path

import sys
sys.path.insert(0, '..') # path of scoring workflow folder 
import scoring
from scoring import db



Set general technical parameters and paths.

In [None]:
sns.set()
%matplotlib inline
%config InlineBackend.close_figures=True
from IPython.display import display, Markdown
pd.options.display.max_columns = None
pd.options.display.max_rows = 15
output_folder = 'documentation_preparation_demo'

if not os.path.exists(output_folder): os.makedirs(output_folder)
if not os.path.exists(output_folder+'/analysis'): os.makedirs(output_folder+'/analysis')
if not os.path.exists(output_folder+'/datasets'): os.makedirs(output_folder+'/datasets')
    
scoring.check_version('0.9.0', list_versions=True)

# Data import

Importing data from a CSV file. It is important to set the following parameters:

encoding: usually 'utf-8' or windows-xxxx on Windows machines, where xxxx is 1250 for Central Europe, 1251 for Cyrilic etc. sep: separator of columns in the file decimal: decimal dot or coma index_col: which columns is used as index - should be the unique credit case identifier

In [None]:
data = db.read_csv(r'coll_demo_data\demo_dataset.csv',
                      sep = ',', decimal = '.', optimize_types=True,
                      encoding = 'utf-8', low_memory = False)

print('Data loaded on',datetime.datetime.fromtimestamp(time.time()).strftime('%Y-%m-%d %H:%M:%S'))
print()
print('Number of rows: {:15.0f}'.format(data.shape[0]))
print('Number of columns: {:12.0f}'.format(data.shape[1]))

In [None]:
data.info()

## Metadata Definition

Assigning ID column, target column, time column, month and week column. The month and week columns doesn't have to exist in the dataset, it will be created later in this workflow. Creating a metadata.csv file

In [None]:
### THESE COLUMNS MUST BE INCLUDED IN THE DATA SET ###

# name of your target column in your dataset
col_target_orig = "TARGET_DPD"
col_target_z = 'TARGET_Z'
# name of the time column in your dataset
col_time = "STARTDATE"
col_diff_days = "DAYS_DIFF"
# name of the workflow column - usually Low, High, Medium etc.
col_workflow = 'PROCESS_NAME'
col_treatment = 'HIGHER_TREATMENT'
# name of the product column - e.g. CASH/CONSUMER
col_product = 'TYPEOFCREDIT'


### THESE COLUMNS DON'T HAVE TO BE INCLUDED IN THE DATA SET AND ARE CREATED AUTOMATICALLY LATER with this given name ###
#name of the base column
col_base = "BASE"
# name of the year column
col_year = "YEAR"
# name of the month column
col_month = "MONTH"
# name of the day column
col_day = "DAY"
# name of the year and week column
col_week = "WEEK"



col_instalment = 'AMTINSTALMENT'
col_receivable = 'AMT_RECEIVABLE' 


# #name of the weight column 
col_weight = 'WEIGHT'


In [None]:
# DECIDE WHICH TIME UNIT YOU WANT TO USE AS DEFAULT FOR THIS EXPLORATORY ANALYSIS - A WEEK OR A MONTH? 

time_unit = col_month

Create the month and day column from the time column is doing the following
- take the time column and tell in which format the time is saved in - **you need to specify this in variable *dtime_input_format*** (see https://docs.python.org/3/library/time.html#time.strftime for reference)
- strip the format just to year, month, day string
- convert the string to number
- the new column will be added to the dataset as day
- truncate this column to just year and month and add it to dataset as month
- add the week to the dataset

In [None]:
dtime_input_format = '%Y-%m-%d'

In [None]:
data.loc[:,col_day] = pd.to_numeric(pd.to_datetime(data[col_time], format=dtime_input_format, cache=False).dt.strftime('%Y%m%d'))
data[col_month] = data[col_day].apply(lambda x: math.trunc(x/100))
data[col_year] = data[col_day].apply(lambda x: math.trunc(x/10000))

data[col_time]=pd.to_datetime(data[col_time], format=dtime_input_format, cache=False)

data[col_week] =data[col_year]*100 + data[col_time].dt.week

print('Columns',col_day,',',col_month,'and',col_week,'added/modified. Number of columns:',data.shape[1])

In [None]:
data.head()

## Target Definition 

Use if you have the target not decided yet. The target is defined as 'delinquent in target-DPD'. 

To be able to use the script, your data have to contain
- The date of the payment
- The date difference between entry date and payment date

In [None]:
# Computing he date difference between entry date and payment date
data[[col_time,'PAIDDATE']] = data[[col_time,'PAIDDATE']].apply(pd.to_datetime, format=dtime_input_format, cache=False)
data[col_diff_days] = (data['PAIDDATE'] - data[col_time]).dt.days

In [None]:
data.loc[data['PAIDDATE'].isna(), 'DAYS_DIFF'] = 30

In [None]:
data[(data[col_time] > datetime.datetime.strptime('2020-03-10', '%Y-%m-%d')) & (data['PAIDDATE'].isna())].index

Delete rows with target date in future (for the time of the dataset download)

In [None]:
data.drop(data[(data[col_time] > datetime.datetime.strptime('2020-03-10', '%Y-%m-%d')) & \
               (data['PAIDDATE'].isna())].index, inplace=True)

In [None]:
data_unfinished = data[data['PAIDDATE'].isna()] 

print((datetime.datetime.strptime('2020-03-13', '%Y-%m-%d') - data_unfinished[col_time]).dt.days)

In [None]:
data.loc[data['DAYS_DIFF'] > 30, 'DAYS_DIFF'] = 30

In [None]:
days = pd.DataFrame(data[[col_diff_days, col_workflow, 'AMTBALANCEACTUALCONTRACT']].groupby([col_diff_days, col_workflow]).sum())
days.sort_index(inplace=True)

days_cum = days.groupby(col_workflow).cumsum()
days_cum.reset_index(level = col_workflow, inplace=True)
days_cum = days_cum.pivot(columns=col_workflow, values='AMTBALANCEACTUALCONTRACT')
maxim = days_cum.max()
days_pct = days_cum/maxim
# display(days_pct)
days_pct_diff = (days_pct['HIGH'] - days_pct['LOW'])*100
# display(days_pct_diff)

In [None]:
plt.figure(figsize = (20, 10))
fig, ax1 = plt.subplots(figsize=(20,10))

ax2 = ax1.twinx()
ax1.plot(days_pct)
ax2.plot(days_pct_diff, 'xkcd:cloudy blue')

plt.title('Cumulative Payment of Balance', fontsize=28)
ax1.set_xlabel('DPD')
ax1.set_ylabel('Cumulative % Paid')
ax1.legend(days_pct.columns, loc='lower center')
# ax2.legend('diff H-L')
ax2.set_ylabel('pp diff H-L', color='b')

filepath = os.path.join(output_folder, 'analysis', 'payment_curve.png')
plt.savefig(filepath, format='png', bbox_inches='tight')
plt.show()

**Decision of the target**

In [None]:
col_target_orig = 'TARGET_10D'

data[col_target_orig] = 1
data.loc[(data[col_diff_days]<= 10), col_target_orig] = 0

In [None]:
if col_base not in data:
    data[col_base] = 0
    data.loc[data[col_target_orig]==0,col_base] = 1
    data.loc[data[col_target_orig]==1,col_base] = 1
    print('Column',col_base,'added/modified. Number of columns:',data.shape[1])
else:
    print('Column',col_base,'already exists.')

# Data Cleaning

The most important part of preparation of data is to carefully check the values (explore_df, explore_numerical and explore_categorical is here to help you) and decide which attributes or which rows you will or will not use. Some attributes are crucial for calculating the CAASB (amount receivable and amount of instalment) and thus you cannot calculate CAASB for rows with missing values for them.

You can decide for cleaning the data - e.g. deleting the rows which miss important information or deleting the columns with predictor which does have not-null value in only 5 % of rows etc.

You can refer to the Segmentation Cookbook https://wiki.homecredit.net/confluence/display/RSK/Segmentation+Cookbook 

## Installment and Receivable

Here, only rows with null in amount of instalment are deleted from the data

In [None]:
# 
rownr_0 = data.shape[0]
print('Original number of rows: ', rownr_0)

data = data[data[col_instalment].notna()]

rownr_1 = data.shape[0]
print('New number of rows: ', rownr_1)
print('Number of rows deleted: ', rownr_0 - rownr_1)

 here, only rows with null in amount of receivable are deleted from the data

In [None]:
rownr_0 = data.shape[0]
print('Original number of rows: ', rownr_0)

data = data[data[col_receivable].notna()]

rownr_1 = data.shape[0]
print('New number of rows: ', rownr_1)
print('Number of rows deleted: ', rownr_0 - rownr_1)

## Too Many Missing Values

In [None]:
from scoring.data_exploration import metadata_table

In [None]:
meta_table = metadata_table(data)

In [None]:
min_fill_percentage = 5

In [None]:
low_perc = pd.DataFrame(columns=meta_table.columns)
for i,j in np.array(meta_table[['name','fill pct']]):
    if j < min_fill_percentage:
        low_perc = low_perc.append(meta_table[meta_table['name'] == i])
display(low_perc)

# Dataset Splits

## Splitting the Dataset into Products

Creating the masks for different products.

In [None]:
products = data[col_product].unique()

print(products)

In [None]:
from scoring.plot import plot_dataset 

# time_unit = col_month # use if you want to see different time interval 
    
plot_dataset(
    data,
    month_col=time_unit,
    def_col=col_target_orig,
    base_col=col_base,
    segment_col=col_product,
    output_folder=os.path.join(output_folder, "analysis"),
    filename="bad_rate_plot.png",
#     weight_col=col_weight,
    zero_ylim=True,
)

# time_unit = col_week # use if you want to set the interval back

## Splitting the Dataset into Workflows

Creating the masks for different workflows. 

**Get to know your workflows - which do we have in the dataset?**

In [None]:
workflows = data[col_workflow].unique()

print(workflows)

**Higher treatment workflow definition**

For obtaining the uplift graphs, we have to specify, which treatment is the 'more intensive treatment' and mark it as higher_treatment = 1. The less intensive is higher_treatment = 0.

In [None]:
data[col_treatment] = 0
data.loc[data['PROCESS_NAME'] == 'HIGH', col_treatment] = 1

**Default Rate in Time**

Simple visualization of the counts and bad rates for each workflow defined in data

In [None]:
# from scoring.plot import plot_dataset 

# time_unit = col_week # use if you want to see different time interval 

plot_dataset(
    data,
    month_col=time_unit,
    def_col=col_target_orig,
    base_col=col_base,
    segment_col=col_workflow,
    output_folder=os.path.join(output_folder, "analysis"),
    filename="bad_rate_plot_wf.png",
    #     weight_col=col_weight,
    zero_ylim=True,
    )

    
# time_unit = col_month # use if you want to set the interval back

## Target Variable Transformation

The Jaroszewicz's transformation is creating a new Target Variable in a very simple manner. 

<br> <br>

<center>
Z = 1 if treatment = High and original target = 0 <br>
Z = 1 if treatment = Low and original target = 1 <br>
Z = 0 ... otherwise <br>
    
</center>

If you are interested in the background, please check this [link](http://people.cs.pitt.edu/~milos/icml_clinicaldata_2012/Papers/Oral_Jaroszewitz_ICML_Clinical_2012.pdf)



**Please be aware** that since our target is defined as 1 == client did not pay, 0 == client paid, some of the definitions from this paper had to be flipped to comply to our needs.

In [None]:
# TARGET TRANSFORMATION
data[col_target_z] = 0
data.loc[(data[col_target_orig]==0) & (data[col_treatment] == 1),col_target_z] = 1
data.loc[(data[col_target_orig]==1) & (data[col_treatment] == 0),col_target_z] = 1

In [None]:
data.head()

In [None]:
# from scoring.plot import plot_dataset 

# time_unit = col_week # use if you want to see different time interval 
plot_dataset(
    data,
    month_col=time_unit,
    def_col=col_target_z,
    base_col=col_base,
    segment_col=col_product,
    output_folder=os.path.join(output_folder, "analysis"),
    filename="bad_rate_product_z.png",
    #     weight_col=col_weight,
    zero_ylim=True,
    )
    
# time_unit = col_month # use if you want to set the interval back

### Weights for the Transformed Target

For the one model transformation, it is needed to have the ratio of classifiables in higher and lower treatment = 1:1, thus, the probability 

**P(group == higher treatment) = P(group == lower treatment) = 1/2**

If not, we may reweight or resample the training dataset such that the assumption becomes valid. 

In [None]:
def make_balanced(df, balancing_var, weight_col, group_name=None, group=None):
    
        if group:
            df = df[df[group_name] == group]
        else:
            pass
        
        B = df[balancing_var].value_counts().values/df[balancing_var].value_counts().values.sum()
        max_PD = B.max()
        B = max_PD/B
        A = df[balancing_var].value_counts().index

        C = pd.DataFrame(B,A)
        C = C.reset_index(drop = False)
        C.rename( columns = {'index' : balancing_var,  0 : weight_col}, inplace = True)

        df =  pd.merge(df, C , how = 'left', left_on=balancing_var, right_on=balancing_var) 
                                       
        print(C)
        return df     
     

**Choose which groups to balance**

In [None]:
balance_group = 'all'  # col_product, col_name


if balance_group == 'all':
    if col_weight in (data.columns):
        data.drop(columns=col_weight, inplace=True)
        print('column ' + col_weight + ' dropped')
    data = make_balanced(data, col_treatment, col_weight)
    
else:
    balance_group_lst = data[balance_group].unique()
    data_product = list()
    for i in range(len(balance_group_lst)):
        data_product.append(make_balanced(data, col_treatment, 'WEIGHT_BY_'+ balance_group, group_name=balance_group, group=balance_group_lst[i]))


# Export Datasets for the Main Python Scoring Workflow (PSW)

Here, you can create various types of datasets ready to use in the PSW: 
- Distinct workflows' datasets for two-model segmentation
- Distinct products' datasets for one-model transformed segmentation for distinct products (e.g. POS, CARDs, CLX, ...)
- Combination of workflows and products for two-model segmentation for distinct products

## One Workflow for One-model with Transformed Target Z

In [None]:
savepath = os.path.join(output_folder,"datasets",'dataset_z.csv')
data.to_csv(savepath, encoding='utf-8', index=False)

## Distinct Workflows' Datasets for Two-model Segmentation

In [None]:
for i in range(len(workflows)):
    savepath = os.path.join(output_folder,"datasets",'dataset_' + workflows[i] + '.csv')
    data[data[col_workflow] == workflows[i]].to_csv(savepath, encoding='utf-8', index=False)

# data.to_csv('prep2_df_all.csv', encoding='utf-8', index=False)

## Distinct Products' Datasets for One-model Segmentation

In [None]:
for i in range (len(products)):
    savepath = os.path.join(output_folder,"datasets",'dataset_z_' + products[i] + '.csv')
    data_product[i].to_csv(savepath, encoding='utf-8', index=False)

## Combinations

In [None]:
for i in range (len(products)):
    for j in range (0,len(workflows)):
        savepath = os.path.join(output_folder,"datasets",'dataset_' + products[i] + '_' + workflows[j] + '.csv')
        data[(data[col_product] == products[i]) & (data[col_workflow] == workflows[j])]\
        .to_csv(savepath, encoding='utf-8', index=False)