We add columns for target encoding for the features:
<ul>
    <li> PAY_1 - PAY_6 </li>
    <li> SEX </li>
    <li> EDUCATION </li>
    <li> MARRIAGE </li>
    <li> AGE </li>
    <li> AGE_BY10 </li>
    </ul>

## Importing packages and data

In [70]:
import pandas as pd
import numpy as np

In [76]:
from sklearn.model_selection import train_test_split

In [71]:
data=pd.read_csv('clean_data_2.csv')

## Splitting the data

We need to split the data before we find our target encoding values.

In [78]:
train, test = train_test_split(data,
                               test_size=0.1,
                               stratify=data['Y'],
                               shuffle=True,
                               random_state=123)

## Helper function for target encoding

Since I just learned what this is, I want to practice coding it from scratch instead of using `from category_encoders import TargetEncoder`

Target Encoding is encoding a categorical value as the probability (items in the category with Y=1)/(items in that category)

In [109]:
def te(tr, df, column, y):
    '''takes in a column (as text) from a dataframe (df) and outputs a target encoded column as a list
    where y is the name of the output 'Y' column
    it will add values to everything in the dataframe df, 
    but only use the training set tr to compute the target encoded values
    '''
    
    answer= []
    
    for i in range(len(df)):
        value=df[column].iloc[i]
        if value in tr[column].values:
            if value in tr.groupby(y)[column].value_counts()[1].index:
                ones=tr.groupby(y)[column].value_counts()[1][value]
                total=tr[column].value_counts()[value]
                answer.append(ones/total)
            else:
                print(value)
                answer.append(0)
        else:
            print('HERE')
            print(value)
            answer.append(0)
        
    return(answer)

In [98]:
data['AGE'].iloc[0]

24

In [104]:
train.groupby('Y')['PAY_1'].value_counts()[1][-1]

840

In [105]:
train['PAY_1'].value_counts()[-1]

5101

In [106]:
840/5101

0.16467359341305626

## Columns for Pay_1 - Pay_6

In [115]:
pay_cols= ['PAY_'+str(i) for i in range(1,7)]
pay_cols

['PAY_1', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6']

In [116]:
for column in pay_cols:
    new_col= column+'_TE'
    data[new_col]=te(train, data, column,'Y')

8


## the rest of them

In [117]:
features = ['SEX', 'EDUCATION', 'MARRIAGE', 'AGE', 'AGE_BY10']

In [118]:
for column in features:
    new_col= column+'_TE'
    data[new_col]=te(train, data, column,'Y')

71
71
68
68
68
68
79
68
71
74


## Saving the dataframe

In [119]:
data.to_csv("clean_data_3.csv", index=False)

In [120]:
data.head()

Unnamed: 0,ID,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_1,PAY_2,PAY_3,PAY_4,...,PAY_2_TE,PAY_3_TE,PAY_4_TE,PAY_5_TE,PAY_6_TE,SEX_TE,EDUCATION_TE,MARRIAGE_TE,AGE_TE,AGE_BY10_TE
0,1,20000,2,2,1,24,2,2,-1,-1,...,0.559584,0.153285,0.157823,0.195856,0.197815,0.206933,0.238835,0.233864,0.264414,0.230823
1,2,120000,2,2,2,26,-1,2,0,0,...,0.559584,0.173907,0.182616,0.188242,0.505793,0.206933,0.238835,0.209881,0.204425,0.230823
2,3,90000,2,2,2,34,0,0,0,0,...,0.157917,0.173907,0.182616,0.188242,0.188377,0.206933,0.238835,0.209881,0.192493,0.200278
3,4,50000,2,2,1,37,0,0,0,0,...,0.157917,0.173907,0.182616,0.188242,0.188377,0.206933,0.238835,0.233864,0.204104,0.200278
4,5,50000,1,2,1,57,-1,0,-1,0,...,0.157917,0.153285,0.182616,0.188242,0.188377,0.2428,0.238835,0.233864,0.212963,0.250474


In [123]:
data['PAY_1_TE'].head()

0    0.689482
1    0.164674
2    0.128358
3    0.128358
4    0.164674
Name: PAY_1_TE, dtype: float64