# Home Credit - Feature Engineering

In the last EDA notebook, 
1. We had a detailed look at features
2. And their relationship with target variable.

Based on above analysis and common sense based domain knowledge, we will create additional features which might be useful in predicting credit default. We have created new features in Data Wrangling when we combined Bureau data and Home Credit Historical data with current application data. This notebook focuses on adding more features based on EDA analysis and current application data.

In [1]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt 
from matplotlib.ticker import PercentFormatter
import seaborn as sns

from scipy import stats

%matplotlib inline
%precision %.2f

plt.style.use('bmh')
pd.set_option('display.max_rows', 30)
pd.set_option('display.min_rows', 10)
pd.set_option('display.max_columns', 100)

In [2]:
train_path = '../data/interim/df_train_dimR.csv'
test_path = '../data/interim/df_test_dimR.csv'
dtype_path = '../data/interim/data_types.csv'

In [3]:
df_train = pd.read_csv(train_path,index_col=0)
df_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 215249 entries, 0 to 215248
Columns: 177 entries, sk_id_curr to bc_cnt_cr_status_others
dtypes: float64(109), int64(41), object(27)
memory usage: 292.3+ MB


Getting the data types of variables

In [4]:
df_dtype = pd.read_csv(dtype_path,index_col=0)
dict_dtype = df_dtype.dtype.to_dict()

Converting dataset to optimize memory usage based on EDA analysis.

In [5]:
df_train = df_train.astype(dict_dtype)
df_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 215249 entries, 0 to 215248
Columns: 177 entries, sk_id_curr to bc_cnt_cr_status_others
dtypes: float16(90), float64(19), int64(6), int8(34), object(28)
memory usage: 132.6+ MB


Now we will create additional features

### Application details based features    


In [6]:
# listing all the indicators for various document availability
document_features = ['flag_document_2', 'flag_document_3', 'flag_document_4', 'flag_document_5', 'flag_document_6',
       'flag_document_7', 'flag_document_8', 'flag_document_9', 'flag_document_10', 'flag_document_11', 
        'flag_document_12', 'flag_document_13', 'flag_document_14', 'flag_document_15', 'flag_document_16', 
        'flag_document_17', 'flag_document_18', 'flag_document_19', 'flag_document_20', 'flag_document_21']

def crt_appliation_features(df):
    """
    Create new featues based on application details
    df - train/test dataset
    """
    
    df['rt_credit_income'] = df.amt_credit/df.amt_income_total
    df['rt_annuity_income'] = df.amt_annuity/df.amt_income_total
    df['rt_annuity_credit'] = df.amt_annuity/df.amt_credit
    df['rt_goods_price_credit'] = df.amt_goods_price/df.amt_credit
    df['total_document_flags'] = df[document_features].sum(axis=1)
    
    return df

In [7]:
df_train_fe = df_train.copy()
df_train_fe = crt_appliation_features(df_train_fe)

### Applicant's details based features

In [8]:
def crt_applicant_features(df):
    """
    Create new featues based on applicant's details
    df - train/test dataset
    """
    
    df['rt_days_employed_birth'] = df.days_employed/df.days_birth
    df['rt_days_id_birth'] = df.days_id_publish/df.days_birth
    df['rt_phone_changed_birth'] = df.days_last_phone_change/df.days_birth
    df['avg_family_income'] = df.amt_income_total/df.cnt_fam_members
    df['avg_family_credit'] = df.amt_credit/df.cnt_fam_members
    df['total_contact_flags'] = df.flag_mobil + df.flag_work_phone + df.flag_cont_mobile + df.flag_phone + df.flag_email
    
    return df

In [9]:
df_train_fe = crt_applicant_features(df_train_fe)

In [10]:
df_train_fe.shape

(215249, 188)

Repeating same steps for test data

In [11]:
df_test = pd.read_csv(test_path,index_col=0)
df_test.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 92250 entries, 0 to 92249
Columns: 177 entries, sk_id_curr to bc_cnt_cr_status_others
dtypes: float64(109), int64(41), object(27)
memory usage: 125.3+ MB


In [12]:
df_test = df_test.astype(dict_dtype)

df_test_fe = df_test.copy()
df_test_fe = crt_appliation_features(df_test_fe)
df_test_fe = crt_applicant_features(df_test_fe)

In [13]:
(df_test_fe.dtypes == df_train_fe.dtypes).sum()

188

In [14]:
train_path = '../data/interim/df_train_fe.csv'
df_train_fe.to_csv(train_path)

test_path = '../data/interim/df_test_fe.csv'
df_test_fe.to_csv(test_path)


In [15]:
df_data_types = pd.DataFrame(df_train_fe.dtypes, columns=['dtype'])

dtype_path = '../data/interim/data_types_fe.csv'
df_data_types.to_csv(dtype_path)