**Description**

Many people struggle to get loans due to insufficient or non-existent credit histories. And, unfortunately, this population is often taken advantage of by untrustworthy lenders.

Home Credit Group

![](https://storage.googleapis.com/kaggle-media/competitions/home-credit/about-us-home-credit.jpg)

Home Credit strives to broaden financial inclusion for the unbanked population by providing a positive and safe borrowing experience. In order to make sure this underserved population has a positive loan experience, Home Credit makes use of a variety of alternative data--including telco and transactional information--to predict their clients' repayment abilities.

While Home Credit is currently using various statistical and machine learning methods to make these predictions, they're challenging Kagglers to help them unlock the full potential of their data. Doing so will ensure that clients capable of repayment are not rejected and that loans are given with a principal, maturity, and repayment calendar that will empower their clients to be successful.

**Evaluation**

For each SK_ID_CURR in the test set, you must predict a probability for the TARGET variable. 
* => 과거 대출 데이타를 가지고, 대출요청자가 대출을 갚을 수 있는지 없는지 예상.
* => Standard Supervised Classification Task.
* => Supervised: The labels are included in the training data and the goal is to train a model to learn to predict the labels from the features ( 라벨을 예상 )
* => Classification: The label is a binary variable, 0 (will repay loan on time), 1 (will have difficulty repaying loan) ( 라벨이 binary varaible임. 0은 제때 갚을 듯. 1은 갚기 어려울 듯 )

**Data**

![](https://storage.googleapis.com/kaggle-media/competitions/home-credit/home_credit.png)

* application_{train|test}.csv

This is the main table, broken into two files for Train (with TARGET) and Test (without TARGET).

=> TARGET : 0: 대출 상환함, 1:대출 상환되지 않았음.

=> 각 대출 신청 정보가 포함됨. 모든 대출은 고유한 행을 가지고, SK_ID_CURR 로 식별. 

Static data for all applications. One row represents one loan in our data sample. => 한 행이 한 대출.

* bureau.csv

All client's previous credits provided by other financial institutions that were reported to Credit Bureau (for clients who have a loan in our sample). 

=> 다른 금융 기관에 있었던 이전 대출, 이전 대출은 한 행. 그러나, 대출 한 건은 여러 개 이전 대출 신청(?)을 가질 수 있음

For every loan in our sample, there are as many rows as number of credits the client had in Credit Bureau before the application date. => 크레딧 수(?) 만큼 행이 있다. 대출 신청 전, 대출했던 모든 이력(?)

* bureau_balance.csv

Monthly balances of previous credits in Credit Bureau. => 이전 대출(Credit?) 월간 데이타. 각 행은 이전 크레딧(?대출?) 한 달이고, 단일 이전 크레딧은 크레딧 길이(대출 기간?) 각각 월별로 여러 행 가질 수 있음.(?)

This table has one row for each month of history of every previous credit reported to Credit Bureau – i.e the table has (#loans in sample * # of relative previous credits * # of months where we have some history observable for the previous credits) rows. => 한달에 한 행. 

* POS_CASH_balance.csv

Monthly balance snapshots of previous POS (point of sales) and cash loans that the applicant had with Home Credit. 

=> 이전 판매 시점 혹은 현금 대출에 대한 월별 데이타

=> 각 행은 이전 판매 시점 또한 현금 대출 한 달 의미하고, 이전 대출은 여러 행 가질 수 있음.

This table has one row for each month of history of every previous credit in Home Credit (consumer credit and cash loans) related to loans in our sample – i.e. the table has (#loans in sample * # of relative previous credits * # of months in which we have some history observable for the previous credits) rows. 

* credit_card_balance.csv

Monthly balance snapshots of previous credit cards that the applicant has with Home Credit.

=> 이전 신용 카드 월별 데이타. 각 행은 신용 카드 잔고 한 달 기준이고, 한 신용 카드는 여러 행을 가질 수 있음.

This table has one row for each month of history of every previous credit in Home Credit (consumer credit and cash loans) related to loans in our sample – i.e. the table has (#loans in sample * # of relative previous credit cards * # of months where we have some history observable for the previous credit card) rows. 

* previous_application.csv

All previous applications for Home Credit loans of clients who have loans in our sample.

There is one row for each previous application related to loans in our data sample.

=> 대출을 가진 이전 신청자들 

=> 각 현재 대출은 여러 개 이전 대출을 가질 수 있음.

=> 이전 각 지원서는 한 행이고, SK_ID_PREV로 식별.

* installments_payments.csv

Repayment history for the previously disbursed credits in Home Credit related to the loans in our sample.
=> 이전 반환 내역

=> 이전 대출에 대한 지불 내역, 모든 결제는 한 행이고, 누락된 결재도 한 행임.

There is a) one row for every payment that was made plus b) one row each for missed payment.
=> 모든 결제(대출하고 상환?) 는 한 행
=> 상환하지 못한 건은 한 행

One row is equivalent to one payment of one installment OR one installment corresponding to one payment of one previous Home Credit credit related to loans in our sample.

* HomeCredit_columns_description.csv

This file contains descriptions for the columns in the various data files.

=> 모든 컬럼 정의.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

**Metric : ROC AUC**

* 0/1을 판단할 때, ROC 모델을 사용함. Receiver Operating Characteristic Area Under the Curve ( ROC AUC, also sometimes called AUROC )
* ( https://www.youtube.com/watch?v=3llmZMHHL_8 모델평가, binary classification, precision recall, ROC )

* common classification metric : ROC AUC ( AUROC ), Receiver Operationg Characteristic Area Under the Curve

* 양분된 결과를 예측하는 테스트의 정확도를 평가하기 위하여 흔히 두 가지 지표, 민감도(sensitivity)와 특이도(specificity)를 사용

ex) 어떤 건강 상태를 가지고 있는 경우와 그렇지 않은 경우를 얼마나 잘 구분할 수 있는지를 의미 

민감도(sensitivity) 

- 1인 케이스에 대해 1이라고 예측한 것 

특이도(specificity) 

- 0인 케이스에 대해 0이라고 예측한 것 

양성율(True Positive Rate; TPR) 

- TPR = 민감도 = 1 - 위음성율, true accept rate 

- 1인 케이스에 대해 1로 맞게 예측한 비율 

- ex) 암환자를 진찰해서 암이라고 진단 함 

위양성율(False Positive Rate; FPR) 

- FPR = 1 - 특이도, false accept rate 

- 0인 케이스에 대해 1로 잘못 예측한 비율 

- ex) 암환자가 아닌데 암이라고 진단 함 


TPR과 FPR은 서로 반비례적인 관계에 있음 

- 암환자를 진단할 때, 성급한 의사는 아주 조금의 징후만 보여도 암인 것 같다고 할 것이다. 

- 이 경우 TPR은 1에 가까워지지만 FPR은 반대로 매우 낮아져버린다. (정상인 사람도 다 암이라고 하니까)

- 반대로 돌팔이 의사라서 암환자를 알아내지 못한다면, 모든 환자에 대해 암이 아니라고 할 것이다. 

- 이 경우 TPR은 매우 낮아져 0에 가까워지지만 반대로 FPR은 급격히 높아져 1에 가까워질 것이다.(암환자라는 진단 자체를 안하므로, 암환자라고 잘못 진단 하는 경우가 없음) 

이처럼 TPR과 FPR은 둘다, 어떤 기준(언제 1이라고 예측 할 지)을 연속적으로 바꾸면서 측정 해야한다. 

결국 TPR과 FPR의 여러가지 상황을 고려해서 성능을 판단해야 하는데, 이것을 한눈에 볼 수 있게 한 것이 바로 ROC 커브이다. 

그래서 ROC커브는 이것들을 그래프로 표현하여, 어떤 지점을 기준으로 잡을 지 결정하기 쉽게 시각화 한 것이다. 

ROC커브의 밑면적(the Area Under a ROC Curve; AUC; AUROC) 

- ROC 커브의 X,Y축은 [0,1]의 범위며, (0,0) 에서 (1,1)을 잇는 곡선 

- ROC 커브의 밑 면적이 1에 가까울수록(즉, 왼쪽 상단 꼭지점에 다가갈수록) 좋은 성능 

- 이때의 면적(AUC)은 0.5~1의 범위를 가짐(0.5면 성능이 전혀 없음. 1이면 최고의 성능) 

AUC 해석 

- 쉽게 1로 예측하는 경우 민감도는 높아지지만 모든 경우를 1이라고 하므로 특이도가 낮아진다. 

- 그러므로 이 민감도와 특이도 모두 1에 가까워야 의미가 있음 

- 따라서ROC커브를 그릴때 특이도를 1-특이도를 X축에 놓고, Y축에 민감도를 놓는다. 

- 그러면 x=0일때 y=1인 경우 최적의 성능이고, 점점 우측 아래로 갈수록, 즉 특이도가 감소하는 속도에비해 얼마나 빠르게 민감도가 

증가하는지를 나타냄. 

- AUC값은 전체적인 민감도와 특이도의 상관 관계를 보여줄 수 있어 매우 편리한 성능 측정 기준임 

AUC = 0.5인 경우 

- 특이도가 감소하는만큼 민감도가 증가하므로 민감도와 특이도를 동시에 높일 수 있는 지점이 없음 

- 특이도가 1일때 민감도는 0, 특이도가 0일때 민감도는 1이되는 비율이 정확하게 trade off관계로, 두 값의 합이 항상 1임  


** 이 문제 해결을 위해서 다음 내용을 아는게 좋다**
* Manual Feature Engineering Part One ( https://www.kaggle.com/willkoehrsen/introduction-to-manual-feature-engineering )
* Manual Feature Engineering Part Two ( https://www.kaggle.com/willkoehrsen/introduction-to-manual-feature-engineering-p2 )
* Introduction to Automated Feature Engineering ( https://www.kaggle.com/willkoehrsen/automated-feature-engineering-basics )
* Advanced Automated Feature Engineering ( https://www.kaggle.com/willkoehrsen/tuning-automated-feature-engineering-exploratory )
* Feature Selection ( https://www.kaggle.com/willkoehrsen/introduction-to-feature-selection )
* Intro to Model Tuning: Grid and Random Search ( https://www.kaggle.com/willkoehrsen/intro-to-model-tuning-grid-and-random-search )
* Automated Model Tuning ( https://www.kaggle.com/willkoehrsen/automated-model-tuning )
* Model Tuning Results ( https://www.kaggle.com/willkoehrsen/model-tuning-results-random-vs-bayesian-opt/notebook )

* Import

In [None]:
import numpy as np
import pandas as pd

from sklearn.preprocessing import LabelEncoder

import os

import warnings
warnings.filterwarnings('ignore')

import matplotlib.pyplot as plt
import seaborn as sns

**Read in Data**

In [None]:
print(os.listdir("/kaggle/input/home-credit-default-risk/"))

In [None]:
# Training Data
app_train = pd.read_csv('/kaggle/input/home-credit-default-risk/application_train.csv')
print('Training data shape: ', app_train.shape)
app_train.head()

* => 122개 feature가 있다. ( TARGET 포함 )

In [None]:
# Testing data features
app_test = pd.read_csv('/kaggle/input/home-credit-default-risk/application_test.csv')
print('Testing data shape: ', app_test.shape)
app_test.head()

* => TARGET 컬럼이 없고, 총 121 feature임.

**Exploratory Data Analysis**

* Examine the Distribution of the Target Column
* => TARGET(0) : for the loan was repaid on time
* => TARGET(1) : the client had payment difficulties

In [None]:
app_train['TARGET'].value_counts()

In [None]:
app_train['TARGET'].astype(int).plot.hist();

* => 이 차트를 보고, imbalanced class problem 이라고 알 수 있다. 

* Examine Missing Values

In [None]:
# 전체 data는 ? 307,511 
print(len(app_train))

In [None]:
def missing_values_table(df):
    # Total missing values
    mis_val = df.isnull().sum()
    
    # Percentage of missing values
    mis_val_percent = 100 * df.isnull().sum() / len(df)
    
    # Make a table with the results
    # axis=1 ; app_train + mis_val 컬럼 + mis_val_percent 컬럼 식으로 추가
    mis_val_table = pd.concat([mis_val, mis_val_percent], axis=1)
    
    # Rename the columns
    mis_val_table_ren_columns = mis_val_table.rename( columns = {0 : 'Missing Values', 1 : '% of Total Values'})
    
    # Sort the table by percentage of missing descending
    mis_val_table_ren_columns = mis_val_table_ren_columns[ mis_val_table_ren_columns.iloc[:,1] != 0 ].sort_values('% of Total Values', ascending=False).round(1)
    
    #print("##")
    #print(mis_val_table_ren_columns.iloc[:,1] )
    #print("##")
    #print("##")
    #print(mis_val_table_ren_columns.iloc[:,1] != 0)
    #print("##")


#    mis_val_table_ren_columns = mis_val_table_ren_columns[ mis_val_table_ren_columns.iloc[:,1] != 0 ].sort_values(by='% of Total Values', ascending=False).round(1)
#    mis_val_table_ren_columns = mis_val_table_ren_columns.sort_values(by='% of Total Values', ascending=False).round(1)
    
    # Print some summary information
    print("Your selected dataframe has " + str(df.shape[1]) + " columns.\n" 
         "There are " + str(mis_val_table_ren_columns.shape[0]) + 
         " columns that have missing values.")
    
    # Return the datafrmae with missing information
    return mis_val_table_ren_columns

# Missing values statistics
missing_values = missing_values_table(app_train)
missing_values.head(20)

* => 1) NaN -> 값 채우기
* => 2) 값 채우지 않고, XGBoost 와 같은 모델 사용 ( https://stats.stackexchange.com/questions/235489/xgboost-can-handle-missing-data-in-the-forecasting-phase )
* => 3) NaN 이 높은 % 는 칼럼 삭제
* => NaN 값이 많음에도 불구하고, 이 컬럼이 도움 될지 않될지 모르기 때문에 유지.

* Column Types
* => int64, float64 ; numeric variables
* => object : strings, categorical features

In [None]:
# Number of each type of column
app_train.dtypes.value_counts()

In [None]:
# Number of unique classes in each object column
app_train.select_dtypes('object').apply(pd.Series.nunique, axis = 0)

* => 이런 categorical variables 다루기 위한 방법은 찾아야함.

* Encoding Categorical Variables
* => 2 categories 이상 : One-Hot Encoding 
* => 2 categories : Label Encoding

In [None]:
# Label Encoding 전, 데이타 체크, 122개 column임. 
app_train.shape

In [None]:
for col in app_train:
    if app_train[col].dtypes == 'object':
        print(app_train[col])

In [None]:
# Create a label encoder object
le = LabelEncoder()
le_count = 0

# Iterate through the columns
for col in app_train:
    if app_train[col].dtype == 'object':
        # If 2 or fewer unique categories
        if len(list(app_train[col].unique())) <= 2:
            print(app_train[col])
            # Train on the training data
            le.fit(app_train[col])
            
            # Transform both training and testing data
            app_train[col] = le.transform(app_train[col])
            app_test[col] = le.transform(app_test[col])
            
            # Keep track of how many columns were label encoded
            le_count += 1
            
print('%d columns were label encoded.' % le_count)

In [None]:
app_train.tail(10)

* => Label Encoding 이라서, column이 추가되진 않음.

In [None]:
# one-hot encoding of categorical variables
app_train = pd.get_dummies(app_train)
app_test = pd.get_dummies(app_test)

print('Training Features shape: ', app_train.shape)
print('Testing Features shape: ', app_test.shape)

* => Training data경우, 122개 컬럼에서, 243개 컬럼으로 늘어남. 

* Aligning Training and Testing Data
* => training과 Testing 데이타 컬럼 수가 다르기 때문에, 데이타 프레임을 column 기준으로 결함 ( axis = 1 )

In [None]:
train_labels = app_train['TARGET']

# Align the training and testing data, keep only columns present in both dataframes
app_train, app_test = app_train.align(app_test, join = 'inner', axis = 1)

# Add the target back in
app_train['TARGET'] = train_labels

print('Training Features shape: ', app_train.shape)
print('Testing Features shape: ', app_test.shape)

* => inner(양쪽 DataFrame 교집합) 로 데이타를 정렬함. 
* => one-hot encoding 후 original column 삭제 효과. ( TARGET 컬럼도 삭제 )
* => Training에 'TARGET'컬럼 추가.

**Back to Exploratory Data Analysis**

* Anomalies
* => EDA 하면서 발생한 이상한 데이타 추적

In [None]:
# - ( negative ) days : 현재 대출 신청 기록. 과거 대출은 - 로 표시 
app_train['DAYS_BIRTH'].describe()

In [None]:
# 날짜가 크기 때문에, 년 단위로 변경
( app_train['DAYS_BIRTH'] / 365).describe()

In [None]:
# positive로 표시
( app_train['DAYS_BIRTH'] / 365 * -1).describe()

* => 특별히 이상한 값음 없음.

In [None]:
app_train['DAYS_EMPLOYED'].describe()

* => max 365243 days / 365 days => 1,243 year?

In [None]:
# DAYS_EMPLOYED : How many days before the application the person started current employment, 일 하기 시작한 날?
app_train['DAYS_EMPLOYED'].plot.hist(title = 'Days Employment Histogram');
plt.xlabel('Days Employment');

* => 비정상적인 수치는 365,243 에 분포되어있다. 
* => 365일로 나누면, 243년? => 비정상적인 데이타라고 간주.

In [None]:
anom = app_train[app_train['DAYS_EMPLOYED'] == 365243]
non_anom = app_train[app_train['DAYS_EMPLOYED'] != 365243]
print('The non-anomalies default on %0.2f%% of loans' % (100 * non_anom['TARGET'].mean()))
print('The anomalies default on %0.2f%% of loans' % (100 * anom['TARGET'].mean()))
print('There are %d anomalous days of employment' % len(anom))

* => 비정상적인 값을 확인했고, 채우기로 함
* => 모든 대출이 공통 내용 공유한다면, 동일한 값으로 채울 수 있음.
* => 새로운 bool 컬럼 생성 : np.nan이 아닌 값으로 채우는데, 이 값이 변칙적인지 여부를 파악하기 위함. ( 365243이면 True, 아니면 False )

In [None]:
# Create an anomalous flag column
app_train['DAYS_EMPLOYED_ANOM'] = app_train['DAYS_EMPLOYED'] == 365243

# Replace the anomalous values with nan
app_train['DAYS_EMPLOYED'].replace({365243: np.nan}, inplace = True)

app_train['DAYS_EMPLOYED'].plot.hist(title = 'Days Employment Histogram');
plt.xlabel('Days Employment');

* => distribution 잘 된것 같다, 갑자기 왜 - 값인지? 바로 이전 histogram은 0 ~ 365243이였..

In [None]:
# fill to app_test DataFrame
# 변칙적인 값이 어느 정도 중요하다고 판단하는 경우, 변친 값을 판단하는 boolean 컬럼을 새로 만들어 둔다. 
# 그리고, 원래 컬럼에 변칙적인 값이 있으면 그 값은 NaN 으로 채워 둔다. 
app_test['DAYS_EMPLOYED_ANOM'] = app_test['DAYS_EMPLOYED'] == 365243
app_test['DAYS_EMPLOYED'].replace({365243: np.nan}, inplace = True)

print('There are %d anomalies in the test data out of %d entries' % (app_test['DAYS_EMPLOYED_ANOM'].sum(), len(app_test)))

**Correlations**

* The correlation coefficient is not the greatest method to represent "relevance" of a feature.
* but it does give us an idea of possible relationships within the data.
* http://www.statstutor.ac.uk/resources/uploaded/pearsons.pdf 값에 대한 해석 참조해서 읽어보면 좋음.
* .00 ~ .19 "very week"
* .20 ~ .39 "weak"
* .40 ~ .59 "moderate"
* .60 ~ .79 "strong"
* .80 ~ 1.0 "very strong"

In [None]:
# Find correlations with the target and sort
correlations = app_train.corr()['TARGET'].sort_values()

# Display correlations
print('Most Positive Correlations:\n', correlations.tail(15))
print('\nMost Negative Correlations:\n', correlations.head(5))

* => DAYS_BIRTH feature가 가장 긍정적인 상관 관계
* => DAYS_BIRTH : Client's age in days at the time of application ??? 대출 시점에 고객 연령(days) 실제로 이 값은 음수로 표현. 클라이언트가 나이가 들수록 대출에 대한 채무 불이행 가능성이 낮아짐. 그렇다고, - 값을 절대값을 취하면 correlation 값이 음수가 됨. 쩝

**Effect of Age on Repayment**

In [None]:
# Find the correlation of the positive days since birth and target

app_train['DAYS_BIRTH'] = abs(app_train['DAYS_BIRTH'])
app_train['DAYS_BIRTH'].corr(app_train['TARGET'])

* => 고객이 나이가 들면, TARGET값과는 negative linear relationship
* => 나이가 들수록, 대출금을 자주 상환하는 것 같다. 

In [None]:
# Make a histogra of age

# Set the style of plots
plt.style.use('fivethirtyeight')

# Plot the distribution of ages in years
plt.hist(app_train['DAYS_BIRTH'] / 365, edgecolor = 'k', bins = 25)
plt.title('Age of Client')
plt.xlabel('Age (years)')
plt.ylabel('Count')

* => 나이 분포를 보면, 특이치(outlier)는 보이지 않는다. 
* => 그래서, 나이를 kernel density estimation plot(KDE, a smoothed histogram)로 찍어보자. 

In [None]:
plt.figure(figsize = (10,8))

# KDE plot of loans that were repaid on time
sns.kdeplot(app_train.loc[app_train['TARGET'] == 0, 'DAYS_BIRTH'] / 365, label = 'target == 0')

# KDE plot of loans which were not repaid on time
sns.kdeplot(app_train.loc[app_train['TARGET'] == 1, 'DAYS_BIRTH'] / 365, label = 'target == 1')

# Labeling of plot
plt.xlabel('Age (years)')
plt.ylabel('Density')
plt.title('Distribution of Ages')

* => target == 1, 나이가 어린쪽으로 곡선이 기울어짐
* => distribution이 있어 보여서 ML 모델에 사용하는게 좋다고 판단

* 다른 방법으로 리뷰
* => 연령대 별 대출 상환 평균 실패

In [None]:
# Age information into a separate dataframe
age_data = app_train[['TARGET', 'DAYS_BIRTH']]
age_data['YEARS_BIRTH'] = age_data['DAYS_BIRTH'] / 365

# Bin the age data
# pd.cut : 실수 값 경계선 지정 function
# bins : 카테고리를 나누는 기준 값
# np.linspace : 20부터 70까지 균등하게 배열을 11개 생성
age_data['YEARS_BINNED'] = pd.cut(age_data['YEARS_BIRTH'], bins = np.linspace(20, 70, num = 11))
age_data.head(10)

In [None]:
# Group by the bin and calculate average
age_groups = age_data.groupby('YEARS_BINNED').mean()
age_groups

In [None]:
plt.figure(figsize = (8, 8))

# Graph the age bins and the average of the target as a bar plot
plt.bar(age_groups.index.astype(str), 100 * age_groups['TARGET'])

# Plot labeling
plt.xticks(rotation = 75)
plt.xlabel('Age Group (years)')
plt.ylabel('Failure to Repay (%)')
plt.title('Failure to Repay by Age Group')

* => 나이가 어린 클라이언트 대출을 상환하지 않을 가능성이 높다.

**Exterior Source**

* 3개 feature, EXT_SOURCE_1, EXT_SOURCE_2, EXT_SOURCE_3 : target 과 강한 negative 상관관계
* => "normalized score from external data source" 
* => 많은 외부 데이타를 사용해서 누적한 신용 등급

In [None]:
# Extract the EXT_SOURCE variables and show correlations
ext_data = app_train[['TARGET', 'EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'DAYS_BIRTH']]
ext_data_corrs = ext_data.corr()
ext_data_corrs

In [None]:
# 숫자보다는 heatmap
plt.figure(figsize = (8,6))

# Heatmap of correlation
sns.heatmap(ext_data_corrs, cmap = plt.cm.RdYlBu_r, vmin = -0.25, annot = True, vmax = 0.6)
plt.title('Correlation Heapmap')

* => 모든 EXT_SOURCE feature는 target과 negative correlation 임
* => EXT_SOURCE 가 증가하면 대출을 상환할 가능성이 높아진다. 
* => DAYS_BIRTH는 EXT_SOURCE_1 과 postive 상관 관계이고, 이 positive 상관 관계 수치 요인 중 하나가 client aget 일수 있음을 가리킨다.

* EXT_SOURCE feature와 TARGET : kdeplot으로 distribution 살펴보기

In [None]:
plt.figure(figsize = (10, 12))

# iterate through the sources
for i, source in enumerate(['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3']):
    
    # create a new subplot for each source
    plt.subplot(3, 1, i + 1)
    
    # plot repaid loans
    sns.kdeplot(app_train.loc[app_train['TARGET'] == 0, source], label = 'target == 0')
    
    # plot loans that were not repaid
    sns.kdeplot(app_train.loc[app_train['TARGET'] == 1, source], label = 'target == 1')
    
    # Label the plots
    plt.title('Distribution of %s by Target Value ' % source)
    plt.xlabel('%s' % source)
    plt.ylabel('Density')
    
plt.tight_layout(h_pad = 2.5)

* => EXT_SOURCE_3 ; the greatest difference. 
* => relationship이 강하진 않지만, 제 시간에 대출을 상환할지 말지를 예상하는 ML model에 유용할 것 같다. 

**Pairs Plot**

* EXT_SOURCE variable과 DAYS_BIRTH variable 의 pair plot
* 여러 변수 사이 분포와 단일 변수 분포 볼수 있음

In [None]:
# Copy the data for plotting
plot_data = ext_data.drop(columns = ['DAYS_BIRTH']).copy()

# Add in the age of the client in years
plot_data['YEARS_BIRTH'] = age_data['YEARS_BIRTH']

# Drop na values and limit to first 100000 rows
plot_data = plot_data.dropna().loc[:100000, :]

# Function to calculate correlation coefficient between two columns
def corr_func(x, y, **kwargs):
    r = np.corrcoef(x, y)[0][1]
    ax = plt.gca()
    ax.annotate("r = {:.2f}".format(r),
                xy=(.2, .8), xycoords=ax.transAxes,
                size = 20)

# Create the pairgrid object
grid = sns.PairGrid(data = plot_data, size = 3, diag_sharey=False,
                    hue = 'TARGET', 
                    vars = [x for x in list(plot_data.columns) if x != 'TARGET'])

# Upper is a scatter plot
grid.map_upper(plt.scatter, alpha = 0.2)

# Diagonal is a histogram
grid.map_diag(sns.kdeplot)

# Bottom is density plot
grid.map_lower(sns.kdeplot, cmap = plt.cm.OrRd_r);

plt.suptitle('Ext Source and Age Features Pairs Plot', size = 32, y = 1.05);

* => 빨간색 : 상환하지 않았던 대출
* => 파란색 : 상환했던 대출
* => EXT_SOURCE_1 과 DAYS_BIRTH(혹은 YEARS_BIRTH) 사이 moderate positive linear relationship => 고객(채무자) 나이 고려할 수 있음을 보여 줌 (????)

**Feature Engineering**

* Kaggle competition은 주로 Feature Engineering에 달림. 
* 모델 Gradient Boost 변경이 많이 우승하는 경향
* 모델을 만들고, hyperparameter 튜닝하는 것 보다 Feature Engineering 이 더 좋은 Prediction 만들어냄 
* Andrew Ng "applied machine learning is basically feature engineering."

* 모델은 단지 주어진 데이타를 가지고 학습할 수 있음
* Data Scientist : 이 데이타로 가능한 한 작업과 관련이 있는지 확인 ( 자동화된 도구 이용 )
* 많은 feature engineering 중에서, Polynomial features 와 Domain knowledge features 사용해보자
* => 선택한 이유가? ....

**Polynomial Features**

* => feature engineering 설명 중, Polynomial features 참고 : https://jakevdp.github.io/PythonDataScienceHandbook/05.04-feature-engineering.html 

In [None]:
# Make a new dataframe for polynomial features
poly_features = app_train[['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'DAYS_BIRTH', 'TARGET']]
poly_features_test = app_test[['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'DAYS_BIRTH']]

# imputer for handling missing values ( strategy:median 의미가???? )
from sklearn.preprocessing import Imputer
imputer = Imputer(strategy = 'median')  

poly_target = poly_features['TARGET']

poly_features = poly_features.drop(columns = ['TARGET'])

# Need to impute missing values
poly_features = imputer.fit_transform(poly_features)
poly_features_test = imputer.transform(poly_features_test)

from sklearn.preprocessing import PolynomialFeatures

# Create the polynomial object with specified degree ( degree:3, 3차수 x^3 )
poly_transformer = PolynomialFeatures(degree = 3)

In [None]:
# Train the polynomial features
poly_transformer.fit(poly_features)

# Transform the features
poly_features = poly_transformer.transform(poly_features)
poly_features_test = poly_transformer.transform(poly_features_test)
print('Polynomial Features shape: ', poly_features.shape)

* => 4개 feature만 poly_features 에 넣었는데, 
* => transform 이후는 35개 features임.
* => 새롭게 만든 feature는 get_feature_name 으로 리스트 가능

In [None]:
poly_transformer.get_feature_names(input_features = ['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'DAYS_BIRTH'])[:15]

* => 새롭게 만든 EXT_SOURCE_1^2 등 feature 확인 가능

* 이 features와 Target 상관 관계를 살펴보자

In [None]:
# Create a dataframe of the features
poly_features = pd.DataFrame(poly_features, columns = poly_transformer.get_feature_names(['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'DAYS_BIRTH']))

# Add in the target
poly_features['TARGET'] = poly_target

# Find the correlations with the target
poly_corrs = poly_features.corr()['TARGET'].sort_values()

# Display most negative and most positive
print(poly_corrs.head(10))
print(poly_corrs.tail(5))

* => 0.15 정도로 몇 가지 feature 상관 관계 값이 있지만 ( 0.0 ~ 0.19 very week )
* => new feature를 사용할지, 기존 feature를 사용할지 결정 필요


* 위 에서 만든 feature를 training과 testing data에 추가하고, 모델을 평가. 추가하지 않고도 평가. 
* ML은 just work and try out!

In [None]:
# Put test features into dataframe
poly_features_test = pd.DataFrame(poly_features_test, columns = poly_transformer.get_feature_names
                                  (['EXT_SOURCE_1', 'EXT_SOURCE_2',
                                   'EXT_SOURCE_3', 'DAYS_BIRTH']))

# Merge polynomial features into training dataframe
poly_features['SK_ID_CURR'] = app_train['SK_ID_CURR']
app_train_poly = app_train.merge(poly_features, on = 'SK_ID_CURR', how = 'left')

# Merge polynomial features into testing dataframe
poly_features_test['SK_ID_CURR'] = app_test['SK_ID_CURR']
app_test_poly = app_test.merge(poly_features_test, on = 'SK_ID_CURR', how = 'left')

# Align the dataframes
app_train_poly, app_test_poly = app_train_poly.align(app_test_poly, join = 'inner', axis = 1)

# Print out the new shapes
print('Training data with polynomial features shape : ', app_train_poly.shape)
print('Testing data with polynomial features shape : ', app_test_poly.shape)

* => 약 48,000, 121 features => 275 features로 늘어났음 

**Domain Knowledge Features**

* https://www.kaggle.com/jsaguiar/lightgbm-with-simple-features 에 따라서 5 features 를 더 들여다 보자
* CREDIT_INCOME_PERCENT ; 고객 소득 대비 신용(credit amout) 비율(percent)
* ANNUITY_INCOME_PERCENT ; 고객 소득 대비 대출 연금(annuity????) 비율
* CREDIT_TERM : 월 단위 지불 기간 ( 연금이 월 지불액 이라서 )
* DAYS_EMPLOYED_PERCENT ; 고객 나이 대비 고용된 날짜 비율 

In [None]:
app_train_domain = app_train.copy()
app_test_domain = app_test.copy()

# AMT_CREDIT : Credit amount of the loan
# AMT_INCOME_TOTAL : Income of the client
app_train_domain['CREDIT_INCOME_PERCENT'] = app_train_domain['AMT_CREDIT'] / app_train_domain['AMT_INCOME_TOTAL']

# AMT_ANNUITY ; Loan annuity ( 대출 연금 )
# AMT_INCOME_TOTAL : Income of the client 
app_train_domain['ANNUITY_INCOME_PERCENT'] = app_train_domain['AMT_ANNUITY'] / app_train_domain['AMT_INCOME_TOTAL']

# AMT_ANNUITY ; Loan annuity ( 대출 연금 )
# AMT_CREDIT ; Credit amount of the loan
app_train_domain['CREDIT_TERM'] = app_train_domain['AMT_ANNUITY'] / app_train_domain['AMT_CREDIT']

# DAYS_EMPLOYED ; How many days before the application the person started current employment
# DAYS_BIRTH ; Client's age in days at the time of application
app_train_domain['DAYS_EMPLOYED_PERCENT'] = app_train_domain['DAYS_EMPLOYED'] / app_train_domain['DAYS_BIRTH']


In [None]:
app_test_domain['CREDIT_INCOME_PERCENT'] = app_test_domain['AMT_CREDIT'] / app_test_domain['AMT_INCOME_TOTAL']
app_test_domain['ANNUITY_INCOME_PERCENT'] = app_test_domain['AMT_ANNUITY'] / app_test_domain['AMT_INCOME_TOTAL']
app_test_domain['CREDIT_TERM'] = app_test_domain['AMT_ANNUITY'] / app_test_domain['AMT_CREDIT']
app_test_domain['DAYS_EMPLOYED_PERCENT'] = app_test_domain['DAYS_EMPLOYED'] / app_test_domain['DAYS_BIRTH']

* Visualize New Variables
* => TARGET = 0 or 1 에 따라 test_domain feature별 kdeplot 

In [None]:
plt.figure(figsize = (12, 20))

# iterate through the new features
for i, feature in enumerate(['CREDIT_INCOME_PERCENT', 'ANNUITY_INCOME_PERCENT', 'CREDIT_TERM', 'DAYS_EMPLOYED_PERCENT']):
    
    # create a new subplot for each source, 4x1, index
    plt.subplot(4, 1, i + 1)
    
    # plot repaid loans
    sns.kdeplot(app_train_domain.loc[app_train_domain['TARGET'] == 0, feature], label = 'target == 0')
    
    # plot loans that were not repaid
    sns.kdeplot(app_train_domain.loc[app_train_domain['TARGET'] == 1, feature], label = 'target == 1')
    
    # Label the plots
    plt.title('Distribution of %s by Target Value' % feature)
    plt.xlabel('%s' % feature)
    plt.ylabel('Density')
    
# h_pad ; 인접한 pad 높이 간격
plt.tight_layout(h_pad = 2.5) 

* => distribution이 3번째, 4번째는 있지만, 이 새로운 feature가 유용하다고 단정지을 순 없다
* => 알아보기 위해서는 이 features를 가지고 predict 해봐야 ...

**Baseline**

* naive baseline : testing set 에서 모든 예는 동일한 값을 추측할 것 같다. 
* This will get us a Reciever Operating Characteristic Area Under the Curve (AUC ROC) of 0.5 in the competition (random guessing on a classification task will score a 0.5). => AUC ROC가 0.5라는건 반반인데, 좀 문제가 있는 prediction 이다. 
* 그래서, baseline을 Logistic Regression을 사용해서 좀 더 정교한 모델을 사용.

**Logistic Regression(로지스틱 회귀) Implementation**
* ( ref : https://ratsgo.github.io/machine%20learning/2017/04/02/logistic/ or web searching... )

* Machine Learning Algorithm : http://faculty.marshall.usc.edu/gareth-james/ 과 같은 책을 보라~
* Features : after encoding the categorical variables
* missing values(imputation): by filling the missing values and normalizing the range of features
* => imputation : 누락된 데이타 값을 채우는 방법
* => Deletion : 분명하지 않은 결측 값이 있는 데이타를 제거하는 방법


In [None]:
from sklearn.preprocessing import MinMaxScaler, Imputer

# Drop the target from the training data
if 'TARGET' in app_train:
    train = app_train.drop(columns = ['TARGET'])
else:
    train = app_train.copy()
    
# Feature names
features = list(train.columns)

# Copy of the testing data
test = app_test.copy()

# Mean(평균), median(중앙값), mode(최빈값)
# Median(중앙값) imputation of missing values
imputer = Imputer(strategy = 'median')

# Scale each feature to 0-1
scaler = MinMaxScaler(feature_range = (0,1))

# Fit on the training data
imputer.fit(train)

# Transform both training and testing data
train = imputer.transform(train)
test = imputer.transform(app_test)

# Repeat with the scaler
scaler.fit(train)
train = scaler.transform(train)
test = scaler.transform(test)

print('Training data shape: ', train.shape)
print('Testing data shape: ', test.shape)


In [None]:
print(train)

In [None]:
from sklearn.linear_model import LogisticRegression

# Make the model with the specified regularization parameter
log_reg = LogisticRegression(C = 0.0001)

# Train on the training data
log_reg.fit(train, train_labels)

In [None]:
# Make prediction
# Make sure to select the second column only
# predict() : output이 0 or 1
# predict_proba() : 1의 probability
log_reg_pred = log_reg.predict_proba(test)[:,1]

In [None]:
# Submission dataframe
submit = app_test[['SK_ID_CURR']]
submit['TARGET'] = log_reg_pred

submit.head()

In [None]:
# Save the submission to a csv file
submit.to_csv('log_reg_baseline.csv', index = False)

* Check Data folder
* => Output : /kaggle/working/log_reg_baseline.csv 

**Improved Model : Random Forest**
* Random Forest 로 개선
* tree가 수백개일 때, 강력함. 
* 100개 tree 사용 예정
* 참고1 : Model  https://datascienceschool.net/view-notebook/766fe73c5c46424ca65329a9557d0918/ 
* 참고2 : https://analysis-flood.tistory.com/103

In [None]:
from sklearn.ensemble import RandomForestClassifier

# Make the random forest classifier
random_forest = RandomForestClassifier(n_estimators = 100, random_state = 50, verbose = 1, n_jobs = -1)

In [None]:
# Train on the training data
random_forest.fit(train, train_labels)

# Extract feature importances
# Random Forest 장점은 각 독립 변수 중요도(feature importance)를 계산할 수 있다.
feature_importance_values = random_forest.feature_importances_

# 이 문서 거의 마지막 모델 "Model Interpretation : Feature Importances" 에서 사용
feature_importances = pd.DataFrame({'feature': features, 'importance': feature_importance_values})

# Make predictions on the test data
predictions = random_forest.predict_proba(test)[:, 1]

In [None]:
# Make a submission dataframe
submit = app_test[['SK_ID_CURR']]
submit['TARGET'] = predictions

# Save the submission dataframe
submit.to_csv('random_forest_baseline.csv', index = False)

* => Random Forest와 LogisticRegression Model score는? ( 이건 누가 매기는 거지? )
* => Random.... : 0.678
* => Logi... : 0.671

**Make Predictions using Engineered Features**
* Polynomial Features, Domain knowledge 사용해보자

In [None]:
poly_features_names = list(app_train_poly.columns)

# Impute the polynomial features
imputer = Imputer(strategy = 'median')

poly_features = imputer.fit_transform(app_train_poly)
poly_features_test = imputer.fit_transform(app_test_poly)

# Scale the polynomial features
scaler = MinMaxScaler(feature_range = (0, 1))

poly_features = scaler.fit_transform(poly_features)
poly_features_test = scaler.transform(poly_features_test)

random_forest_poly = RandomForestClassifier(n_estimators = 100, random_state = 50, verbose = 1, n_jobs = -1)

In [None]:
# Train on the training data
random_forest_poly.fit(poly_features, train_labels)

# Make predictions on the test data
predictions = random_forest_poly.predict_proba(poly_features_test)[:, 1]

In [None]:
# Make a submission dataframe
submit = app_test[['SK_ID_CURR']]
submit['TARGET'] = predictions

# Save the submission dataframe
submit.to_csv('random_forest_baseline_engineered.csv', index = False)

* => Random Forest와 LogisticRegression Model score, Random Forest using Engineering features는? ( 이건 누가 매기는 거지? )
* => Random.... : 0.678
* => Logi... : 0.671
* => Random using Engineering features ... : 0.678
* => 이건 모,,,, Engineering Features를 왜 한건지...

**Testing Domain Features**

In [None]:
app_train_domain = app_train_domain.drop(columns = 'TARGET')

domain_features_names = list(app_train_domain.columns)

# Impute the domainnomial features
imputer = Imputer(strategy = 'median')

domain_features = imputer.fit_transform(app_train_domain)
domain_features_test = imputer.transform(app_test_domain)

# Scale the domainnomial features
scaler = MinMaxScaler(feature_range = (0,1))

domain_features = scaler.fit_transform(domain_features)
domain_features_test = scaler.transform(domain_features_test)

random_forest_domain = RandomForestClassifier(n_estimators = 100, random_state = 50, verbose = 1, n_jobs = -1)

# Train on the training data
random_forest_domain.fit(domain_features, train_labels)

# Extract feature importances
feature_importance_values_domain = random_forest_domain.feature_importances_
feature_importance_domain = pd.DataFrame({'feature': domain_features_names, 'importance': feature_importance_values_domain})

# Make predictions on the test data
predictions = random_forest_domain.predict_proba(domain_features_test)[:, 1]

In [None]:
submit = app_test[['SK_ID_CURR']]
submit['TARGET'] = predictions

# Save the submission dataframe
submit.to_csv('random_forest_baseline_domain.csv', index = False)

* => testing domain features, scores : 0.679
* => Random Forest와 LogisticRegression Model score, Random Forest using Engineering features는? ( 이건 누가 매기는 거지? )
* => Random.... : 0.678
* => Logi... : 0.671
* => Random using Engineering features ... : 0.678
* => 이건 모,,,, Engineering Features를 왜 한건지...

**Model Interpretation : Feature Importances**
* 좀 더 해보면, EXT_SOURCE와 DAYS_BIRTH가 중요한 feature라는 걸 예상해야한다. 그래서 이걸 ... 사용해보자

In [None]:
def plot_feature_importances(df):
    """
    Plot importances returned by a model. This can work with any measure of
    feature importance provided that higher importance is better. 
    
    Args:
        df (dataframe): feature importances. Must have the features in a column
        called `features` and the importances in a column called `importance
        
    Returns:
        shows a plot of the 15 most importance features
        
        df (dataframe): feature importances sorted by importance (highest to lowest) 
        with a column for normalized importance
        """
    
    # Sort features according to importance
    df = df.sort_values('importance', ascending = False).reset_index()
    
    # Normalize the feature importances to add up to one
    df['importance_normalized'] = df['importance'] / df['importance'].sum()

    # Make a horizontal bar chart of feature importances
    plt.figure(figsize = (10, 6))
    ax = plt.subplot()
    
    # Need to reverse the index to plot most important on top
    ax.barh(list(reversed(list(df.index[:15]))), 
            df['importance_normalized'].head(15), 
            align = 'center', edgecolor = 'k')
    
    # Set the yticks and labels
    ax.set_yticks(list(reversed(list(df.index[:15]))))
    ax.set_yticklabels(df['feature'].head(15))
    
    # Plot labeling
    plt.xlabel('Normalized Importance'); plt.title('Feature Importances')
    plt.show()
    
    return df


In [None]:
# Show the feature importances for the default features
# feature_importances : Randome Forest 후 나온 중요도 변수를 train pandas 데이타에 추가함. 
feature_importances_sorted = plot_feature_importances(feature_importances)


* => 예상대로, EXT_SOURCE, DAYS_BIRTH가 중요해 보임.

In [None]:
feature_importance_domain_sorted = plot_feature_importances(feature_importance_domain)

* => Domain Knowledge로 했을 때 중요도.

**Conclusions**

**We followed the general outline of a machine learning project:**
* Understand the problem and the data
* Data cleaning and formatting (this was mostly done for us)
* Exploratory Data Analysis
* Baseline model
* Improved model
* Model interpretation (just a little)

Light Gradient Boosting Machine 모델 사용한ㄴ 경우에 prediction이 좋아짐