Followed [this Notebook](https://www.kaggle.com/willkoehrsen/introduction-to-manual-feature-engineering).

# Introduction : Manual Feature Engineering

이전 노트북에서는 `application`데이터만 사용해서, 점수를 개선했다.

여기서는, 더 많은 데이터를 데이터셋으로 사용해서, 점수를 개선해본다. 

사용할 데이터는, `bureau` , `bureau_balance` 데이터다.

> bureau: information about client's previous loans with other financial institutions reported to Home Credit. Each previous loan has its own row.

>bureau_balance: monthly information about the previous loans. Each month has its own row.

>Manual feature engineering can be a tedious process (which is why we use automated feature engineering with featuretools!) and often relies on domain expertise. Since I have limited domain knowledge of loans and what makes a person likely to default, I will instead concentrate of getting as much info as possible into the final training dataframe. The idea is that the model will then pick up on which features are important rather than us having to decide that. Basically, our approach is to make as many features as possible and then give them all to the model to use! Later, we can perform feature reduction using the feature importances from the model or other techniques such as PCA.

해당 분야에 전문지식이 없기 때문에, 전문적으로 접근하기 보다는, 데이터의 모델에 대입해서 문제를 해결하는 방식으로 접근한다.

우리가 feature를 사전에 전문지식 또는 경험을 통해 선택하기 보다는, 모델을 통해서 어떤 feature들이 중요한지 그 feature importance를 탐색할수 있으며, 중요도가 높은 feature 들을 고르는 방법으로 진행한다.

기본적인 접근 방법은, 많은 데이터를 들이부어 보고, 찾는 방식이다. 

나중에는, 모델 또는 PCA 기술?로 부터 feature importance등을 적용해서 feature들을 감소시키는 것들에 대해서 해 볼 수 있을것 같다.

In [None]:
# pandas and numpy for data manipulation
import pandas as pd
import numpy as np

# matplotlib and seaborn for plotting
import matplotlib.pyplot as plt
import seaborn as sns

# Suppress warning from pandas
import warnings
warnings.filterwarnings('ignore')

plt.style.use('fivethirtyeight')

### Example : Counts of a clients's previous loans



feature engineering 수작업을 하기위해, 대출 고객의 다른 금융기관으로부터 대출 횟수에 관해서 세어봐야 할 것 같다.

pandas이 아래와 같은 기능들을 사용함.

* groupby: group a dataframe by a column. In this case we will group by the unique client, the SK_ID_CURR column
* agg: perform a calculation on the grouped data such as taking the mean of columns. We can either call the function directly (grouped_df.mean()) or use the agg function together with a list of transforms (grouped_df.agg([mean, max, min, sum]))
* merge: match the aggregated statistics to the appropriate client. We need to merge the original training data with the calculated stats on the SK_ID_CURR column which will insert NaN in any cell for which the client does not have the corresponding statistic
* We also use the (rename) function quite a bit specifying the columns to be renamed as a dictionary. This is useful in order to keep track of the new variables we create.

In [None]:
# Read in bureau
bureau = pd.read_csv('../input/home-credit-default-risk/bureau.csv')
bureau.head(10)

일부러, SK_ID_CURR를 순서대로 안해놓은건지, 뒤죽박준인데 ID로 정렬해보면, 이전에 했던 노트북에서 본 대출고객의 아이디와 매칭됨을 확인할수 있다.

In [None]:
# Groupby the client id (SK_ID_CURR), count the number of previous loans, and rename the column
previous_loan_counts = bureau.groupby('SK_ID_CURR', as_index=False)['SK_ID_BUREAU'].count().rename(columns={'SK_ID_BUREAU':'previous_loan_counts'})
previous_loan_counts.head()

In [None]:
# Join to the training dataframe
train = pd.read_csv('../input/home-credit-default-risk/application_train.csv')
# 'SK_ID_CURR' 컬럼을 기준으로 머지 시켜준다.
train = train.merge(previous_loan_counts, on='SK_ID_CURR', how='left')

# Fill the missing values with 0
train['previous_loan_counts'] = train['previous_loan_counts'].fillna(0)
train.head()

### Assessing Usefulness of New Variable with r value

피어슨 계수를 이용해서, r-value를 구해보고, postivie 또는 negative 관계를 갖는지 확인해 본다.

유용성을 확인하는 가장 좋은 방법이진 않지만, 간략하게 확인할 수 있는 척도로 잘 사용하면 좋다.

이후에는, KDE 플랏을 그려서 그 유효성도 확인해 본다.


### Kernel Density Estimate Plots

KDE플랏은 이전 노트북에서도 활용했기 때문에 자세한 설명은 아래 참조.

>The kernel density estimate plot shows the distribution of a single variable (think of it as a smoothed histogram). To see the different in distributions dependent on the value of a categorical variable, we can color the distributions differently according to the category. For example, we can show the kernel density estimate of the previous_loan_count colored by whether the TARGET = 1 or 0. The resulting KDE will show any significant differences in the distribution of the variable between people who did not repay their loan (TARGET == 1) and the people who did (TARGET == 0). This can serve as an indicator of whether a variable will be 'relevant' to a machine learning model.

그리고, 어떤 변수에도 활용할 수 있게 아래와 같이 함수화 시켜 사용한다.

In [None]:
# Plots the distribution of a variable colored by value of the target
def kde_target(var_name, df):
    # Calculate the correlation coefficient between the new variable and the target.
    corr = df['TARGET'].corr(df[var_name])
    
    # Calculate medians for repaid vs not repaid
    # 중간값을 구하는 듯.
    avg_repaid = df.ix[df['TARGET']==0, var_name].median()
    avg_not_repaid = df.ix[df['TARGET']==1, var_name].median()
    
    plt.figure(figsize=(12,6))
    
    # Plot the distribution for target == 0 and target == 1
    sns.kdeplot(df.ix[df['TARGET']==0, var_name], label='TARGET == 0')
    sns.kdeplot(df.ix[df['TARGET']==1, var_name], label='TARGET == 1')
    
    # label the plot
    plt.xlabel(var_name)
    plt.ylabel('Density')
    plt.title('%s Distribution' % var_name)
    plt.legend();
    
    # print out the correlation
    print('The correlation between %s and the TARGET is %0.4f' % (var_name, corr))
    # print out median values
    print('Median value for loan that was not repaid = %0.4f' % avg_not_repaid)
    print('Median value for loan that was repaid = %0.4f' % avg_repaid)
    
    

`EXT_SOURCE_3`을 이용해서 이 함수를 테스트 해본다. 

참고로 이 feature는 랜덤포레스트와 Gradient Boosting Machine모델에서 가장 중요한 변수로 찾아진 변수다.

In [None]:
kde_target('EXT_SOURCE_3', train)

이제, 새로만든 변수 the number of previous loans에 대해서 플랏해본다.

In [None]:
kde_target('previous_loan_counts', train)

상관관계 딱히 보이진 않는다. 너무 낮은 수치가 나왔다.

이제 bureau 데이터 프레임의 다른 변수들에 대해서도, max,min,mean등의 수치적인 값들을 모두 살펴본다.

### Aggregating Numeric Columns

>To account for the numeric information in the bureau dataframe, we can compute statistics for all the numeric columns. To do so, we groupby the client id, agg the grouped dataframe, and merge the result back into the training data. The agg function will only calculate the values for the numeric columns where the operation is considered valid. We will stick to using 'mean', 'max', 'min', 'sum' but any function can be passed in here. We can even write our own function and use it in an agg call.

수치로 된 컬럼에 대해서 통계적인 항목을 모두 계산한다. 그 다음, 고객 id를 기준으로 트레이닝 데이터에 머지 시킨다.

agg 기능을 사용할 거임. 통계적인 데이터를 모두 구해서, 컬럼을 추가하고 기존 training 데이터 프레임에 모두 통합해서 합치겠다는 말.

In [None]:
bureau_agg = bureau.drop(columns=['SK_ID_BUREAU']).groupby('SK_ID_CURR', as_index=False).agg(['count','mean','max','min','sum']).reset_index()
bureau_agg.head()

레벨 두개로 나누어져 있는 컬럼을 하나의 레벨로 통합해준다.

In [None]:
bureau_agg.columns.levels[0]

In [None]:
bureau_agg.columns.levels[1] # 맨끝에 아무것도 없는 요소가 있다.

In [None]:
bureau_agg.columns.levels[1][:-1]

In [None]:
# List of column names
columns = ['SK_ID_CURR']

# Iterate through the variables names
for var in bureau_agg.columns.levels[0]:
    # Skip the id name
    if var != 'SK_ID_CURR':
        # Iterate through the stat names
        for stat in bureau_agg.columns.levels[1][:-1]:
            # Make a new column name for the variable and stat
            columns.append('bureau_%s_%s' % (var, stat))


In [None]:
# Assign the list of columns names as the dataframe column names
bureau_agg.columns = columns
bureau_agg.head()

이 데이터를 이제 train 데이터에 머지 시킴!

In [None]:
# Merge with the training data
train  = train.merge(bureau_agg, on='SK_ID_CURR', how='left')
train.head()

### Correlations of Aggregated Values with Target

여기서도 피어슨 계수를 찾아내서 대략적인 관계를 살펴본다.

In [None]:
# List of new correlations
new_corrs = []

# Iterate through the columns
for col in columns:
    # Calculate correlations with the target
    corr = train['TARGET'].corr(train[col])
    
    # Append the list as a tuple
    # (str, float) 형태의 튜플 데이터가 new_corrs로 들어감.
    new_corrs.append((col, corr))

In [None]:
new_corrs[:15]

해당 데이터를, 절대치로 반환해서 내림차순으로 정렬한다. 값 자체는 그대로 유지하고, 절대값 기준으로 정렬만 해줌

In [None]:
# Sort the correlations by the absolute value, key옵션을 써서 튜플의 1번째 인덱스 값의 절대값 기준으로 정렬할 것을 할당해줌
# Make sure to reverse to put the largest values at the front of list

new_corrs = sorted(new_corrs, key=lambda x: abs(x[1]), reverse=True)
new_corrs[:15]

지금 검사한 데이터 변수들이 눈에 띄게 영향이 있는 변수는 없어 보인다. 최대치가 0.08밖에 안됨.

최대치를 보이는 `bureau_DAYS_CREDIT_mean`을 가지고, KDE를 그려본다.

In [None]:
kde_target('bureau_DAYS_CREDIT_mean', train)

>The definition of this column is: "How many days before current application did client apply for Credit Bureau credit". My interpretation is this is the number of days that the previous loan was applied for before the application for a loan at Home Credit. Therefore, a larger negative number indicates the loan was further before the current loan application. We see an extremely weak positive relationship between the average of this variable and the target meaning that clients who applied for loans further in the past potentially are more likely to repay loans at Home Credit. With a correlation this weak though, it is just as likely to be noise as a signal.

굳이 데이터에 대한 해석을 하자면,

해당변수는, 현재 대출을 하기전에 가지고 있었던 대출의 기간을 의미한다고 한다.

이 그래프로만 판단할 수 있는 것은, 대출을 가지고 있었던 기간이 길었던 사람일 수록 다시 대출을 갚을 확률이 높다! 라는 판단을 할 수 있는데, 상관관계 수치자체가 너무 작아서, 노이즈 신호로 판단하는 것 같다.

### The Multiple Comparisons Problem


>When we have lots of variables, we expect some of them to be correlated just by pure chance, a problem known as multiple comparisons. We can make hundreds of features, and some will turn out to be corelated with the target simply because of random noise in the data. Then, when our model trains, it may overfit to these variables because it thinks they have a relationship with the target in the training set, but this does not necessarily generalize to the test set. There are many considerations that we have to take into account when making features!

### Function for Numeric Aggregations


>Let's encapsulate all of the previous work into a function. This will allow us to compute aggregate stats for numeric columns across any dataframe. We will re-use this function when we want to apply the same operations for other dataframes.

위에서 통합적인 통계수치를 구했었다. 이 작업을 함수를 통해 다른 데이터 프레임에서도 재사용할 수 있도록 함수화 한다.

In [None]:
def agg_numeric(df, group_var, df_name):
    """
    Aggregates the numeric values in a dataframe. This can be used to create features for 
    each instance of the grouping variable.
    
    Parameters
    ----------
        df(dataframe):
            the dataframe to calculate the statistics on
        group_var (string):
            the variable by which to group df
        df_name (string):
            the variable used to rename the columns
            
    Return
    ----------
        agg (dataframe):
            a dataframe with the statistics aggregated for
            all numeric columns. Each instance of the grouping variable will have
            the statistics (mean, max, min, sum; currently supported) calculated.
            The columns are also renamed to keep track of features created.
    """
    
    # Remove id variables other than grouping variable
    for col in df:
        if col != group_var and 'SK_ID' in col:
            df.drop(columns=col)
    
    group_ids = df[group_var]
    numeric_df = df.select_dtypes('number')
    numeric_df[group_var] = group_ids
    
    # Group by the specified variable and calculate the statistics
    agg = numeric_df.groupby(group_var).agg(['count','mean','max','min','sum']).reset_index()
    
    # Need to create new column names
    columns = [group_var]
    
    # Iterate through the variables names
    for var in agg.columns.levels[0]:
        # Skip the grouping variable
        if var != group_var:
            # Iterate through the stat names
            for stat in agg.columns.levels[1][:-1]:
                # Make a new column name for the variable and stat
                columns.append('%s_%s_%s' % (df_name, var, stat))
    agg.columns = columns
    return agg
    
    

In [None]:
bureau_agg_new = agg_numeric(bureau.drop(columns=['SK_ID_BUREAU']), group_var='SK_ID_CURR', df_name='bureau')
bureau_agg_new.head()

>To make sure the function worked as intended, we should compare with the aggregated dataframe we constructed by hand.

수작업으로 만든 데이터 프레임과 같은지 확인!

In [None]:
bureau_agg.head()

수치적으로 대조해보면, 같다는 것을 알 수 있다. 추후 같은작업에 함수를 이용하면 좋을 듯 하다.

### Correlation Function

상관계수를 구하는 함수를 정의한다.

In [None]:
# Function to calculate correlation with the target for a dataframe

def target_corrs(df):
    # List of correalations, 빈 리스트
    corrs = []
    
    # Interate through the columns
    for col in df.columns:
        print(col)
        # Skip the target column:
        if col != 'TARGET':
            # Calculate correlation with the target
            corr = df['TARGET'].corr(df['col'])
            
            # Append the list as a tuple
            corrs.append((col, corr))
            
    # Sort by absolute magnitude of correlation
    corrs = sorted(corrs, key = lambda x : abs(x[1]), reverse=True)
    
    return corrs

### Categorical Variables

이제 카테고리 변수에 대해 진행한다. 불연속 문자열 데이터 등이 있기 때문에 통계적인 최대, 최소, 평균등은 적용할 수 없다.

대신에, 각 카테고리 컬럼에 대한 분류 데이터 수를 계산해서 접근할 수 있다.

각 카테고리별 발생횟수에 대해서 전체 이벤트수로 나눠주면, 일반화된 즉, 1보다 적은 숫자로 수치화 할수 있다.

인코딩 한다는 말!

>First we one-hot encode a dataframe with only the categorical columns (dtype == 'object').

In [None]:
categorical = pd.get_dummies(bureau.select_dtypes('object'))
# SK_ID_CURR은 숫자로 이루어진 컬럼이라 포함되어 있지 않았기 때문에, 아래와 같이 추가시켜 동기화? 시켜줌
categorical['SK_ID_CURR'] = bureau['SK_ID_CURR']
categorical.head()

In [None]:
# SK_ID_CURR 컬럼으로 그룹화 해서 agg 특성 컬럼을 추가시켜줌
categorical_grouped = categorical.groupby('SK_ID_CURR').agg(['sum','mean'])
categorical_grouped.head()

>The `sum` columns represent the count of that category for the associated client and the mean represents the normalized count. One-hot encoding makes the process of calculating these figures very easy!

>We can use a similar function as before to rename the columns. Again, we have to deal with the multi-level index for the columns. We iterate through the first level (level 0) which is the name of the categorical variable appended with the value of the category (from one-hot encoding). Then we iterate stats we calculated for each client. We will rename the column with the level 0 name appended with the stat. As an example, the column with CREDIT_ACTIVE_Active as level 0 and sum as level 1 will become CREDIT_ACTIVE_Active_count.

sum 은 그야말로, 원-핫 인코딩하고 각 아이디 별로 발생한 횟수 이고, mean값은 그 값을 정규화한 값이라고 판단해도 된다.
정규화 했다는 것은, 평균 구하는 것 처럼 전체 이벤트에 대한 해당 항목의 비율을 나타내게 된다.

그리고, 이전과 같이 컬럼이 레벨 2개로 나누어 졌는데, 1개로 통합해준다.

In [None]:
categorical_grouped.columns.levels[0][:10] # 전체 23개 컬럼으로 이루어졌음

In [None]:
categorical_grouped.columns.levels[1]

In [None]:
group_var = 'SK_ID_CURR'

# Need to create new column names
# 여기서 아래의 리스트에 'SK_ID_CURR'를 초기값으로 넣지 않는 이유는,
# 위에서 categorical_grouped데이터프레임을 만들때, as_index를 지정하지 않아서 디폴트로 True롤 설정되어 버림.
# 즉, 'SK_ID_CURR'컬럼이 인덱스로 지정되어서 컬럼 이름에 초기값으로 설정하지 않은 것 같음.
columns = []

# Iterate through the variables names
for var in categorical_grouped.columns.levels[0]:
    # Skip the grouping variable
    if var != group_var:
        for stat in ['count', 'count_norm']:
            # Make a new column name for the variable and stat
            columns.append('%s_%s' % (var, stat))

# Rename the columns
categorical_grouped.columns = columns

categorical_grouped.head()
            

>The sum column records the counts and the mean column records the normalized count.
We can merge this dataframe into the training data.

이제, 기존 training데이터에 방금 만든 데이터 프레임을 머지 시킴.

In [None]:
train = train.merge(categorical_grouped, left_on='SK_ID_CURR', right_index = True, how='left')
train.head()

In [None]:
train.shape

In [None]:
train.iloc[:10,123:]

### Function to Handle Categorical Variables

>To make the code more efficient, we can now write a function to handle the categorical variables for us. This will take the same form as the `agg_numeric` function in that it accepts a dataframe and a grouping variable. Then it will calculate the counts and normalized counts of each category for all categorical variables in the dataframe.

위에서 작업한 `agg_numeric`함수 처럼 카테고리 변수를 핸들링 할 수 있는 함수를 하나 더 만든다.

인풋은 마찬가지로, 데이터 프레임과 그룹핑할 변수다.

이 함수는, 각 카테고리별 이벤트 발생 횟수를 더하고, 정규화하는데, 전체 카테고리 변수에 대해서 그 작업을 수행한다.

요약하면, 그냥 위에서 한 작업을 함수화 시켜서 가지고 있는 것!

In [None]:
def count_categorical(df, group_var, df_name):
    """
    Computes counts and normalized counts for each observation
    of `group_var` of each unique category in every categorical variable
    
    Parameters
    ----------
    df: dataframe
        The dataframe to calculate the value counts for.
        
    group_var : string
        The variable by which to group the dataframe. For each unique
        value of this variable, the final dataframe will have one row
        
    df_name : string
        Variable added to the front of column names to keep track of columns
        
    Return
    ------
    categorical : dataframe
        A dataframe with counts and normalized counts of each unique category in every categorical variable
        with one row for every unique value of the `group_var`
    """
    
    # Select the categorical columns
    categorical = pd.get_dummies(df.select_dtypes('object'))
    
    # Make sure to put the identifying id on the column
    categorical[group_var] = df[group_var]
    
    # Groupby the group var and calculate the sum and mean
    categorical = categorical.groupby(group_var).agg(['sum','mean'])
    
    column_names = []
    
    # Iterate through the columns in level 0
    for var in categorical.columns.levels[0]:
        # Iterate through the stats in level 1, with new name!
        for stat in ['count','count_norm']:
            # Make a new column name
            column_names.append('%s_%s_%s' % (df_name, var, stat))
    
    # 컬럼 이름 다시 할당해주기.
    
    categorical.columns = column_names
    
    return categorical     
    
    
    

In [None]:
bureau_counts = count_categorical(bureau, group_var='SK_ID_CURR', df_name='bureau')
bureau_counts.head()

### Applying Operation to another dataframe

>We will now turn to the bureau balance dataframe. This dataframe has monthly information about each client's previous loan(s) with other financial institutions. Instead of grouping this dataframe by the `SK_ID_CURR` which is the client id, we will first group the dataframe by the `SK_ID_BUREAU` which is the id of the previous loan. This will give us one row of the dataframe for each loan. Then, we can group by the `SK_ID_CURR` and calculate the aggregations across the loans of each client. The final result will be a dataframe with one row for each client, with stats calculated for their loans.

bureau balance 데이터 프레임을 작업해볼거다. 이 데이터프레임은 고객의 이전 대출 이력에 대한 정보를 가지고 있는데, 월별로 정리된 데이터 같다.

그리고, 이번엔 SK_ID_CURR 이 아닌 SK_ID_BUREAU로 그룹핑을 할꺼다. 이 컬럼은, 이전 대출에 대한 특정 ID값을 나타낸다. 각각의 ID에 대해서 이전 대출에 대한 정보를 한열 한열 정보로 던져줄거고, 이걸 SK_ID_CURR로 다시 그룹핑해서 통합하면, 각 고객별 이전 대출 이력을 종합할 수 있다고 한다.

최종적으로는 한고객에 대해서 한 열 데이터로 통합? 된 형태로 통계적인 수치와 함께? 얻을 수 있을 것 같다.

In [None]:
# Read in bureau balance
bureau_balance = pd.read_csv('../input/home-credit-default-risk/bureau_balance.csv')
bureau_balance.head()

>First, we can calculate the value counts of each status for each loan. Fortunately, we already have a function that does this for us!

In [None]:
# Counts of each type of status for each previous loan
bureau_balance_counts = count_categorical(bureau_balance,group_var='SK_ID_BUREAU', df_name='bureau_balance')
bureau_balance_counts.head()

>Now we can handle the one numeric column. The `MONTHS_BALANCE` column has the "months of balance relative to application date." This might not necessarily be that important as a numeric variable, and in future work we might want to consider this as a time variable. For now, we can just calculate the same aggregation statistics as previously.

MONTH_BALANCE 는 고객에 대한 월별 잔고?(BALANCE) 를 나타낸다.

여기서는 수치적으로는 크게 중요한 지표로는 생각하지 않는다.

하지만, 나중에 이 변수를 시간에 대한 변수로 사용할수도 있다고 한다. 일단 이전과 같이 통합하기 위한 데이터 작업을 한다.


In [None]:
# Calculate value count statistics for each `SK_ID_CURR`
# 숫자 컬럼에 대해서만 작업해주기 때문에 이걸 사용함
bureau_balance_agg = agg_numeric(bureau_balance, group_var = 'SK_ID_BUREAU', df_name = 'bureau_balance')
bureau_balance_agg.head()

>The above dataframes have the calculations done on each loan. Now we need to aggregate these for each client. We can do this by merging the dataframes together first and then since all the variables are numeric, we just need to aggregate the statistics again, this time grouping by the `SK_ID_CURR`.

여기서 한 작업들의 요점은,

수치데이터는 agg_numeric 함수로 통계데이터를 뽑고,

카테고리컬 데이터는, count_categorical 함수로 인코딩해서 수치화 후 뽑아낸다. 

이 작업을 할 때 기준은 SK_ID_BUREAU로 묶어서 통계치를 산출한다는 것.

먼저 SK_ID_BUREAU로 인코딩된 데이터와 수치데이터를 묶어주고,

해당 데이터 들을 SK_ID_CURR 데이터(각각 고객ID에 대해서)와 머지 하는 작업을 해줄꺼다. 즉, 이제는

SK_ID_CURR 데이터를 기준으로 그룹핑 해준다.

In [None]:
# Dataframe grouped by the loan
bureau_by_loan = bureau_balance_agg.merge(
    bureau_balance_counts, 
    right_index=True, 
    left_on = 'SK_ID_BUREAU', 
    how='outer')
bureau_by_loan.head()

In [None]:
# Merge to include the SK_ID_CURR
bureau_by_loan = bureau_by_loan.merge(
    bureau[['SK_ID_BUREAU','SK_ID_CURR']], 
    on='SK_ID_BUREAU', 
    how='left')
bureau_by_loan.head()

`SK_ID_BUREAU`에 `SK_ID_CURR` 을 동기화 시킨 모양이 되었다.

In [None]:
bureau_balance_by_client = agg_numeric(bureau_by_loan.drop(columns=['SK_ID_BUREAU']), group_var='SK_ID_CURR', df_name='client')

bureau_balance_by_client.head()

요약하자면,

`bureau_balance` 데이터프레임에 대하여 다음과 같은 작업을 했다.

 1. 데이터를 대출 아이디로 그룹핑 후 수치데이터 컬럼의 통계수치를 구함.
 2. 데이터를 대출 아이디로 그룹핑 후 카테고리 데이터 컬럼의 구분 변수? 갯수를 카운팅 함
 3. 앞서 구한 통계수치롸 구분 변수 갯수를 대출 데이터 프레임에 통함한다.
 4. 고객 id로 그룹핑하고, 수치적인 통계 데이터를 계산 및 정리한다.
    

결국, 한개의 row 데이터에는 특정 고객의 id에 대해서 과거 대출 총 이력(횟수 등등)과 월별 잔고? 대출 금액 정보?를 나타낸다.

다음 몇몇 변수들은 조금 헷갈리는 부분이 있어서 설명을 덧붙이다.

>`client_bureau_balance_MONTHS_BALANCE_mean_mean`: For each loan calculate the mean value of MONTHS_BALANCE. Then for each client, calculate the mean of this value for all of their loans.

>`client_bureau_balance_STATUS_X_count_norm_sum`: For each loan, calculate the number of occurences of STATUS == X divided by the number of total STATUS values for the loan. Then, for each client, add up the values for each loan.

또한, 코릴레이션 값은 모든 변수를 하나의 데이터 프레임에 통합할때 까지는 시작하지 않기로 함.

# Putting the Functions Together

>We now have all the pieces in place to take the information from the previous loans at other institutions and the monthly payments information about these loans and put them into the main training dataframe. Let's do a reset of all the variables and then use the functions we built to do this from the ground up. This demonstrate the benefit of using functions for repeatable workflows!

지금까지 했던 작업들, 고객의 이전 대출정보 및 월별 대출 납부 내용 등을 모두 리셋하고, 반복적인 작업 등을 함수를 이용해서 전체적으로 다시 작성해보도록 한다.

In [None]:
# Free up memory by deleting old objects
import gc # garbage collector?
gc.enable()
del train, bureau, bureau_balance, bureau_agg, bureau_agg_new, bureau_balance_agg,bureau_balance_counts,bureau_balance_by_client, bureau_by_loan, bureau_counts
gc.collect()


참고 커널과 다른 숫자가 나왔지만, 다 지워진건 확실한 거 같다. 진행한다.

In [None]:
# Read in new copies of all the dataframes
train = pd.read_csv('../input/home-credit-default-risk/application_train.csv')
bureau = pd.read_csv('../input/home-credit-default-risk/bureau.csv')
bureau_balance = pd.read_csv('../input/home-credit-default-risk/bureau_balance.csv')

### Counts of Bureau Dataframe

In [None]:
bureau_counts = count_categorical(bureau, group_var='SK_ID_CURR', df_name='bureau')
bureau_counts.head()
#dtype=='object' 인 컬럼만 처리됨.

### Aggregated Stats of Bureau Dataframe

In [None]:
# dtype=='number' 인 컬럼만 가지고 와서 처리함
bureau_agg = agg_numeric(bureau.drop(columns=['SK_ID_BUREAU']), group_var='SK_ID_CURR', df_name='bureau' )
bureau_agg.head()

### Value counts of Bureau Balance dataframe by loan

bureau_balance 데이터에 대해서 수치 작업 진행!

카테고리 데이터

In [None]:
bureau_balance_counts = count_categorical(bureau_balance, group_var='SK_ID_BUREAU', df_name='bureau_balance')
bureau_balance_counts.head()

### Aggregated stats of Bureau Balance dataframe by loan

In [None]:
# 수치데이터에 대해서 수치작업 진행!
bureau_balance_agg = agg_numeric(bureau_balance, group_var='SK_ID_BUREAU',df_name='bureau_balance')
bureau_balance_agg.head()

### Aggregated Stats of Bureau Balance by Client

위에서는 대출 아이디에 대해서 통계수치를 통합했고, 여기서는 고객 아이디에 대해서 통합하는 것 같은데, 위 데이터를 이용할 것 같음.

In [None]:
# Dataframe grouped by the loan
# 위에서 작성한 대출아이디에 대해 통계화 한 수치적 데이터와 카테고리 데이터를 하나로 머지 시켜줌
bureau_by_loan = bureau_balance_agg.merge(
    bureau_balance_counts, 
    right_index=True,
    left_on='SK_ID_BUREAU',
    how='outer'
)

# Merge to include the SK_ID_CURR
# SK_ID_BUREAU, SK_ID_CURR를 서로 맞춰 주기 위해서 
# 원래 데이터 프레임에서 두 컬럼 데이터를 가지고와서 머지에 사용함
bureau_by_loan = bureau[['SK_ID_BUREAU','SK_ID_CURR']].merge(
    bureau_by_loan,
    on='SK_ID_BUREAU',
    how='left'
)


In [None]:
bureau_by_loan.head()

위에서 머지하고자 한 두 컬럼이 매칭 된것을 확인했고, 이제 SK_ID_BUREAU 컬럼은 드랍하고,

SK_ID_CURR로 재정렬하며, 각 컬럼을 다시 통계화 한다.

In [None]:
# Aggregate the stats for each client
bureau_balance_by_client = agg_numeric(
    bureau_by_loan.drop(columns=['SK_ID_BUREAU']),
    group_var='SK_ID_CURR',
    df_name='client'
)

In [None]:
bureau_balance_by_client.head()

### Insert Computed Features into Training Data

In [None]:
original_features = list(train.columns)
print('Original Number of Features : ', len(original_features))

In [None]:
# Merge with the value counts of bureau
train = train.merge(bureau_counts, on = 'SK_ID_CURR', how='left')

# Merge with the stats of bureau
train = train.merge(bureau_agg, on='SK_ID_CURR', how='left')

# Merge with the monthly information grouped by client
train = train.merge(bureau_balance_by_client, on='SK_ID_CURR', how='left')


In [None]:
new_features = list(train.columns)
print('Number of features using previous loans from other instituation data : ',
     len(new_features))

# Feature Engineering Outcomes

>After all that work, now we want to take a look at the variables we have created. We can look at the percentage of missing values, the correlations of variables with the target, and also the correlation of variables with the other variables. The correlations between variables can show if we have collinear varibles, that is, variables that are highly correlated with one another. Often, we want to remove one in a pair of collinear variables because having both variables would be redundant. We can also use the percentage of missing values to remove features with a substantial majority of values that are not present. Feature selection will be an important focus going forward, because reducing the number of features can help the model learn during training and also generalize better to the testing data. The "curse of dimensionality" is the name given to the issues caused by having too many features (too high of a dimension). As the number of variables increases, the number of datapoints needed to learn the relationship between these variables and the target value increases exponentially.

>Feature selection is the process of removing variables to help our model to learn and generalize better to the testing set. The objective is to remove useless/redundant variables while preserving those that are useful. There are a number of tools we can use for this process, but in this notebook we will stick to removing columns with a high percentage of missing values and variables that have a high correlation with one another. Later we can look at using the feature importances returned from models such as the `Gradient Boosting Machine` or `Random Forest` to perform feature selection.

해석해보시면 됨.

포인트는, 변수중에 missing value 처리하고, 상관계수 잘 찾아서 영향이 무디거나 없는 변수들은 다 없애버리고, 어떤 상관관계 또는 변수로 인해 증가하는 차원을 줄이는 것이 결과를 예측하는 점에서 좋다.

여기서는 missing value가 많은 컬럼을 지우거나 상관관계가 낮은 변수는 제외하는 방식으로만 접근해서,
feature engineering을 할거다.


### Missing Values

In [None]:
# Function to calculate missing values by column 
def missing_values_table(df):
    # Total missing values
    mis_val = df.isnull().sum()
    
    # Percentage of missing values
    mis_val_percent = 100 * df.isnull().sum() / len(df)
    
    # Make a table with the results
    mis_val_table = pd.concat([mis_val, mis_val_percent], axis=1)
    
    # Rename the columns
    mis_val_table_ren_columns = mis_val_table.rename(
        columns = {0 : 'Missing Values', 1 : '% of Total Values'}   
    )
    
    # Sort the table by percentage of missing descending
    mis_val_table_ren_columns = mis_val_table_ren_columns[
        mis_val_table_ren_columns.iloc[:,1] != 0].sort_values(
        '% of Total Values', ascending=False).round(1)
    
    # Print some summary information
    print('Your selected dataframe has ' + str(df.shape[1]) + ' columns.\n'
          'There are ' + str(mis_val_table_ren_columns.shape[0]) + 
          ' columns that have missing values.')
    
    # Return the dataframe with missing information
    return mis_val_table_ren_columns

In [None]:
missing_train = missing_values_table(train)
missing_train.head(10)

> We see there are a number of columns with a high percentage of missing values. There is no well-established threshold for removing missing values, and the best course of action depends on the problem. Here, to reduce the number of features, we will remove any columns in either the training or the testing data that have greater than 90% missing values.

트레이닝이든 테스트 데이터이던간에 Missing Value 가 90%넘으면 해당 컬럼은 제외한다.

In [None]:
missing_train_vars = list(missing_train.index[missing_train['% of Total Values'] > 90])
len(missing_train_vars)

90% 넘는거 없구만.

>Before we remove the missing values, we will find the missing value percentages in the testing data. We'll then remove any columns with greater than 90% missing values in either the training or testing data. Let's now read in the testing data, perform the same operations, and look at the missing values in the testing data. We already have calculated all the counts and aggregation statistics, so we only need to merge the testing data with the appropriate data.

먼저 테스트 데이터에서 missing data가 90%가 넘는게 있는지 확인해보고, 트레이닝 또는 테스트 데이터에서 컬럼을 제외한다고 한다. 일단 진행해본다.


### Calculate Information for Testing Data

In [None]:
# Read in the test dataframe
test = pd.read_csv('../input/home-credit-default-risk/application_test.csv')
test.head()

In [None]:
# Merge with the value counts of bureau
test = test.merge(bureau_counts, on = 'SK_ID_CURR', how='left')

# Merge with the stats of bureau
test = test.merge(bureau_agg, on = 'SK_ID_CURR', how='left')

# Merge with the value counts of bureau balance
test = test.merge(bureau_balance_by_client, on = 'SK_ID_CURR', how = 'left')

In [None]:
print('Shape of Training Data : ', train.shape)
print('Shape of Testing Data: ', test.shape)

>We need to align the testing and training dataframes, which means matching up the columns so they have the exact same columns. This shouldn't be an issue here, but when we one-hot encode variables, we need to align the dataframes to make sure they have the same columns.

트레이닝과 테스팅데이터를 얼라인 해줘야 한다. 즉, 컬럼이 딱 같게 만든다는데..

특히 원핫인코딩을 할때 둘다 같은 컬럼 구조를 가지고 있어야 한다고 함.

In [None]:
train_labels = train['TARGET']

# Align the dataframes, this will remove the 'TARGET' column
train, test = train.align(test, join = 'inner', axis = 1)

train['TARGET'] = train_labels

In [None]:
print('Training Data Shape : ', train.shape)
print('Testing Data Shape : ', test.shape)

In [None]:
train.head(5)

In [None]:
test.head(5)

>The dataframes now have the same columns (with the exception of the TARGET column in the training data). This means we can use them in a machine learning model which needs to see the same columns in both the training and testing dataframes.

>Let's now look at the percentage of missing values in the testing data so we can figure out the columns that should be dropped.

이제 트레이닝, 테스팅 데이터의 컬럼이 같아 졌음. 단 트레이닝 데이터에는 TARGET컬럼이 하나 더 있을뿐이다.

이제 테스트 데이터에서 missing 데이터를 찾아보고, 살릴 컬럼과 삭제할 컬럼을 가려낸다.

In [None]:
missing_test = missing_values_table(test)
missing_test.head(10)

In [None]:
missing_test_vars = list(missing_test.index[missing_test['% of Total Values'] > 90])
len(missing_test_vars)

In [None]:
# set을 쓰는 이유는, 중복되는 요소를 없애주기 위함. 근데 여긴 어차피 두 변수 아무 요소도 없음.
# 개념적으로, 구성하는 방법만 참고.
missing_columns = list(set(missing_test_vars + missing_train_vars))
print('There are %d columns with more than 90%% missing in either the training or testing data.'
     % len(missing_columns))

In [None]:
# set사용법 참고
set([1,2,3] + [3,10,11])

In [None]:
# Drop the missing columns
# 의미 없는 dropping
train = train.drop(columns=missing_columns)
test = test.drop(columns=missing_columns)

>We ended up removing no columns in this round because there are no columns with more than 90% missing values. We might have to apply another feature selection method to reduce the dimensionality.

결국 컬럼 드랍한것 없다고 함. 다른 방법을 써야합니다.

>At this point we will save both the training and testing data. I encourage anyone to try different percentages for dropping the missing columns and compare the outcomes.

추천사항으로, 다른 퍼센티지를 기준으로 missing value 컬럼을 드랍해보고 결과를 비교해보라고 한다.

일단, 트레이닝 테스트 데이터를 파일러 저장합시다.

In [None]:
train.to_csv('train_bureau_raw.csv', index=False)
test.to_csv('test_bureau_raw.csv', index=False)

### Correlations

>First let's look at the correlations of the variables with the target. We can see in any of the variables we created have a greater correlation than those already present in the training data (from application).

어떤 디폴트에서 있었던 변수보다 지금까지 만든 변수들이 상관관계가 높다는 것을 확신하고 있다.

계산해봅시다.

In [None]:
# Calculate all correlations in dataframe
corrs = train.corr()

In [None]:
corrs = corrs.sort_values('TARGET', ascending = False)

# Ten most positive correlations
pd.DataFrame(corrs['TARGET'].head(10))

In [None]:
# Ten most negative correlations
pd.DataFrame(corrs['TARGET'].dropna().tail(10))

>The highest correlated variable with the target (other than the TARGET which of course has a correlation of 1), is a variable we created. However, just because the variable is correlated does not mean that it will be useful, and we have to remember that if we generate hundreds of new variables, some are going to be correlated with the target simply because of random noise.

>Viewing the correlations skeptically, it does appear that several of the newly created variables may be useful. To assess the "usefulness" of variables, we will look at the feature importances returned by the model. For curiousity's sake (and because we already wrote the function) we can make a kde plot of two of the newly created variables.

In [None]:
kde_target(var_name='client_bureau_balance_MONTHS_BALANCE_count_mean', df=train)

>This variable represents the average number of monthly records per loan for each client. For example, if a client had three previous loans with 3, 4, and 5 records in the monthly data, the value of this variable for them would be 4. Based on the distribution, clients with a greater number of average monthly records per loan were more likely to repay their loans with Home Credit. Let's not read too much into this value, but it could indicate that clients who have had more previous credit history are generally more likely to repay a loan.

어느정도 맞긴하다. 원본 커널에서는 키 에러로 어떤 변수인지 모르겠으나, 설명상 `client_bureau_balance_MONTHS_BALANCE_count_mean` 인거 같아서 이걸로 했다.

In [None]:
kde_target(var_name='bureau_CREDIT_ACTIVE_Active_count_norm', df=train)

>Well this distribution is all over the place. This variable represents the number of previous loans with a CREDIT_ACTIVE value of Active divided by the total number of previous loans for a client. The correlation here is so weak that I do not think we should draw any conclusions!

이 변수는 중구난방.

여기서 상관관계는 너무 약한 강도를 가지고 있어서, 어떤 결론도 내리지 않는다고 함.

### Collinear Variables(동일선 상의, 공선적인? 변수?)

>We can calculate not only the correlations of the variables with the target, but also the correlation of each variable with every other variable. This will allow us to see if there are highly collinear variables that should perhaps be removed from the data.
>Let's look for any variables that have a greather than 0.8 correlation with other variables.

그러니깐, TARGET 과의 상관관계가 아니라, TARGET을 제외한 변수들 간의 상관관계에 대한 내용을 살펴본다.

즉, TARGET외의 변수들간의 상관관계가 너무 지배적으로 강하게 나타나는 경우도 좋지 않기 때문에 그런 항목을 제거해주는게 좋다.

0.8 보다 큰 상관계수를 갖는 변수들을 찾아본다.

In [None]:
# Set the threshold
threshold = 0.8

# Empty dictionary to hold correlated variables
above_threshold_vars = {}

# For each column, record the variables that are above the threshold
for col in corrs:
    above_threshold_vars[col] = list(corrs.index[corrs[col] > threshold])

>For each of these pairs of highly correlated variables, we only want to remove one of the variables. The following code creates a set of variables to remove by only adding one of each pair.

다음 코드를 이용해서, threshold를 넘는 컬럼을 삭제할려고 하는데, key, value가 서로 짝으로 같은 것 이외에,  추가된 컬럼이 그에 해당되므로 그 컬럼들을 삭제하려고 한다.

In [None]:
above_threshold_vars

In [None]:
# Track columns to remove and columns already examined
cols_to_remove = []
cols_seen = []
cols_to_remove_pair = []

# Iterate through columns and correlated columns
# 사전형 뒤에 .items() 로 루프를 돌리면, key,value 두개를 같은 짝으로 뿌려줄수 있음.

for key, value in above_threshold_vars.items():
    # Keep track of columns already examined
    cols_seen.append(key)
    for x in value:
        if x == key:
            next
        else:
            # Only want to remove one in a pair
            if x not in cols_seen:
                cols_to_remove.append(x)
                cols_to_remove_pair.append(key)

cols_to_remove = list(set(cols_to_remove))
print('Number of columns to remove : ', len(cols_to_remove))

>We can remove these columns from both the training and the testing datasets. We will have to compare performance after removing these variables with performance keeping these variables (the raw csv files we saved earlier).

이제 선별해낸 컬럼들? 삭제대상의 컬럼들을 트레이닝, 테스트 데이터로부터 삭제한다. 

추후에 이렇게 컬럼을 삭제한 데이터와 그렇지 않은 데이터에 대해서 성능을 비교해본다.(미리 csv파일로 저장한 데이터를 이용할 예정)

In [None]:
train_corrs_removed = train.drop(columns=cols_to_remove)
test_corrs_removed = test.drop(columns=cols_to_remove)

In [None]:
print('Training Corrs Removed Shape : ', train_corrs_removed.shape)
print('Testing Corrs Removed Shape : ', test_corrs_removed.shape)

In [None]:
train_corrs_removed.to_csv('train_bureau_corrs_removed.csv', index=False)
test_corrs_removed.to_csv('test_bureau_corrs_removed.csv', index=False)

# Modeling


>To actually test the performance of these new datasets, we will try using them for machine learning! Here we will use a function I developed in another notebook to compare the features (the raw version with the highly correlated variables removed). We can run this kind of like an experiment, and the control will be the performance of just the `application` data in this function when submitted to the competition. I've already recorded that performance, so we can list out our control and our two test conditions:

>For all datasets, use the model shown below (with the exact hyperparameters).

- control: only the data in the application files.
- test one: the data in the application files with all of the data recorded from the bureau and bureau_balance files
- test two: the data in the application files with all of the data recorded from the bureau and bureau_balance files with highly correlated variables removed.


이제 지금까지 핸들링한 데이터 셋을 가지고 테스트를 할 거다.

[이전 노트북](https://www.kaggle.com/sanholee/home-credit-kor-ver)에서 만든 함수 `model` 을 이용해서 변수들을 대입하고 머신러닝 모델의 퍼포먼스를 비교하는 것 같다.

비교할 데이터 셋의 종류는 위의 세가지 이다.

In [None]:
import lightgbm as lgb

from sklearn.model_selection import KFold
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import LabelEncoder

import gc #garbage collector

import matplotlib.pyplot as plt

In [None]:
# 이전 노트북에서 정의한 함수 그대로 가져옴
def model(features, test_features, encoding = 'ohe', n_folds = 5):
    """
    Train and Test a light gradient boosting model using cross validation.
    
    Parameters:
    -----------
        features(pd.DataFrame):
            dataframe of training feature to use
            for training a model. Must include Target column.
        test_features(pd.DataFrame):
            dataframe of testing feature to use
            for making predictions with the model.
        encoding(str, default = 'ohe'):
            method for encoding categorical variables. Either 'ohe' for one-hot encoding or 'le' for interger label encoding
        n_folds(int, default = 5): number of folds to use for cross validation
        
    Return:
    -------
        submission(pd.DataFrame):
            dataframe with 'SK_ID_CURR' and 'TARGET' probabilities
            predicted by the model.
        feature_importances(pd.DataFrame):
            dataframe with the feature importances from the model.
        valid_metrics(pd.DataFrame):
            dataframe with training and validation metrics(ROC AUC) for each fold and overall.
    """
    
    # Extract the ids
    train_ids = features['SK_ID_CURR']
    test_ids = test_features['SK_ID_CURR']
    
    # Extract the labels for training
    labels = features['TARGET']
    
    # Remove the ids and target
    features = features.drop(columns = ['SK_ID_CURR', 'TARGET'])
    test_features = test_features.drop(columns = ['SK_ID_CURR'])
    
    
    # One Hot Encoding
    if encoding == 'ohe':
        features = pd.get_dummies(features)
        test_features = pd.get_dummies(test_features)
        
        # Align the dataframes by the columns
        features, test_features = features.align(test_features, join = 'inner', axis = 1)
        
        # No categorical indices to record
        cat_indices = 'auto'
    
    # Integer label encoding
    elif encoding == 'le':
        
        # Create a label encoder
        label_encoder = LabelEncoder()
        
        # List for storing categorical indices
        cat_indices = []
        
        # Iterate through each column
        for i, col in enumerate(features):
            if features[col].dtype == 'object':
                # Map the categorical features to integers
                features[col] = label_encoder.fit_transform(np.array(features[col].astype(str)).reshape((-1,)))
                test_features[col] = label_encoder.transform(np.array(test_features[col].astype(str)).reshape((-1,)))

                # Record the categorical indices
                cat_indices.append(i)
    
    # Catch error if label encoding scheme is not valid
    else:
        raise ValueError("Encoding must be either 'ohe' or 'le'")
        
    print('Training Data Shape: ', features.shape)
    print('Testing Data Shape: ', test_features.shape)
    
    # Extract feature names
    feature_names = list(features.columns)
    
    # Convert to np arrays
    features = np.array(features)
    test_features = np.array(test_features)
    
    # Create the kfold object
    k_fold = KFold(n_splits = n_folds, shuffle = True, random_state = 50)
    
    # Empty array for feature importances
    feature_importance_values = np.zeros(len(feature_names))
    
    # Empty array for test predictions
    test_predictions = np.zeros(test_features.shape[0])
    
    # Empty array for out of fold validation predictions
    out_of_fold = np.zeros(features.shape[0])
    
    # Lists for recording validation and training scores
    valid_scores = []
    train_scores = []
    
    # Iterate through each fold
    for train_indices, valid_indices in k_fold.split(features):
        
        # Training data for the fold
        train_features, train_labels = features[train_indices], labels[train_indices]
        # Validation data for the fold
        valid_features, valid_labels = features[valid_indices], labels[valid_indices]
        
        # Create the model
        model = lgb.LGBMClassifier(n_estimators=10000, objective = 'binary', 
                                   class_weight = 'balanced', learning_rate = 0.05, 
                                   reg_alpha = 0.1, reg_lambda = 0.1, 
                                   subsample = 0.8, n_jobs = -1, random_state = 50)
        
        # Train the model
        model.fit(train_features, train_labels, eval_metric = 'auc',
                  eval_set = [(valid_features, valid_labels), (train_features, train_labels)],
                  eval_names = ['valid', 'train'], categorical_feature = cat_indices,
                  early_stopping_rounds = 100, verbose = 200)
        
        # Record the best iteration
        best_iteration = model.best_iteration_
        
        # Record the feature importances
        feature_importance_values += model.feature_importances_ / k_fold.n_splits
        
        # Make predictions
        test_predictions += model.predict_proba(test_features, num_iteration = best_iteration)[:, 1] / k_fold.n_splits
        
        # Record the out of fold predictions
        out_of_fold[valid_indices] = model.predict_proba(valid_features, num_iteration = best_iteration)[:, 1]
        
        # Record the best score
        valid_score = model.best_score_['valid']['auc']
        train_score = model.best_score_['train']['auc']
        
        valid_scores.append(valid_score)
        train_scores.append(train_score)
        
        # Clean up memory
        gc.enable()
        del model, train_features, valid_features
        gc.collect()
        
    # Make the submission dataframe
    submission = pd.DataFrame({'SK_ID_CURR': test_ids, 'TARGET': test_predictions})
    
    # Make the feature importance dataframe
    feature_importances = pd.DataFrame({'feature': feature_names, 'importance': feature_importance_values})
    
    # Overall validation score
    valid_auc = roc_auc_score(labels, out_of_fold)
    
    # Add the overall scores to the metrics
    valid_scores.append(valid_auc)
    train_scores.append(np.mean(train_scores))
    
    # Needed for creating dataframe of validation scores
    fold_names = list(range(n_folds))
    fold_names.append('overall')
    
    # Dataframe of validation scores
    metrics = pd.DataFrame({'fold': fold_names,
                            'train': train_scores,
                            'valid': valid_scores}) 
    
    return submission, feature_importances, metrics

In [None]:
#영향계수? 를 플랏하는 함수를 작성!
def plot_feature_importances(df):
    """
    Plot importances returned by a model. This can work with any measure of 
    feature importance provided that higher importance is better.
    
    Args:
        df(dataframe): feature importances. Must have the features in a column
        called 'features' and the importances in a column called 'importance'
        
    Reutrns:
        shows a plot of the 15 most importance features
        
        df(dataframe): feature importances sorted by importance(hightest to lowest)
        with a column for normalized importance
    """
    
    # Sort features according to importance
    # reset_index()를 하면, sorting 후 내림차순이 된 데이터의 row에 대해서 다시 0부터 인덱스를 매겨준다!
    df = df.sort_values('importance', ascending = False).reset_index()
    
    # Normalize the feature importances to add up to one
    # 전체 영향계수? 를 더하면 1이 되므로, 그 값으로 각각의 영향 계수를 나눠서 일반화 해줌.(0,1)사이의 값을 갖는다.
    df['importance_normalized'] = df['importance'] / df['importance'].sum()
    
    # Make a horizontal bar chart of feature importances
    plt.figure(figsize=(10,6))
    ax = plt.subplot()
    
    # Need to reverse the index to plot most important on top
    # 여기서 인덱스를 리버스 하면서 앞에 list를 겹처서 붙여주는 이유는, 
    # 상속되는 대상이 객체(object) 타입의? 데이터라서, 리스트 화 시켜주기 위함이다.
    ax.barh(list(reversed(list(df.index[:15]))), 
            df['importance_normalized'].head(15), 
            align='center', 
            edgecolor='k')
    
    # Set the yticks and labels, y축 값 범위 정해주고, y축 데이터 라벨 이름 할당해주기
    ax.set_yticks(list(reversed(list(df.index[:15]))))
    ax.set_yticklabels(df['feature'].head(15))
    
    # Plot labeling
    plt.xlabel('Normalized Importance')
    plt.title('Feature Importances')
    plt.show()
    
    return df

### Control

control은 여기서 기준 모델, 기본을 말하는 것 같다.

아무튼, 위에서 만든 함수를 갖고 Gradient Boosting Machine 모델을 만들어서, 기본 데이터 소스 `application`을 가지고 control을 시험해보단. 예측해봄.

In [None]:
train_control = pd.read_csv('../input/home-credit-default-risk/application_train.csv')
test_control = pd.read_csv('../input/home-credit-default-risk/application_test.csv')

>Fortunately, once we have taken the time to write a function, using it is simple (if there's a central theme in this notebook, it's use functions to make things simpler and reproducible!). The function above returns a submission dataframe we can upload to the competition, a `fi` dataframe of feature importances, and a metrics dataframe with validation and test performance.

이미 작성해논 함수를 이용해서 반복적인 작업을 간단하게 할 수 있다.

해당 함수는 컴페티션에 사용할 submission형태의 데이터프레임, feature importance를 가지고 있는 데이터 프레임 fi, 모델의 성능을 나타내는 test데이터에 대한 metrics값을 리턴 해준다.

In [None]:
submission, fi, metrics = model(train_control, test_control)

In [None]:
metrics

>The control slightly overfits because the training score is higher than the validation score. We can address this in later notebooks when we look at regularization (we already perform some regularization in this model by using `reg_lambda` and `reg_alpha` as well as early stopping).

>We can visualize the feature importance with another function, `plot_feature_importances`. The feature importances may be useful when it's time for feature selection.

control 결과는 약간 overfitting 됐다. training 데이터가 validation 결과보다 더 크다.

In [None]:
fi.head()

In [None]:
fi_sorted = plot_feature_importances(fi)

In [None]:
submission.to_csv('control.csv', index=False)

> The control scores 0.745 when submitted to the competition.

나의 경우는 .. 비슷하게 나왔다.



### Test One

이제 위 기본값을 기준으로 다음내용을 테스트 해본다.

기본적인 `train`, `test` 데이터 소스를 이용하는데, 이 데이터는 `bureau`, `bureau_balance` 등을 가공한 데이터 셋이다.

In [None]:
submission_raw, fi_raw, metrics_raw = model(train, test)

In [None]:
metrics_raw

>Based on these numbers, the engineered features perform better than the control case. However, we will have to submit the predictions to the leaderboard before we can say if this better validation performance transfers to the testing data.


숫자로만보면, 직전에 작업한 raw데이터가 control 데이터 셋보다는 성능이 더 좋은 것을 확인 할 수 있다. 

하지만, submission해서 점수를 봐바야 진짜 성능이 더 좋은지 알 수 있다고 한다.

In [None]:
# feature importance 확인
fi_raw_sorted = plot_feature_importances(fi_raw)

>Examining the feature improtances, it looks as if a few of the feature we constructed are among the most important. Let's find the percentage of the top 100 most important features that we made in this notebook. However, rather than just compare to the original features, we need to compare to the one-hot encoded original features. These are already recorded for us in `fi` (from the original data).

feature importance를 백분율로 분류해서 탑 100개 요소를 추려본다.

In [None]:
# top 100 개만 추려서 변수에 할당
top_100 = list(fi_raw_sorted['feature'])[:100]
# 추려낸 100개의 feature 중에서 fi_raw 에만 있는 feature 들을 new_feature로 따로 분류해주는 작엄, 즉 추려내면 100개가 아닐거임.
new_feature = [x for x in top_100 if x not in list(fi['feature'])]

print('%% of Top 100 Features created from the bureau data = %d.00' % len(new_feature))


>Over half of the top 100 features were made by us! That should give us confidence that all the hard work we did was worthwhile.

참조커널에서는 53% 나왔는데, 나는 50%가 나옴. 어쨌든, 탑 100개 중에 가공한, 우리가 만든 feature가 절반이상을 차지한다는 것이 매우 고무적이다.

In [None]:
submission_raw.to_csv('test_one.csv', index=False)

참조커널은 0.759 나옴... 여기서는 ... 

### Test Two

>That was easy, so let's do another run! Same as before but with the highly collinear variables removed.

상관관계가 너무 강한 feature를 삭제한 데이터 셋을 이용해서 테스트를 한다.

In [None]:
submission_corrs, fi_corrs, metrics_corrs = model(train_corrs_removed, test_corrs_removed)

In [None]:
metrics_corrs

>These results are better than the control, but slightly lower than the raw features.


In [None]:
fi_corrs_sorted = plot_feature_importances(fi_corrs)

In [None]:
submission_corrs.to_csv('test_two.csv', index=False)

참조커널에서 test two결과는 0.753이 나왔다. 여기서는 .. 