# Home Credit Default Risk Competition (Introduction KOR)

# 전산학프로젝트 : Machine Leaning 을 통한 리스크 관리 프로젝트

### 본 프로젝트의 목적
 - 은행에서 사용가능한 연체 예측 프로젝트를 진행하는 것이 그 목적이나, 실제 은행의 데이터를 가져오는 것에 어려움이 있어 본 프로젝트는 캐글 내 Home Credit Default Risk 데이터를 기반으로 진행하기로 함.
 - 기본적인 분석 방향은 캐글 내 가장 추천 수가 많은 https://www.kaggle.com/willkoehrsen/start-here-a-gentle-introduction 를 참고하여 진행함.
 - 머신 러닝을 위한 기본적인 지식과 방법들을 우선 정리하고 추후 이를 활용하여 더 나은 결과와 기능을 만들어내기 위해 지속적으로 수정 및 보완 예정임.

### Import package

In [None]:
import os
import gc
import time

import numpy as np
import pandas as pd

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import KFold, StratifiedKFold
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import LabelEncoder
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import MinMaxScaler

import matplotlib.pyplot as plt

import seaborn as sns

from lightgbm import LGBMClassifier

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.filterwarnings('ignore')

from contextlib import contextmanager
import multiprocessing as mp
from functools import partial
from scipy.stats import kurtosis, iqr, skew

### 대표적인 패키지

### - numpy 
Numerical Python을 의미하는 넘파이는 파이썬에서 선형대수 기반의 프로그램을 쉽게 만들 수 있도록 지원하는 대표적인 패키지로 루프를 사용하지 않고 대량 데이터의 배열 연산을 가능하게 하므로 빠른 배열 연산 속도를 보장함.

### - pandas
판다스는 데이터 처리를 위해 존재하는 가장 인기 있는 라이브러리로 행과 열로 이뤄진 2차원 데이터를 효율적으로 가공/처리할 수 있는 기능을 제공함.

### - sklearn
사이킷런은 파이썬 머신러닝 라이브러리 중 가장 많이 사용되는 라이브러리로 머신러닝을 위한 다양한 알고리즘과 개발을 위한 편리한 프레임워크와 API를 제공함.
 
### - matplotlib
matplotlib.pyplot 모듈은 MATLAB과 비슷하게 명령어 스타일로 동작하는 함수의 모음으로 matplotlib.pyplot 모듈의 각각의 함수를 사용해서 간편하게 그래프를 만들고 변화를 줄 수 있음.

### - seaborn
Seaborn은 matplotlib 기반의 시각화 라이브러리로 유익한 통계 그래픽을 그리기 위한 고급 인터페이스를 제공함.

### - lightgbm
XGBoost와 함께 가장 각광을 받고 있는 부스팅 계열 알고리즘으로 학습 시간과 메모리 사용량이 상대적으로 적은 편이며 리프 중심 트리분할 방식을 사용함.
단, 적은 데이터 세트에 사용할 경우 과적합이 발생하기 쉬움(10,000건 이하의 데이터 세트).

### 데이터 분석
우선 분석할 데이터는 캐글 내 Home Credit Default Risk 데이터로 각 파일이 의미하는 바는 다음과 같음.
 - application : 대출 신청 시 작성한 내용.
 - Previous application : 과거 대출 기록.
 - bureau : 개인신용평가기관에 기록된 신청자의 과거 타금융기관과 신용거래 내역(국내의 NICE / KCB).
 - application_train.csv : 학습 메인 테이블.
 - application_test.csv : 테스트 메인 테이블.
 - bureau.csv : 신용평가기관에서 제공한 신용도 정보.
 - bureau_balance.csv : 이전 신용거래 월 잔액 정보.
 - POS_CASH_balance.csv : 신용 거래 정보.
 - credit_card_balance.csv : 신용카드 월 잔액 정보.
 - previous_application.csv : 이전 가계신용대출 정보.
 - installments_payments.csv : 대출 상환 내역 정보.
 - HomeCredit_columns_description.csv : 파일의 열에 대한 정보.

In [None]:
# List files available
print(os.listdir("../input/home-credit-default-risk/"))

- 학습 데이터(train)
- 각 칼럼에 관련한 사항은 https://chocoffee20.tistory.com/6 참고함.

In [None]:
# Training data
app_train = pd.read_csv('../input/home-credit-default-risk/application_train.csv')
print('Training data shape: ', app_train.shape)
app_train.head()

- 테스트 데이터(test)
- 학습 데이터와의 차이 : TARGET 열이 없음
- [TARGET] : 0 과 1(default)로 구성된 연체 정보로 테스트 데이터에서는 이를 예측하여야 하기 때문에 열이 존재하지 않음.

In [None]:
# Testing data features : row 48744 / columns : 121
app_test = pd.read_csv('../input/home-credit-default-risk/application_test.csv')
print('Testing data shape: ', app_test.shape)
app_test.head()

In [None]:
app_train['TARGET'].value_counts()

In [None]:
app_train['TARGET'].astype(int).plot.hist();

이 결과를 보았을 때 제대로 상환되지 않은 대출보다 상환된 대출의 비율이 훨씬 높은 것을 확인할 수 있으며(0값 : 대출 상환, 1값 : 대출 체납)
- imbalanced class problem
문제가 발생할 수 있음을 예상할 수 있음.
- 이러한 문제를 해결하기 위하여 https://www.kaggle.com/kaanboke/catboost-lightgbm-xgboost-explained-by-shap 를 참고하여 추가적으로 작업 진행 예정임(Deals With Imbalanced Data).

imbalanced class problem 이란?
- 다수 클래스의 수가 소수 클래스의 수보다 월등히 많은 학습상황을 의미.
- 분류 성능이 저하되는 문제가 발생.
- https://blog.naver.com/tjdtjdgus99/222208515494 참고

### 데이터 타입 분석

In [None]:
# Number of each type of column
app_train.dtypes.value_counts()

- 오브젝트 타입 분석

In [None]:
# Number of unique classes in each object column
app_train.select_dtypes('object').apply(pd.Series.nunique, axis = 0)

### 범주형 변수들을 어떻게 처리할 것인가?

### Label encoding
- 문자열 값들을 숫자형으로 변경.
- 함수 : labelencoder_counter(df)
- df 에 읽은 파일을 변수로 입력 : app_train = pd.read_csv('./input/application_train.csv')

In [None]:
def labelencoder_counter(df) :
    # Create a label encoder object
    le = LabelEncoder()
    le_count = 0

    # Iterate through the columns
    for col in df:
        if app_train[col].dtype == 'object':
            # If 2 or fewer unique categories
            if len(list(app_train[col].unique())) <= 2:
                # Train on the training data
                le.fit(app_train[col])
                # Transform both training and testing data
                app_train[col] = le.transform(app_train[col])
                app_test[col] = le.transform(app_test[col])

                # Keep track of how many columns were label encoded
                le_count += 1

    return print('%d columns were label encoded.' % le_count)

In [None]:
labelencoder_counter(app_train)

### One-hot encoding
- 고유값에 해당하는 칼럼에는 1, 나머지에는 0을 표시.
- 학습(train) / 시험(test) 데이터에 모두 동일한 열이 필요함.
- 열을 기준으로 정렬 : axis = 1

In [None]:
# one-hot encoding of categorical variables
app_train = pd.get_dummies(app_train)
app_test = pd.get_dummies(app_test)

print('Training Features shape: ', app_train.shape)
print('Testing Features shape: ', app_test.shape)

 - 학습(train) / 시험(test) 데이터에 모두 동일한 열이 필요함.
 - 열을 기준으로 정렬 : axis = 1

In [None]:
train_labels = app_train['TARGET']

# Align the training and testing data, keep only columns present in both dataframes
app_train, app_test = app_train.align(app_test, join = 'inner', axis = 1)

# Add the target back in
app_train['TARGET'] = train_labels

print('Training Features shape: ', app_train.shape)
print('Testing Features shape: ', app_test.shape)

### 상관관계 분석
 - 함수 : correlations(number)
 - number 에 파악하고자 하는 숫자를 입력.

In [None]:
def correlations(number) :
    # Find correlations with the target and sort
    correlations = app_train.corr()['TARGET'].sort_values()

    # Display correlations
    print('Most Positive Correlations:\n', correlations.tail(20))
    print('\nMost Negative Correlations:\n', correlations.head(20))

In [None]:
correlations(10)

### 상관계수
- 일반적인 해석
    - .00-.19 : very weak
    - .20-.39 : weak
    - .40-.59 : moderate
    - .60-.79 : strong
    - .80-1.0 :very strong
- TARGET 과 DAYS_BIRTH 의 상관관계가 가장 높은 것으로 확인됨.

### 결측치(Missing Value) 확인
 - 함수 : missing_values_table(df)
 - df 에 읽은 파일을 변수로 입력 : app_train = pd.read_csv('../input/application_train.csv')
 - 누락 데이터를 처리하는 방법 : https://dining-developer.tistory.com/19 참고.

In [None]:
# Function to calculate missing values by column# Funct 
def missing_values_table(df):
        # Total missing values
        mis_val = df.isnull().sum()
        
        # Percentage of missing values
        mis_val_percent = 100 * df.isnull().sum() / len(df)
        
        # Make a table with the results
        mis_val_table = pd.concat([mis_val, mis_val_percent], axis=1)
        
        # Rename the columns
        mis_val_table_ren_columns = mis_val_table.rename(
        columns = {0 : 'Missing Values', 1 : '% of Total Values'})
        
        # Sort the table by percentage of missing descending
        mis_val_table_ren_columns = mis_val_table_ren_columns[
            mis_val_table_ren_columns.iloc[:,1] != 0].sort_values(
        '% of Total Values', ascending=False).round(1)
        
        # Print some summary information
        print ("Your selected dataframe has " + str(df.shape[1]) + " columns.\n"      
            "There are " + str(mis_val_table_ren_columns.shape[0]) +
              " columns that have missing values.")
        
        # Return the dataframe with missing information
        return mis_val_table_ren_columns

In [None]:
# Missing values statistics
missing_values = missing_values_table(app_train)
missing_values.head(20)

- 머신러닝에서는 이러한 결측값들을 채워 넣어야 함.
- 이 후 과정에서 결측치의 비율이 높은 칼럼의 경우 이 열을 사용할 것인지 사용하지 않을 것인지 정할 예정임.

### 이상치 확인
- 상관관계가 가장 Column을 먼저 분석함.
- DAYS_BIRTH : 현재 대출과 비교하여 기록되기 때문에 음수이므로 나누기를 음수로 하여 더 쉽게 나이를 확인할 수 있도록 함.
- mean : 평균
- min : 최솟값
- max : 최댓값
- std : 표준편차

In [None]:
(app_train['DAYS_BIRTH'] / -365).describe()

- DAYS_EMPLOYED : 이상치 발견.
- 이상치 : 최소값 NAME_INCOME_TYPE 열에서 Pensioner 로 연금수급자임을 확인함.
- 이러한 값들의 처리에 대해 현재 고민중이며 추후 데이터를 처리하여 수정, 보완 예정임.

In [None]:
(app_train['DAYS_EMPLOYED'] / -365).describe()

### 커널 밀도 추정
- 연령이 대상에 미치는 영향을 시각화(seaborn kdeplot 사용).

In [None]:
plt.figure(figsize = (10, 8))

# KDE plot of loans that were repaid on time
sns.kdeplot(app_train.loc[app_train['TARGET'] == 0, 'DAYS_BIRTH'] / 365, label = 'target == 0')

# KDE plot of loans which were not repaid on time
sns.kdeplot(app_train.loc[app_train['TARGET'] == 1, 'DAYS_BIRTH'] / 365, label = 'target == 1')

# Labeling of plot
plt.xlabel('Age (years)'); plt.ylabel('Density'); plt.title('Distribution of Ages');

### 나이를 5년 단위로 분리하여 데이터 분석
- 나이 변수를 절대값으로 변환하여 확인이 쉽도록 변경함.

In [None]:
# Find the correlation of the positive days since birth and target
app_train['DAYS_BIRTH'] = abs(app_train['DAYS_BIRTH'])

In [None]:
# Age information into a separate dataframe
age_data = app_train[['TARGET', 'DAYS_BIRTH']]
age_data['YEARS_BIRTH'] = age_data['DAYS_BIRTH'] / 365

# Bin the age data
age_data['YEARS_BINNED'] = pd.cut(age_data['YEARS_BIRTH'], bins = np.linspace(20, 70, num = 11))
age_data.head(10)

In [None]:
# Group by the bin and calculate averages
age_groups  = age_data.groupby('YEARS_BINNED').mean()
age_groups

- 나이가 어릴수록 default 비율이 더 높은 것을 알 수 있음.

### EXT_SOURCE 
- Target과 가장 음의 상관계수를 가지는 변수들.
- 값이 클수록 대출 상환 비율이 증가함.

In [None]:
# Extract the EXT_SOURCE variables and show correlations
ext_data = app_train[['TARGET', 'EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'DAYS_BIRTH']]
ext_data_corrs = ext_data.corr()
ext_data_corrs

In [None]:
plt.figure(figsize = (10, 12))

# iterate through the sources
for i, source in enumerate(['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3']):
    
    # create a new subplot for each source
    plt.subplot(3, 1, i + 1)
    # plot repaid loans
    sns.kdeplot(app_train.loc[app_train['TARGET'] == 0, source], label = 'target == 0')
    # plot loans that were not repaid
    sns.kdeplot(app_train.loc[app_train['TARGET'] == 1, source], label = 'target == 1')
    
    # Label the plots
    plt.title('Distribution of %s by Target Value' % source)
    plt.xlabel('%s' % source); plt.ylabel('Density');
    
plt.tight_layout(h_pad = 2.5)

- 전체적으로 Target과 변수들의 상관관계가 높은 편은 아니지만 이러한 변수들이 결과를 예측하는데에는 도움이 될 수 있음.

### Polynomial Features
- interaction terms : 변수를 결합하여 대상과의 관계를 확인.

In [None]:
from sklearn.impute import SimpleImputer

# Make a new dataframe for polynomial features
poly_features = app_train[['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'DAYS_BIRTH', 'TARGET']]
poly_features_test = app_test[['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'DAYS_BIRTH']]

# imputer for handling missing values
SimpleImputer = SimpleImputer(strategy = 'median')

poly_target = poly_features['TARGET']

poly_features = poly_features.drop(columns = ['TARGET'])

# Need to impute missing values
poly_features = SimpleImputer.fit_transform(poly_features)
poly_features_test =SimpleImputer.transform(poly_features_test)
                                  
# Create the polynomial object with specified degree
poly_transformer = PolynomialFeatures(degree = 3)

In [None]:
# Train the polynomial features
poly_transformer.fit(poly_features)

# Transform the features
poly_features = poly_transformer.transform(poly_features)
poly_features_test = poly_transformer.transform(poly_features_test)
print('Polynomial Features shape: ', poly_features.shape)

In [None]:
poly_transformer.get_feature_names(input_features = ['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'DAYS_BIRTH'])[:15]

In [None]:
# Create a dataframe of the features 
poly_features = pd.DataFrame(poly_features, 
                             columns = poly_transformer.get_feature_names(['EXT_SOURCE_1', 'EXT_SOURCE_2', 
                                                                           'EXT_SOURCE_3', 'DAYS_BIRTH']))

# Add in the target
poly_features['TARGET'] = poly_target

# Find the correlations with the target
poly_corrs = poly_features.corr()['TARGET'].sort_values()

# Display most negative and most positive
print(poly_corrs.head(10))
print(poly_corrs.tail(5))

In [None]:
# Put test features into dataframe
poly_features_test = pd.DataFrame(poly_features_test, 
                                  columns = poly_transformer.get_feature_names(['EXT_SOURCE_1', 'EXT_SOURCE_2', 
                                                                                'EXT_SOURCE_3', 'DAYS_BIRTH']))

# Merge polynomial features into training dataframe
poly_features['SK_ID_CURR'] = app_train['SK_ID_CURR']
app_train_poly = app_train.merge(poly_features, on = 'SK_ID_CURR', how = 'left')

# Merge polnomial features into testing dataframe
poly_features_test['SK_ID_CURR'] = app_test['SK_ID_CURR']
app_test_poly = app_test.merge(poly_features_test, on = 'SK_ID_CURR', how = 'left')

# Align the dataframes
app_train_poly, app_test_poly = app_train_poly.align(app_test_poly, join = 'inner', axis = 1)

# Print out the new shapes
print('Training data with polynomial features shape: ', app_train_poly.shape)
print('Testing data with polynomial features shape:  ', app_test_poly.shape)

- 새롭게 생성한 변수 중 일부는 원래 변수보다 목표와의 상관관계가 더 크다는 점을 알 수 있음.
- 변수들 간의 관계를 고려하여 더 상관관계가 높은 변수들의 조합을 찾아볼 예정임.

### Baseline
- Logistic Regression
- 범주형 변수 인코딩.
- 결측값을 채우고 데이터 전처리(strategy = 'median').

In [None]:
from sklearn.preprocessing import MinMaxScaler
from sklearn.impute import SimpleImputer

# Drop the target from the training data
if 'TARGET' in app_train:
    train = app_train.drop(columns = ['TARGET'])
else:
    train = app_train.copy()
    
# Feature names
features = list(train.columns)

# Copy of the testing data
test = app_test.copy()

# Median imputation of missing values
SimpleImputer = SimpleImputer(strategy = 'median')

# Scale each feature to 0-1
scaler = MinMaxScaler(feature_range = (0, 1))

# Fit on the training data
SimpleImputer.fit(train)

# Transform both training and testing data
train = SimpleImputer.transform(train)
test = SimpleImputer.transform(app_test)

# Repeat with the scaler
scaler.fit(train)
train = scaler.transform(train)
test = scaler.transform(test)

print('Training data shape: ', train.shape)
print('Testing data shape: ', test.shape)

In [None]:
from sklearn.linear_model import LogisticRegression

# Make the model with the specified regularization parameter
log_reg = LogisticRegression(C = 0.0001)

# Train on the training data
log_reg.fit(train, train_labels)

In [None]:
# Make predictions
# Make sure to select the second column only
log_reg_pred = log_reg.predict_proba(test)[:, 1]

In [None]:
# Submission dataframe
submit = app_test[['SK_ID_CURR']]
submit['TARGET'] = log_reg_pred

submit.head()

In [None]:
# Submission dataframe
submit = app_test[['SK_ID_CURR']]
submit['TARGET'] = log_reg_pred

submit.head()

In [None]:
# Save the submission to a csv file
submit.to_csv('log_reg_baseline.csv', index = False)

### Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier

# Make the random forest classifier
random_forest = RandomForestClassifier(n_estimators = 100, random_state = 50, verbose = 1, n_jobs = -1)

In [None]:
# Train on the training data
random_forest.fit(train, train_labels)

# Extract feature importances
feature_importance_values = random_forest.feature_importances_
feature_importances = pd.DataFrame({'feature': features, 'importance': feature_importance_values})

# Make predictions on the test data
predictions = random_forest.predict_proba(test)[:, 1]

In [None]:
# Make a submission dataframe
submit = app_test[['SK_ID_CURR']]
submit['TARGET'] = predictions

# Save the submission dataframe
submit.to_csv('random_forest_baseline.csv', index = False)

In [None]:
from sklearn.impute import SimpleImputer

poly_features_names = list(app_train_poly.columns)

# Impute the polynomial features
SimpleImputer = SimpleImputer(strategy = 'median')

poly_features = SimpleImputer.fit_transform(app_train_poly)
poly_features_test = SimpleImputer.transform(app_test_poly)

# Scale the polynomial features
scaler = MinMaxScaler(feature_range = (0, 1))

poly_features = scaler.fit_transform(poly_features)
poly_features_test = scaler.transform(poly_features_test)

random_forest_poly = RandomForestClassifier(n_estimators = 100, random_state = 50, verbose = 1, n_jobs = -1)

In [None]:
# Train on the training data
random_forest_poly.fit(poly_features, train_labels)

# Make predictions on the test data
predictions = random_forest_poly.predict_proba(poly_features_test)[:, 1]

In [None]:
# Make a submission dataframe
submit = app_test[['SK_ID_CURR']]
submit['TARGET'] = predictions

# Save the submission dataframe
submit.to_csv('random_forest_baseline_engineered.csv', index = False)

- Random Forest 의 기능이 성능 향상에 크게 도움이 되지 않음을 확인함.
- Gradient Boosting Model 에는 적용가능함.

### Feature Importances
- 어떤 변수가 관련성이 높은지 확인.

In [None]:
def plot_feature_importances(df):
    """
    Plot importances returned by a model. This can work with any measure of
    feature importance provided that higher importance is better. 
    
    Args:
        df (dataframe): feature importances. Must have the features in a column
        called `features` and the importances in a column called `importance
        
    Returns:
        shows a plot of the 15 most importance features
        
        df (dataframe): feature importances sorted by importance (highest to lowest) 
        with a column for normalized importance
        """
    
    # Sort features according to importance
    df = df.sort_values('importance', ascending = False).reset_index()
    
    # Normalize the feature importances to add up to one
    df['importance_normalized'] = df['importance'] / df['importance'].sum()

    # Make a horizontal bar chart of feature importances
    plt.figure(figsize = (10, 6))
    ax = plt.subplot()
    
    # Need to reverse the index to plot most important on top
    ax.barh(list(reversed(list(df.index[:15]))), 
            df['importance_normalized'].head(15), 
            align = 'center', edgecolor = 'k')
    
    # Set the yticks and labels
    ax.set_yticks(list(reversed(list(df.index[:15]))))
    ax.set_yticklabels(df['feature'].head(15))
    
    # Plot labeling
    plt.xlabel('Normalized Importance'); plt.title('Feature Importances')
    plt.show()
    
    return df

In [None]:
# Show the feature importances for the default features
feature_importances_sorted = plot_feature_importances(feature_importances)

### Most Important Features
- EXT_SOURCE
- DAYS_BIRTH

### Light Gradient Boosting Machine

In [None]:
from sklearn.model_selection import KFold
from sklearn.metrics import roc_auc_score
import lightgbm as lgb
import gc

def model(features, test_features, encoding = 'ohe', n_folds = 5):
    
    """Train and test a light gradient boosting model using
    cross validation. 
    
    Parameters
    --------
        features (pd.DataFrame): 
            dataframe of training features to use 
            for training a model. Must include the TARGET column.
        test_features (pd.DataFrame): 
            dataframe of testing features to use
            for making predictions with the model. 
        encoding (str, default = 'ohe'): 
            method for encoding categorical variables. Either 'ohe' for one-hot encoding or 'le' for integer label encoding
            n_folds (int, default = 5): number of folds to use for cross validation
        
    Return
    --------
        submission (pd.DataFrame): 
            dataframe with `SK_ID_CURR` and `TARGET` probabilities
            predicted by the model.
        feature_importances (pd.DataFrame): 
            dataframe with the feature importances from the model.
        valid_metrics (pd.DataFrame): 
            dataframe with training and validation metrics (ROC AUC) for each fold and overall.
        
    """
    
    # Extract the ids
    train_ids = features['SK_ID_CURR']
    test_ids = test_features['SK_ID_CURR']
    
    # Extract the labels for training
    labels = features['TARGET']
    
    # Remove the ids and target
    features = features.drop(columns = ['SK_ID_CURR', 'TARGET'])
    test_features = test_features.drop(columns = ['SK_ID_CURR'])
    
    # One Hot Encoding
    if encoding == 'ohe':
        features = pd.get_dummies(features)
        test_features = pd.get_dummies(test_features)
        
        # Align the dataframes by the columns
        features, test_features = features.align(test_features, join = 'inner', axis = 1)
        
        # No categorical indices to record
        cat_indices = 'auto'
    
    # Integer label encoding
    elif encoding == 'le':
        
        # Create a label encoder
        label_encoder = LabelEncoder()
        
        # List for storing categorical indices
        cat_indices = []
        
        # Iterate through each column
        for i, col in enumerate(features):
            if features[col].dtype == 'object':
                # Map the categorical features to integers
                features[col] = label_encoder.fit_transform(np.array(features[col].astype(str)).reshape((-1,)))
                test_features[col] = label_encoder.transform(np.array(test_features[col].astype(str)).reshape((-1,)))

                # Record the categorical indices
                cat_indices.append(i)
    
    # Catch error if label encoding scheme is not valid
    else:
        raise ValueError("Encoding must be either 'ohe' or 'le'")

    print('Training Data Shape: ', features.shape)
    print('Testing Data Shape: ', test_features.shape)
    
    # Extract feature names
    feature_names = list(features.columns)
    
    # Convert to np arrays
    features = np.array(features)
    test_features = np.array(test_features)
    
    # Create the kfold object
    k_fold = KFold(n_splits = n_folds, shuffle = True, random_state = 50)
    
    # Empty array for feature importances
    feature_importance_values = np.zeros(len(feature_names))
    
    # Empty array for test predictions
    test_predictions = np.zeros(test_features.shape[0])
    
    # Empty array for out of fold validation predictions
    out_of_fold = np.zeros(features.shape[0])
    
    # Lists for recording validation and training scores
    valid_scores = []
    train_scores = []
    
    # Iterate through each fold
    for train_indices, valid_indices in k_fold.split(features):
        
        # Training data for the fold
        train_features, train_labels = features[train_indices], labels[train_indices]
        # Validation data for the fold
        valid_features, valid_labels = features[valid_indices], labels[valid_indices]
        
        # Create the model
        model = lgb.LGBMClassifier(n_estimators=10000, objective = 'binary', 
                                   class_weight = 'balanced', learning_rate = 0.05, 
                                   reg_alpha = 0.1, reg_lambda = 0.1, 
                                   subsample = 0.8, n_jobs = -1, random_state = 50)

        # Train the model
        model.fit(train_features, train_labels, eval_metric = 'auc',
                  eval_set = [(valid_features, valid_labels), (train_features, train_labels)],
                  eval_names = ['valid', 'train'], categorical_feature = cat_indices,
                  early_stopping_rounds = 100, verbose = 200)
        
        # Record the best iteration
        best_iteration = model.best_iteration_
        
        # Record the feature importances
        feature_importance_values += model.feature_importances_ / k_fold.n_splits
        
        # Make predictions
        test_predictions += model.predict_proba(test_features, num_iteration = best_iteration)[:, 1] / k_fold.n_splits
        
        # Record the out of fold predictions
        out_of_fold[valid_indices] = model.predict_proba(valid_features, num_iteration = best_iteration)[:, 1]
        
        # Record the best score
        valid_score = model.best_score_['valid']['auc']
        train_score = model.best_score_['train']['auc']
        
        valid_scores.append(valid_score)
        train_scores.append(train_score)
        
        # Clean up memory
        gc.enable()
        del model, train_features, valid_features
        gc.collect()
        
    # Make the submission dataframe
    submission = pd.DataFrame({'SK_ID_CURR': test_ids, 'TARGET': test_predictions})
    
    # Make the feature importance dataframe
    feature_importances = pd.DataFrame({'feature': feature_names, 'importance': feature_importance_values})

    # Overall validation score
    valid_auc = roc_auc_score(labels, out_of_fold)
    
    # Add the overall scores to the metrics
    valid_scores.append(valid_auc)
    train_scores.append(np.mean(train_scores))
    
    # Needed for creating dataframe of validation scores
    fold_names = list(range(n_folds))
    fold_names.append('overall')
    
    # Dataframe of validation scores
    metrics = pd.DataFrame({'fold': fold_names,
                            'train': train_scores,
                            'valid': valid_scores}) 
    
    return submission, feature_importances, metrics

In [None]:
submission, fi, metrics = model(app_train, app_test)
print('Baseline metrics')
print(metrics)

In [None]:
fi_sorted = plot_feature_importances(fi)

In [None]:
submission.to_csv('baseline_lgb.csv', index = False)

### 추가사항
- https://www.kaggle.com/jsaguiar/lightgbm-with-simple-features?scriptVersionId=6025993 를 참고하여 아래와 같은 변수들이 결과값들과 유의미한 상관관계가 있음을 확인함.
    - CREDIT_INCOME_PERCENT: the percentage of the credit amount relative to a client's income
    - ANNUITY_INCOME_PERCENT: the percentage of the loan annuity relative to a client's income
    - CREDIT_TERM: the length of the payment in months (since the annuity is the monthly amount due
    - DAYS_EMPLOYED_PERCENT: the percentage of the days employed relative to the client's age

In [None]:
app_train_domain = app_train.copy()
app_test_domain = app_test.copy()

app_train_domain['CREDIT_INCOME_PERCENT'] = app_train_domain['AMT_CREDIT'] / app_train_domain['AMT_INCOME_TOTAL']
app_train_domain['ANNUITY_INCOME_PERCENT'] = app_train_domain['AMT_ANNUITY'] / app_train_domain['AMT_INCOME_TOTAL']
app_train_domain['CREDIT_TERM'] = app_train_domain['AMT_ANNUITY'] / app_train_domain['AMT_CREDIT']
app_train_domain['DAYS_EMPLOYED_PERCENT'] = app_train_domain['DAYS_EMPLOYED'] / app_train_domain['DAYS_BIRTH']

In [None]:
app_test_domain['CREDIT_INCOME_PERCENT'] = app_test_domain['AMT_CREDIT'] / app_test_domain['AMT_INCOME_TOTAL']
app_test_domain['ANNUITY_INCOME_PERCENT'] = app_test_domain['AMT_ANNUITY'] / app_test_domain['AMT_INCOME_TOTAL']
app_test_domain['CREDIT_TERM'] = app_test_domain['AMT_ANNUITY'] / app_test_domain['AMT_CREDIT']
app_test_domain['DAYS_EMPLOYED_PERCENT'] = app_test_domain['DAYS_EMPLOYED'] / app_test_domain['DAYS_BIRTH']

In [None]:
app_train_domain['TARGET'] = train_labels

# Test the domain knolwedge features
submission_domain, fi_domain, metrics_domain = model(app_train_domain, app_test_domain)
print('Baseline with domain knowledge features metrics')
print(metrics_domain)

In [None]:
fi_sorted = plot_feature_importances(fi_domain)

### 진행예정 사항
- 데이터 전처리 : 제공된 데이터의 항목 뿐만 아니라 이를 결합하여 새로운 변수를 만들고 이를 토대로 분석 진행.
- 사용자 편의성 : 시각화 및 분석 모델을 함수화 하여 사용자가 더 쉽게 프로그램을 사용할 수 있도록 프로그래밍 진행.
- 정확도 : 더 높은 정확도를 얻기 위해 여러 자료들 참고하여 수정 보완.
- https://www.kaggle.com/c/kaggle-survey-2021/discussion/279327#1553191 에 나온 팁을 참고하여 계속적으로 새로운 모델 시도.