# Kaggle Compettion - Classification with an Academic Success Dataset 2

### 여러가지 모델을 사용한 Modeling
HistGB를 기본 모델로 전처리, feature engineering 등을 사용하여 성능이 0.83~0.84 범위 안에 있다는 것을 확인할 수 있었다. 모델의 성능을 더 높이기 위하여 여러가지 모델을 파이프 라인에 저장하고 ss, mms, onehot encoder, ordinal encoder, log transform, binning, outlier cap 등 의 feature engieneering을 적용한 학습 데이터를 사용하여 성능 테스트를 한다.

### 개요
- **Data Setting**
- **numeric, categorical features setting**
- **feature engineering setting**
- **Modeling**
   - 전체 모델을 기본 설정한 후 전처리, feature engineering을 변경하여 모델 성능 테스트, test 1 ~ test 10
   - 전처리와 feature entineering 중 어떤 조합이 좀더 성능이 나은지 테스트
- **Hyper parameter Tunning**
   - 훈련 데이터의 전처리 mms, onehot 방법을 기본으로하여 각 개별 모델별 hyper paramter tunning
- **Best estimator Submission**
   - hyper parameter tunning의 best estimator 중 상위 3개 모델을 사용하여 GridCV로 재 fitting 후 test data의 예측값 추정 및 submission

## 1. Data Setting

In [1]:
import numpy as np
import pandas as pd
pd.options.display.max_colwidth = 2000

import os
import pickle
import time
from itertools import product

import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

plt.rcParams['figure.dpi'] = 100
sns.set_style(
    style='darkgrid', 
    rc={'axes.facecolor': '.9', 'grid.color': '.8'})
sns.set_palette(palette='deep')
sns_c = sns.color_palette(palette='deep')
sns_c

In [2]:
submission_df = pd.read_csv("./data/sample_submission.csv")
train_df = pd.read_csv("./data/train.csv")
test_df = pd.read_csv("./data/test.csv")

submission_df.shape, train_df.shape, test_df.shape

((51012, 2), (76518, 38), (51012, 37))

In [3]:
X = train_df.copy().drop(["id", "Target"], axis=1)
y = train_df["Target"]

X.shape, y.shape

((76518, 36), (76518,))

In [4]:
from sklearn.preprocessing import LabelEncoder

In [5]:
le = LabelEncoder()
y_enc = le.fit_transform(y)
np.unique(y_enc, return_counts=True)

(array([0, 1, 2]), array([25296, 14940, 36282], dtype=int64))

## 2. numeric, categorical features setting

In [6]:
curricular_features = list(X.filter(regex=".*Curricular").columns)
curricular_features

['Curricular units 1st sem (credited)',
 'Curricular units 1st sem (enrolled)',
 'Curricular units 1st sem (evaluations)',
 'Curricular units 1st sem (approved)',
 'Curricular units 1st sem (grade)',
 'Curricular units 1st sem (without evaluations)',
 'Curricular units 2nd sem (credited)',
 'Curricular units 2nd sem (enrolled)',
 'Curricular units 2nd sem (evaluations)',
 'Curricular units 2nd sem (approved)',
 'Curricular units 2nd sem (grade)',
 'Curricular units 2nd sem (without evaluations)']

In [7]:
grade_features = list(X.filter(regex=".*grade").columns)
grade_features

['Previous qualification (grade)',
 'Admission grade',
 'Curricular units 1st sem (grade)',
 'Curricular units 2nd sem (grade)']

In [8]:
rate_features = list(X.filter(regex=".*rate").columns)
rate_features

['Unemployment rate', 'Inflation rate']

In [9]:
age_features = list(X.filter(regex=".*Age*").columns)
age_features

['Age at enrollment']

In [10]:
gdp_features = list(X.filter(regex=".*GDP*").columns)
gdp_features

['GDP']

In [11]:
numeric_features = list(set(curricular_features + grade_features + rate_features + age_features + gdp_features))
categorical_features = list(set(X.columns).difference(numeric_features))

print(f'''
<numeric features>
* total number : {len(numeric_features)}
* total features : {numeric_features}

<categorical features>
* total number : {len(categorical_features)}
* total features : {categorical_features}
''')


<numeric features>
* total number : 18
* total features : ['Curricular units 2nd sem (credited)', 'Curricular units 2nd sem (approved)', 'Curricular units 1st sem (enrolled)', 'Curricular units 1st sem (evaluations)', 'Curricular units 2nd sem (evaluations)', 'Curricular units 2nd sem (grade)', 'Curricular units 1st sem (credited)', 'Age at enrollment', 'Previous qualification (grade)', 'Curricular units 1st sem (approved)', 'Curricular units 1st sem (grade)', 'Curricular units 2nd sem (enrolled)', 'Curricular units 2nd sem (without evaluations)', 'Curricular units 1st sem (without evaluations)', 'Admission grade', 'Unemployment rate', 'GDP', 'Inflation rate']

<categorical features>
* total number : 18
* total features : ['Debtor', 'Tuition fees up to date', 'Marital status', 'Application order', 'Course', 'Nacionality', 'Displaced', "Mother's occupation", 'Scholarship holder', 'Gender', "Mother's qualification", 'Previous qualification', 'Educational special needs', "Father's occupa

### categorical features의 dtype 변경

In [12]:
X[categorical_features] = X[categorical_features].astype("category")
X.dtypes

Marital status                                    category
Application mode                                  category
Application order                                 category
Course                                            category
Daytime/evening attendance                        category
Previous qualification                            category
Previous qualification (grade)                     float64
Nacionality                                       category
Mother's qualification                            category
Father's qualification                            category
Mother's occupation                               category
Father's occupation                               category
Admission grade                                    float64
Displaced                                         category
Educational special needs                         category
Debtor                                            category
Tuition fees up to date                           catego

## 3. feature engineering setting
- Binning, Log transform, Cap outlier 처리 함수 생성

### Binning

In [13]:
def get_lower_uniques(bin_X, thr_num) : 
    
    '''
    feature별 thr_num 보다 갯수가 작은 유니크 데이터를 데이터 프레임으로 반환한다.
    '''
    
    cate_unique_datas = []
    for c in categorical_features : 
        cate_unique_datas.append(
            bin_X[c]\
                .value_counts()\
                .to_frame()\
                .reset_index(names="unique")\
                .assign(column_name=c)\
                .query("count < @thr_num")
        )
    cate_unique_datas = pd.concat(cate_unique_datas, axis=0)
    # .agg(list) : groupby로 취합 된 데이터를 list 형태로 담아서 반환한다.
    lower_count_unique_df = cate_unique_datas.groupby("column_name")["unique"].agg(list).reset_index(name=str(thr_num) + "_lower_uniques")
    
    return lower_count_unique_df

In [14]:
def get_binning_cate_X(bin_X, thr_num) : 
    
    '''
    thr_num 보다 갯수가 작은 유니크 데이터들을 특정값(9999)으로 binning 한다.
    '''
    
    print("~~~ Binning Cate ~~~")
    
    lower_count_unique_df = get_lower_uniques(bin_X, thr_num)
    for ele in lower_count_unique_df.iterrows() : 
        col = ele[1]["column_name"]
        uniques = ele[1][str(thr_num) + "_lower_uniques"]
        bin_X[col] = bin_X[col].apply(lambda x: x if x not in uniques else 9999)
        
        print(f"columns : {col}")
        print(f'''
        * uniques : {uniques}
        * change uniques ea : {bin_X[col].value_counts()[9999]}
        ''')
        print(f"")
        
    return bin_X

In [15]:
transform_X = X.copy()
get_lower_uniques(transform_X, 100)

Unnamed: 0,column_name,100_lower_uniques
0,Application mode,"[5, 10, 2, 27, 4, 26, 35, 12, 9, 3]"
1,Application order,"[0, 9]"
2,Course,"[33, 39, 979]"
3,Father's occupation,"[193, 171, 144, 163, 175, 103, 192, 181, 152, 135, 182, 102, 172, 151, 112, 154, 183, 123, 194, 122, 153, 143, 195, 131, 141, 101, 114, 121, 174, 132, 134, 161, 125, 148, 96, 39, 22, 19, 191, 13, 12, 11, 124]"
4,Father's qualification,"[11, 36, 29, 40, 9, 14, 30, 43, 41, 22, 10, 26, 6, 42, 35, 18, 44, 20, 13, 27, 7, 33, 31, 25, 24, 21, 15, 23]"
5,Marital status,"[6, 3]"
6,Mother's occupation,"[194, 141, 123, 144, 192, 10, 193, 152, 134, 151, 132, 175, 143, 153, 131, 122, 173, 171, 172, 38, 163, 101, 127, 11, 124, 103, 125]"
7,Mother's qualification,"[5, 40, 39, 9, 11, 41, 6, 42, 43, 29, 10, 36, 35, 30, 22, 14, 26, 18, 33, 31, 44, 28, 27, 15, 8, 7]"
8,Nacionality,"[26, 22, 6, 11, 24, 2, 103, 100, 101, 105, 21, 25, 62, 17, 109, 32]"
9,Previous qualification,"[6, 2, 10, 43, 38, 4, 15, 37, 5, 14, 17, 11, 36]"


In [16]:
transform_X = get_binning_cate_X(transform_X, 100)
get_lower_uniques(transform_X, 100)

~~~ Binning Cate ~~~
columns : Application mode

        * uniques : [5, 10, 2, 27, 4, 26, 35, 12, 9, 3]
        * change uniques ea : 146
        

columns : Application order

        * uniques : [0, 9]
        * change uniques ea : 4
        

columns : Course

        * uniques : [33, 39, 979]
        * change uniques ea : 74
        

columns : Father's occupation

        * uniques : [193, 171, 144, 163, 175, 103, 192, 181, 152, 135, 182, 102, 172, 151, 112, 154, 183, 123, 194, 122, 153, 143, 195, 131, 141, 101, 114, 121, 174, 132, 134, 161, 125, 148, 96, 39, 22, 19, 191, 13, 12, 11, 124]
        * change uniques ea : 551
        

columns : Father's qualification

        * uniques : [11, 36, 29, 40, 9, 14, 30, 43, 41, 22, 10, 26, 6, 42, 35, 18, 44, 20, 13, 27, 7, 33, 31, 25, 24, 21, 15, 23]
        * change uniques ea : 321
        

columns : Marital status

        * uniques : [6, 3]
        * change uniques ea : 51
        

columns : Mother's occupation

        * uniques :

Unnamed: 0,column_name,100_lower_uniques
0,Application order,[9999]
1,Course,[9999]
2,Marital status,[9999]


### Log Transform

In [15]:
def get_log_transform_X(trans_X) :
    
    '''
    데이터를 log transform 하여 반환한다.
    '''
    
    trans_X[numeric_features] = trans_X[numeric_features].apply(lambda x: (x - x.min() + 1).transform(np.log), axis=1)
    
    return trans_X

In [18]:
transform_X = get_log_transform_X(transform_X)
transform_X[numeric_features].head()

Unnamed: 0,Curricular units 1st sem (enrolled),Curricular units 2nd sem (credited),Curricular units 2nd sem (without evaluations),Age at enrollment,Curricular units 2nd sem (evaluations),GDP,Curricular units 2nd sem (approved),Curricular units 2nd sem (grade),Curricular units 1st sem (grade),Inflation rate,Curricular units 1st sem (evaluations),Curricular units 1st sem (approved),Admission grade,Unemployment rate,Curricular units 1st sem (without evaluations),Previous qualification (grade),Curricular units 2nd sem (enrolled),Curricular units 1st sem (credited)
0,1.94591,0.0,0.0,2.944439,2.079442,1.105257,1.94591,2.597385,2.74084,0.470004,1.94591,1.94591,4.817051,2.493205,0.0,4.844187,1.94591,0.0
1,1.94591,0.0,0.0,2.944439,2.302585,1.105257,0.0,0.0,2.533697,0.470004,2.197225,1.609438,4.794136,2.493205,0.0,4.836282,1.94591,0.0
2,2.069391,0.652325,0.652325,2.991724,0.652325,0.0,0.652325,0.652325,0.652325,0.797507,0.652325,0.652325,4.987844,2.897016,0.652325,4.933898,2.069391,0.652325
3,2.079442,0.0,0.0,2.944439,2.484907,1.105257,2.079442,2.626117,2.609426,0.470004,2.302585,2.079442,4.844974,2.493205,0.0,4.882802,2.197225,0.0
4,2.079442,0.0,0.0,2.944439,2.564949,0.277632,1.94591,2.634284,2.634284,1.280934,2.564949,1.94591,4.796617,2.151762,0.0,4.890349,2.079442,0.0


### Cap Outlier

In [16]:
def find_outlier(X, col) : 

    upper_lim_std = X[col].mean() + (X[col].std() * 3)
    lower_lim_std = X[col].mean() - (X[col].std() * 3)
    
    upper_lim_quant = X[col].quantile(.95)
    lower_lim_quant = X[col].quantile(.05)

    print(f'''
    [ {col}'s outlier ]
    * std
      - upper lim : {upper_lim_std.round(3)}, nlower lim : {lower_lim_std.round(3)}
      - upper lim length : {len(X[X[col] > upper_lim_std][col])}, lower lim length : {len(X[X[col] < lower_lim_std][col])}
      - outlier / length : { ( len(X[X[col] > upper_lim_std][col]) + len(X[X[col] < lower_lim_std][col]) ) / X.shape[0] :.3f}
    * quantile  
      - upper lim : {upper_lim_quant.round(3)}, nlower lim : {lower_lim_quant.round(3)}
      - upper lim length : {len(X[X[col] > upper_lim_quant][col])}, lower lim length : {len(X[X[col] < lower_lim_quant][col])}
      - outlier / length : { ( len(X[X[col] > upper_lim_quant][col]) + len(X[X[col] < lower_lim_quant][col]) ) / X.shape[0] :.3f}
    ''')
    std_percentage = ( len(X[X[col] > upper_lim_std][col]) + len(X[X[col] < lower_lim_std][col]) ) / X.shape[0]
    quant_percentage = ( len(X[X[col] > upper_lim_quant][col]) + len(X[X[col] < lower_lim_quant][col]) ) / X.shape[0]
    
    return std_percentage, quant_percentage

In [20]:
print("< find outlier by std > \n")
outlier_results = []
for c in numeric_features : 
    std_percent, quant_percent = find_outlier(X, c)
    outlier_results.append(pd.DataFrame({
        "column": c,
        "std_outlier": std_percent,
        "quant_outlier": quant_percent
    }, index=[0]))

< find outlier by std > 


    [ Curricular units 1st sem (enrolled)'s outlier ]
    * std
      - upper lim : 10.907, nlower lim : 0.876
      - upper lim length : 1071, lower lim length : 2671
      - outlier / length : 0.049
    * quantile  
      - upper lim : 8.0, nlower lim : 5.0
      - upper lim length : 1480, lower lim length : 2990
      - outlier / length : 0.058
    

    [ Curricular units 2nd sem (credited)'s outlier ]
    * std
      - upper lim : 2.939, nlower lim : -2.664
      - upper lim length : 1442, lower lim length : 0
      - outlier / length : 0.019
    * quantile  
      - upper lim : 0.0, nlower lim : 0.0
      - upper lim length : 2709, lower lim length : 0
      - outlier / length : 0.035
    

    [ Curricular units 2nd sem (without evaluations)'s outlier ]
    * std
      - upper lim : 1.449, nlower lim : -1.324
      - upper lim length : 1118, lower lim length : 0
      - outlier / length : 0.015
    * quantile  
      - upper lim : 0.0, nlower lim : 0.0

In [21]:
outlier_result_df = pd.concat(outlier_results, axis=0).sort_values("std_outlier", ascending=False)
outlier_result_df

Unnamed: 0,column,std_outlier,quant_outlier
0,Curricular units 1st sem (enrolled),0.048904,0.058418
0,Curricular units 2nd sem (enrolled),0.045597,0.053791
0,Age at enrollment,0.031914,0.048864
0,Curricular units 1st sem (credited),0.019747,0.04037
0,Curricular units 2nd sem (credited),0.018845,0.035403
0,Curricular units 2nd sem (without evaluations),0.014611,0.028046
0,Curricular units 1st sem (without evaluations),0.014088,0.032006
0,Curricular units 1st sem (evaluations),0.00707,0.035207
0,Previous qualification (grade),0.005881,0.088162
0,Curricular units 2nd sem (evaluations),0.005319,0.050001


In [22]:
outlier_features = outlier_result_df.query("std_outlier > 0.03")["column"]
outlier_features.values

array(['Curricular units 1st sem (enrolled)',
       'Curricular units 2nd sem (enrolled)', 'Age at enrollment'],
      dtype=object)

In [17]:
def get_cap_outlier(trans_X, outlier_columns, op="std") : 
    
    print("~~~ Cap Outlier ~~~")
    
    for col in outlier_columns : 
        print(f"columns : {col}")
        
        if op == "std" : 
            upper_lim = trans_X[col].mean() + (trans_X[col].std() * 3)
            lower_lim = trans_X[col].mean() - (trans_X[col].std() * 3)
        elif op == "quant" : 
            upper_lim = trans_X[col].quantile(.95)
            lower_lim = trans_X[col].quantile(.05)

        print(f'''
        * befor cap {op} 
          - upper lim : {upper_lim.round(3)}, nlower lim : {lower_lim.round(3)}
          - upper lim length : {len(trans_X[trans_X[col] > upper_lim][col])}, 
          - lower lim length : {len(trans_X[trans_X[col] < lower_lim][col])}
        ''')

        # cap
        trans_X.loc[(trans_X[col] > upper_lim), col] = upper_lim
        trans_X.loc[(trans_X[col] < lower_lim), col] = upper_lim

        print(f'''
        * after cap
           - upper lim length : {len(trans_X[trans_X[col] > upper_lim][col])}, 
           - lower lim length : {len(trans_X[trans_X[col] < lower_lim][col])}
        ''')
    
    return trans_X

In [24]:
cap_outlier_X = get_cap_outlier(X, outlier_features)
cap_outlier_X.head()

~~~ Cap Outlier ~~~
columns : Curricular units 1st sem (enrolled)

        * befor cap std 
          - upper lim : 10.907, nlower lim : 0.876
          - upper lim length : 1071, 
          - lower lim length : 2671
        

        * after cap
           - upper lim length : 0, 
           - lower lim length : 0
        
columns : Curricular units 2nd sem (enrolled)

        * befor cap std 
          - upper lim : 10.815, nlower lim : 1.052
          - upper lim length : 807, 
          - lower lim length : 2682
        

        * after cap
           - upper lim length : 0, 
           - lower lim length : 0
        
columns : Age at enrollment

        * befor cap std 
          - upper lim : 42.946, nlower lim : 1.611
          - upper lim length : 2442, 
          - lower lim length : 0
        

        * after cap
           - upper lim length : 0, 
           - lower lim length : 0
        


Unnamed: 0,Marital status,Application mode,Application order,Course,Daytime/evening attendance,Previous qualification,Previous qualification (grade),Nacionality,Mother's qualification,Father's qualification,...,Curricular units 1st sem (without evaluations),Curricular units 2nd sem (credited),Curricular units 2nd sem (enrolled),Curricular units 2nd sem (evaluations),Curricular units 2nd sem (approved),Curricular units 2nd sem (grade),Curricular units 2nd sem (without evaluations),Unemployment rate,Inflation rate,GDP
0,1,1,1,9238,1,1,126.0,1,1,19,...,0,0,6.0,7,6,12.428571,0,11.1,0.6,2.02
1,1,17,1,9238,1,1,125.0,1,19,19,...,0,0,6.0,9,0,0.0,0,11.1,0.6,2.02
2,1,17,2,9254,1,1,137.0,1,3,19,...,0,0,6.0,0,0,0.0,0,16.2,0.3,-0.92
3,1,1,3,9500,1,1,131.0,1,19,3,...,0,0,8.0,11,7,12.82,0,11.1,0.6,2.02
4,1,1,2,9500,1,1,132.0,1,19,37,...,0,0,7.0,12,6,12.933333,0,7.6,2.6,0.32


## 4. Modeling
- 전체 모델을 기본 설정한 후 전처리, feature engineering을 변경하여 모델 성능 테스트, test 1 ~ test 10
   - 전처리와 feature entineering 중 어떤 조합이 좀더 성능이 나은지 테스트
- Hyper parameter Tunning
   - 훈련 데이터의 전처리 mms, onehot 방법을 기본으로하여 각 개별 모델별 hyper paramter tunning
- Best estimator Submission
   - hyper parameter tunning의 best estimator 중 상위 3개 모델을 사용하여 GridCV로 재 fitting 후 test data의 예측값 추정 및 submission

In [13]:
# 전처리, 엔지니어링
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, OrdinalEncoder, StandardScaler, MinMaxScaler, PolynomialFeatures
from sklearn.compose import ColumnTransformer, make_column_selector
from sklearn.model_selection import train_test_split, KFold, cross_validate, GridSearchCV, RepeatedStratifiedKFold
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, classification_report
from sklearn.feature_selection import VarianceThreshold, SelectKBest, chi2, f_classif

# 확률적 판별모형
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier

# 확률적 생성모형
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis
from sklearn.naive_bayes import MultinomialNB, BernoulliNB, GaussianNB, CategoricalNB

# 판별모형
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier

# 앙상블(모형결합)
from sklearn.ensemble import (VotingClassifier, BaggingClassifier, RandomForestClassifier, 
                              ExtraTreesClassifier, AdaBoostClassifier, HistGradientBoostingClassifier)
import xgboost as xgb

### Pipeline 생성

In [14]:
num_transformer = Pipeline(
    steps=[
        ("ss", StandardScaler()),
        ("mms", MinMaxScaler()),
        ("non_scaler", "passthrough")
])

cat_transformer = Pipeline(
    steps=[
        ("onehot", OneHotEncoder(sparse_output=False, handle_unknown="ignore")),
        ("ordinal", OrdinalEncoder(handle_unknown="use_encoded_value", unknown_value=-1)),
        ("non_encoder", "passthrough"),
        ("drop_cate", "drop")
])
 
feature_preprocessor = ColumnTransformer(
    transformers=[
        ("num_transformer", num_transformer, make_column_selector(dtype_exclude="category")),
        ("cat_transformer", cat_transformer, make_column_selector(dtype_include="category"))
    ], remainder="passthrough"
)

feature_tech = Pipeline(
    steps=[
        ("feature_select_1", SelectKBest(score_func=f_classif, k=10)),
        ("feature_interaction", PolynomialFeatures(degree=2, interaction_only=True, include_bias=False)),
        ("feature_select_2", VarianceThreshold())
    ]
)

feature_engineering = Pipeline(
    steps=[
        ("preprocessor", feature_preprocessor),
        ("tech", feature_tech)
    ]
)

model_link = {}

model_pipe = Pipeline(
    steps=[
        ("engineering", feature_engineering),
        ("model", model_link)
    ]
)

model_pipe

### 모델 객체를 저장
- RandomForestClassifier, HistGradientBoostingClassifier, XGBClassifier

In [15]:
model_link.update(
    {
        "randomF": RandomForestClassifier(),
        "histGB": HistGradientBoostingClassifier(),
        "xgbclf": xgb.XGBClassifier()
    }
)

model_link

{'randomF': RandomForestClassifier(),
 'histGB': HistGradientBoostingClassifier(),
 'xgbclf': XGBClassifier(base_score=None, booster=None, callbacks=None,
               colsample_bylevel=None, colsample_bynode=None,
               colsample_bytree=None, device=None, early_stopping_rounds=None,
               enable_categorical=False, eval_metric=None, feature_types=None,
               gamma=None, grow_policy=None, importance_type=None,
               interaction_constraints=None, learning_rate=None, max_bin=None,
               max_cat_threshold=None, max_cat_to_onehot=None,
               max_delta_step=None, max_depth=None, max_leaves=None,
               min_child_weight=None, missing=nan, monotone_constraints=None,
               multi_strategy=None, n_estimators=None, n_jobs=None,
               num_parallel_tree=None, random_state=None, ...)}

In [16]:
def get_cv_result(X, y, pipe, model_name=None, num_scaler=None, cate_encoder=None) : 
    
    pipe.set_params(model=model_link[model_name])
    
    kfold = KFold(n_splits=5, shuffle=True, random_state=45)
    cv_results = []
    # scaler와 encoder의 조합으로 모델링 실험
    total_start_time = time.time()
    for comb in list(product(num_scaler, cate_encoder)) :
        # model_pipe_5는 engineering pipe에 tech와 perprocessor step으로 구분함
        pipe["engineering"]["preprocessor"].set_params(
            num_transformer=num_transformer[comb[0]], 
            cat_transformer=cat_transformer[comb[1]]
        )
        
        start_time = time.time()
        cv_result = cross_validate(
            estimator=pipe, 
            X=X, 
            y=y, 
            scoring="accuracy", 
            cv=kfold,
            return_train_score=True,
            n_jobs=2,
        )
        end_time = time.time()
        
        cv_results.append(pd.DataFrame({
            "model": pipe["model"].__class__.__name__,
            "num_transformer": comb[0],
            "cat_transformer": comb[1],
            "train_score": cv_result["train_score"].mean(),
            "test_score": cv_result["test_score"].mean(),
            "fit_time": end_time - start_time
        }, index=[0]
        ))
    
    result = pd.concat(cv_results, axis=0).sort_values("test_score", ascending=False)
    total_end_time = time.time()
    print(f"{model_name} fitting time : {total_end_time - total_start_time : .3f}")
    
    return result

### (1) test 1
- transform X : 
    - binning cate
    - log transform num
- preprocessor
- feature engineering
   - SelectKBtest(k=10)
   - PolynomialFeatures()
   - VarianceThreshold()

In [115]:
%%time

scaler = ["ss", "mms", "non_scaler"]
encoder = ["onehot", "ordinal", "non_encoder"]

cv_results = []
for model_name in model_link.keys() : 
    cv_results.append(
        get_cv_result(transform_X, y_enc, model_pipe, model_name=model_name, num_scaler=scaler, cate_encoder=encoder)
    )

result = pd.concat(cv_results, axis=0)
result

randomF fitting time :  831.954
histGB fitting time :  132.657
xgbclf fitting time :  97.928
CPU times: total: 6.47 s
Wall time: 17min 42s


Unnamed: 0,model,num_transformer,cat_transformer,train_score,test_score,fit_time
0,RandomForestClassifier,non_scaler,non_encoder,0.985412,0.803589,96.591965
0,RandomForestClassifier,ss,non_encoder,0.985412,0.803157,99.925054
0,RandomForestClassifier,non_scaler,ordinal,0.985418,0.802844,96.056036
0,RandomForestClassifier,mms,ordinal,0.985422,0.802805,96.262019
0,RandomForestClassifier,mms,non_encoder,0.985412,0.802752,96.384962
0,RandomForestClassifier,ss,ordinal,0.985425,0.80189,99.668998
0,RandomForestClassifier,non_scaler,onehot,0.977234,0.794637,81.682
0,RandomForestClassifier,mms,onehot,0.977215,0.794467,81.616018
0,RandomForestClassifier,ss,onehot,0.977218,0.793617,83.761981
0,HistGradientBoostingClassifier,ss,ordinal,0.833402,0.813874,15.257


### (2) test 2
- transform X :
   - binning cate : lower 100
   - log transform numeric
- preprocessor   
- feature engineering
   - SelectKBest(k=18)
   - PolynomialFeatures()
   - VarianceThreshold() : passthrough

In [22]:
model_pipe["engineering"]["tech"].set_params(feature_select_1=SelectKBest(score_func=f_classif, k=18))
model_pipe

In [121]:
%%time

scaler = ["ss", "mms", "non_scaler"]
encoder = ["onehot", "ordinal", "non_encoder"]

cv_results = []
for model_name in model_link.keys() : 
    cv_results.append(
        get_cv_result(transform_X, y_enc, model_pipe, model_name=model_name, num_scaler=scaler, cate_encoder=encoder)
    )

result_2 = pd.concat(cv_results, axis=0)
result_2

randomF fitting time :  1655.064
histGB fitting time :  303.827
xgbclf fitting time :  257.604
CPU times: total: 6.45 s
Wall time: 36min 56s


Unnamed: 0,model,num_transformer,cat_transformer,train_score,test_score,fit_time
0,RandomForestClassifier,ss,non_encoder,0.998928,0.819572,204.150974
0,RandomForestClassifier,ss,ordinal,0.998902,0.819441,204.198563
0,RandomForestClassifier,mms,non_encoder,0.998909,0.818448,197.470486
0,RandomForestClassifier,non_scaler,non_encoder,0.998922,0.818017,197.869366
0,RandomForestClassifier,non_scaler,ordinal,0.998928,0.817507,197.595623
0,RandomForestClassifier,mms,ordinal,0.998915,0.817442,196.867938
0,RandomForestClassifier,ss,onehot,0.997494,0.815834,154.445466
0,RandomForestClassifier,non_scaler,onehot,0.997497,0.815455,151.39997
0,RandomForestClassifier,mms,onehot,0.997497,0.815259,151.06004
0,HistGradientBoostingClassifier,ss,ordinal,0.847771,0.824551,31.161522


In [122]:
result_2.sort_values("test_score", ascending=False)

Unnamed: 0,model,num_transformer,cat_transformer,train_score,test_score,fit_time
0,HistGradientBoostingClassifier,ss,ordinal,0.847771,0.824551,31.161522
0,HistGradientBoostingClassifier,non_scaler,non_encoder,0.856341,0.824447,38.724001
0,HistGradientBoostingClassifier,non_scaler,ordinal,0.852786,0.824303,34.135
0,HistGradientBoostingClassifier,ss,non_encoder,0.851398,0.824094,35.053
0,HistGradientBoostingClassifier,mms,non_encoder,0.850124,0.823976,32.941003
0,HistGradientBoostingClassifier,mms,ordinal,0.851185,0.823545,35.361996
0,HistGradientBoostingClassifier,mms,onehot,0.849548,0.823427,33.393
0,HistGradientBoostingClassifier,ss,onehot,0.847856,0.823179,30.773486
0,XGBClassifier,non_scaler,ordinal,0.894544,0.823153,29.233999
0,XGBClassifier,non_scaler,non_encoder,0.894544,0.823153,30.010998


### (3) test 3
- transform X :
   - binning cate : lower 500
   - cap outlier numeric : std, upper 0.03
   - log transform numeric
- preprocessor
- feature engineering
   - SelectKBest(k=20)
   - PolynomialFeatures()
   - VarianceThreshold() : passthrough

In [148]:
transform_X = X.copy()
transform_X = get_binning_cate_X(transform_X, 100)

outlier_features = outlier_result_df.query("std_outlier > 0.03")["column"].values
transform_X = get_cap_outlier(transform_X, outlier_features, op="quant")

transform_X = get_log_transform_X(transform_X)
transform_X.head()

~~~ Binning Cate ~~~
columns : Application mode

        * change uniques ea : 146
        

columns : Application order

        * change uniques ea : 4
        

columns : Course

        * change uniques ea : 74
        

columns : Father's occupation

        * change uniques ea : 551
        

columns : Father's qualification

        * change uniques ea : 321
        

columns : Marital status

        * change uniques ea : 51
        

columns : Mother's occupation

        * change uniques ea : 351
        

columns : Mother's qualification

        * change uniques ea : 446
        

columns : Nacionality

        * change uniques ea : 284
        

columns : Previous qualification

        * change uniques ea : 364
        

~~~ Cap Outlier ~~~
columns : Curricular units 1st sem (enrolled)

        * befor cap quant 
          - upper lim : 10.0, nlower lim : 5.0
          - upper lim length : 3742, 
          - lower lim length : 319
        

        * after cap
           

Unnamed: 0,Marital status,Application mode,Application order,Course,Daytime/evening attendance,Previous qualification,Previous qualification (grade),Nacionality,Mother's qualification,Father's qualification,...,Curricular units 1st sem (without evaluations),Curricular units 2nd sem (credited),Curricular units 2nd sem (enrolled),Curricular units 2nd sem (evaluations),Curricular units 2nd sem (approved),Curricular units 2nd sem (grade),Curricular units 2nd sem (without evaluations),Unemployment rate,Inflation rate,GDP
0,1,1,1,9238,1,1,4.844187,1,1,19,...,0.0,0.0,1.94591,2.079442,1.94591,2.597385,0.0,2.493205,0.470004,1.105257
1,1,17,1,9238,1,1,4.836282,1,19,19,...,0.0,0.0,1.94591,2.302585,0.0,0.0,0.0,2.493205,0.470004,1.105257
2,1,17,2,9254,1,1,4.933898,1,3,19,...,0.652325,0.652325,2.069391,0.652325,0.652325,0.652325,0.652325,2.897016,0.797507,0.0
3,1,1,3,9500,1,1,4.882802,1,19,3,...,0.0,0.0,2.197225,2.484907,2.079442,2.626117,0.0,2.493205,0.470004,1.105257
4,1,1,2,9500,1,1,4.890349,1,19,37,...,0.0,0.0,2.079442,2.564949,1.94591,2.634284,0.0,2.151762,1.280934,0.277632


In [23]:
model_pipe["engineering"]["tech"].set_params(feature_select_1=SelectKBest(score_func=f_classif, k=20))
model_pipe

In [150]:
%%time

scaler = ["ss", "mms", "non_scaler"]
encoder = ["onehot", "ordinal", "non_encoder"]

cv_results = []
for model_name in model_link.keys() : 
    cv_results.append(
        get_cv_result(transform_X, y_enc, model_pipe, model_name=model_name, num_scaler=scaler, cate_encoder=encoder)
    )

result_3 = pd.concat(cv_results, axis=0)
result_3

randomF fitting time :  1762.509
histGB fitting time :  429.088
xgbclf fitting time :  310.946
CPU times: total: 6.33 s
Wall time: 41min 42s


Unnamed: 0,model,num_transformer,cat_transformer,train_score,test_score,fit_time
0,RandomForestClassifier,mms,ordinal,0.99903,0.820395,197.2488
0,RandomForestClassifier,mms,non_encoder,0.999036,0.819232,196.198001
0,RandomForestClassifier,ss,non_encoder,0.999036,0.81918,211.691961
0,RandomForestClassifier,ss,ordinal,0.999013,0.818801,211.610051
0,RandomForestClassifier,non_scaler,ordinal,0.999033,0.818749,199.222565
0,RandomForestClassifier,non_scaler,non_encoder,0.999036,0.818579,201.460826
0,RandomForestClassifier,ss,onehot,0.998902,0.815494,189.601622
0,RandomForestClassifier,mms,onehot,0.998919,0.815063,175.880001
0,RandomForestClassifier,non_scaler,onehot,0.998902,0.814423,179.592999
0,HistGradientBoostingClassifier,mms,non_encoder,0.854668,0.825701,50.043001


In [151]:
result_3.sort_values("test_score", ascending=False)

Unnamed: 0,model,num_transformer,cat_transformer,train_score,test_score,fit_time
0,HistGradientBoostingClassifier,mms,non_encoder,0.854668,0.825701,50.043001
0,HistGradientBoostingClassifier,non_scaler,ordinal,0.852525,0.824773,47.640001
0,HistGradientBoostingClassifier,non_scaler,non_encoder,0.852721,0.824708,47.981001
0,HistGradientBoostingClassifier,ss,ordinal,0.851904,0.824551,48.416003
0,HistGradientBoostingClassifier,mms,ordinal,0.848784,0.824473,42.388998
0,XGBClassifier,mms,non_encoder,0.89772,0.824224,35.384996
0,XGBClassifier,mms,ordinal,0.89772,0.824224,34.444999
0,HistGradientBoostingClassifier,ss,non_encoder,0.85373,0.823636,50.455996
0,XGBClassifier,non_scaler,non_encoder,0.896936,0.823453,37.030889
0,XGBClassifier,non_scaler,ordinal,0.896936,0.823453,35.006523


### (4) test 4
- transform X :
   - binning cate : lower 500
- feature engineering
   - SelectKBest(k=20)
   - PolynomialFeatures()
   - VarianceThreshold() : passthrough
- model
   - update lda, qda, logisticR 

In [29]:
transform_X = X.copy()
transform_X = get_binning_cate_X(transform_X, 500)
transform_X.head()

~~~ Binning Cate ~~~
columns : Application mode

        * uniques : [51, 16, 53, 15, 5, 10, 2, 27, 4, 26, 35, 12, 9, 3]
        * change uniques ea : 1261
        

columns : Application order

        * uniques : [0, 9]
        * change uniques ea : 4
        

columns : Course

        * uniques : [33, 39, 979]
        * change uniques ea : 74
        

columns : Educational special needs

        * uniques : [1]
        * change uniques ea : 286
        

columns : Father's occupation

        * uniques : [99, 193, 171, 144, 163, 175, 103, 192, 181, 152, 135, 182, 102, 172, 151, 112, 154, 183, 123, 194, 122, 153, 143, 195, 131, 141, 101, 114, 121, 174, 132, 134, 161, 125, 148, 96, 39, 22, 19, 191, 13, 12, 11, 124]
        * change uniques ea : 701
        

columns : Father's qualification

        * uniques : [2, 12, 4, 39, 5, 11, 36, 29, 40, 9, 14, 30, 43, 41, 22, 10, 26, 6, 42, 35, 18, 44, 20, 13, 27, 7, 33, 31, 25, 24, 21, 15, 23]
        * change uniques ea : 1555
        

co

Unnamed: 0,Marital status,Application mode,Application order,Course,Daytime/evening attendance,Previous qualification,Previous qualification (grade),Nacionality,Mother's qualification,Father's qualification,...,Curricular units 1st sem (without evaluations),Curricular units 2nd sem (credited),Curricular units 2nd sem (enrolled),Curricular units 2nd sem (evaluations),Curricular units 2nd sem (approved),Curricular units 2nd sem (grade),Curricular units 2nd sem (without evaluations),Unemployment rate,Inflation rate,GDP
0,1,1,1,9238,1,1,126.0,1,1,19,...,0,0,6.0,7,6,12.428571,0,11.1,0.6,2.02
1,1,17,1,9238,1,1,125.0,1,19,19,...,0,0,6.0,9,0,0.0,0,11.1,0.6,2.02
2,1,17,2,9254,1,1,137.0,1,3,19,...,0,0,6.0,0,0,0.0,0,16.2,0.3,-0.92
3,1,1,3,9500,1,1,131.0,1,19,3,...,0,0,8.0,11,7,12.82,0,11.1,0.6,2.02
4,1,1,2,9500,1,1,132.0,1,19,37,...,0,0,7.0,12,6,12.933333,0,7.6,2.6,0.32


In [30]:
model_pipe["engineering"]["tech"].set_params(feature_select_2="passthrough")
model_pipe

In [24]:
model_link.update({
    "lda": LinearDiscriminantAnalysis(),
    "qda": QuadraticDiscriminantAnalysis(),
    "logisticR": LogisticRegression(solver="sag")
})
model_link

{'randomF': RandomForestClassifier(),
 'histGB': HistGradientBoostingClassifier(),
 'xgbclf': XGBClassifier(base_score=None, booster=None, callbacks=None,
               colsample_bylevel=None, colsample_bynode=None,
               colsample_bytree=None, device=None, early_stopping_rounds=None,
               enable_categorical=False, eval_metric=None, feature_types=None,
               gamma=None, grow_policy=None, importance_type=None,
               interaction_constraints=None, learning_rate=None, max_bin=None,
               max_cat_threshold=None, max_cat_to_onehot=None,
               max_delta_step=None, max_depth=None, max_leaves=None,
               min_child_weight=None, missing=nan, monotone_constraints=None,
               multi_strategy=None, n_estimators=None, n_jobs=None,
               num_parallel_tree=None, random_state=None, ...),
 'lda': LinearDiscriminantAnalysis(),
 'qda': QuadraticDiscriminantAnalysis(),
 'logisticR': LogisticRegression(solver='sag')}

In [32]:
%%time

scaler = ["ss", "mms", "non_scaler"]
encoder = ["onehot", "ordinal", "non_encoder"]

cv_results = []
for model_name in model_link.keys() : 
    cv_results.append(
        get_cv_result(transform_X, y_enc, model_pipe, model_name=model_name, num_scaler=scaler, cate_encoder=encoder)
    )

result_4 = pd.concat(cv_results, axis=0)
result_4

randomF fitting time :  626.983
histGB fitting time :  115.527
xgbclf fitting time :  82.266
lda fitting time :  18.105
qda fitting time :  20.180
logisticR fitting time :  133.945
CPU times: total: 5.14 s
Wall time: 16min 37s


Unnamed: 0,model,num_transformer,cat_transformer,train_score,test_score,fit_time
0,RandomForestClassifier,non_scaler,non_encoder,0.981073,0.805092,73.614259
0,RandomForestClassifier,mms,non_encoder,0.981067,0.80436,69.812587
0,RandomForestClassifier,non_scaler,ordinal,0.98108,0.804268,85.759852
0,RandomForestClassifier,mms,ordinal,0.981067,0.803563,69.952894
0,RandomForestClassifier,ss,non_encoder,0.981083,0.803563,72.882042
0,RandomForestClassifier,ss,ordinal,0.981063,0.803406,73.76097
0,RandomForestClassifier,non_scaler,onehot,0.966812,0.794872,59.08298
0,RandomForestClassifier,ss,onehot,0.966818,0.794297,63.286717
0,RandomForestClassifier,mms,onehot,0.966828,0.794271,58.820482
0,HistGradientBoostingClassifier,ss,non_encoder,0.833379,0.81803,11.875804


In [33]:
result_4.sort_values("test_score", ascending=False)

Unnamed: 0,model,num_transformer,cat_transformer,train_score,test_score,fit_time
0,HistGradientBoostingClassifier,ss,non_encoder,0.833379,0.81803,11.875804
0,HistGradientBoostingClassifier,non_scaler,ordinal,0.835111,0.817964,12.919284
0,HistGradientBoostingClassifier,non_scaler,non_encoder,0.83532,0.817886,12.98124
0,HistGradientBoostingClassifier,mms,ordinal,0.836071,0.81786,13.592519
0,HistGradientBoostingClassifier,mms,non_encoder,0.835411,0.817599,13.140338
0,HistGradientBoostingClassifier,ss,ordinal,0.836163,0.817507,13.893214
0,HistGradientBoostingClassifier,ss,onehot,0.830187,0.817298,12.432246
0,HistGradientBoostingClassifier,mms,onehot,0.831213,0.816697,12.465124
0,HistGradientBoostingClassifier,non_scaler,onehot,0.830948,0.816618,12.214168
0,XGBClassifier,mms,non_encoder,0.867995,0.816475,9.534575


## (5) test 5
- transform X 
   - binning cate : lower 200
   - cap outlier numeric : quantile, upper 0.03
   - log transform numeric
- preprocessor
- feature select + feature interaction
   - SelectKBest(k=25)
   - PolynomialFeatures()
   - VarianceThreshold()

In [34]:
transform_X = X.copy()
transform_X = get_binning_cate_X(transform_X, 200)

outlier_features = outlier_result_df.query("quant_outlier > 0.03")["column"].values
transform_X = get_cap_outlier(transform_X, outlier_features, op="quant")

transform_X = get_log_transform_X(transform_X)
transform_X.head()

~~~ Binning Cate ~~~
columns : Application mode

        * uniques : [15, 5, 10, 2, 27, 4, 26, 35, 12, 9, 3]
        * change uniques ea : 329
        

columns : Application order

        * uniques : [0, 9]
        * change uniques ea : 4
        

columns : Course

        * uniques : [33, 39, 979]
        * change uniques ea : 74
        

columns : Father's occupation

        * uniques : [99, 193, 171, 144, 163, 175, 103, 192, 181, 152, 135, 182, 102, 172, 151, 112, 154, 183, 123, 194, 122, 153, 143, 195, 131, 141, 101, 114, 121, 174, 132, 134, 161, 125, 148, 96, 39, 22, 19, 191, 13, 12, 11, 124]
        * change uniques ea : 701
        

columns : Father's qualification

        * uniques : [39, 5, 11, 36, 29, 40, 9, 14, 30, 43, 41, 22, 10, 26, 6, 42, 35, 18, 44, 20, 13, 27, 7, 33, 31, 25, 24, 21, 15, 23]
        * change uniques ea : 541
        

columns : Marital status

        * uniques : [5, 6, 3]
        * change uniques ea : 167
        

columns : Mother's occupation



Unnamed: 0,Marital status,Application mode,Application order,Course,Daytime/evening attendance,Previous qualification,Previous qualification (grade),Nacionality,Mother's qualification,Father's qualification,...,Curricular units 1st sem (without evaluations),Curricular units 2nd sem (credited),Curricular units 2nd sem (enrolled),Curricular units 2nd sem (evaluations),Curricular units 2nd sem (approved),Curricular units 2nd sem (grade),Curricular units 2nd sem (without evaluations),Unemployment rate,Inflation rate,GDP
0,1,1,1,9238,1,1,4.844187,1,1,19,...,0.0,0.0,1.94591,2.079442,1.94591,2.597385,0.0,2.493205,0.470004,1.105257
1,1,17,1,9238,1,1,4.836282,1,19,19,...,0.0,0.0,1.94591,2.302585,0.0,0.0,0.0,2.493205,0.470004,1.105257
2,1,17,2,9254,1,1,4.933898,1,3,19,...,0.652325,0.652325,2.069391,0.652325,0.652325,0.652325,0.652325,2.897016,0.797507,0.0
3,1,1,3,9500,1,1,4.882802,1,19,3,...,0.0,0.0,2.197225,2.484907,2.079442,2.626117,0.0,2.493205,0.470004,1.105257
4,1,1,2,9500,1,1,4.890349,1,19,37,...,0.0,0.0,2.079442,2.564949,1.94591,2.634284,0.0,2.151762,1.280934,0.277632


In [36]:
%%time

scaler = ["ss", "mms", "non_scaler"]
encoder = ["onehot", "ordinal", "non_encoder"]

cv_results = []
for model_name in model_link.keys() : 
    cv_results.append(
        get_cv_result(transform_X, y_enc, model_pipe, model_name=model_name, num_scaler=scaler, cate_encoder=encoder)
    )

result_5 = pd.concat(cv_results, axis=0)
result_5

randomF fitting time :  754.412
histGB fitting time :  127.957
xgbclf fitting time :  93.871
lda fitting time :  20.322
qda fitting time :  21.964
logisticR fitting time :  151.438
CPU times: total: 13.8 s
Wall time: 19min 29s


Unnamed: 0,model,num_transformer,cat_transformer,train_score,test_score,fit_time
0,RandomForestClassifier,non_scaler,non_encoder,0.984729,0.802713,88.138999
0,RandomForestClassifier,non_scaler,ordinal,0.984726,0.802648,87.853054
0,RandomForestClassifier,mms,ordinal,0.984726,0.802478,86.426971
0,RandomForestClassifier,ss,non_encoder,0.984726,0.80202,90.994391
0,RandomForestClassifier,mms,non_encoder,0.984729,0.801811,87.389054
0,RandomForestClassifier,ss,ordinal,0.984729,0.801811,90.478433
0,RandomForestClassifier,non_scaler,onehot,0.975904,0.794833,72.827999
0,RandomForestClassifier,mms,onehot,0.975898,0.793826,72.406974
0,RandomForestClassifier,ss,onehot,0.975921,0.793591,77.892317
0,HistGradientBoostingClassifier,mms,non_encoder,0.839247,0.814501,17.219995


In [37]:
result_5.sort_values("test_score", ascending=False)

Unnamed: 0,model,num_transformer,cat_transformer,train_score,test_score,fit_time
0,HistGradientBoostingClassifier,mms,non_encoder,0.839247,0.814501,17.219995
0,HistGradientBoostingClassifier,mms,ordinal,0.833137,0.814201,13.067997
0,HistGradientBoostingClassifier,non_scaler,non_encoder,0.835699,0.81373,16.322002
0,HistGradientBoostingClassifier,ss,onehot,0.829426,0.813443,12.580997
0,HistGradientBoostingClassifier,ss,non_encoder,0.829311,0.81343,12.319
0,HistGradientBoostingClassifier,non_scaler,ordinal,0.833343,0.81343,13.383999
0,HistGradientBoostingClassifier,ss,ordinal,0.83533,0.81326,14.966996
0,HistGradientBoostingClassifier,mms,onehot,0.830226,0.812933,13.89
0,HistGradientBoostingClassifier,non_scaler,onehot,0.831285,0.812658,14.197002
0,XGBClassifier,ss,onehot,0.858569,0.812227,10.571694


### (6) test 6
- transform X 
   - binning cate : lower 100
   - cap outlier numeric : std, upper 0.03
   - log transform numeric
- preprocessor
- feature select + feature interaction
   - SelectKBest(k=25)
   - PolynomialFeatures()
   - VarianceThreshold()

In [38]:
transform_X = X.copy()
transform_X = get_binning_cate_X(transform_X, 100)

outlier_features = outlier_result_df.query("std_outlier > 0.03")["column"].values
transform_X = get_cap_outlier(transform_X, outlier_features, op="std")

transform_X = get_log_transform_X(transform_X)
transform_X.head()

~~~ Binning Cate ~~~
columns : Application mode

        * uniques : [5, 10, 2, 27, 4, 26, 35, 12, 9, 3]
        * change uniques ea : 146
        

columns : Application order

        * uniques : [0, 9]
        * change uniques ea : 4
        

columns : Course

        * uniques : [33, 39, 979]
        * change uniques ea : 74
        

columns : Father's occupation

        * uniques : [193, 171, 144, 163, 175, 103, 192, 181, 152, 135, 182, 102, 172, 151, 112, 154, 183, 123, 194, 122, 153, 143, 195, 131, 141, 101, 114, 121, 174, 132, 134, 161, 125, 148, 96, 39, 22, 19, 191, 13, 12, 11, 124]
        * change uniques ea : 551
        

columns : Father's qualification

        * uniques : [11, 36, 29, 40, 9, 14, 30, 43, 41, 22, 10, 26, 6, 42, 35, 18, 44, 20, 13, 27, 7, 33, 31, 25, 24, 21, 15, 23]
        * change uniques ea : 321
        

columns : Marital status

        * uniques : [6, 3]
        * change uniques ea : 51
        

columns : Mother's occupation

        * uniques :

Unnamed: 0,Marital status,Application mode,Application order,Course,Daytime/evening attendance,Previous qualification,Previous qualification (grade),Nacionality,Mother's qualification,Father's qualification,...,Curricular units 1st sem (without evaluations),Curricular units 2nd sem (credited),Curricular units 2nd sem (enrolled),Curricular units 2nd sem (evaluations),Curricular units 2nd sem (approved),Curricular units 2nd sem (grade),Curricular units 2nd sem (without evaluations),Unemployment rate,Inflation rate,GDP
0,1,1,1,9238,1,1,4.844187,1,1,19,...,0.0,0.0,1.94591,2.079442,1.94591,2.597385,0.0,2.493205,0.470004,1.105257
1,1,17,1,9238,1,1,4.836282,1,19,19,...,0.0,0.0,1.94591,2.302585,0.0,0.0,0.0,2.493205,0.470004,1.105257
2,1,17,2,9254,1,1,4.933898,1,3,19,...,0.652325,0.652325,2.069391,0.652325,0.652325,0.652325,0.652325,2.897016,0.797507,0.0
3,1,1,3,9500,1,1,4.882802,1,19,3,...,0.0,0.0,2.197225,2.484907,2.079442,2.626117,0.0,2.493205,0.470004,1.105257
4,1,1,2,9500,1,1,4.890349,1,19,37,...,0.0,0.0,2.079442,2.564949,1.94591,2.634284,0.0,2.151762,1.280934,0.277632


In [41]:
model_pipe["engineering"]["tech"].set_params(feature_select_2=SelectKBest(k=25))
model_pipe["engineering"]["tech"].set_params(feature_select_2=VarianceThreshold())
model_pipe["engineering"]["tech"]

In [42]:
%%time

scaler = ["ss", "mms", "non_scaler"]
encoder = ["onehot", "ordinal", "non_encoder"]

cv_results = []
for model_name in model_link.keys() : 
    cv_results.append(
        get_cv_result(transform_X, y_enc, model_pipe, model_name=model_name, num_scaler=scaler, cate_encoder=encoder)
    )

result_6 = pd.concat(cv_results, axis=0)
result_6

randomF fitting time :  825.896
histGB fitting time :  129.252
xgbclf fitting time :  101.605
lda fitting time :  22.680
qda fitting time :  24.305
logisticR fitting time :  146.508
CPU times: total: 13.1 s
Wall time: 20min 50s


Unnamed: 0,model,num_transformer,cat_transformer,train_score,test_score,fit_time
0,RandomForestClassifier,non_scaler,ordinal,0.985373,0.802922,95.153057
0,RandomForestClassifier,mms,ordinal,0.985383,0.802622,96.250038
0,RandomForestClassifier,mms,non_encoder,0.98536,0.802295,96.431504
0,RandomForestClassifier,non_scaler,non_encoder,0.985347,0.802256,95.768999
0,RandomForestClassifier,ss,non_encoder,0.985373,0.801981,98.459631
0,RandomForestClassifier,ss,ordinal,0.985369,0.801837,97.922565
0,RandomForestClassifier,non_scaler,onehot,0.97711,0.795434,81.690963
0,RandomForestClassifier,mms,onehot,0.977133,0.794218,80.142051
0,RandomForestClassifier,ss,onehot,0.977113,0.793395,84.074618
0,HistGradientBoostingClassifier,ss,ordinal,0.833807,0.814122,14.210995


In [46]:
result_6.sort_values("test_score", ascending=False)

Unnamed: 0,model,num_transformer,cat_transformer,train_score,test_score,fit_time
0,HistGradientBoostingClassifier,ss,ordinal,0.833807,0.814122,14.210995
0,HistGradientBoostingClassifier,non_scaler,non_encoder,0.836267,0.813416,16.860999
0,HistGradientBoostingClassifier,mms,ordinal,0.833738,0.813416,13.753998
0,HistGradientBoostingClassifier,mms,non_encoder,0.835166,0.813351,14.934003
0,HistGradientBoostingClassifier,non_scaler,ordinal,0.833827,0.813116,13.879
0,HistGradientBoostingClassifier,ss,non_encoder,0.832798,0.81292,15.361999
0,XGBClassifier,non_scaler,ordinal,0.870602,0.812907,11.060002
0,XGBClassifier,non_scaler,non_encoder,0.870602,0.812907,12.159036
0,HistGradientBoostingClassifier,non_scaler,onehot,0.831161,0.812894,13.491997
0,HistGradientBoostingClassifier,ss,onehot,0.830932,0.812868,13.209003


### (7) test 7
- transform X 
   - binning cate : lower 100
   - cap outlier numeric : std, upper 0.03
   - log transform numeric
- preprocessor
- feature select + feature interaction
   - SelectKBest(k=20)
   - PolynomialFeatures()
   - VarianceThreshold()

In [57]:
transform_X = X.copy()
transform_X = get_binning_cate_X(transform_X, 100)

outlier_features = outlier_result_df.query("std_outlier > 0.03")["column"].values
transform_X = get_cap_outlier(transform_X, outlier_features, op="std")

transform_X = get_log_transform_X(transform_X)
transform_X.head()

~~~ Binning Cate ~~~
columns : Application mode

        * uniques : [5, 10, 2, 27, 4, 26, 35, 12, 9, 3]
        * change uniques ea : 146
        

columns : Application order

        * uniques : [0, 9]
        * change uniques ea : 4
        

columns : Course

        * uniques : [33, 39, 979]
        * change uniques ea : 74
        

columns : Father's occupation

        * uniques : [193, 171, 144, 163, 175, 103, 192, 181, 152, 135, 182, 102, 172, 151, 112, 154, 183, 123, 194, 122, 153, 143, 195, 131, 141, 101, 114, 121, 174, 132, 134, 161, 125, 148, 96, 39, 22, 19, 191, 13, 12, 11, 124]
        * change uniques ea : 551
        

columns : Father's qualification

        * uniques : [11, 36, 29, 40, 9, 14, 30, 43, 41, 22, 10, 26, 6, 42, 35, 18, 44, 20, 13, 27, 7, 33, 31, 25, 24, 21, 15, 23]
        * change uniques ea : 321
        

columns : Marital status

        * uniques : [6, 3]
        * change uniques ea : 51
        

columns : Mother's occupation

        * uniques :

Unnamed: 0,Marital status,Application mode,Application order,Course,Daytime/evening attendance,Previous qualification,Previous qualification (grade),Nacionality,Mother's qualification,Father's qualification,...,Curricular units 1st sem (without evaluations),Curricular units 2nd sem (credited),Curricular units 2nd sem (enrolled),Curricular units 2nd sem (evaluations),Curricular units 2nd sem (approved),Curricular units 2nd sem (grade),Curricular units 2nd sem (without evaluations),Unemployment rate,Inflation rate,GDP
0,1,1,1,9238,1,1,4.844187,1,1,19,...,0.0,0.0,1.94591,2.079442,1.94591,2.597385,0.0,2.493205,0.470004,1.105257
1,1,17,1,9238,1,1,4.836282,1,19,19,...,0.0,0.0,1.94591,2.302585,0.0,0.0,0.0,2.493205,0.470004,1.105257
2,1,17,2,9254,1,1,4.933898,1,3,19,...,0.652325,0.652325,2.069391,0.652325,0.652325,0.652325,0.652325,2.897016,0.797507,0.0
3,1,1,3,9500,1,1,4.882802,1,19,3,...,0.0,0.0,2.197225,2.484907,2.079442,2.626117,0.0,2.493205,0.470004,1.105257
4,1,1,2,9500,1,1,4.890349,1,19,37,...,0.0,0.0,2.079442,2.564949,1.94591,2.634284,0.0,2.151762,1.280934,0.277632


In [25]:
model_pipe["engineering"]["tech"].set_params(feature_select_2=SelectKBest(k=28))
model_pipe

In [59]:
%%time

scaler = ["ss", "mms", "non_scaler"]
encoder = ["onehot", "ordinal", "non_encoder"]

cv_results = []
for model_name in model_link.keys() : 
    cv_results.append(
        get_cv_result(transform_X, y_enc, model_pipe, model_name=model_name, num_scaler=scaler, cate_encoder=encoder)
    )

result_7 = pd.concat(cv_results, axis=0).sort_values("test_score", ascending=False)
result_7

randomF fitting time :  675.988
histGB fitting time :  107.402
xgbclf fitting time :  74.522
lda fitting time :  19.264
qda fitting time :  20.068
logisticR fitting time :  88.838
CPU times: total: 13.8 s
Wall time: 16min 26s


Unnamed: 0,model,num_transformer,cat_transformer,train_score,test_score,fit_time
0,HistGradientBoostingClassifier,ss,onehot,0.828308,0.813403,9.51999
0,HistGradientBoostingClassifier,ss,ordinal,0.831128,0.813364,10.279002
0,HistGradientBoostingClassifier,ss,non_encoder,0.831991,0.813142,12.372
0,HistGradientBoostingClassifier,non_scaler,onehot,0.833725,0.813077,12.798003
0,HistGradientBoostingClassifier,mms,onehot,0.831233,0.812044,12.73466
0,XGBClassifier,ss,non_encoder,0.858801,0.812018,8.122999
0,XGBClassifier,ss,ordinal,0.858801,0.812018,8.324001
0,XGBClassifier,ss,onehot,0.857386,0.811966,8.360679
0,HistGradientBoostingClassifier,mms,ordinal,0.830507,0.811587,12.225523
0,XGBClassifier,non_scaler,onehot,0.866074,0.811482,7.858


### (8) test 8
- preprocessor 만 적용

In [26]:
model_pipe["engineering"].set_params(tech="passthrough")
model_pipe

In [28]:
%%time

scaler = ["ss", "mms", "non_scaler"]
encoder = ["onehot", "ordinal", "non_encoder"]

cv_results = []
for model_name in model_link.keys() : 
    cv_results.append(
        get_cv_result(X, y_enc, model_pipe, model_name=model_name, num_scaler=scaler, cate_encoder=encoder)
    )

result_8 = pd.concat(cv_results, axis=0).sort_values("test_score", ascending=False)
result_8

randomF fitting time :  471.601
histGB fitting time :  293.805
xgbclf fitting time :  94.319
lda fitting time :  29.471
qda fitting time :  32.080
logisticR fitting time :  277.773
CPU times: total: 2.14 s
Wall time: 19min 59s


Unnamed: 0,model,num_transformer,cat_transformer,train_score,test_score,fit_time
0,XGBClassifier,mms,onehot,0.875827,0.831713,17.875877
0,XGBClassifier,non_scaler,onehot,0.875827,0.831713,18.40408
0,XGBClassifier,ss,onehot,0.875827,0.831713,20.15187
0,HistGradientBoostingClassifier,ss,onehot,0.851064,0.831478,57.082044
0,HistGradientBoostingClassifier,mms,onehot,0.854982,0.831308,71.423432
0,HistGradientBoostingClassifier,mms,ordinal,0.853538,0.830942,14.314302
0,HistGradientBoostingClassifier,non_scaler,ordinal,0.855286,0.830929,15.402017
0,HistGradientBoostingClassifier,non_scaler,onehot,0.856253,0.830916,73.876855
0,HistGradientBoostingClassifier,mms,non_encoder,0.856004,0.83085,15.571253
0,HistGradientBoostingClassifier,ss,non_encoder,0.856517,0.830785,15.372363


### (9) test 9
- transoform X 
   - log transform
- preprocessor

In [29]:
transform_X = X.copy()
transform_X = get_log_transform_X(transform_X)
transform_X[numeric_features].head()

Unnamed: 0,Inflation rate,Curricular units 1st sem (approved),Curricular units 2nd sem (credited),Previous qualification (grade),Curricular units 1st sem (evaluations),Unemployment rate,Curricular units 2nd sem (grade),Admission grade,Curricular units 1st sem (enrolled),Curricular units 2nd sem (approved),Curricular units 1st sem (grade),Curricular units 1st sem (credited),Curricular units 2nd sem (evaluations),Curricular units 2nd sem (enrolled),Curricular units 2nd sem (without evaluations),GDP,Curricular units 1st sem (without evaluations),Age at enrollment
0,0.470004,1.94591,0.0,4.844187,1.94591,2.493205,2.597385,4.817051,1.94591,1.94591,2.74084,0.0,2.079442,1.94591,0.0,1.105257,0.0,2.944439
1,0.470004,1.609438,0.0,4.836282,2.197225,2.493205,0.0,4.794136,1.94591,0.0,2.533697,0.0,2.302585,1.94591,0.0,1.105257,0.0,2.944439
2,0.797507,0.652325,0.652325,4.933898,0.652325,2.897016,0.652325,4.987844,2.069391,0.652325,0.652325,0.652325,0.652325,2.069391,0.652325,0.0,0.652325,2.991724
3,0.470004,2.079442,0.0,4.882802,2.302585,2.493205,2.626117,4.844974,2.079442,2.079442,2.609426,0.0,2.484907,2.197225,0.0,1.105257,0.0,2.944439
4,1.280934,1.94591,0.0,4.890349,2.564949,2.151762,2.634284,4.796617,2.079442,1.94591,2.634284,0.0,2.564949,2.079442,0.0,0.277632,0.0,2.944439


In [31]:
%%time

scaler = ["ss", "mms", "non_scaler"]
encoder = ["onehot", "ordinal", "non_encoder"]

cv_results = []
for model_name in model_link.keys() : 
    cv_results.append(
        get_cv_result(transform_X, y_enc, model_pipe, model_name=model_name, num_scaler=scaler, cate_encoder=encoder)
    )

result_9 = pd.concat(cv_results, axis=0).sort_values("test_score", ascending=False)
result_9

randomF fitting time :  512.975
histGB fitting time :  298.108
xgbclf fitting time :  100.900
lda fitting time :  28.023
qda fitting time :  32.222
logisticR fitting time :  266.868
CPU times: total: 7.14 s
Wall time: 20min 39s


Unnamed: 0,model,num_transformer,cat_transformer,train_score,test_score,fit_time
0,XGBClassifier,mms,onehot,0.875624,0.828968,19.874001
0,XGBClassifier,non_scaler,onehot,0.875624,0.828968,18.254995
0,XGBClassifier,ss,onehot,0.875624,0.828968,21.085428
0,HistGradientBoostingClassifier,non_scaler,onehot,0.850538,0.828276,65.295
0,HistGradientBoostingClassifier,ss,onehot,0.8536,0.828067,69.137998
0,HistGradientBoostingClassifier,mms,ordinal,0.855939,0.827975,16.309999
0,HistGradientBoostingClassifier,non_scaler,non_encoder,0.853766,0.827923,15.620001
0,HistGradientBoostingClassifier,ss,non_encoder,0.852211,0.827818,14.766999
0,HistGradientBoostingClassifier,mms,non_encoder,0.854224,0.827661,16.350003
0,HistGradientBoostingClassifier,mms,onehot,0.851577,0.827622,66.705999


### (10) test 10
- preprocessor
   - onehot encoder 제외
- feature engineering
   - PolynomialFeatures() : onehot encoder 적용시 features가 많아져 메모리 부족 에러 발생

In [27]:
model_pipe["engineering"].set_params(tech=feature_tech["feature_interaction"])
model_pipe

In [39]:
%%time

scaler = ["ss", "mms", "non_scaler"]
encoder = ["ordinal", "non_encoder"]

cv_results = []
for model_name in model_link.keys() : 
    cv_results.append(
        get_cv_result(X, y_enc, model_pipe, model_name=model_name, num_scaler=scaler, cate_encoder=encoder)
    )

result_10 = pd.concat(cv_results, axis=0).sort_values("test_score", ascending=False)
result_10

randomF fitting time :  1500.231
histGB fitting time :  750.747
xgbclf fitting time :  565.634
lda fitting time :  117.499
qda fitting time :  160.167
logisticR fitting time :  1012.308
CPU times: total: 1.28 s
Wall time: 1h 8min 26s


Unnamed: 0,model,num_transformer,cat_transformer,train_score,test_score,fit_time
0,HistGradientBoostingClassifier,ss,non_encoder,0.858615,0.830354,114.380821
0,HistGradientBoostingClassifier,non_scaler,ordinal,0.860477,0.830197,122.667517
0,HistGradientBoostingClassifier,ss,ordinal,0.863463,0.829949,134.214322
0,HistGradientBoostingClassifier,mms,ordinal,0.86147,0.829857,127.08656
0,HistGradientBoostingClassifier,non_scaler,non_encoder,0.861885,0.829583,129.893158
0,HistGradientBoostingClassifier,mms,non_encoder,0.859726,0.829295,122.491491
0,XGBClassifier,mms,non_encoder,0.909949,0.82923,93.46658
0,XGBClassifier,ss,ordinal,0.908688,0.828955,94.826479
0,XGBClassifier,mms,ordinal,0.908087,0.828785,95.684642
0,XGBClassifier,non_scaler,ordinal,0.906904,0.828785,89.513002


## 5. Hyper parameter Tunning

### (1) Logistic Regression

In [28]:
model_pipe["engineering"]["preprocessor"].set_params(
    num_transformer=num_transformer["mms"],
    cat_transformer=cat_transformer["onehot"]
)
model_pipe["engineering"].set_params(tech="passthrough")
model_pipe

In [44]:
%%time

lr_params = {
    "model__solver": ["saga", "sag"],
    "model__multi_class": ["multinomial", "ovr"],
    "model__max_iter": [100, 200, 300, 400, 500],
    "model__C": [0.01, 0.1, 1]
}

cv = KFold(n_splits=5, shuffle=True, random_state=45)
lr_grid_cv = GridSearchCV(estimator=model_pipe, 
                          param_grid=lr_params, 
                          cv=cv, 
                          scoring="accuracy", 
                          return_train_score=True, 
                          error_score="raise",
                          n_jobs=2)
lr_grid_cv.fit(X, y_enc)

CPU times: total: 22.1 s
Wall time: 47min 16s


In [17]:
def get_grid_cv_result_df(grid_cv) : 
    
    result_df = pd.DataFrame(grid_cv.cv_results_).sort_values("rank_test_score")
    cols = list(result_df.filter(regex="^param_model").columns) + ["mean_test_score", "mean_train_score", "std_test_score", "rank_test_score"]
    result_df = result_df[cols]
    
    return result_df

In [47]:
lr_grid_cv_result = get_grid_cv_result_df(lr_grid_cv)
lr_grid_cv_result[:5]

Unnamed: 0,param_model__C,param_model__max_iter,param_model__multi_class,param_model__solver,mean_test_score,mean_train_score,std_test_score,rank_test_score
49,1,300,multinomial,sag,0.822003,0.824251,0.003253,1
53,1,400,multinomial,sag,0.82199,0.824254,0.003276,2
41,1,100,multinomial,sag,0.82195,0.824267,0.003342,3
57,1,500,multinomial,sag,0.821924,0.824273,0.003343,4
44,1,200,multinomial,saga,0.821924,0.824264,0.003361,5


In [18]:
def save_model(dir_path, save_model, model_name) : 
    
    os.makedirs(dir_path, exist_ok=True)
    full_path = dir_path + "/" + model_name + ".pkl"
    
    pickle.dump(save_model, open(full_path, 'wb'))
    
    # 저장한 모델을 불러온 후 모델의 params의 갯수를 확인
    load_model = pickle.load(open(full_path, "rb"))
    if len(load_model["model"].get_params()) > 1 : 
        print(f"{model_name} save success.")

In [90]:
lr_grid_cv_result.to_csv("./lr_grid_cv_result.csv")
save_model("./build_model", lr_grid_cv.best_estimator_, "lr_grid_cv_best_model")

lr_grid_cv_best_model save success.


### (2) Linear Discriminant Analysis
- 다른 모델들에 비해 fitting 시간이 매우 짧다.

In [31]:
model_pipe.set_params(model=model_link["lda"])
model_pipe

In [51]:
from sklearn.covariance import OAS

oas = OAS(store_precision=False, assume_centered=False)
oas

In [58]:
%%time

lda_params = {
    "model__solver": ["svd", "lsqr"],
    "model__shrinkage": ["auto", None],
    "model__covariance_estimator": [None, oas]
}

cv = KFold(n_splits=5, shuffle=True, random_state=45)
lda_grid_cv = GridSearchCV(estimator=model_pipe, 
                          param_grid=lda_params, 
                          cv=cv, 
                          scoring="accuracy", 
                          return_train_score=True,
                          n_jobs=2)
lda_grid_cv.fit(X, y_enc)

CPU times: total: 1.27 s
Wall time: 17.9 s


In [59]:
lda_grid_cv_result = get_grid_cv_result_df(lda_grid_cv)
lda_grid_cv_result

Unnamed: 0,param_model__covariance_estimator,param_model__shrinkage,param_model__solver,mean_test_score,mean_train_score,std_test_score,rank_test_score
3,,,lsqr,0.811116,0.813227,0.003128,1
2,,,svd,0.811103,0.813227,0.003106,2
7,OAS(store_precision=False),,lsqr,0.811025,0.812822,0.003046,3
1,,auto,lsqr,0.806385,0.807049,0.002478,4
0,,auto,svd,,,,5
4,OAS(store_precision=False),auto,svd,,,,5
5,OAS(store_precision=False),auto,lsqr,,,,5
6,OAS(store_precision=False),,svd,,,,5


In [91]:
lda_grid_cv_result.to_csv("./lda_grid_cv_result.csv")
save_model("./build_model", lda_grid_cv.best_estimator_, "lda_grid_cv_best_model")

lda_grid_cv_best_model save success.


### (3) Random Forest

In [32]:
model_pipe["engineering"].set_params(tech="passthrough")
model_pipe.set_params(model=model_link["randomF"])
model_pipe

In [75]:
%%time

rf_params = {
    "model__n_estimators": [50, 100, 200, 300, 400],
    "model__criterion": ["entropy"],
    "model__max_depth": [3, 4, 5, 6, 7, 8],
    "model__min_samples_leaf": [1, 2, 3],
    "model__min_samples_split": [2, 4, 6],
    "model__n_jobs": [-1],
}

cv = KFold(n_splits=5, shuffle=True, random_state=45)
rf_grid_cv = GridSearchCV(estimator=model_pipe, 
                          param_grid=rf_params, 
                          cv=cv, 
                          scoring="accuracy", 
                          return_train_score=True,
                          n_jobs=-1)
rf_grid_cv.fit(X, y_enc)

CPU times: total: 1min 40s
Wall time: 1h 24min 54s


In [77]:
rf_grid_cv_result = ge t_grid_cv_result_df(rf_grid_cv)
rf_grid_cv_result[:5]

Unnamed: 0,param_model__criterion,param_model__max_depth,param_model__min_samples_leaf,param_model__min_samples_split,param_model__n_estimators,param_model__n_jobs,mean_test_score,mean_train_score,std_test_score,rank_test_score
232,entropy,8,1,4,200,-1,0.807,0.810074,0.003753,1
248,entropy,8,2,4,300,-1,0.806895,0.810332,0.002487,2
240,entropy,8,2,2,50,-1,0.806791,0.809816,0.004607,3
262,entropy,8,3,4,200,-1,0.806751,0.809904,0.004134,4
250,entropy,8,2,6,50,-1,0.806634,0.809943,0.003808,5


In [92]:
rf_grid_cv_result.to_csv("./rf_grid_cv_result.csv")
save_model("./build_model", rf_grid_cv.best_estimator_, "rf_grid_cv_best_model")

rf_grid_cv_best_model save success.


### (4) Hist Gradient Boosting
- 6시간 fitting 

In [33]:
model_pipe.set_params(model=model_link["histGB"])
model_pipe

In [81]:
%%time

histgb_params = {
    "model__max_iter": [50, 100, 200, 300, 400],
    "model__learning_rate": [0.01, 0.1, 1],
    "model__max_depth": [3, 4, 5, 6, 7],
    "model__min_samples_leaf": [20, 50, 100, 500],
    "model__l2_regularization": [0.01, 0.1]
    
}

cv = KFold(n_splits=5, shuffle=True, random_state=45)
histgb_grid_cv = GridSearchCV(estimator=model_pipe, 
                          param_grid=histgb_params, 
                          cv=cv, 
                          scoring="accuracy", 
                          return_train_score=True,
                          n_jobs=-1)
histgb_grid_cv.fit(X, y_enc)

CPU times: total: 1min 44s
Wall time: 6h 52min 13s


In [82]:
histgb_grid_cv_result = get_grid_cv_result_df(histgb_grid_cv)
histgb_grid_cv_result[:5]

Unnamed: 0,param_model__l2_regularization,param_model__learning_rate,param_model__max_depth,param_model__max_iter,param_model__min_samples_leaf,mean_test_score,mean_train_score,std_test_score,rank_test_score
419,0.1,0.1,3,400,500,0.832719,0.840279,0.003403,1
158,0.01,0.1,5,400,100,0.832458,0.851316,0.002928,2
450,0.1,0.1,5,200,100,0.832432,0.850659,0.003628,3
114,0.01,0.1,3,300,100,0.832419,0.842076,0.003517,4
454,0.1,0.1,5,300,100,0.832419,0.853632,0.002757,5


In [93]:
histgb_grid_cv_result.to_csv("./histgb_grid_cv_result.csv")
save_model("./build_model", histgb_grid_cv.best_estimator_, "histgb_grid_cv_best_model")

histgb_grid_cv_best_model save success.


### (5) XGBClassifier

In [21]:
model_pipe["engineering"]["preprocessor"].set_params(
    num_transformer=num_transformer["mms"],
    cat_transformer=cat_transformer["onehot"]
)
model_pipe["engineering"].set_params(tech="passthrough")
model_pipe.set_params(model=model_link["xgbclf"])
model_pipe

In [97]:
help(xgb.XGBClassifier())

Help on XGBClassifier in module xgboost.sklearn object:

class XGBClassifier(XGBModel, sklearn.base.ClassifierMixin)
 |  XGBClassifier(*, objective: Union[str, Callable[[numpy.ndarray, numpy.ndarray], Tuple[numpy.ndarray, numpy.ndarray]], NoneType] = 'binary:logistic', **kwargs: Any) -> None
 |  
 |  Implementation of the scikit-learn API for XGBoost classification.
 |  See :doc:`/python/sklearn_estimator` for more information.
 |  
 |  Parameters
 |  ----------
 |  
 |      n_estimators : Optional[int]
 |          Number of boosting rounds.
 |  
 |      max_depth :  Optional[int]
 |          Maximum tree depth for base learners.
 |      max_leaves :
 |          Maximum number of leaves; 0 indicates no limit.
 |      max_bin :
 |          If using histogram-based algorithm, maximum number of bins per feature
 |      grow_policy :
 |          Tree growing policy. 0: favor splitting at nodes closest to the node, i.e. grow
 |          depth-wise. 1: favor splitting at nodes with highest l

### XGB 첫번째 tunning : 너무 오래 걸려서 중지 (24h 이상)

In [99]:
%%time

xgb_params = {
    "model__n_estimators": [50, 100, 150, 200, 250, 300]
    "model__max_depth": [3, 4, 5, 6, 7, 8],
    "model__max_leaves": [0, 1, 2, 3],
    "model__bin": [100, 200, 255, 300],
    "model__learning_rate": [0.01, 0.1, 0.5, 1],
    "model__booster": ["gbtree", "dart"],
    "model__gamma": [0.01, 0.1, 1, 10, 100],
    "model__device": ["cpu", "cuda", "gpu"]
}

cv = KFold(n_splits=5, shuffle=True, random_state=45)
xgbclf_grid_cv = GridSearchCV(estimator=model_pipe, 
                          param_grid=xgb_params, 
                          cv=cv, 
                          scoring="accuracy", 
                          return_train_score=True,
                          n_jobs=-1)
xgbclf_grid_cv.fit(X, y_enc)

KeyboardInterrupt: 

### XGB 두 번쨰 tunning

In [38]:
%%time

xgb_params = {
    "model__n_estimators": [50, 100, 150, 200, 250, 300],
    #"model__max_depth": [3, 4, 5, 6, 7, 8],
    #"model__max_leaves": [0, 1, 2, 3],
    #"model__bin": [100, 200, 255, 300],
    #"model__learning_rate": [0.01, 0.1, 0.5, 1],
    #"model__booster": ["gbtree", "dart"],
    "model__gamma": [0.01, 0.1, 1, 10, 100],
    "model__device": ["cuda"]
}

cv = KFold(n_splits=5, shuffle=True, random_state=45)
xgbclf_grid_cv = GridSearchCV(estimator=model_pipe, 
                          param_grid=xgb_params, 
                          cv=cv, 
                          scoring="accuracy", 
                          return_train_score=True,
                          n_jobs=-1)
xgbclf_grid_cv.fit(X, y_enc)

CPU times: total: 13.7 s
Wall time: 11min 56s


In [40]:
xgbclf_grid_cv_result = get_grid_cv_result_df(xgbclf_grid_cv)
xgbclf_grid_cv_result

Unnamed: 0,param_model__device,param_model__gamma,param_model__n_estimators,mean_test_score,mean_train_score,std_test_score,rank_test_score
7,cuda,0.1,100,0.831504,0.875748,0.002568,1
2,cuda,0.01,150,0.83136,0.890059,0.003164,2
8,cuda,0.1,150,0.831321,0.891186,0.001909,3
14,cuda,1.0,150,0.831242,0.858275,0.002808,4
17,cuda,1.0,300,0.831242,0.858275,0.002808,4
16,cuda,1.0,250,0.831242,0.858275,0.002808,4
15,cuda,1.0,200,0.831242,0.858275,0.002808,4
13,cuda,1.0,100,0.831242,0.858275,0.002808,4
1,cuda,0.01,100,0.831216,0.875467,0.003517,9
0,cuda,0.01,50,0.831086,0.857527,0.003326,10


In [48]:
xgbclf_grid_cv_result.to_csv("./xgbclf_grid_cv_result.csv")
save_model("./build_model", xgbclf_grid_cv.best_estimator_, "xgbclf_grid_cv_best_model")

xgbclf_grid_cv_best_model save success.


### XGB 세 번째 tunning

In [42]:
%%time

xgb_params = {
    "model__n_estimators": [100, 130, 150, 170, 200],
    "model__max_depth": [3, 4, 5, 6, 7],
    #"model__max_leaves": [0, 1, 2, 3],
    #"model__bin": [100, 200, 255, 300],
    #"model__learning_rate": [0.01, 0.1, 0.5, 1],
    #"model__booster": ["gbtree", "dart"],
    "model__gamma": [0.01, 0.03, 0.1, 0.3, 1],
    "model__device": ["cuda"]
}

cv = KFold(n_splits=5, shuffle=True, random_state=45)
xgbclf_grid_cv_2 = GridSearchCV(estimator=model_pipe, 
                          param_grid=xgb_params, 
                          cv=cv, 
                          scoring="accuracy", 
                          return_train_score=True,
                          n_jobs=-1)
xgbclf_grid_cv_2.fit(X, y_enc)

CPU times: total: 27.7 s
Wall time: 47min 43s


In [44]:
xgbclf_grid_cv_result_2 = get_grid_cv_result_df(xgbclf_grid_cv_2)
xgbclf_grid_cv_result_2[:10]

Unnamed: 0,param_model__device,param_model__gamma,param_model__max_depth,param_model__n_estimators,mean_test_score,mean_train_score,std_test_score,rank_test_score
33,cuda,0.03,4,170,0.832706,0.854962,0.003042,1
59,cuda,0.1,4,200,0.832667,0.858445,0.003495,2
32,cuda,0.03,4,150,0.832667,0.852665,0.003558,3
82,cuda,0.3,4,150,0.83251,0.852639,0.003035,4
86,cuda,0.3,5,130,0.832471,0.865087,0.00442,5
81,cuda,0.3,4,130,0.832458,0.850127,0.00272,6
89,cuda,0.3,5,200,0.832445,0.868021,0.004057,7
88,cuda,0.3,5,170,0.832445,0.868021,0.004057,7
39,cuda,0.03,5,200,0.832432,0.877414,0.00374,9
87,cuda,0.3,5,150,0.832432,0.868008,0.004077,10


In [49]:
xgbclf_grid_cv_result_2.to_csv("./xgbclf_grid_cv_result_2.csv")
save_model("./build_model", xgbclf_grid_cv_2.best_estimator_, "xgbclf_grid_cv_2_best_model")

xgbclf_grid_cv_2_best_model save success.


### XGB 네 번째 tunning

In [23]:
%%time

xgb_params = {
    "model__n_estimators": [120, 130, 150, 170, 190],
    "model__max_depth": [3, 4, 5, 6],
    "model__sub_sample": [0.3, 0.5, 0.7],
    "model__learning_rate": [0.01, 0.1, 1],
    "model__gamma": [0.01, 0.1, 1],
    "model__tree_method": ["hist"],
    "model__device": ["cuda"],
    #"model__max_leaves": [0, 1, 2, 3],
    #"model__bin": [100, 200, 255, 300],
    #"model__booster": ["gbtree", "dart"],
}

cv = KFold(n_splits=5, shuffle=True, random_state=45)
xgbclf_grid_cv_3 = GridSearchCV(estimator=model_pipe, 
                          param_grid=xgb_params, 
                          cv=cv, 
                          scoring="accuracy", 
                          return_train_score=True,
                          n_jobs=-1)
xgbclf_grid_cv_3.fit(X, y_enc)

CPU times: total: 1min 12s
Wall time: 3h 38min 55s


In [25]:
xgbclf_grid_cv_result_3 = get_grid_cv_result_df(xgbclf_grid_cv_3)
xgbclf_grid_cv_result_3[:5]

Unnamed: 0,param_model__device,param_model__gamma,param_model__learning_rate,param_model__max_depth,param_model__n_estimators,param_model__sub_sample,param_model__tree_method,mean_test_score,mean_train_score,std_test_score,rank_test_score
102,cuda,0.01,0.1,5,190,0.3,hist,0.832157,0.849039,0.003327,1
103,cuda,0.01,0.1,5,190,0.5,hist,0.832157,0.849039,0.003327,1
104,cuda,0.01,0.1,5,190,0.7,hist,0.832157,0.849039,0.003327,1
279,cuda,0.1,0.1,5,170,0.3,hist,0.832105,0.84719,0.0032,4
280,cuda,0.1,0.1,5,170,0.5,hist,0.832105,0.84719,0.0032,4


In [26]:
xgbclf_grid_cv_result_3.to_csv("./xgbclf_grid_cv_result_3.csv")
save_model("./build_model", xgbclf_grid_cv_3.best_estimator_, "xgbclf_grid_cv_3_best_model")

xgbclf_grid_cv_3_best_model save success.


### XGB 다섯 번째 tunning : XGBClassifier의 내장 onehot encoder 사용
- sklearn의 onehot encoder 보다 내장 encoder를 사용하는 것이 더 효과가 좋다고 함

In [27]:
model_pipe["engineering"]["preprocessor"].set_params(
    cat_transformer="passthrough"
)
model_pipe

In [31]:
%%time

xgb_params = {
    "model__n_estimators": [150, 170, 190, 200],
    "model__max_depth": [3, 4, 5, 6, 7],
    "model__sub_sample": [0.3, 0.5],
    "model__learning_rate": [0.1, 0.2],
    "model__gamma": [0.01, 0.1, 1],
    "model__enable_categorical": [True],
    "model__tree_method": ["hist"],
    "model__device": ["cuda"],
    #"model__max_leaves": [0, 1, 2, 3],
    "model__bin": [100, 200, 300],
    #"model__booster": ["gbtree", "dart"],
}

cv = KFold(n_splits=5, shuffle=True, random_state=45)
xgbclf_grid_cv_4 = GridSearchCV(estimator=model_pipe, 
                          param_grid=xgb_params, 
                          cv=cv, 
                          scoring="accuracy", 
                          return_train_score=True,
                          n_jobs=-1)
xgbclf_grid_cv_4.fit(X, y_enc)

CPU times: total: 1min 22s
Wall time: 2h 5min 18s


In [32]:
xgbclf_grid_cv_result_4 = get_grid_cv_result_df(xgbclf_grid_cv_4)
xgbclf_grid_cv_result_4[:5]

Unnamed: 0,param_model__bin,param_model__device,param_model__enable_categorical,param_model__gamma,param_model__learning_rate,param_model__max_depth,param_model__n_estimators,param_model__sub_sample,param_model__tree_method,mean_test_score,mean_train_score,std_test_score,rank_test_score
615,300,cuda,True,0.1,0.2,4,200,0.5,hist,0.83217,0.852015,0.003283,1
135,100,cuda,True,0.1,0.2,4,200,0.5,hist,0.83217,0.852015,0.003283,1
134,100,cuda,True,0.1,0.2,4,200,0.3,hist,0.83217,0.852015,0.003283,1
614,300,cuda,True,0.1,0.2,4,200,0.3,hist,0.83217,0.852015,0.003283,1
374,200,cuda,True,0.1,0.2,4,200,0.3,hist,0.83217,0.852015,0.003283,1


In [33]:
xgbclf_grid_cv_result_4.to_csv("./xgbclf_grid_cv_result_4.csv")
save_model("./build_model", xgbclf_grid_cv_4.best_estimator_, "xgbclf_grid_cv_4_best_model")

xgbclf_grid_cv_4_best_model save success.


## 6. Grid CV best estimator + Submission

### 저장 된 모델 별 grid cv result 병합 후 조회

In [92]:
grid_file_names = []
for f in os.listdir() : 
    if "_" in f : 
        if f.split("_")[1] == "grid" :
            grid_file_names.append(f)

grid_results = []        
for file_name in grid_file_names :
    model_name = file_name.split(".")[0]
    grid_result_df = pd.read_csv("./" + file_name).drop("Unnamed: 0", axis=1).head(1)
    grid_result_df["model"] = [model_name] * len(grid_result_df)
    new_cols = ["model"] + list(grid_result_df.columns)
    new_cols.pop(-1)
    grid_result_df = grid_result_df[new_cols]
    grid_results.append(grid_result_df)

In [95]:
compare_grid_model_test_score = pd.concat(grid_results, axis=0)[["model", "mean_test_score", "mean_train_score"]]\
    .sort_values("mean_test_score", ascending=False)\
    .reset_index(drop=True)
compare_grid_model_test_score

Unnamed: 0,model,mean_test_score,mean_train_score
0,histgb_grid_cv_result,0.832719,0.840279
1,xgbclf_grid_cv_result_2,0.832706,0.854962
2,xgbclf_grid_cv_result_4,0.83217,0.852015
3,xgbclf_grid_cv_result_3,0.832157,0.849039
4,xgbclf_grid_cv_result,0.831504,0.875748
5,lr_grid_cv_result,0.822003,0.824251
6,lda_grid_cv_result,0.811116,0.813227
7,rf_grid_cv_result,0.807,0.810074


### (1) Submission test - HistGB best model 
- **submission score : 0.83552**

In [101]:
histgb_grid_cv_best_model = pickle.load(open("./build_model/histgb_grid_cv_best_model.pkl", "rb"))
histgb_grid_cv_best_model

In [113]:
histgb_grid_cv_best_model.fit(X, y_enc)

submission_X = test_df.drop("id", axis=1)
pred = histgb_grid_cv_best_model.predict(submission_X)
histgb_submission_df = submission_df.copy()
histgb_submission_df["Target"] = pred
target_mapper = {0: "Dropout", 1: "Enrolled", 2: "Graduate"}
histgb_submission_df["Target"] = histgb_submission_df["Target"].map(target_mapper)
histgb_submission_df.head()

Unnamed: 0,id,Target
0,76518,Dropout
1,76519,Graduate
2,76520,Graduate
3,76521,Enrolled
4,76522,Enrolled


In [120]:
histgb_submission_df.to_csv("./submission/histgb_grid_cv_best_model_submission_df.csv", index=False)
[f for f in os.listdir("./submission") if f.split("_")[0][:4] in "hist"]

['histGB_best_estimator_dorp_association_features.csv',
 'histGB_best_estimator_feature_importances_69.csv',
 'histGB_best_estimator_feature_importances_70.csv',
 'histGB_best_estimator_feature_importances_79.csv',
 'histGB_best_estimator_feature_importance_119.csv',
 'histGB_best_estimator_submission_df.csv',
 'histGB_best_estimator_submission_df_2.csv',
 'histgb_grid_cv_best_model_submission_df.csv']

### (2) Submission - XGBclf best model 
- **submission score : 0.83620**
   - **모든 submission score 중에서 가장 높은 점수**

In [123]:
xgbclf_grid_cv_best_model_1 = pickle.load(open("./build_model/xgbclf_grid_cv_2_best_model.pkl", "rb"))
xgbclf_grid_cv_best_model_1

In [124]:
xgbclf_grid_cv_best_model_1.fit(X, y_enc)

submission_X = test_df.drop("id", axis=1)
pred = xgbclf_grid_cv_best_model_1.predict(submission_X)
xgbclf_1_submission_df = submission_df.copy()
xgbclf_1_submission_df["Target"] = pred
target_mapper = {0: "Dropout", 1: "Enrolled", 2: "Graduate"}
xgbclf_1_submission_df["Target"] = xgbclf_1_submission_df["Target"].map(target_mapper)
xgbclf_1_submission_df.head()

Unnamed: 0,id,Target
0,76518,Dropout
1,76519,Graduate
2,76520,Graduate
3,76521,Enrolled
4,76522,Enrolled


In [125]:
xgbclf_1_submission_df.to_csv("./submission/xgbclf_grid_cv_best_model_1_submission_df.csv", index=False)
[f for f in os.listdir("./submission") if "xgb" in f.split("_")[0][:4]]

['xgbclf_grid_cv_best_model_1_submission_df.csv']

### (3) Submission - XGBclf best model 
- **submission score : 0.83493**

In [128]:
xgbclf_grid_cv_best_model_2 = pickle.load(open("./build_model/xgbclf_grid_cv_4_best_model.pkl", "rb"))
xgbclf_grid_cv_best_model_2

In [129]:
xgbclf_grid_cv_best_model_2.fit(X, y_enc)

submission_X = test_df.drop("id", axis=1)
pred = xgbclf_grid_cv_best_model_2.predict(submission_X)
xgbclf_2_submission_df = submission_df.copy()
xgbclf_2_submission_df["Target"] = pred
target_mapper = {0: "Dropout", 1: "Enrolled", 2: "Graduate"}
xgbclf_2_submission_df["Target"] = xgbclf_2_submission_df["Target"].map(target_mapper)
xgbclf_2_submission_df.head()

Unnamed: 0,id,Target
0,76518,Dropout
1,76519,Graduate
2,76520,Graduate
3,76521,Enrolled
4,76522,Enrolled


In [130]:
xgbclf_2_submission_df.to_csv("./submission/xgbclf_grid_cv_best_model_2_submission_df.csv", index=False)
[f for f in os.listdir("./submission") if "xgb" in f.split("_")[0][:4]]

['xgbclf_grid_cv_best_model_1_submission_df.csv',
 'xgbclf_grid_cv_best_model_2_submission_df.csv']

### (4) Submission - LinearR best model 
- **submission score : 0.8281**

In [132]:
linearR_grid_cv_best_model = pickle.load(open("./build_model/lr_grid_cv_best_model.pkl", "rb"))
linearR_grid_cv_best_model

In [133]:
linearR_grid_cv_best_model.fit(X, y_enc)

submission_X = test_df.drop("id", axis=1)
pred = linearR_grid_cv_best_model.predict(submission_X)
linearR_submission_df = submission_df.copy()
linearR_submission_df["Target"] = pred
target_mapper = {0: "Dropout", 1: "Enrolled", 2: "Graduate"}
linearR_submission_df["Target"] = linearR_submission_df["Target"].map(target_mapper)
linearR_submission_df.head()

Unnamed: 0,id,Target
0,76518,Dropout
1,76519,Graduate
2,76520,Graduate
3,76521,Enrolled
4,76522,Enrolled


In [136]:
linearR_submission_df.to_csv("./submission/linearR_grid_cv_best_model_submission_df.csv", index=False)
[f for f in os.listdir("./submission") if "li" in f.split("_")[0][:4]]

['linearR_grid_cv_best_model_submission_df.csv']