# Topics
* [데이터 살펴보기](#데이터-살펴보기)
    + [데이터 불러오기](#데이터-불러오기)
    + [데이터 파악](#데이터-파악)
    + [열정보(feature)파악](#열정보(feature)파악)
* [간단한 전처리](#간단한-전처리)
    + [targetDateTime 열의 날짜 정보를 구체화](#targetDateTime-열의-날짜-정보를-구체화)
    + [연령대 피쳐 추가](#연령대-피쳐-추가)
    + [범주형 변수를 숫자로 매핑](#범주형-변수를-숫자로-매핑)
    + [스케일링](#스케일링)
* [모델링](#모델링)
    + [기본 모델 성능 파악](#기본-모델-성능-파악)
    + [배깅](#배깅)
    + [DecisionTree에 배깅 적용하기](#DecisionTree에-배깅-적용하기)
    + [랜덤포레스트](#랜덤포레스트)
    + [엑스트라트리](#엑스트라트리)
* [베이지안최적화](#베이지안-최적화)

# 데이터 살펴보기

- 설문조사 데이터
- 목표 : 설문지 응답 여부 예측하기 

### 데이터 불러오기

In [184]:
import pandas as pd
X_train = pd.read_csv('./X_train.csv', encoding = 'cp949') 
y_train = pd.read_csv('./y_train.csv', encoding = 'cp949') 
X_test = pd.read_csv('./X_test.csv', encoding = 'cp949') 
y_test = pd.read_csv('./y_test.csv', encoding = 'cp949') 

### 데이터 파악

- 데이터는 총 `X_train`, `X_test`, `y_train`, `y_test` 4가지로 이루어져있다. 
- 설문조사 응답 여부를 예측하는 문제이기 때문에 `X_train`의 열(features)을 이용하여 `y_train`(status)를 예측하는 모델을 만들면 된다. 

In [185]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 300001 entries, 0 to 300000
Data columns (total 7 columns):
 #   Column          Non-Null Count   Dtype 
---  ------          --------------   ----- 
 0   PANEL_TYPE      300001 non-null  object
 1   targetDateTime  300001 non-null  object
 2   weekday         300001 non-null  int64 
 3   isholiday       300001 non-null  int64 
 4   gender          300001 non-null  object
 5   region          300001 non-null  object
 6   birthYear       300001 non-null  int64 
dtypes: int64(3), object(4)
memory usage: 16.0+ MB


In [186]:
y_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 300001 entries, 0 to 300000
Data columns (total 1 columns):
 #   Column  Non-Null Count   Dtype
---  ------  --------------   -----
 0   status  300001 non-null  int64
dtypes: int64(1)
memory usage: 2.3 MB


In [187]:
X_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400001 entries, 0 to 400000
Data columns (total 7 columns):
 #   Column          Non-Null Count   Dtype 
---  ------          --------------   ----- 
 0   PANEL_TYPE      400001 non-null  object
 1   targetDateTime  400001 non-null  object
 2   weekday         400001 non-null  int64 
 3   isholiday       400001 non-null  int64 
 4   gender          400001 non-null  object
 5   region          400001 non-null  object
 6   birthYear       400001 non-null  int64 
dtypes: int64(3), object(4)
memory usage: 21.4+ MB


In [188]:
y_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400001 entries, 0 to 400000
Data columns (total 1 columns):
 #   Column  Non-Null Count   Dtype
---  ------  --------------   -----
 0   status  400001 non-null  int64
dtypes: int64(1)
memory usage: 3.1 MB


### 열(feature)정보 파악 

- `PANEL_TYPE` : 설문조사 업체 유형
- `targetDateTime` : 설문지 발송시간
- `weekday` : 설문지 발송요일
- `isholiday` : 공휴일 여부
- `gender` : 성별
- `region` : 지역
- `birthYear` : 생년월일
- `status` : 응답여부 (target)

In [189]:
X_train.head()

Unnamed: 0,PANEL_TYPE,targetDateTime,weekday,isholiday,gender,region,birthYear
0,B,2020-10-02 14:31:33.907000+00:00,4,1,남성,대구,1984
1,A,2020-10-31 02:41:16.345000+00:00,5,0,여성,경기,1972
2,B,2020-11-22 10:00:22.825000+00:00,6,0,여성,서울,1985
3,B,2020-12-29 23:01:23.912000+00:00,1,0,남성,서울,1957
4,A,2020-12-09 22:03:50.542000+00:00,2,0,남성,광주,1974


In [190]:
y_train.head()

Unnamed: 0,status
0,0
1,0
2,0
3,1
4,0


# 간단한 전처리

### targetDateTime 열의 날짜 정보를 구체화

In [191]:
# 날짜 분리 (Date)
X_train['Year'] = X_train['targetDateTime'].str.split(' ').str[0].str.split('-').str[0].astype('int64')
X_train['Month'] = X_train['targetDateTime'].str.split(' ').str[0].str.split('-').str[1].astype('int64')
X_train['Day'] = X_train['targetDateTime'].str.split(' ').str[0].str.split('-').str[2].astype('int64')

X_test['Year'] = X_test['targetDateTime'].str.split(' ').str[0].str.split('-').str[0].astype('int64')
X_test['Month'] = X_test['targetDateTime'].str.split(' ').str[0].str.split('-').str[1].astype('int64')
X_test['Day'] = X_test['targetDateTime'].str.split(' ').str[0].str.split('-').str[2].astype('int64')

# 날짜 분리 (Time)
X_train['Hour'] = X_train['targetDateTime'].str.split(' ').str[1].str.split(':').str[0].astype('int64')
X_train['Minute'] = X_train['targetDateTime'].str.split(' ').str[1].str.split(':').str[1].astype('int64')

X_test['Hour'] = X_test['targetDateTime'].str.split(' ').str[1].str.split(':').str[0].astype('int64')
X_test['Minute'] = X_test['targetDateTime'].str.split(' ').str[1].str.split(':').str[1].astype('int64')

# 분리했으므로 기존 날짜 열은 삭제
X_train.drop('targetDateTime',axis=1,inplace=True)
X_test.drop('targetDateTime',axis=1,inplace=True)

In [192]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 300001 entries, 0 to 300000
Data columns (total 11 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   PANEL_TYPE  300001 non-null  object
 1   weekday     300001 non-null  int64 
 2   isholiday   300001 non-null  int64 
 3   gender      300001 non-null  object
 4   region      300001 non-null  object
 5   birthYear   300001 non-null  int64 
 6   Year        300001 non-null  int64 
 7   Month       300001 non-null  int64 
 8   Day         300001 non-null  int64 
 9   Hour        300001 non-null  int64 
 10  Minute      300001 non-null  int64 
dtypes: int64(8), object(3)
memory usage: 25.2+ MB


### 연령대 피쳐 추가

In [203]:
X_train['age'] = ((X_train['Year'].astype('int') - X_train['birthYear']) // 10 * 10).astype('str') + '대'
X_test['age'] = ((X_test['Year'].astype('int') - X_test['birthYear']) // 10 * 10).astype('str') + '대'

In [204]:
X_train

Unnamed: 0,PANEL_TYPE,weekday,isholiday,gender,region,birthYear,Year,Month,Day,Hour,Minute,age
0,B,4,1,남성,대구,1984,2020,10,2,14,31,30대
1,A,5,0,여성,경기,1972,2020,10,31,2,41,40대
2,B,6,0,여성,서울,1985,2020,11,22,10,0,30대
3,B,1,0,남성,서울,1957,2020,12,29,23,1,60대
4,A,2,0,남성,광주,1974,2020,12,9,22,3,40대
...,...,...,...,...,...,...,...,...,...,...,...,...
299996,A,3,0,남성,광주,1996,2020,8,27,5,25,20대
299997,B,3,0,여성,서울,1963,2020,2,6,4,41,50대
299998,A,6,0,여성,서울,1967,2021,2,28,8,1,50대
299999,A,0,0,남성,전남,1995,2020,11,23,23,4,20대


In [205]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 300001 entries, 0 to 300000
Data columns (total 12 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   PANEL_TYPE  300001 non-null  object
 1   weekday     300001 non-null  int64 
 2   isholiday   300001 non-null  int64 
 3   gender      300001 non-null  object
 4   region      300001 non-null  object
 5   birthYear   300001 non-null  int64 
 6   Year        300001 non-null  int64 
 7   Month       300001 non-null  int64 
 8   Day         300001 non-null  int64 
 9   Hour        300001 non-null  int64 
 10  Minute      300001 non-null  int64 
 11  age         300001 non-null  object
dtypes: int64(8), object(4)
memory usage: 27.5+ MB


In [206]:
X_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400001 entries, 0 to 400000
Data columns (total 12 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   PANEL_TYPE  400001 non-null  object
 1   weekday     400001 non-null  int64 
 2   isholiday   400001 non-null  int64 
 3   gender      400001 non-null  object
 4   region      400001 non-null  object
 5   birthYear   400001 non-null  int64 
 6   Year        400001 non-null  int64 
 7   Month       400001 non-null  int64 
 8   Day         400001 non-null  int64 
 9   Hour        400001 non-null  int64 
 10  Minute      400001 non-null  int64 
 11  age         400001 non-null  object
dtypes: int64(8), object(4)
memory usage: 36.6+ MB


### 범주형 변수를 숫자로 매핑

In [207]:
# OneHotECD를 위해 임의로 X_train과 X_test를 합침
df = pd.concat([X_train,X_test])
df = pd.concat([df,pd.get_dummies(df[['PANEL_TYPE','gender','region','age']])], axis=1)
df.drop(columns = ['PANEL_TYPE','gender','region','age'], inplace = True)

# 다시 분리 
X_train = df.iloc[:300001]
X_test = df.iloc[300001:]

In [208]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 300001 entries, 0 to 300000
Data columns (total 39 columns):
 #   Column        Non-Null Count   Dtype
---  ------        --------------   -----
 0   weekday       300001 non-null  int64
 1   isholiday     300001 non-null  int64
 2   birthYear     300001 non-null  int64
 3   Year          300001 non-null  int64
 4   Month         300001 non-null  int64
 5   Day           300001 non-null  int64
 6   Hour          300001 non-null  int64
 7   Minute        300001 non-null  int64
 8   PANEL_TYPE_A  300001 non-null  uint8
 9   PANEL_TYPE_B  300001 non-null  uint8
 10  PANEL_TYPE_C  300001 non-null  uint8
 11  gender_남성     300001 non-null  uint8
 12  gender_여성     300001 non-null  uint8
 13  region_강원     300001 non-null  uint8
 14  region_경기     300001 non-null  uint8
 15  region_경남     300001 non-null  uint8
 16  region_경북     300001 non-null  uint8
 17  region_광주     300001 non-null  uint8
 18  region_대구     300001 non-null  uint8
 19  re

In [209]:
X_test.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 400001 entries, 0 to 400000
Data columns (total 39 columns):
 #   Column        Non-Null Count   Dtype
---  ------        --------------   -----
 0   weekday       400001 non-null  int64
 1   isholiday     400001 non-null  int64
 2   birthYear     400001 non-null  int64
 3   Year          400001 non-null  int64
 4   Month         400001 non-null  int64
 5   Day           400001 non-null  int64
 6   Hour          400001 non-null  int64
 7   Minute        400001 non-null  int64
 8   PANEL_TYPE_A  400001 non-null  uint8
 9   PANEL_TYPE_B  400001 non-null  uint8
 10  PANEL_TYPE_C  400001 non-null  uint8
 11  gender_남성     400001 non-null  uint8
 12  gender_여성     400001 non-null  uint8
 13  region_강원     400001 non-null  uint8
 14  region_경기     400001 non-null  uint8
 15  region_경남     400001 non-null  uint8
 16  region_경북     400001 non-null  uint8
 17  region_광주     400001 non-null  uint8
 18  region_대구     400001 non-null  uint8
 19  re

### 스케일링

In [210]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# 모델링

### 기본 모델 성능 파악

In [211]:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score
from sklearn.metrics import roc_auc_score
import warnings
warnings.filterwarnings(action = 'ignore')

lr = LogisticRegression()
dt = DecisionTreeClassifier(max_depth = 10)
models = [lr,dt]

print('<교차검증점수 평균>')
for model in models:
    print('{} : {}'.format(type(model).__name__,cross_val_score(model, X_train, y_train, cv = 5, scoring = 'roc_auc').mean()))
print('<테스트세트점수>')
for model in models:
    model.fit(X_train, y_train)
    print('{} : {}'.format(type(model).__name__, roc_auc_score(y_test, model.predict_proba(X_test)[:,1])))

<교차검증점수 평균>
LogisticRegression : 0.7329315932152592
DecisionTreeClassifier : 0.7725058397353664
<테스트세트점수>
LogisticRegression : 0.73257992257018
DecisionTreeClassifier : 0.7779891295088074


## 배깅

< Parameter >
- `base_estimator` : bagging을 적용할 모델
- `n_estimators` : 앙상블에 사용할 분류기의 수
- `max_samples` : 무작위로 샘플링할 샘플 수(행)
- `bootstrap` : 중복 여부(행)
- `max_features` : 무작위로 샘플링할 특성 수(열)
- `bootstrap_features` : 중복 여부(열)
- `...`

```python
# 배깅
bootstrap == False
# 페이스팅
bootstrap == True
# 랜덤 패치 방식
bootstrap = True, max_samples < 1.0
bootstrap_features = True, max_samples < 1.0
# 서브스페이스 방식
bootstrap = False, max_samples = 1.0
bootstrap_features = True, max_features < 1.0
```

### DecisionTree에 배깅 적용하기

In [212]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier

bag_clf = BaggingClassifier(
    DecisionTreeClassifier(max_depth = 10), n_estimators=100,
    max_samples=0.8, bootstrap=True, random_state=42)
models = [bag_clf]

print('<교차검증점수 평균>')
for model in models:
    print('{} : {}'.format(type(model).__name__,cross_val_score(model, X_train, y_train, cv = 5, scoring = 'roc_auc').mean()))
print('<테스트세트점수>')
for model in models:
    model.fit(X_train, y_train)
    print('{} : {}'.format(type(model).__name__, roc_auc_score(y_test, model.predict_proba(X_test)[:,1])))

<교차검증점수 평균>
BaggingClassifier : 0.7924181729041783
<테스트세트점수>
BaggingClassifier : 0.7957090588291086


## 랜덤포레스트

< Parameter >
- `n_estimators` : 앙상블에 사용할 분류기(DT)의 수 
- `max_features` : 무작위로 샘플링할 특성 수(열)
- `max_samples` : 무작위로 샘플링할 샘플 수 (열)
- `bootstrap` : 중복 여부(행)
- `max_depth` : 트리의 최대 깊이 수
- `criterion` : {'gini','entropy'} : 분할에 사용할 function 선택
- `min_samples_split` : 노드를 분할하기 위해 필요한 최소 sample 수 
- `min_samples_leaf` : 리프노트가 되기 위해 필요한 최소 sample 수 
- `...`

In [213]:
from sklearn.ensemble import RandomForestClassifier

rnd_clf = RandomForestClassifier(n_estimators=100, max_depth = 10, random_state=42)
models = [rnd_clf]

print('<교차검증점수 평균>')
for model in models:
    print('{} : {}'.format(type(model).__name__,cross_val_score(model, X_train, y_train, cv = 5, scoring = 'roc_auc').mean()))
print('<테스트세트점수>')
for model in models:
    model.fit(X_train, y_train)
    print('{} : {}'.format(type(model).__name__, roc_auc_score(y_test, model.predict_proba(X_test)[:,1])))

<교차검증점수 평균>
RandomForestClassifier : 0.7722007900664852
<테스트세트점수>
RandomForestClassifier : 0.7743317764994401


## 엑스트라트리

- 파라미터는 다른 트리모델과 거의 비슷합니다.

In [214]:
from sklearn.ensemble import ExtraTreesClassifier

etc_clf =  ExtraTreesClassifier(n_estimators = 100, max_depth = 10, random_state = 42)
models = [etc_clf]

print('<교차검증점수 평균>')
for model in models:
    print('{} : {}'.format(type(model).__name__,cross_val_score(model, X_train, y_train, cv = 5, scoring = 'roc_auc').mean()))
print('<테스트세트점수>')
for model in models:
    model.fit(X_train, y_train)
    print('{} : {}'.format(type(model).__name__, roc_auc_score(y_test, model.predict_proba(X_test)[:,1])))

<교차검증점수 평균>
ExtraTreesClassifier : 0.7577923649762845
<테스트세트점수>
ExtraTreesClassifier : 0.7603551973874395


# 베이지안 최적화

In [215]:
!pip install bayesian-optimization



In [216]:
from bayes_opt import BayesianOptimization
from sklearn.metrics import log_loss

In [217]:
# Step1. 탐색하고싶은 하이퍼파라미터의 범위를 dictionary 형태로 지정.
pbounds = {'n_estimators': (10,30),
            'max_depth': (5,10)}

# Step2. 1에서 생성한 dictionary의 key를 arguments로 하는 함수 생성
def rnd_opt(n_estimators, max_depth):

    # Step3. 탐색하고싶은 하이퍼파라미터의 조건 지정
    params = {
        'n_estimators' : int(round(n_estimators,0)),
        'max_depth' : int(round(max_depth,0))
    }
    
    # Step4. 모델 생성
    rnd_clf = RandomForestClassifier(**params)
    rnd_clf.fit(X_train, y_train)
    
    # Step5. 최대화하려는 score 선언
    score = roc_auc_score(y_test, rnd_clf.predict_proba(X_test)[:,1])
    
    return score

In [218]:
# Step6. BayesianOptimization 객체 생성
BO_rnd = BayesianOptimization(f = rnd_opt, pbounds = pbounds, random_state=42) # 최대화하려는 함수 f, 탐색범위 pbounds

# Step7. 최대화
BO_rnd.maximize(init_points=5, n_iter=25) # 처음 탐색 횟수 init_points, 추가 탐색 횟수 n_iter

|   iter    |  target   | max_depth | n_esti... |
-------------------------------------------------
| [0m 1       [0m | [0m 0.7615  [0m | [0m 6.873   [0m | [0m 29.01   [0m |
| [95m 2       [0m | [95m 0.7724  [0m | [95m 8.66    [0m | [95m 21.97   [0m |
| [0m 3       [0m | [0m 0.7532  [0m | [0m 5.78    [0m | [0m 13.12   [0m |
| [0m 4       [0m | [0m 0.7507  [0m | [0m 5.29    [0m | [0m 27.32   [0m |
| [0m 5       [0m | [0m 0.7624  [0m | [0m 8.006   [0m | [0m 24.16   [0m |
| [0m 6       [0m | [0m 0.7719  [0m | [0m 9.616   [0m | [0m 22.43   [0m |
| [95m 7       [0m | [95m 0.7728  [0m | [95m 10.0    [0m | [95m 19.57   [0m |
| [0m 8       [0m | [0m 0.7516  [0m | [0m 5.954   [0m | [0m 19.31   [0m |
| [0m 9       [0m | [0m 0.77    [0m | [0m 10.0    [0m | [0m 30.0    [0m |
| [0m 10      [0m | [0m 0.7684  [0m | [0m 10.0    [0m | [0m 10.0    [0m |
| [0m 11      [0m | [0m 0.7713  [0m | [0m 10.0    [0m | [0m 16.26