# 서울시 종로구 익일 미세먼지 농도 예측 모델

- 머신러닝
- 분류모델
- 데이터 : 서울, 경기, 인천 대기오염도(한국환경공단), 서울, 경기, 인천 종관기상관측(기상청)
- 독립변수
    - 수도권 118개 측정소의 PM10(미세먼지), O3(오존), NO2(이산화질소), CO(일산화탄소), SO2(아황산가스)
    - 수도권 9개 관측소의 기온, 강수량, 풍속, 풍향, 습도, 증기압, 이슬점온도, 현지기압, 해면기압, 일조, 일사, 적설, 3시간신적설, 전운량, 중하층운량, 운형, 최저운고, 시정, 지면상태, 현상번호, 지면온도, 지중온도
- 종속변수 : 익일 미세먼지 농도(기준에 따라 '0 : 좋음, 1 : 보통, 2 : 나쁨, 3 : 매우 나쁨'으로 비닝)

In [1]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split, cross_validate, RandomizedSearchCV
from sklearn.preprocessing import RobustScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, GradientBoostingClassifier, ExtraTreesRegressor, RandomForestRegressor
from sklearn.metrics import accuracy_score, mean_squared_error, mean_squared_log_error, mean_absolute_error
from catboost import CatBoostClassifier
from xgboost import XGBClassifier, XGBRegressor
from lightgbm import LGBMClassifier
from sklearn.decomposition import PCA
from sklearn.svm import SVC
from scipy.stats import uniform, randint
from sklearn.ensemble import VotingClassifier, StackingClassifier

# 1. 베이스라인 모델

In [2]:
ndf = pd.read_csv("./data/preprodata/pollutant_weather_df.csv")

In [3]:
x = ndf.drop(["datetime", "PM10_y_bin"], axis = 1)
y = ndf["PM10_y_bin"]

In [4]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=8, stratify = y)

## 1-1.KNN

In [5]:
kn = KNeighborsClassifier()

In [6]:
kn.fit(x_train, y_train)

In [7]:
kn.score(x_train, y_train)

0.7731748293388342

In [8]:
kn.score(x_test, y_test)

0.6522602093981045

## 1-2. 랜덤 포레스트

In [6]:
x_train.shape, y_train.shape, x_test.shape, y_test.shape

((47023, 2516), (47023,), (20153, 2516), (20153,))

In [9]:
rf = RandomForestClassifier(n_jobs = -1, random_state = 8)

In [27]:
scores = cross_validate(rf, x_train, y_train, return_train_score = True, n_jobs = -1)
print(np.mean(scores["train_score"]), np.mean(scores["test_score"]))

1.0 0.7941433950420048


In [10]:
rf.fit(x_train, y_train)

In [11]:
rf.score(x_train, y_train)

0.9999787338111137

In [12]:
rf.score(x_test, y_test)

0.8110455019103856

## 1-3. 엑스트라 트리

In [5]:
et = ExtraTreesClassifier(n_jobs = -1, random_state = 8)

In [28]:
scores = cross_validate(et, x_train, y_train, return_train_score = True, n_jobs = -1)
print(np.mean(scores["train_score"]), np.mean(scores["test_score"]))

1.0 0.8180253179899466


In [6]:
et.fit(x_train, y_train)

In [9]:
et.score(x_train, y_train)

1.0

In [10]:
et.score(x_test, y_test)

0.8358557038654295

## 1-4. CatBoost

In [15]:
cb = CatBoostClassifier()

In [None]:
scores = cross_validate(cb, x_train, y_train, return_train_score = True, n_jobs = -1)
print(np.mean(scores["train_score"]), np.mean(scores["test_score"]))

In [16]:
cb.fit(x_train, y_train)

Learning rate set to 0.096297
0:	learn: 1.2824771	total: 613ms	remaining: 10m 12s
1:	learn: 1.2001951	total: 996ms	remaining: 8m 16s
2:	learn: 1.1359325	total: 1.35s	remaining: 7m 28s
3:	learn: 1.0809649	total: 1.71s	remaining: 7m 5s
4:	learn: 1.0359132	total: 2.1s	remaining: 6m 57s
5:	learn: 0.9974511	total: 2.48s	remaining: 6m 50s
6:	learn: 0.9642655	total: 2.84s	remaining: 6m 42s
7:	learn: 0.9341464	total: 3.21s	remaining: 6m 38s
8:	learn: 0.9087883	total: 3.73s	remaining: 6m 50s
9:	learn: 0.8860975	total: 4.26s	remaining: 7m 1s
10:	learn: 0.8674943	total: 4.77s	remaining: 7m 8s
11:	learn: 0.8502648	total: 5.32s	remaining: 7m 18s
12:	learn: 0.8346274	total: 5.86s	remaining: 7m 24s
13:	learn: 0.8209173	total: 6.44s	remaining: 7m 33s
14:	learn: 0.8082523	total: 6.95s	remaining: 7m 36s
15:	learn: 0.7965338	total: 7.46s	remaining: 7m 38s
16:	learn: 0.7866471	total: 7.82s	remaining: 7m 32s
17:	learn: 0.7773813	total: 8.18s	remaining: 7m 26s
18:	learn: 0.7697612	total: 8.53s	remaining: 7m

158:	learn: 0.6007288	total: 1m 6s	remaining: 5m 50s
159:	learn: 0.6000349	total: 1m 6s	remaining: 5m 50s
160:	learn: 0.5994263	total: 1m 7s	remaining: 5m 50s
161:	learn: 0.5987013	total: 1m 7s	remaining: 5m 49s
162:	learn: 0.5981551	total: 1m 8s	remaining: 5m 49s
163:	learn: 0.5976046	total: 1m 8s	remaining: 5m 48s
164:	learn: 0.5969955	total: 1m 8s	remaining: 5m 47s
165:	learn: 0.5964377	total: 1m 9s	remaining: 5m 47s
166:	learn: 0.5956231	total: 1m 9s	remaining: 5m 46s
167:	learn: 0.5950186	total: 1m 9s	remaining: 5m 46s
168:	learn: 0.5947612	total: 1m 10s	remaining: 5m 45s
169:	learn: 0.5944037	total: 1m 10s	remaining: 5m 44s
170:	learn: 0.5939843	total: 1m 10s	remaining: 5m 43s
171:	learn: 0.5936667	total: 1m 11s	remaining: 5m 43s
172:	learn: 0.5931136	total: 1m 11s	remaining: 5m 42s
173:	learn: 0.5928115	total: 1m 12s	remaining: 5m 41s
174:	learn: 0.5921742	total: 1m 12s	remaining: 5m 41s
175:	learn: 0.5917145	total: 1m 12s	remaining: 5m 41s
176:	learn: 0.5911991	total: 1m 13s	re

311:	learn: 0.5355896	total: 2m 9s	remaining: 4m 44s
312:	learn: 0.5351825	total: 2m 9s	remaining: 4m 44s
313:	learn: 0.5348452	total: 2m 9s	remaining: 4m 43s
314:	learn: 0.5346284	total: 2m 10s	remaining: 4m 43s
315:	learn: 0.5343079	total: 2m 10s	remaining: 4m 43s
316:	learn: 0.5338689	total: 2m 11s	remaining: 4m 42s
317:	learn: 0.5335912	total: 2m 11s	remaining: 4m 42s
318:	learn: 0.5333277	total: 2m 11s	remaining: 4m 41s
319:	learn: 0.5329898	total: 2m 12s	remaining: 4m 41s
320:	learn: 0.5327365	total: 2m 12s	remaining: 4m 40s
321:	learn: 0.5323081	total: 2m 13s	remaining: 4m 40s
322:	learn: 0.5319999	total: 2m 13s	remaining: 4m 40s
323:	learn: 0.5317921	total: 2m 14s	remaining: 4m 39s
324:	learn: 0.5311622	total: 2m 14s	remaining: 4m 39s
325:	learn: 0.5308506	total: 2m 14s	remaining: 4m 39s
326:	learn: 0.5303784	total: 2m 15s	remaining: 4m 38s
327:	learn: 0.5300288	total: 2m 15s	remaining: 4m 38s
328:	learn: 0.5297155	total: 2m 16s	remaining: 4m 37s
329:	learn: 0.5294357	total: 2m

464:	learn: 0.4901854	total: 3m 17s	remaining: 3m 47s
465:	learn: 0.4899648	total: 3m 18s	remaining: 3m 47s
466:	learn: 0.4898025	total: 3m 18s	remaining: 3m 46s
467:	learn: 0.4896457	total: 3m 19s	remaining: 3m 46s
468:	learn: 0.4895201	total: 3m 19s	remaining: 3m 45s
469:	learn: 0.4891360	total: 3m 20s	remaining: 3m 45s
470:	learn: 0.4887660	total: 3m 20s	remaining: 3m 45s
471:	learn: 0.4885013	total: 3m 21s	remaining: 3m 45s
472:	learn: 0.4881714	total: 3m 21s	remaining: 3m 44s
473:	learn: 0.4877958	total: 3m 22s	remaining: 3m 44s
474:	learn: 0.4874663	total: 3m 22s	remaining: 3m 44s
475:	learn: 0.4871450	total: 3m 23s	remaining: 3m 43s
476:	learn: 0.4869342	total: 3m 23s	remaining: 3m 43s
477:	learn: 0.4865946	total: 3m 24s	remaining: 3m 42s
478:	learn: 0.4861662	total: 3m 24s	remaining: 3m 42s
479:	learn: 0.4859924	total: 3m 25s	remaining: 3m 42s
480:	learn: 0.4856579	total: 3m 25s	remaining: 3m 41s
481:	learn: 0.4854332	total: 3m 26s	remaining: 3m 41s
482:	learn: 0.4852172	total:

617:	learn: 0.4518979	total: 4m 34s	remaining: 2m 49s
618:	learn: 0.4516382	total: 4m 34s	remaining: 2m 49s
619:	learn: 0.4514064	total: 4m 35s	remaining: 2m 48s
620:	learn: 0.4512029	total: 4m 35s	remaining: 2m 48s
621:	learn: 0.4511246	total: 4m 36s	remaining: 2m 47s
622:	learn: 0.4508958	total: 4m 36s	remaining: 2m 47s
623:	learn: 0.4506096	total: 4m 37s	remaining: 2m 46s
624:	learn: 0.4502489	total: 4m 37s	remaining: 2m 46s
625:	learn: 0.4501422	total: 4m 38s	remaining: 2m 46s
626:	learn: 0.4500397	total: 4m 38s	remaining: 2m 45s
627:	learn: 0.4498623	total: 4m 38s	remaining: 2m 45s
628:	learn: 0.4496331	total: 4m 39s	remaining: 2m 44s
629:	learn: 0.4494100	total: 4m 39s	remaining: 2m 44s
630:	learn: 0.4492406	total: 4m 40s	remaining: 2m 43s
631:	learn: 0.4489056	total: 4m 40s	remaining: 2m 43s
632:	learn: 0.4488052	total: 4m 41s	remaining: 2m 43s
633:	learn: 0.4486012	total: 4m 41s	remaining: 2m 42s
634:	learn: 0.4483552	total: 4m 42s	remaining: 2m 42s
635:	learn: 0.4482811	total:

770:	learn: 0.4203546	total: 5m 52s	remaining: 1m 44s
771:	learn: 0.4201268	total: 5m 53s	remaining: 1m 44s
772:	learn: 0.4198041	total: 5m 53s	remaining: 1m 43s
773:	learn: 0.4196307	total: 5m 54s	remaining: 1m 43s
774:	learn: 0.4193942	total: 5m 54s	remaining: 1m 42s
775:	learn: 0.4192550	total: 5m 55s	remaining: 1m 42s
776:	learn: 0.4191321	total: 5m 55s	remaining: 1m 42s
777:	learn: 0.4189665	total: 5m 56s	remaining: 1m 41s
778:	learn: 0.4189097	total: 5m 56s	remaining: 1m 41s
779:	learn: 0.4186192	total: 5m 56s	remaining: 1m 40s
780:	learn: 0.4184425	total: 5m 57s	remaining: 1m 40s
781:	learn: 0.4182372	total: 5m 57s	remaining: 1m 39s
782:	learn: 0.4181038	total: 5m 58s	remaining: 1m 39s
783:	learn: 0.4178817	total: 5m 58s	remaining: 1m 38s
784:	learn: 0.4177055	total: 5m 59s	remaining: 1m 38s
785:	learn: 0.4174232	total: 5m 59s	remaining: 1m 38s
786:	learn: 0.4173130	total: 6m	remaining: 1m 37s
787:	learn: 0.4169943	total: 6m 1s	remaining: 1m 37s
788:	learn: 0.4167696	total: 6m 1

925:	learn: 0.3959337	total: 7m 7s	remaining: 34.2s
926:	learn: 0.3956921	total: 7m 8s	remaining: 33.7s
927:	learn: 0.3955266	total: 7m 8s	remaining: 33.3s
928:	learn: 0.3953486	total: 7m 9s	remaining: 32.8s
929:	learn: 0.3952262	total: 7m 9s	remaining: 32.3s
930:	learn: 0.3950224	total: 7m 10s	remaining: 31.9s
931:	learn: 0.3948761	total: 7m 10s	remaining: 31.4s
932:	learn: 0.3946297	total: 7m 11s	remaining: 31s
933:	learn: 0.3943368	total: 7m 11s	remaining: 30.5s
934:	learn: 0.3941919	total: 7m 12s	remaining: 30s
935:	learn: 0.3939919	total: 7m 12s	remaining: 29.6s
936:	learn: 0.3939144	total: 7m 13s	remaining: 29.1s
937:	learn: 0.3938095	total: 7m 13s	remaining: 28.7s
938:	learn: 0.3937080	total: 7m 13s	remaining: 28.2s
939:	learn: 0.3936182	total: 7m 14s	remaining: 27.7s
940:	learn: 0.3935362	total: 7m 14s	remaining: 27.3s
941:	learn: 0.3934507	total: 7m 15s	remaining: 26.8s
942:	learn: 0.3932492	total: 7m 15s	remaining: 26.3s
943:	learn: 0.3930985	total: 7m 16s	remaining: 25.9s
94

<catboost.core.CatBoostClassifier at 0x17a70da08b0>

In [18]:
cb.score(x_train, y_train)

0.8707015715713587

In [17]:
cb.score(x_test, y_test)

0.7933310177144842

## 1-5. XGBoost

In [5]:
xgb = XGBClassifier()

In [6]:
scores = cross_validate(xgb, x_train, y_train, return_train_score = True, n_jobs = -1)
print(np.mean(scores["train_score"]), np.mean(scores["test_score"]))

0.9522839900915713 0.8075411438253678


In [20]:
xgb.fit(x_train, y_train)

In [21]:
xgb.score(x_train, y_train)

0.9379027284520342

In [22]:
xgb.score(x_test, y_test)

0.8139731057410807

## 1-6. LightGBM

In [22]:
lgbm = LGBMClassifier(learning_rate = 0.0003, n_estimators = 10000)

In [8]:
scores = cross_validate(lgbm, x_train, y_train, return_train_score = True, n_jobs = -1)
print(np.mean(scores["train_score"]), np.mean(scores["test_score"]))

0.8810582042951269 0.7808307435771672


In [23]:
lgbm.fit(x_train, y_train)

In [24]:
lgbm.score(x_train, y_train)

0.7673478935839908

In [25]:
lgbm.score(x_test, y_test)

0.7385997122016573

## 1-7. PCA 주성분 분석

### 스케일링

In [68]:
rb = RobustScaler()
scaled_x = rb.fit_transform(x)

### PCA

In [71]:
# pca
pca = PCA()
pca.fit(x)
pd.Series(np.cumsum(pca.explained_variance_ratio_))[:]

0      0.823536
1      0.845935
2      0.863830
3      0.875720
4      0.884367
         ...   
503    1.000000
504    1.000000
505    1.000000
506    1.000000
507    1.000000
Length: 508, dtype: float64

In [95]:
pd.Series(np.cumsum(pca.explained_variance_ratio_))[507]

0.9999999999999998

In [96]:
n_pca = PCA(507)
arr_pca = n_pca.fit_transform(x)

In [97]:
col = [f"components_{i}" for i in range(0, 507)]

In [98]:
pca_x = pd.DataFrame(arr_pca, columns = col)

### 훈련용, 검증용 데이터 나누기

In [99]:
x_train, x_test, y_train, y_test = train_test_split(pca_x, y, test_size = 0.3, random_state = 8, stratify = y)

In [100]:
x_train.shape, y_train.shape, x_test.shape, y_test.shape

((47023, 507), (47023,), (20153, 507), (20153,))

### 학습 및 검증

In [101]:
et5 = ExtraTreesClassifier()

In [102]:
scores5 = cross_validate(et5, x_train, y_train, return_train_score = True, n_jobs = -1)

In [103]:
print(np.mean(scores5["train_score"]), np.mean(scores5["test_score"]))

1.0 0.7035494504922968


In [104]:
et5.fit(x_train, y_train)

In [105]:
et5.score(x_train, y_train)

1.0

In [106]:
et5.score(x_test, y_test)

0.7106138043963678

## 1-8. 서포트벡터머신

In [7]:
svm_model = SVC(random_state=8)

In [8]:
svm_model.fit(x_train, y_train)

SVC(random_state=8)

In [9]:
y_pred = svm_model.predict(x_test)

In [10]:
svm_mat = confusion_matrix(y_test, y_pred)
print(svm_mat)

[[5440 3236    0    0]
 [2067 8084    0    0]
 [  61 1120    0    2]
 [  27  107    0    9]]


In [11]:
svm_report = classification_report(y_test, y_pred)
print(svm_report)

              precision    recall  f1-score   support

           0       0.72      0.63      0.67      8676
           1       0.64      0.80      0.71     10151
           2       0.00      0.00      0.00      1183
           3       0.82      0.06      0.12       143

    accuracy                           0.67     20153
   macro avg       0.54      0.37      0.37     20153
weighted avg       0.64      0.67      0.65     20153



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


- 베이스라인으로 트리 기반의 여러 분류 모델을 돌려본 결과, 교차검증 정확도는
    - KNeighborsClassifier : 0.652
    - RandomForestClassifier : 0.794
    - ExtraTreesClassifier : 0.818
    - CatBoostClassifier : 성능 이슈로 실패
    - XGBClassifier : 0.808
    - LightGBM : 0.781
    - PCA : 0.711
    - SVM : 0.67
- 엑스트라트리의 성능이 가장 좋은 것으로 나타남 (엑스트라트리는 임의의 변수를 선택해서 학습 진행)
- 하지만 모든 모델에서 과대적합이 일어난 것이 보임. 파라미터 튜닝 필요
- 파라미터 튜닝에 앞서, 피처 선별을 다시 하고자 함

# 2. 피처 선별
- 피처 선별
    - 분류모델 시 feature_importances가 높게 나왔던 피처들만 사용
- 교차검증 점수가 가장 높았던 엑스트라트리 이용

### 피처 선별

In [7]:
# feature importances
col_lst = x_train.columns.tolist()
feat_imp = et.feature_importances_.tolist()

feature_importances = pd.DataFrame({
    "column" : col_lst,
    "importance" : feat_imp
})

In [9]:
feat_imp = feature_importances[feature_importances["importance"] > 0.0001].iloc[:,0].tolist()

In [9]:
len(feat_imp)

776

### 학습 및 검증

In [10]:
x = ndf[feat_imp]
y = ndf["PM10_y_bin"]

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state = 8, stratify = y)

In [175]:
et2 = ExtraTreesClassifier(n_jobs = -1, random_state = 8)

In [176]:
scores2 = cross_validate(et2, x_train, y_train, return_train_score = True, n_jobs = -1)
print(np.mean(scores2["train_score"]), np.mean(scores2["test_score"]))

1.0 0.8210026319294492


In [177]:
et2.fit(x_train, y_train)

In [178]:
et2.score(x_train, y_train)

1.0

In [179]:
et2.score(x_test, y_test)

0.8362526670967102

In [114]:
et_pred = et2.predict(x_test)
accuracy_score(y_test, et_pred)

0.83054632064705

- feature_importances를 따라 피처를 선별한 결과, 성능이 소폭 개선됨
    - 0.0001 -> 0.8245
    - 0.00009 -> 0.8235
    - 0.00007 -> 0.8225
    - 0.00005 -> 0.8243
    - 0.00001 -> 0.8230
- 앞으로 선별된 피처를 사용하도록 하겠음

# 3. 하이퍼파라미터 튜닝
- 교차검증 점수가 높은 랜덤포레스트, 엑스트라트리 사용

### 랜덤서치 함수 생성

In [60]:
def randomized_search_cls(model, params, n_iter = 10, log_y = None):
    '''하이퍼 파라미터 랜덤 튜닝 함수'''
    from sklearn.model_selection import RandomizedSearchCV
    from sklearn.metrics import accuracy_score
    
    rscv = RandomizedSearchCV(model,
                              params,
                              n_iter = n_iter)
    rscv.fit(x_train, y_train)
    best_model = rscv.best_estimator_
    best_score = rscv.best_score_
    print(f"훈련 점수 : {best_score : .3f}")
    
    y_pred = best_model.predict(scaled_test)
    if log_y:
        y_pred = np.expm1(best_model.predict(x_test))
    accuracy = accuracy_score(y_test, y_pred)
    print(f"ACCURACY : {accuracy : .3f}")
    
    return best_model

### 파라미터 지정

In [61]:
params = dict(
    max_depth = [None, 2, 4, 6, 8, 10, 12],
    min_samples_split = [2, 3, 4, 5, 6, 7, 8, 9, 10],
    min_samples_leaf = [1, .1, .01, .02, .03, .04, .05],
    min_weight_fraction_leaf = [0.0, .0005, .005, .05, .1, .15, .2],
    max_features = [1.0, .95, .9, .85, .8, .75, .7],
    max_leaf_nodes = [None, 10, 15, 20, 25, 30, 35, 40, 45, 50],
    min_impurity_decrease = [0.0, .0005, .005, .05, .1, .15, .2],
    )

## 3-1. 랜덤포레스트

In [185]:
# 랜덤서치 객체 생성
rs_rf = RandomizedSearchCV(RandomForestClassifier(random_state = 8), params, n_iter = 100, n_jobs = -1, random_state = 8)

# 모델 학습
rs_rf.fit(x_train, y_train)

# 정확도
score = rs_rf.score(x_test, y_test)

# 최상의 매개변수 조합
rs_rf_best_params = rs_rf.best_params_

# 최상의 교차검증 점수
rs_rf_result = np.max(rs_rf.cv_results_["mean_test_score"])

print(f"accuracy : {score}")
print(f"최상의 매개변수 : {rs_rf_best_params}")
print(f"최상의 교차검증 점수 : {rs_rf_result}")

accuracy : 0.7484245521758547
최상의 매개변수 : {'max_depth': 32, 'min_impurity_decrease': 0.00010830837443257038, 'min_samples_split': 9}
최상의 교차검증 점수 : 0.7482510637730141


In [None]:
rf7 = RandomForestClassifier(max_depth = 32, min_impurity_decrease = 0.0001, min_samples_split = 9)

scores7 = cross_validate(rf7, x_train, y_train, return_train_score = True, n_jobs = -1)
print(np.mean(scores7["train_score"]), np.mean(scores7["test_score"]))

## 3-2. 엑스트라트리

In [14]:
%%time
params = {
    "max_depth" : randint(10, 50),
    "min_impurity_decrease" : uniform(0.0001, 0.001),
    "min_samples_split" : randint(2, 50),
    "min_samples_leaf" : randint(2, 50)
}

# 랜덤서치 객체 생성
rs_et = RandomizedSearchCV(ExtraTreesClassifier(random_state = 8), params, n_iter = 500, n_jobs = -1, random_state = 8)

# 모델 학습
rs_et.fit(x_train, y_train)

# 정확도
score = rs_et.score(x_test, y_test)

# 최상의 매개변수 조합
rs_et_best_params = rs_et.best_params_

# 최상의 교차검증 점수
rs_et_result = np.max(rs_et.cv_results_["mean_test_score"])

print(f"accuracy : {score}")
print(f"최상의 매개변수 : {rs_et_best_params}")
print(f"최상의 교차검증 점수 : {rs_et_result}")

accuracy : 0.6899220959658612
최상의 매개변수 : {'max_depth': 20, 'min_impurity_decrease': 0.00010242272665406818, 'min_samples_leaf': 9, 'min_samples_split': 24}
최상의 교차검증 점수 : 0.691087444323917
CPU times: total: 34.9 s
Wall time: 51min 3s


In [None]:
et7 = ExtraTreesClassifier(max_depth = 20, min_impurity_decrease = 0.0001, min_samples_leaf = 9, min_samples_split = 24)

scores7 = cross_validate(et7, x_train, y_train, return_train_score = True, n_jobs = -1)
print(np.mean(scores7["train_score"]), np.mean(scores7["test_score"]))

- 하이퍼파라미터 튜닝은 성능상 이슈로, 최적의 하이퍼파라미터를 찾을 수 없어 포기
- 다만, n_estimator를 올리고, min_impurity_decrease를 낮췄을 때 과대적합이 방지되는 것을 확인함
    - 성능 문제로 성능을 이전 모델보다 끌어올리는 데는 실패했지만 이 부분을 참고해서 하이퍼파라미터 튜닝을 성공한다면 성능이 개선될 것으로 추정

# 4. 보팅

## 4-1. 하드 보팅

In [None]:
classifiers = [RandomForestClassifier, ExtraTreesClassifier, XGBClassifier]
for classifier in classifiers:
    classifier.fit(x_train, y_train)
    pred = classifier.predict(x_test)
    class_name = classifier.__class__.__name__
    print("{0} 정확도 : {1:.4f}".format(class_name, accuracy_score(y_test, pred)))

In [12]:
clf1 = RandomForestClassifier(random_state = 8, n_jobs = -1)
clf2 = ExtraTreesClassifier(random_state = 8, n_jobs = -1)
clf3 = XGBClassifier(random_state = 8, n_jobs = -1)

voting_model = VotingClassifier(estimators = [
    ("rf", clf1), ("etc", clf2), ("xgb", clf3)],
                               voting = "hard")

voting_model.fit(x_train, y_train)

In [15]:
scores8 = cross_validate(voting_model, x_train, y_train, scoring = "accuracy", cv = 5)

In [18]:
np.mean(scores8["test_score"])

0.8122409141449192

In [13]:
voting_model.score(x_train, y_train)

0.9999787338111137

In [14]:
voting_model.score(x_test, y_test)

0.8272217535850742

## 4-2. 소프트 보팅

In [19]:
clf1 = RandomForestClassifier(random_state = 8, n_jobs = -1)
clf2 = ExtraTreesClassifier(random_state = 8, n_jobs = -1)
clf3 = XGBClassifier(random_state = 8, n_jobs = -1)

voting_model = VotingClassifier(estimators = [
    ("rf", clf1), ("etc", clf2), ("xgb", clf3)],
                               voting = "soft")

voting_model.fit(x_train, y_train)

In [20]:
scores8_2 = cross_validate(voting_model, x_train, y_train, scoring = "accuracy", cv = 5)

In [21]:
np.mean(scores8_2["test_score"])

0.8168981652021344

- 하드보팅과 소프트보팅 모두 성능 개선에 기여하지 못함

# 5. 스태킹

## 5-2. 스태킹(1)

In [11]:
et_cls = ExtraTreesClassifier(random_state = 8, n_jobs = -1)
et_reg = ExtraTreesRegressor(random_state = 8, n_jobs = -1)

rf_cls = RandomForestClassifier(random_state = 8, n_jobs = -1)
rf_reg = RandomForestRegressor(random_state = 8, n_jobs = -1)

xgb_cls = XGBClassifier(random_state = 8, n_jobs = -1)
xgb_reg = XGBRegressor(random_state = 8, n_jobs = -1)

In [13]:
stk_model = StackingClassifier(estimators = [
    ("ExtraTreeClassifier", et_cls), ("RandomForestClassifier", rf_cls), ("XGBClassifier", xgb_cls), ("ExtraTreesRegressor", et_reg)],
                              final_estimator = ExtraTreesClassifier(random_state = 8, n_jobs = -1))

In [17]:
stk_model.fit(x_train, y_train)

In [19]:
stk_model.score(x_test, y_test)

0.8654294645958418

In [14]:
scores = cross_validate(stk_model, x_train, y_train, scoring = "accuracy", cv = 5)

In [15]:
np.mean(scores["test_score"])

0.8533058539908929

- 스태킹을 해서 나온 교차검증 정확도가 0.853으로 가장 성능이 좋음
- 최종 모델에 스태킹을 이용하기로 하고, estimator를 늘림

## 5-2. 스태킹(2)

In [12]:
stk_model2 = StackingClassifier(estimators = [
    ("ExtraTreesClassifier", et_cls), ("ExtraTreesRegressor", et_reg),
    ("RandomForestClassifier", rf_cls), ("RandomForestRegressor", rf_reg),
    ("XGBClassifier", xgb_cls), ("XGBRegressor", xgb_reg)],
                              final_estimator = GradientBoostingClassifier(random_state = 8))

In [13]:
stk_model2.fit(x_train, y_train)

In [14]:
stk_model2.score(x_test, y_test)

0.8706396070064011

In [15]:
scores_stk2 = cross_validate(stk_model2, x_train, y_train, scoring = "accuracy", cv = 5)

In [16]:
np.mean(scores_stk2["test_score"])

0.8579419030801422

- 스태킹 기법을 이용
    - 분류 모델과 회귀 모델을 이용해 얻어낸 예측값을 기반으로 다시 분류를 진행했을 때, 성능이 이전보다 상당히 개선됨
    - 따라서 스태킹을 이용한 모델을 최종 모델로 결정