## Ensemble Learning
- 여러 개의 개본 모델을 활용하여 하나의 새로운 모델을 만들어내는 개념

## Stacking
- Meta Learner라고 불리우며, 다양한 모델을 결합하여 사용하는 기법
![ex2.jpg](attachment:ex2.jpg)

#### Stacking이란
- K-fold를 나눈 뒤에 각각의 데이터에 대해 여러 개의 모델을 이용하여 학습
![ex2.jpg](attachment:ex2.jpg)

## 응용 : Ensemble의 Ensemble
- Ensemble 모델을 단일 모델로 사용해보자
![ex2.jpg](attachment:ex2.jpg)

- Ensemble의 기본 조건 : 다양한 모델
- Boosting 계열 알고리즘들은 hyper parameter에 민감한 경향이 있음 -> hyper parameter의 다양화
- Bagging, RandomForest 컨셉과 같이 데이터 및 변수 random 추출
- 일반적으로 단일 ensemble 모델에 비해 성능이 좋음

# 실습 : Ensemble의 Ensemble

## 기본 설정

In [1]:
import os
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
data = pd.read_csv("C:/Users/mitha/OneDrive/바탕 화면/otto_train.csv") # Product Category

In [2]:
nCar = data.shape[0] # 데이터 개수
nVar = data.shape[1] # 변수 개수
print('nCar: %d' % nCar, 'nVar: %d' % nVar )
data = data.drop(['id'], axis = 1) # id 제거
mapping_dict = {"Class_1": 1,
                "Class_2": 2,
                "Class_3": 3,
                "Class_4": 4,
                "Class_5": 5,
                "Class_6": 6,
                "Class_7": 7,
                "Class_8": 8,
                "Class_9": 9}
after_mapping_target = data['target'].apply(lambda x: mapping_dict[x])
feature_columns = list(data.columns.difference(['target'])) # target을 제외한 모든 행
X = data[feature_columns] # 설명변수
y = after_mapping_target # 타겟변수
train_x, test_x, train_y, test_y = train_test_split(X, y, test_size = 0.2, random_state = 42) # 학습데이터와 평가데이터의 비율을 8:2 로 분할| 
print(train_x.shape, test_x.shape, train_y.shape, test_y.shape) # 데이터 개수 확인

nCar: 61878 nVar: 95
(49502, 93) (12376, 93) (49502,) (12376,)


## 1. XGBoost

In [4]:
# !pip install xgboost
import xgboost as xgb
import time
start = time.time() # 시작 시간 지정
xgb_dtrain = xgb.DMatrix(data = train_x, label = train_y) # 학습 데이터를 XGBoost 모델에 맞게 변환
xgb_dtest = xgb.DMatrix(data = test_x) # 평가 데이터를 XGBoost 모델에 맞게 변환
xgb_param = {'max_depth': 10, # 트리 깊이
         'learning_rate': 0.01, # Step Size
         'n_estimators': 100, # Number of trees, 트리 생성 개수
         'objective': 'multi:softmax', # 목적 함수
        'num_class': len(set(train_y)) + 1} # 파라미터 추가, Label must be in [0, num_class) -> num_class보다 1 커야한다.
xgb_model = xgb.train(params = xgb_param, dtrain = xgb_dtrain) # 학습 진행
xgb_model_predict = xgb_model.predict(xgb_dtest) # 평가 데이터 예측
print("Accuracy: %.2f" % (accuracy_score(test_y, xgb_model_predict) * 100), "%") # 정확도 % 계산
print("Time: %.2f" % (time.time() - start), "seconds") # 코드 실행 시간 계산

Parameters: { "n_estimators" } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


Accuracy: 76.67 %
Time: 7.78 seconds


## 2. LightGBM

In [5]:
# !pip install lightgbm
import lightgbm as lgb
start = time.time() # 시작 시간 지정
lgb_dtrain = lgb.Dataset(data = train_x, label = train_y) # 학습 데이터를 LightGBM 모델에 맞게 변환
lgb_param = {'max_depth': 10, # 트리 깊이
            'learning_rate': 0.01, # Step Size
            'n_estimators': 100, # Number of trees, 트리 생성 개수
            'objective': 'multiclass', # 목적 함수
            'num_class': len(set(train_y)) + 1} # 파라미터 추가, Label must be in [0, num_class) -> num_class보다 1 커야한다.
lgb_model = lgb.train(params = lgb_param, train_set = lgb_dtrain) # 학습 진행
lgb_model_predict = np.argmax(lgb_model.predict(test_x), axis = 1) # 평가 데이터 예측, Softmax의 결과값 중 가장 큰 값의 Label로 예측
print("Accuracy: %.2f" % (accuracy_score(test_y, lgb_model_predict) * 100), "%") # 정확도 % 계산
print("Time: %.2f" % (time.time() - start), "seconds") # 코드 실행 시간 계산



You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 3110
[LightGBM] [Info] Number of data points in the train set: 49502, number of used features: 93
[LightGBM] [Info] Start training from score -34.538776
[LightGBM] [Info] Start training from score -3.476745
[LightGBM] [Info] Start training from score -1.341381
[LightGBM] [Info] Start training from score -2.039019
[LightGBM] [Info] Start training from score -3.135151
[LightGBM] [Info] Start training from score -3.125444
[LightGBM] [Info] Start training from score -1.481556
[LightGBM] [Info] Start training from score -3.074772
[LightGBM] [Info] Start training from score -1.986562
[LightGBM] [Info] Start training from score -2.533374
Accuracy: 76.28 %
Time: 3.09 seconds


## 3. Catboost

In [6]:
# !pip install catboost
import catboost as cb
start = time.time() # 시작 시간 지정
cb_dtrain = cb.Pool(data = train_x, label = train_y) # 학습 데이터를 Catboost 모델에 맞게 변환
cb_param = {'max_depth': 10, # 트리 깊이
            'learning_rate': 0.01, # Step Size
            'n_estimators': 100, # Number of trees, 트리 생성 개수
            'eval_metric': 'Accuracy', # 평가 척도
            'loss_function': 'MultiClass'} # 손실 함수, 목적 함수
cb_model = cb.train(pool = cb_dtrain, params = cb_param) # 학습 진행
cb_model_predict = np.argmax(cb_model.predict(test_x), axis = 1) + 1 # 평가 데이터 예측, Softmax의 결과값 중 가장 큰 값의 Label로 예측, 인덱스의 순서를 맞추기 위해 +1
print("Accuracy: %.2f" % (accuracy_score(test_y, cb_model_predict) * 100), "%") # 정확도 % 계산
print("Time: %.2f" % (time.time() - start), "seconds") # 코드 실행 시간 계산

0:	learn: 0.5907034	total: 1s	remaining: 1m 39s
1:	learn: 0.6356107	total: 1.83s	remaining: 1m 29s
2:	learn: 0.6411256	total: 2.61s	remaining: 1m 24s
3:	learn: 0.6480344	total: 3.42s	remaining: 1m 22s
4:	learn: 0.6508222	total: 4.21s	remaining: 1m 20s
5:	learn: 0.6499939	total: 4.99s	remaining: 1m 18s
6:	learn: 0.6507818	total: 5.77s	remaining: 1m 16s
7:	learn: 0.6548422	total: 6.53s	remaining: 1m 15s
8:	learn: 0.6559533	total: 7.28s	remaining: 1m 13s
9:	learn: 0.6560947	total: 8.03s	remaining: 1m 12s
10:	learn: 0.6568421	total: 8.87s	remaining: 1m 11s
11:	learn: 0.6588219	total: 9.65s	remaining: 1m 10s
12:	learn: 0.6592259	total: 10.4s	remaining: 1m 9s
13:	learn: 0.6611248	total: 11.2s	remaining: 1m 8s
14:	learn: 0.6625591	total: 11.9s	remaining: 1m 7s
15:	learn: 0.6631853	total: 12.6s	remaining: 1m 6s
16:	learn: 0.6639328	total: 13.4s	remaining: 1m 5s
17:	learn: 0.6668821	total: 14.1s	remaining: 1m 4s
18:	learn: 0.6669630	total: 14.9s	remaining: 1m 3s
19:	learn: 0.6675286	total: 15.9

## 4. ensemble의 ensemble

In [13]:
from sklearn.metrics import mean_squared_error, r2_score
from math import sqrt

sqrt(mean_squared_error(lgb_model.predict(test_x),test_y))
import random
bagging_predict_result = [] # 빈 리스트 생성
for _ in range(30):
    data_index = [data_index for data_index in range(train_x.shape[0])] # 학습 데이터의 인덱스를 리스트로 변환
    random_data_index = np.random.choice(data_index, train_x.shape[0]) # 데이터의 1/10 크기만큼 랜덤 샘플링, // 는 소수점을 무시하기 위함
    print(len(set(random_data_index)))
    lgb_dtrain = lgb.Dataset(data = train_x.iloc[random_data_index,], label = train_y.iloc[random_data_index]) # 학습 데이터를 LightGBM 모델에 맞게 변환
    lgb_param = {'max_depth': 10, # 트리 깊이
                'learning_rate': 0.01, # Step Size
                'n_estimators': 500, # Number of trees, 트리 생성 개수
                'objective': 'regression',
                'force_col_wise':True} # 파라미터 추가, Label must be in [0, num_class) -> num_class보다 1 커야한다.
    lgb_model = lgb.train(params = lgb_param, train_set = lgb_dtrain) # 학습 진행
 
    predict1 = lgb_model.predict(test_x) # 테스트 데이터 예측
    bagging_predict_result.append(predict1) # 반복문이 실행되기 전 빈 리스트에 결과 값 저장
    print(sqrt(mean_squared_error(lgb_model.predict(test_x),test_y)))

31296




[LightGBM] [Info] Total Bins 2969
[LightGBM] [Info] Number of data points in the train set: 49502, number of used features: 93
[LightGBM] [Info] Start training from score 4.844653
1.3657106841323998
31421




[LightGBM] [Info] Total Bins 2970
[LightGBM] [Info] Number of data points in the train set: 49502, number of used features: 93
[LightGBM] [Info] Start training from score 4.847340
1.3634864954804904
31335




[LightGBM] [Info] Total Bins 2974
[LightGBM] [Info] Number of data points in the train set: 49502, number of used features: 93
[LightGBM] [Info] Start training from score 4.825765
1.3556623225606212
31205




[LightGBM] [Info] Total Bins 3004
[LightGBM] [Info] Number of data points in the train set: 49502, number of used features: 93
[LightGBM] [Info] Start training from score 4.846410
1.3615947277545573
31293




[LightGBM] [Info] Total Bins 2999
[LightGBM] [Info] Number of data points in the train set: 49502, number of used features: 93
[LightGBM] [Info] Start training from score 4.831845
1.3641369020980383
31088




[LightGBM] [Info] Total Bins 2962
[LightGBM] [Info] Number of data points in the train set: 49502, number of used features: 93
[LightGBM] [Info] Start training from score 4.834734
1.358320076002917
31359




[LightGBM] [Info] Total Bins 2960
[LightGBM] [Info] Number of data points in the train set: 49502, number of used features: 93
[LightGBM] [Info] Start training from score 4.833340
1.3701628767974081
31312




[LightGBM] [Info] Total Bins 2962
[LightGBM] [Info] Number of data points in the train set: 49502, number of used features: 93
[LightGBM] [Info] Start training from score 4.809785
1.3598304582843055
31221




[LightGBM] [Info] Total Bins 2962
[LightGBM] [Info] Number of data points in the train set: 49502, number of used features: 93
[LightGBM] [Info] Start training from score 4.842835
1.3619579690195687
31282




[LightGBM] [Info] Total Bins 2975
[LightGBM] [Info] Number of data points in the train set: 49502, number of used features: 93
[LightGBM] [Info] Start training from score 4.852673
1.359623580677644
31258




[LightGBM] [Info] Total Bins 2960
[LightGBM] [Info] Number of data points in the train set: 49502, number of used features: 93
[LightGBM] [Info] Start training from score 4.837279
1.3619803451028107
31388




[LightGBM] [Info] Total Bins 2961
[LightGBM] [Info] Number of data points in the train set: 49502, number of used features: 93
[LightGBM] [Info] Start training from score 4.831078
1.3629980922046503
31295




[LightGBM] [Info] Total Bins 2955
[LightGBM] [Info] Number of data points in the train set: 49502, number of used features: 93
[LightGBM] [Info] Start training from score 4.832734
1.3633243830676092
31325




[LightGBM] [Info] Total Bins 2991
[LightGBM] [Info] Number of data points in the train set: 49502, number of used features: 93
[LightGBM] [Info] Start training from score 4.836148
1.3635108842774861
31330




[LightGBM] [Info] Total Bins 2970
[LightGBM] [Info] Number of data points in the train set: 49502, number of used features: 93
[LightGBM] [Info] Start training from score 4.833683
1.3658286157158623
31341




[LightGBM] [Info] Total Bins 2984
[LightGBM] [Info] Number of data points in the train set: 49502, number of used features: 93
[LightGBM] [Info] Start training from score 4.842936
1.3593071002305293
31210




[LightGBM] [Info] Total Bins 2978
[LightGBM] [Info] Number of data points in the train set: 49502, number of used features: 93
[LightGBM] [Info] Start training from score 4.829845
1.3608231530719281
31372




[LightGBM] [Info] Total Bins 2991
[LightGBM] [Info] Number of data points in the train set: 49502, number of used features: 93
[LightGBM] [Info] Start training from score 4.834653
1.3661380333639312
31196




[LightGBM] [Info] Total Bins 2954
[LightGBM] [Info] Number of data points in the train set: 49502, number of used features: 93
[LightGBM] [Info] Start training from score 4.820391
1.361460038133029
31522




[LightGBM] [Info] Total Bins 2962
[LightGBM] [Info] Number of data points in the train set: 49502, number of used features: 93
[LightGBM] [Info] Start training from score 4.838835
1.3601950505269398
31252




[LightGBM] [Info] Total Bins 2975
[LightGBM] [Info] Number of data points in the train set: 49502, number of used features: 93
[LightGBM] [Info] Start training from score 4.844107
1.363429426023378
31354




[LightGBM] [Info] Total Bins 2983
[LightGBM] [Info] Number of data points in the train set: 49502, number of used features: 93
[LightGBM] [Info] Start training from score 4.834795
1.3632220673843196
31254




[LightGBM] [Info] Total Bins 2938
[LightGBM] [Info] Number of data points in the train set: 49502, number of used features: 93
[LightGBM] [Info] Start training from score 4.819017
1.364626275547405
31382




[LightGBM] [Info] Total Bins 2982
[LightGBM] [Info] Number of data points in the train set: 49502, number of used features: 93
[LightGBM] [Info] Start training from score 4.837724
1.3609421239296464
31165




[LightGBM] [Info] Total Bins 2969
[LightGBM] [Info] Number of data points in the train set: 49502, number of used features: 93
[LightGBM] [Info] Start training from score 4.843602
1.3639416325541447
31258




[LightGBM] [Info] Total Bins 2949
[LightGBM] [Info] Number of data points in the train set: 49502, number of used features: 93
[LightGBM] [Info] Start training from score 4.819724
1.363916594046968
31329




[LightGBM] [Info] Total Bins 2987
[LightGBM] [Info] Number of data points in the train set: 49502, number of used features: 93
[LightGBM] [Info] Start training from score 4.830512
1.3588280716233747
31363




[LightGBM] [Info] Total Bins 2992
[LightGBM] [Info] Number of data points in the train set: 49502, number of used features: 93
[LightGBM] [Info] Start training from score 4.837683
1.3650323283839063
31253




[LightGBM] [Info] Total Bins 2971
[LightGBM] [Info] Number of data points in the train set: 49502, number of used features: 93
[LightGBM] [Info] Start training from score 4.833845
1.3583794628871937
31366




[LightGBM] [Info] Total Bins 2989
[LightGBM] [Info] Number of data points in the train set: 49502, number of used features: 93
[LightGBM] [Info] Start training from score 4.812674
1.3607302098186855


In [14]:
# Bagging을 바탕으로 예측한 결과값에 대한 평균을 계산
bagging_predict = [] # 빈 리스트 생성
for lst2_index in range(test_x.shape[0]): # 테스트 데이터 개수만큼의 반복
    temp_predict = [] # 임시 빈 리스트 생성 (반복문 내 결과값 저장)
    for lst_index in range(len(bagging_predict_result)): # Bagging 결과 리스트 반복
        temp_predict.append(bagging_predict_result[lst_index][lst2_index]) # 각 Bagging 결과 예측한 값 중 같은 인덱스를 리스트에 저장
    bagging_predict.append(np.mean(temp_predict)) # 해당 인덱스의 30개의 결과값에 대한 평균을 최종 리스트에 추가

In [15]:
bagging_predict

[3.641528395238082,
 5.612225425923341,
 5.891476547356574,
 5.231383479630531,
 5.959816354965337,
 3.3022195590296914,
 2.61398678510761,
 4.267137023827591,
 2.570151519785575,
 3.015347810950047,
 4.266276719669156,
 5.0147699176541245,
 2.382220693029305,
 3.027873003047812,
 4.5429844489336775,
 6.372375675371007,
 2.8126867864345275,
 6.011054871918301,
 4.1059626117613215,
 4.992723319895498,
 3.163849579765856,
 3.119299333245067,
 3.3256703985995686,
 3.319439695273637,
 7.966212137919581,
 6.820907017449489,
 6.434756377854731,
 2.376313862145501,
 5.103532156159715,
 3.3009383232825558,
 5.9240807733009975,
 5.531427633812589,
 2.4027498605581696,
 6.594390066234364,
 6.408663142729358,
 4.818370270643071,
 5.436832798849759,
 3.2203693718303588,
 2.1838362124418653,
 6.582486562509564,
 6.221127078316718,
 3.33796888911803,
 2.464710679012083,
 2.286399178759422,
 6.4180997522595495,
 5.266082570388206,
 2.578381688776221,
 3.9242254688313123,
 8.293578468122842,
 7.539899

In [16]:
sqrt(mean_squared_error(lgb_model.predict(test_x),test_y))

1.3607302098186855