# __Ensembles__

### __Ensembles almost always work better__

### Bias & Variance

![Alt text](./images/bias.png)

## 앙상블의목적: 다수의모델을학습하여오류의감소를추구
>분산의감소에의한오류감소: 배깅(Bagging), 랜덤포레스트(Random Forest) <br>
>**편향의감소에의한오류감소: 부스팅(Boosting)**

# __Boosting__

<p align="center"><img width="600" height="auto" src="./images/boosting.png"></p>

* Boosting도 Bagging과 동일하게 복원 랜덤 샘플링을 하지만, 가중치를 부여한다는 차이점이 있다
* Bagging이 병렬로 학습하는 반면, Boosting은 순차적으로 학습되며, __학습이 끝나면 나온 결과에 따라 가중치가 재분배 됨__

# __Package를 사용하여 AdaBoost 코드 작성__

In [1]:
# Load libraries
from sklearn.tree import DecisionTreeClassifier # 의사 결정 나무
from sklearn.ensemble import AdaBoostClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn import metrics
import pandas as pd
from sklearn.metrics import f1_score


In [2]:
filename = './dataset/pima-indians-diabetes.data.csv'
dataframe = pd.read_csv(filename, header =None)
dataframe.columns = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age', 'Class']
dataframe.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Class
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [3]:
X = dataframe.iloc[:, :-1]
y = dataframe.iloc[:, -1] 

In [4]:
X.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
0,6,148,72,35,0,33.6,0.627,50
1,1,85,66,29,0,26.6,0.351,31
2,8,183,64,0,0,23.3,0.672,32
3,1,89,66,23,94,28.1,0.167,21
4,0,137,40,35,168,43.1,2.288,33


In [5]:
y.head()

0    1
1    0
2    1
3    0
4    1
Name: Class, dtype: int64

In [6]:
# 데이터 셋 분할하기
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state = 0) 

- Train model

In [7]:
# hyperparameters
param_grid = {'n_estimators': [100, 200],
              'learning_rate': [0.01, 0.001, 0.0001], 
              'base_estimator__max_depth': [1, 3, 5]
              }

In [8]:
# 1) 모델 선언
DT = DecisionTreeClassifier()
DT

DecisionTreeClassifier()

In [9]:
# 여러 모델들을 ensemble: adaboost
ada_model = AdaBoostClassifier(base_estimator=DT, random_state=1)

# hyperparameter search
grid_search = GridSearchCV(ada_model, param_grid=param_grid, cv=5, scoring='f1')
grid_search.fit( X_train, y_train)

GridSearchCV(cv=5,
             estimator=AdaBoostClassifier(base_estimator=DecisionTreeClassifier(),
                                          random_state=1),
             param_grid={'base_estimator__max_depth': [1, 3, 5],
                         'learning_rate': [0.01, 0.001, 0.0001],
                         'n_estimators': [100, 200]},
             scoring='f1')

In [10]:
grid_search.best_params_

{'base_estimator__max_depth': 3, 'learning_rate': 0.001, 'n_estimators': 200}

- 최적의 파라미터를 찾은 후 모델 결정

In [11]:
opt_model = grid_search.best_estimator_
opt_model

AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=3),
                   learning_rate=0.001, n_estimators=200, random_state=1)

In [12]:
# 4) 예측
test_pred_y = opt_model.predict(X_test)
test_pred_y

array([1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0,
       0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1,
       1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1,
       1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1,
       0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0])

In [13]:
# 테스트 데이터에 대한 f1-score
ada_f1 = f1_score(y_true= y_test, y_pred= test_pred_y)
ada_f1

0.6136363636363638

- 변수중요도

In [14]:
opt_model.feature_importances_

array([0.00366245, 0.67004666, 0.0057606 , 0.        , 0.01823516,
       0.14301762, 0.0086825 , 0.15059501])

In [14]:
var_df = pd.Series(opt_model.feature_importances_, index = dataframe.columns[:-1])
var_df.sort_values(ascending=False)

Glucose                     0.670047
Age                         0.150595
BMI                         0.143018
Insulin                     0.018235
DiabetesPedigreeFunction    0.008682
BloodPressure               0.005761
Pregnancies                 0.003662
SkinThickness               0.000000
dtype: float64

---

# __Package를 사용하여 Gradient Boosting Machine 코드 작성__

In [15]:
# 패키지 불러오기
from sklearn.ensemble import GradientBoostingClassifier
# 성능지표
from sklearn.metrics import classification_report, confusion_matrix, roc_curve, auc
# 데이터 파티션
from sklearn.model_selection import train_test_split
# 데이터 불러오기
import pandas as pd

In [16]:
filename = './dataset/pima-indians-diabetes.data.csv'
dataframe = pd.read_csv(filename)
dataframe.columns = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age', 'Class']

X = dataframe.iloc[:, :-1]
y = dataframe.iloc[:, -1] 

In [17]:
# 데이터 셋 분할하기
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,random_state = 0) 

- train model

In [18]:
# hyperparameters
param_grid = {'n_estimators': [100, 200],
              'learning_rate': [0.01, 0.001, 0.0001], 
              'max_depth': [1, 3, 5]
              }

In [19]:
# 2) 여러 모델들을 ensemble: bagging
gbm_model = GradientBoostingClassifier()

# hyperparameter search
grid_search = GridSearchCV(gbm_model, param_grid=param_grid, cv=5, scoring='f1')
grid_search.fit( X_train, y_train)

GridSearchCV(cv=5, estimator=GradientBoostingClassifier(),
             param_grid={'learning_rate': [0.01, 0.001, 0.0001],
                         'max_depth': [1, 3, 5], 'n_estimators': [100, 200]},
             scoring='f1')

In [20]:
grid_search.best_params_

{'learning_rate': 0.01, 'max_depth': 3, 'n_estimators': 200}

- 최적의 파라미터를 찾은 후 모델 결정

In [21]:
opt_model = grid_search.best_estimator_
opt_model

GradientBoostingClassifier(learning_rate=0.01, n_estimators=200)

In [22]:
# 4) 예측
test_pred_y = opt_model.predict(X_test)
test_pred_y

array([1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0,
       0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1,
       0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0,
       1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
       1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0])

In [23]:
# 테스트 데이터에 대한 f1-score
gbm_f1 = f1_score(y_true= y_test, y_pred= test_pred_y)
gbm_f1

0.5952380952380953

- 변수중요도

In [24]:
opt_model.feature_importances_

array([0.03459622, 0.53804443, 0.00547286, 0.01103247, 0.05438359,
       0.21350661, 0.05042473, 0.0925391 ])

In [25]:
var_df = pd.Series(opt_model.feature_importances_, index = dataframe.columns[:-1])
var_df.sort_values(ascending=False)

Glucose                     0.538044
BMI                         0.213507
Age                         0.092539
Insulin                     0.054384
DiabetesPedigreeFunction    0.050425
Pregnancies                 0.034596
SkinThickness               0.011032
BloodPressure               0.005473
dtype: float64

---

- Summary

In [26]:
pd.Series([ada_f1,gbm_f1],index =['ada', 'gbm'], name = 'f1-score')

ada    0.613636
gbm    0.595238
Name: f1-score, dtype: float64