# 앙상블 학습

- 일련의 예측기(분류, 회귀 모델)로부터 예측을 수집하여 학습 하는 것
- 일련의 예측기를 ```앙상블```이라고 함.
- 무작위로 모은 수천명의 대답이 전문가 한명의 답보다 나은 경우가 있다는 원리(대중의 지혜)
- 결정트리의 앙상블을 ```랜덤 포레스트```라고 함
- 보팅, 배깅, 부스팅 등이 있음

# 1. 보팅

In [27]:
import pandas as pd

#sklearn에서 내장된 모듈 불러옴
from sklearn.ensemble import VotingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
cancer = load_breast_cancer()

data = pd.DataFrame(cancer.data, columns = cancer.feature_names)
data.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


In [17]:
#로지스틱 회귀와 KNN 구현
logistic_regression = LogisticRegression()
knn = KNeighborsClassifier(n_neighbors=8)

# 소프트보팅 앙상블 모델
voting_model = VotingClassifier(estimators=[ ('LogisticRegression', logistic_regression), ('KNN', knn)], voting='soft')

#train_set과 test_set으로 나누기
X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target, test_size=0.2, random_state=1561)

# 보팅 분류기의 학습/예측/평가
voting_model.fit(X_train, y_train)
pred = voting_model.predict(X_test)
print('소프트보팅 분류기의 정확도: {0: .4f}'.format(accuracy_score(y_test, pred)))

# 개별 모델의 학습/예측/평가
classifiers = [logistic_regression, knn]
for classifier in classifiers:
    classifier.fit(X_train, y_train)
    pred = classifier.predict(X_test)
    class_name = classifier.__class__.__name__
    print('{0} 정확도: {1:.4f}'.format(class_name, accuracy_score(y_test, pred)))

소프트보팅 분류기의 정확도:  0.9649
LogisticRegression 정확도: 0.9474
KNeighborsClassifier 정확도: 0.9649


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


# 2. 배깅(Bootstrap Aggregating, Bagging)

같은 알고리즘을 사용하고 훈련 세트의 서브셋을 무작위로 구성하여 분류기를 각기 다르게 학습

핵심은 **한정된** 데이터에서 다양성을 뽑아내는 것  
- Bootstrapping : 무작위 샘플링(중복 허용 = 부트스트랩)  
- Aggregating : 집계

![image.png](attachment:image.png)

전체가 반영되지 않은 여러 데이터 샘플을 모아서 **충분한 학습효과**를 주어 편향과 분산을 모두 감소시킴 -> ```과대적합```과 ```과소적합```을 방지

## 주의 !! 과대적합? 과소적합?

![image.png](attachment:image.png)

![image.png](attachment:image.png)

현실적으로 Low Bias와 Low Variance의 ***균형점***을 잘 맞추는 방법론을 도출하는 것이 중요.

그 중 하나가 배깅!

In [23]:
from sklearn.ensemble import RandomForestClassifier

data.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


In [24]:
bagging_rf_ensemble = RandomForestClassifier()

bagging_rf_ensemble.fit(X_train, y_train)
y_pred = bagging_rf_ensemble.predict(X_test)

print("bagging RandomForest 분류기 정확도 {0:.4f}".format(accuracy_score(y_test, y_pred)))

bagging RandomForest 분류기 정확도 0.9649


# 3. 부스팅 (Boosting)

**부스팅**은 약한 학습기를 여러 개 연결하여 강한 학습기를 만드는 앙상블 방법

이전 분류기의 학습 결과를 토대로 다음 분류기의 학습 데이터의 샘플 가중치를 조정해 학습을 진행

배깅은 각각의 분류기들이 서로 영향을 주지 않고, 모든 상황에서 학습이 끝난 다음 결과를 종합
부스팅은 **이전 분류기의 학습 결과를 토대로 다음 분류기의 가중치를 조정**

오답에 높은 가중치를 부여하기, 때문에 이상치에 취약할 수 있다!

## 3.1 에이다부스트

![image.png](attachment:image.png)

![image.png](attachment:image.png)

In [33]:
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier

ada_clf = AdaBoostClassifier(
    DecisionTreeClassifier(max_depth=1), n_estimators=200,
    algorithm="SAMME.R", learning_rate=0.5)
ada_clf.fit(X_train, y_train)
y_pred=ada_clf.predict(X_test)

print("bagging RandomForest 분류기 정확도 {0:.4f}".format(accuracy_score(y_test, y_pred)))

bagging RandomForest 분류기 정확도 0.9737


# 3.2 그레이디언트 부스팅

에이다부스트처럼 그레이디언트 부스팅은 앙상블에 이전까지의 오차를 보정하도록 예측기를 순차적으로 추가.  
에이다부스트처럼 반복마다 샘플의 가중치를 수정하는 대신 이전 예측기가 만든 **잔여 오차**에 새로운 예측기를 학습시킴

## 과제: 보팅 예시에 Decision Tree 추가해서 소프트보팅 정확도 구하기!

소프트보팅이 어떠한 결과를 얻을 수 있을 지 볼 수 있도록 과제 하기(Decision Tree는 무조건 포함 시키고, 다른 어떤 것들은 포함 해도 되고 안해도 될 것 같음)