**앙상블**
- 여러 개의 머신러닝 알고리즘을 결합하여 각 모델이 예측한 결과를 취합 또는 부스팅하여 예측을 수행하는 기법 
- **취합**
  - 앙상블을 구성하는 내부의 각 모델이 서로 독립적으로 동작함
  - 각 모델은 적절한 수준의 과적합을 수행할 필요가 있음
  - 학습과 예측의 수행 속도가 빠름
  - 각 모델이 독립적이라 병렬 처리 가능
  - **Voting**
    - Hard
      - 각 모델이 예측한 결과를 집계하여, 가장 많이 나온 클래스로 결정 
    - Soft
      - 각 모델이 예측한 확률의 평균을 구해, 가장 높은 평균값을 갖는 클래스로 결정
  - **Bagging**
    - 특정 머신러닝 알고리즘을 기반으로 데이터를 무작위 추출하여 각 모델이 서로 다른 데이터를 학습하는 방식으로 앙상블을 수행하는 기법
  - **RandomForest**
    - Bagging에 Decision Tree를 조합하여 사용하는 기법 
- **부스팅**
  - 앙상블을 구성하는 내부의 각 모델이 선형으로 연결되어 동작함
  - 각 모델은 이전 모델의 학습이 종료되어야 학습을 수행할 수 있음
  - 각 모델은 이전 모델에 영향을 받음 
  - 각 모델에 강한 제약을 설정하여 점진적인 성능 향상을 도모함
  - 학습과 예측의 수행 속도가 느림
  - **AdaBoosting**
    - 직전 모델이 예측한 데이터에 가중치를 부여하는 데이터 중심 기법 
  - **GradientBoosting** 
    - 결정 트리를 기본 모델로 하는 기법 

## Lecture 01 - 앙상블 with 취합 

In [1]:
# import pandas
import pandas as pd

pd.options.display.max_columns = 5
pd.options.display.max_rows = 10

In [2]:
# 1. load dataset
from sklearn.datasets import load_breast_cancer
data = load_breast_cancer()

In [3]:
# 2. set X, y
X = pd.DataFrame(data=data.data, columns=data.feature_names)
y = pd.Series(data=data.target)

print(X.head())

   mean radius  mean texture  ...  worst symmetry  worst fractal dimension
0        17.99         10.38  ...          0.4601                  0.11890
1        20.57         17.77  ...          0.2750                  0.08902
2        19.69         21.25  ...          0.3613                  0.08758
3        11.42         20.38  ...          0.6638                  0.17300
4        20.29         14.34  ...          0.2364                  0.07678

[5 rows x 30 columns]


In [4]:
# 3. check X, y
print(X.info())                 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 30 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   mean radius              569 non-null    float64
 1   mean texture             569 non-null    float64
 2   mean perimeter           569 non-null    float64
 3   mean area                569 non-null    float64
 4   mean smoothness          569 non-null    float64
 5   mean compactness         569 non-null    float64
 6   mean concavity           569 non-null    float64
 7   mean concave points      569 non-null    float64
 8   mean symmetry            569 non-null    float64
 9   mean fractal dimension   569 non-null    float64
 10  radius error             569 non-null    float64
 11  texture error            569 non-null    float64
 12  perimeter error          569 non-null    float64
 13  area error               569 non-null    float64
 14  smoothness error         5

In [5]:
print(X.isnull().sum())

mean radius                0
mean texture               0
mean perimeter             0
mean area                  0
mean smoothness            0
                          ..
worst compactness          0
worst concavity            0
worst concave points       0
worst symmetry             0
worst fractal dimension    0
Length: 30, dtype: int64


In [6]:
print(X.describe())

       mean radius  mean texture  ...  worst symmetry  worst fractal dimension
count   569.000000    569.000000  ...      569.000000               569.000000
mean     14.127292     19.289649  ...        0.290076                 0.083946
std       3.524049      4.301036  ...        0.061867                 0.018061
min       6.981000      9.710000  ...        0.156500                 0.055040
25%      11.700000     16.170000  ...        0.250400                 0.071460
50%      13.370000     18.840000  ...        0.282200                 0.080040
75%      15.780000     21.800000  ...        0.317900                 0.092080
max      28.110000     39.280000  ...        0.663800                 0.207500

[8 rows x 30 columns]


In [7]:
print(y.value_counts())

1    357
0    212
dtype: int64


In [8]:
print(y.value_counts() / len(y))

1    0.627417
0    0.372583
dtype: float64


In [9]:
# 4. split data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.2)

print(len(X_train), len(X_test))
print(len(y_train), len(y_test))

455 114
455 114


In [16]:
# 5. build model

# 내부 모델 
models = []

from sklearn.neighbors import KNeighborsClassifier
models.append(KNeighborsClassifier())

from sklearn.linear_model import LogisticRegression
models.append(LogisticRegression(max_iter=10000))

from sklearn.tree import DecisionTreeClassifier
models.append(DecisionTreeClassifier())

# 취합 모델
from sklearn.ensemble import VotingClassifier
model = VotingClassifier(estimators=[(f'model{i}', models[i]) for i in range(3)])
model.fit(X_train, y_train)

score = model.score(X_train, y_train)
print(f'SCORE(TRAIN): {score}')

score = model.score(X_test, y_test)
print(f' SCORE(TEST): {score}\n')

print(f'    ANSWER: {y_test[:5].values}')

pred = model.predict(X_test[:5])
print(f'PREDICT(0): {pred}')

for i in range(3):
  pred = model.estimators_[i].predict(X_test[:5])
  print(f'PREDICT({i+1}): {pred}')

SCORE(TRAIN): 0.9802197802197802
 SCORE(TEST): 0.9298245614035088

    ANSWER: [1 1 1 1 0]
PREDICT(0): [1 1 1 1 0]
PREDICT(1): [1 1 1 1 0]
PREDICT(2): [1 1 1 1 0]
PREDICT(3): [1 1 1 1 0]


## Lecture 02 - 앙상블 with 부스팅

In [17]:
# import pandas
import pandas as pd

pd.options.display.max_columns = 5
pd.options.display.max_rows = 10

In [18]:
# 1. load dataset
from sklearn.datasets import load_breast_cancer
data = load_breast_cancer()

In [19]:
# 2. set X, y
X = pd.DataFrame(data=data.data, columns=data.feature_names)
y = pd.Series(data=data.target)

print(X.head())

   mean radius  mean texture  ...  worst symmetry  worst fractal dimension
0        17.99         10.38  ...          0.4601                  0.11890
1        20.57         17.77  ...          0.2750                  0.08902
2        19.69         21.25  ...          0.3613                  0.08758
3        11.42         20.38  ...          0.6638                  0.17300
4        20.29         14.34  ...          0.2364                  0.07678

[5 rows x 30 columns]


In [20]:
# 3. check X, y
print(X.info())                 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 30 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   mean radius              569 non-null    float64
 1   mean texture             569 non-null    float64
 2   mean perimeter           569 non-null    float64
 3   mean area                569 non-null    float64
 4   mean smoothness          569 non-null    float64
 5   mean compactness         569 non-null    float64
 6   mean concavity           569 non-null    float64
 7   mean concave points      569 non-null    float64
 8   mean symmetry            569 non-null    float64
 9   mean fractal dimension   569 non-null    float64
 10  radius error             569 non-null    float64
 11  texture error            569 non-null    float64
 12  perimeter error          569 non-null    float64
 13  area error               569 non-null    float64
 14  smoothness error         5

In [21]:
print(X.isnull().sum())

mean radius                0
mean texture               0
mean perimeter             0
mean area                  0
mean smoothness            0
                          ..
worst compactness          0
worst concavity            0
worst concave points       0
worst symmetry             0
worst fractal dimension    0
Length: 30, dtype: int64


In [22]:
print(X.describe())

       mean radius  mean texture  ...  worst symmetry  worst fractal dimension
count   569.000000    569.000000  ...      569.000000               569.000000
mean     14.127292     19.289649  ...        0.290076                 0.083946
std       3.524049      4.301036  ...        0.061867                 0.018061
min       6.981000      9.710000  ...        0.156500                 0.055040
25%      11.700000     16.170000  ...        0.250400                 0.071460
50%      13.370000     18.840000  ...        0.282200                 0.080040
75%      15.780000     21.800000  ...        0.317900                 0.092080
max      28.110000     39.280000  ...        0.663800                 0.207500

[8 rows x 30 columns]


In [23]:
print(y.value_counts())

1    357
0    212
dtype: int64


In [None]:
print(y.value_counts() / len(y))

1    0.627417
0    0.372583
dtype: float64


In [24]:
# 4. split data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.2)

print(len(X_train), len(X_test))
print(len(y_train), len(y_test))

455 114
455 114


In [26]:
# 5. build model

# 베이스 모델 
from sklearn.linear_model import LogisticRegression
base_model = LogisticRegression(max_iter=10000)
models = []

# 부스팅 모델
from sklearn.ensemble import AdaBoostClassifier
model = AdaBoostClassifier(base_estimator=base_model)
model.fit(X_train, y_train)

score = model.score(X_train, y_train)
print(f'SCORE(TRAIN): {score}')

score = model.score(X_test, y_test)
print(f' SCORE(TEST): {score}\n')

pred = model.predict(X_test[:5])
print(f'PREDICT: {pred}')
print(f' ANSWER: {y_test[:5].values}')

SCORE(TRAIN): 0.9560439560439561
 SCORE(TEST): 0.9473684210526315

PREDICT: [1 1 0 0 1]
 ANSWER: [1 0 0 0 1]
