# 앙상블 (Ensemble) : Voting
- 다양한 모델을 결합하여 예측 성능을 향상시키는 방법
    - hard voting : 여러 개의 예측치에 대해 다수결로 결정
    - soft voting : 여러 개의 예측 확률을 평균내어 결정

In [43]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns

#### 위스콘신 유방암 데이터셋 (Wisconsin Breast Cancer Dataset)
유방암의 악성(Malignant)과 양성(Benign)을 분류하기 위해 자주 사용되는 데이터셋
(의학적인 이미지를 바탕으로 유방암 종양의 특징을 수치화한 데이터)
   
**데이터셋 개요**
   - **목적**: 유방암 종양이 악성(Malignant)인지, 양성(Benign)인지 분류
   - **샘플 수**: 569개
   - **특징(Features) 수**: 30개
   - **타겟(Target)**: 0(악성) 또는 1(양성)
  
**데이터 구성**
   1. **Radius mean**: 종양의 평균 반지름
   2. **Texture mean**: 종양의 표면의 거칠기
   3. **Perimeter mean**: 종양의 평균 둘레 길이
   4. **Area mean**: 종양의 평균 면적
   5. **Smoothness mean**: 종양의 매끄러움 정도
   6. **Compactness mean**: 종양의 압축도
   7. **Concavity mean**: 종양의 오목함
   8. **Concave points mean**: 종양의 오목한 점 개수
   9. **Symmetry mean**: 종양의 대칭성
   10. **Fractal dimension mean**: 종양의 프랙탈 차원

In [44]:
from sklearn.datasets import load_breast_cancer

data = load_breast_cancer()
# data.DESCR  # 설명

df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target

In [45]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 31 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   mean radius              569 non-null    float64
 1   mean texture             569 non-null    float64
 2   mean perimeter           569 non-null    float64
 3   mean area                569 non-null    float64
 4   mean smoothness          569 non-null    float64
 5   mean compactness         569 non-null    float64
 6   mean concavity           569 non-null    float64
 7   mean concave points      569 non-null    float64
 8   mean symmetry            569 non-null    float64
 9   mean fractal dimension   569 non-null    float64
 10  radius error             569 non-null    float64
 11  texture error            569 non-null    float64
 12  perimeter error          569 non-null    float64
 13  area error               569 non-null    float64
 14  smoothness error         5

In [46]:
df.describe()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,target
count,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,...,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0
mean,14.127292,19.289649,91.969033,654.889104,0.09636,0.104341,0.088799,0.048919,0.181162,0.062798,...,25.677223,107.261213,880.583128,0.132369,0.254265,0.272188,0.114606,0.290076,0.083946,0.627417
std,3.524049,4.301036,24.298981,351.914129,0.014064,0.052813,0.07972,0.038803,0.027414,0.00706,...,6.146258,33.602542,569.356993,0.022832,0.157336,0.208624,0.065732,0.061867,0.018061,0.483918
min,6.981,9.71,43.79,143.5,0.05263,0.01938,0.0,0.0,0.106,0.04996,...,12.02,50.41,185.2,0.07117,0.02729,0.0,0.0,0.1565,0.05504,0.0
25%,11.7,16.17,75.17,420.3,0.08637,0.06492,0.02956,0.02031,0.1619,0.0577,...,21.08,84.11,515.3,0.1166,0.1472,0.1145,0.06493,0.2504,0.07146,0.0
50%,13.37,18.84,86.24,551.1,0.09587,0.09263,0.06154,0.0335,0.1792,0.06154,...,25.41,97.66,686.5,0.1313,0.2119,0.2267,0.09993,0.2822,0.08004,1.0
75%,15.78,21.8,104.1,782.7,0.1053,0.1304,0.1307,0.074,0.1957,0.06612,...,29.72,125.4,1084.0,0.146,0.3391,0.3829,0.1614,0.3179,0.09208,1.0
max,28.11,39.28,188.5,2501.0,0.1634,0.3454,0.4268,0.2012,0.304,0.09744,...,49.54,251.2,4254.0,0.2226,1.058,1.252,0.291,0.6638,0.2075,1.0


In [47]:
from sklearn.model_selection import train_test_split

X = data.data
y = data.target

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

### hard voting


In [48]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import VotingClassifier

knn_clf = KNeighborsClassifier()
lr_clf = LogisticRegression()
dt_clf = DecisionTreeClassifier(random_state=0)

voting_clf = VotingClassifier(
    estimators=[
        ('knn_clf', knn_clf), 
        ('lr_clf', lr_clf),
        ('dt_clf', dt_clf),
    ], 
    voting='hard'                # voting 방법 선택
)

voting_clf.fit(X_train, y_train)

print('학습 점수:', voting_clf.score(X_train, y_train))
print('평가 점수:', voting_clf.score(X_test, y_test))

학습 점수: 0.9647887323943662
평가 점수: 0.951048951048951


STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT

Increase the number of iterations to improve the convergence (max_iter=100).
You might also want to scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [49]:
# hard voting 작동 원리 == 다수결
# 밑에 각 모델 별 예측값에서 다수결을 따른 것을 알 수 있음. 
start, end = 40, 50

voting_pred = voting_clf.predict(X_test[start:end])
print(f'앙상블 예측값: {voting_pred}')

for classifier in [knn_clf, lr_clf, dt_clf] : 
    # 개별 학습 및 예측
    classifier.fit(X_train, y_train)
    pred = classifier.predict(X_test[start:end])
    score = classifier.score(X_test, y_test)

    class_name = classifier.__class__.__name__    # 언더바 두개 
    print(f'{class_name} 개별 정확도 : {score}')
    print(f'{class_name} 예측값: {pred}')

앙상블 예측값: [0 1 0 1 0 0 1 1 1 0]
KNeighborsClassifier 개별 정확도 : 0.9370629370629371
KNeighborsClassifier 예측값: [0 1 0 1 0 0 1 1 1 0]
LogisticRegression 개별 정확도 : 0.9440559440559441
LogisticRegression 예측값: [0 1 0 1 0 0 1 1 1 0]
DecisionTreeClassifier 개별 정확도 : 0.8811188811188811
DecisionTreeClassifier 예측값: [1 1 0 1 0 0 1 1 1 0]


STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT

Increase the number of iterations to improve the convergence (max_iter=100).
You might also want to scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


### soft voting

In [50]:
voting_clf = VotingClassifier(
    estimators=[
        ('knn_clf', knn_clf), 
        ('lr_clf', lr_clf),
        ('dt_clf', dt_clf),
    ], 
    voting='soft'                # voting 방법 선택
)

voting_clf.fit(X_train, y_train)

print('학습 점수:', voting_clf.score(X_train, y_train))
print('평가 점수:', voting_clf.score(X_test, y_test))    # 전체 평가 지표가 올라감. 

학습 점수: 0.9859154929577465
평가 점수: 0.9370629370629371


STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT

Increase the number of iterations to improve the convergence (max_iter=100).
You might also want to scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [51]:
# soft voting 작동 원리 == 각 예측기의 확률값 평균  
start, end = 40, 50

voting_pred = voting_clf.predict_proba(X_test[start:end])
print(f'앙상블 예측 확률: {voting_pred}')

averages = np.full_like(voting_pred, 0)

for classifier in [knn_clf, lr_clf, dt_clf] : 
    # 개별 학습 및 예측
    classifier.fit(X_train, y_train)
    pred = classifier.predict_proba(X_test[start:end])
    score = classifier.score(X_test, y_test)

    averages += pred

    class_name = classifier.__class__.__name__    # 언더바 두개 
    print(f'{class_name} 개별 정확도 : {score}')
    print(f'{class_name} 예측 확률: {pred}')

print('각 모델별 예측 확률의 평균:', averages / 3 )
print(np.array_equal(voting_pred, averages / 3))

앙상블 예측 확률: [[5.70263157e-01 4.29736843e-01]
 [1.08113730e-03 9.98918863e-01]
 [9.99622506e-01 3.77494355e-04]
 [3.35757426e-04 9.99664243e-01]
 [9.00993416e-01 9.90065841e-02]
 [1.00000000e+00 1.75163138e-13]
 [7.79971341e-05 9.99922003e-01]
 [1.83004552e-02 9.81699545e-01]
 [1.14568790e-03 9.98854312e-01]
 [9.32982089e-01 6.70179112e-02]]
KNeighborsClassifier 개별 정확도 : 0.9370629370629371
KNeighborsClassifier 예측 확률: [[0.8 0.2]
 [0.  1. ]
 [1.  0. ]
 [0.  1. ]
 [0.8 0.2]
 [1.  0. ]
 [0.  1. ]
 [0.  1. ]
 [0.  1. ]
 [0.8 0.2]]
LogisticRegression 개별 정확도 : 0.9440559440559441
LogisticRegression 예측 확률: [[9.10789471e-01 8.92105287e-02]
 [3.24341189e-03 9.96756588e-01]
 [9.98867517e-01 1.13248306e-03]
 [1.00727228e-03 9.98992728e-01]
 [9.02980248e-01 9.70197522e-02]
 [1.00000000e+00 5.25489414e-13]
 [2.33991402e-04 9.99766009e-01]
 [5.49013655e-02 9.45098634e-01]
 [3.43706371e-03 9.96562936e-01]
 [9.98946266e-01 1.05373359e-03]]
DecisionTreeClassifier 개별 정확도 : 0.8811188811188811
DecisionTreeCla

STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT

Increase the number of iterations to improve the convergence (max_iter=100).
You might also want to scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
