# Project: Iris의 3 가지 품종 분류하기
# Model: Decision Tree, XGBoost, RandomForest, Logistic Regression
## Dataset: Scikit-Learn - Iris


- Image Classification[Machine Learning]
- 지도학습(Supervised Learning)
- 붓꽃의 꽃잎과 꽃받침 길이를 이용해 학습 또는 예측





In [None]:
pip install scikit-learn

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
# sklearn.datasets 중 Iris dataset load 하기

from sklearn.datasets import load_iris
iris = load_iris() # dataset을 iris라는 변수에 지정

# dir()는 객체가 어떤 변수와 메서드를 가지고 있는 지 나열
print(dir(iris))

['DESCR', 'data', 'data_module', 'feature_names', 'filename', 'frame', 'target', 'target_names']


# Iris 데이터 불러오기
# 데이터 확인하기

In [None]:
print(iris.feature_names)
print(iris.DESCR) # description
print(iris.filename)

['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica
                
    :Summary Statistics:

                    Min  Max   Mean    SD   Class Correlation
    sepal length:   4.3  7.9   5.84   0.83    0.7826
    sepal width:    2.0  4.4   3.05   0.43   -0.4194
    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
    petal width:    0.1  2.5   1.20   0.76    0.9565  (high!)

    :Missing Attribute Values: None
    :Class Distribution: 33.3% for each of 3 classes.
    :Cre

In [None]:
# 어떤 데이터가 담겨있는 지 확인해보기
iris.keys()

dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename', 'data_module'])

In [None]:
iris_data = iris.data
print(iris_data.shape)

(150, 4)


In [None]:
iris_data[0] # the 4 attributes of the first flower

array([5.1, 3.5, 1.4, 0.2])

In [None]:
iris_label = iris.target
print(iris_label.shape)
iris_label

(150,)


array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

In [None]:
iris.target_names # 0, 1, 2 순으로 출력

array(['setosa', 'versicolor', 'virginica'], dtype='<U10')

#### Pandas를 활용한 데이터 확인하기
#### DataFrame에 원하는 대로 데이터를 정리해주기

In [None]:
import pandas as pd
print(pd.__version__)

1.3.5


In [None]:
# DataFrame이라는 자료형으로 데이터 변환해주기
iris_df = pd.DataFrame(data=iris_data, columns = iris.feature_names)
iris_df

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2
...,...,...,...,...
145,6.7,3.0,5.2,2.3
146,6.3,2.5,5.0,1.9
147,6.5,3.0,5.2,2.0
148,6.2,3.4,5.4,2.3


In [None]:
# 마지막 열에 label column을 추가해주기
iris_df["label"] = iris.target
iris_df

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),label
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,2
146,6.3,2.5,5.0,1.9,2
147,6.5,3.0,5.2,2.0,2
148,6.2,3.4,5.4,2.3,2


(1) 데이터 준비 - train, test split
(2) 학습 train (fit)
(3) 예측 predict
(4) accuracy report 정확도 측정

In [None]:
# split data into train and test sets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(iris_data, # 문제지 / feature
                                                    iris_label, # 모델이 맞춰야 하는 label
                                                    test_size=0.2, # test_size to 20%
                                                    random_state=7) # 무작위로 데이터 섞기

print('X_train 개수: ', len(X_train),', X_test 개수: ',len(X_test))

X_train 개수:  120 , X_test 개수:  30


X_train: 4개의 feature들을 담은 데이터

y_train: 정답 라벨 y만 있는 데이터

In [None]:
X_train.shape, y_train.shape

((120, 4), (120,))

In [None]:
X_test.shape, y_test.shape

((30, 4), (30,))

In [None]:
y_train, y_test 

(array([2, 1, 0, 2, 1, 0, 0, 0, 0, 2, 2, 1, 2, 2, 1, 0, 1, 1, 2, 0, 0, 0,
        2, 0, 2, 1, 1, 1, 0, 0, 0, 1, 2, 1, 1, 0, 2, 0, 0, 2, 2, 0, 2, 0,
        1, 2, 1, 0, 1, 0, 2, 2, 1, 0, 0, 1, 2, 0, 2, 2, 1, 0, 1, 0, 2, 2,
        0, 0, 2, 1, 2, 2, 1, 0, 0, 2, 0, 0, 1, 2, 2, 1, 1, 0, 2, 0, 0, 1,
        1, 2, 0, 1, 1, 2, 2, 1, 2, 0, 1, 1, 0, 0, 0, 1, 1, 0, 2, 2, 1, 2,
        0, 2, 1, 1, 0, 2, 1, 2, 1, 0]),
 array([2, 1, 0, 1, 2, 0, 1, 1, 0, 1, 1, 1, 0, 2, 0, 1, 2, 2, 0, 0, 1, 2,
        1, 2, 2, 2, 1, 1, 2, 2]))

## 1. Decision Tree Model
- classification, supervised learning

In [None]:
from sklearn.tree import DecisionTreeClassifier

decision_tree = DecisionTreeClassifier(random_state=32)
print(decision_tree._estimator_type)

classifier


In [None]:
# model 학습
decision_tree.fit(X_train, y_train)

DecisionTreeClassifier(random_state=32)

In [None]:
# predict label
y_pred = decision_tree.predict(X_test)
y_pred

array([2, 1, 0, 1, 2, 0, 1, 1, 0, 1, 2, 1, 0, 2, 0, 2, 2, 2, 0, 0, 1, 2,
       1, 1, 2, 2, 1, 1, 2, 2])

In [None]:
# 실제 정답 라벨
y_test

array([2, 1, 0, 1, 2, 0, 1, 1, 0, 1, 1, 1, 0, 2, 0, 1, 2, 2, 0, 0, 1, 2,
       1, 2, 2, 2, 1, 1, 2, 2])

In [None]:
# 정확도 알아보기

from sklearn.metrics import accuracy_score

accuracy = accuracy_score(y_test, y_pred)
accuracy

0.9

## 2. Random Forest Model
- ensemble(여러 decision tree모델을 합침, 집단지성)
- 여러 feature 중 무작위로 n개만 선택하여 트리 생성
- 여러 개의 결정 트리 생성
- 여러 결정 트리들이 내린 예측 값들 중 가장 많이 나온 값을 최종 예측값으로 지정

In [None]:
# 필요한 모듈 import
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

# train, test 데이터 분리
X_train, X_test, y_train, y_test = train_test_split(iris_data, 
                                                    iris_label, 
                                                    test_size=0.2, 
                                                    random_state=7)

random_forest = RandomForestClassifier(random_state=32)
random_forest.fit(X_train, y_train)
y_pred = random_forest.predict(X_test)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00         7
           1       0.83      0.83      0.83        12
           2       0.82      0.82      0.82        11

    accuracy                           0.87        30
   macro avg       0.88      0.88      0.88        30
weighted avg       0.87      0.87      0.87        30



## 3. SVM(Support Vector Machine) Model
Support vector와 hyperplane(초평면)을 이용하여 분류를 수행
선형 분류 알고리즘

- Decision Boundary
- Support Vector: Decision BOundary에 가까이 있는 데이터
- Margin: Decision Boundary와 Support Vector 사이의 거리

In [None]:
from sklearn import svm
svm_model = svm.SVC()

print(svm_model._estimator_type)

svm_model.fit(X_train, y_train)

y_pred = svm_model.predict(X_test)
print("y pred", y_pred)
print("y test", y_test)

print("--------------------------")
print(classification_report(y_test, y_pred))


accuracy=accuracy_score(y_test, y_pred)
print(accuracy)

classifier
y pred [2 1 0 1 1 0 1 1 0 1 2 1 0 2 0 2 2 2 0 0 1 2 1 1 2 2 1 1 2 2]
y test [2 1 0 1 2 0 1 1 0 1 1 1 0 2 0 1 2 2 0 0 1 2 1 2 2 2 1 1 2 2]
--------------------------
              precision    recall  f1-score   support

           0       1.00      1.00      1.00         7
           1       0.83      0.83      0.83        12
           2       0.82      0.82      0.82        11

    accuracy                           0.87        30
   macro avg       0.88      0.88      0.88        30
weighted avg       0.87      0.87      0.87        30

0.8666666666666667


## 4. Stochastic Gradient Descent Classifier(SGD Classifier)(확률적 경사하강법)

gradient descent with a batch size of 1
** batch 란? 경사하강법에서 배치는 단일 기울기를 계산하는데 사용하는 예(data)의 총 개수
(Gradient Descent에서의 배치는 전체 데이터셋)

In [None]:
from sklearn.linear_model import SGDClassifier
sgd_model = SGDClassifier()

print(sgd_model._estimator_type)

classifier


In [None]:
sgd_model.fit(X_train, y_train)
y_pred = sgd_model.predict(X_test)

print(classification_report(y_test, y_pred))

sgdAccuracy = accuracy_score(y_test, y_pred)
print(sgdAccuracy)

              precision    recall  f1-score   support

           0       1.00      1.00      1.00         7
           1       0.88      0.58      0.70        12
           2       0.67      0.91      0.77        11

    accuracy                           0.80        30
   macro avg       0.85      0.83      0.82        30
weighted avg       0.83      0.80      0.80        30

0.8


## Logistic Regression(Softmax Regression)

선형 불류, softmax를 사용한 다중 클래스 분류

softmax: 클래스가 N개일 때, N차원이 벡터가 각 클래스가 정답일 확률을 표현하도록 정규화를 해주는 함수

In [None]:
from sklearn.linear_model import LogisticRegression
logistic_model = LogisticRegression()

logistic_model.fit(X_train, y_train)
y_pred = logistic_model.predict(X_test)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00         7
           1       0.83      0.83      0.83        12
           2       0.82      0.82      0.82        11

    accuracy                           0.87        30
   macro avg       0.88      0.88      0.88        30
weighted avg       0.87      0.87      0.87        30



STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
