와인 분류 
====


scikit-learn의 예제 데이터 Toy Dataset 중 load_wine (와인 데이터)를 사용하여         
와인의 특징들으로 와인의 종류를 3가지로 분류해 보는 실습

- load_digits 데이터는 총 178개 
- feature는 총 13개, Alcohol, Malic acid, Color intensity 등 와인의 특성값
- label은 class 0, 1, 2의 세 가지 카테고리
    
https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_wine.html#sklearn.datasets.load_wine

## 1) 필요한 모듈 import하기

In [1]:
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

## 2) 데이터 준비

In [2]:
wine = load_wine()

# wine에는 어떤 정보들이 담겼을지, keys() 메서드로 확인
wine.keys()

dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names'])

## 3) 데이터 이해하기

### Feature Data 지정하기

In [3]:
wine_data = wine.data

print(wine_data.shape)

(178, 13)


### 데이터 확인

In [4]:
wine_data[10]

array([1.41e+01, 2.16e+00, 2.30e+00, 1.80e+01, 1.05e+02, 2.95e+00,
       3.32e+00, 2.20e-01, 2.38e+00, 5.75e+00, 1.25e+00, 3.17e+00,
       1.51e+03])

### Label Data 지정하기

In [5]:
wine_label = wine.target

print(wine_label.shape)

(178,)


### Target Names 출력해 보기

In [6]:
wine.target_names

array(['class_0', 'class_1', 'class_2'], dtype='<U7')

### 데이터 Describe 해 보기

In [7]:
wine.DESCR



## 4) train, test 데이터 분리

In [8]:
import pandas as pd

wine_df = pd.DataFrame(data=wine_data, columns=wine.feature_names)
wine_df.tail()

Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280/od315_of_diluted_wines,proline
173,13.71,5.65,2.45,20.5,95.0,1.68,0.61,0.52,1.06,7.7,0.64,1.74,740.0
174,13.4,3.91,2.48,23.0,102.0,1.8,0.75,0.43,1.41,7.3,0.7,1.56,750.0
175,13.27,4.28,2.26,20.0,120.0,1.59,0.69,0.43,1.35,10.2,0.59,1.56,835.0
176,13.17,2.59,2.37,20.0,120.0,1.65,0.68,0.53,1.46,9.3,0.6,1.62,840.0
177,14.13,4.1,2.74,24.5,96.0,2.05,0.76,0.56,1.35,9.2,0.61,1.6,560.0


In [9]:
# label 추가
wine_df["label"] = wine_label

wine_df.tail()

Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280/od315_of_diluted_wines,proline,label
173,13.71,5.65,2.45,20.5,95.0,1.68,0.61,0.52,1.06,7.7,0.64,1.74,740.0,2
174,13.4,3.91,2.48,23.0,102.0,1.8,0.75,0.43,1.41,7.3,0.7,1.56,750.0,2
175,13.27,4.28,2.26,20.0,120.0,1.59,0.69,0.43,1.35,10.2,0.59,1.56,835.0,2
176,13.17,2.59,2.37,20.0,120.0,1.65,0.68,0.53,1.46,9.3,0.6,1.62,840.0,2
177,14.13,4.1,2.74,24.5,96.0,2.05,0.76,0.56,1.35,9.2,0.61,1.6,560.0,2


In [10]:
X_train, X_test, y_train, y_test = train_test_split(wine_data, wine_label, 
                                                    test_size=0.2, random_state=11)

print('X_train 개수: ', len(X_train), ', X_test 개수: ', len(X_test))
print('y_train 개수: ', len(y_train), ', y_test 개수: ', len(y_test))

X_train 개수:  142 , X_test 개수:  36
y_train 개수:  142 , y_test 개수:  36


## 5) 다양한 모델로 학습시켜보기

### Decision Tree 사용해 보기

In [11]:
# Decision Tree 모델 
from sklearn.tree import DecisionTreeClassifier

decision_tree = DecisionTreeClassifier(random_state=111)
print(decision_tree._estimator_type)

classifier


In [12]:
# 모델 학습
decision_tree.fit(X_train, y_train)

DecisionTreeClassifier(random_state=111)

In [13]:
# 예측과 정확도 확인 
from sklearn.metrics import accuracy_score
y_pred_dt = decision_tree.predict(X_test)

accuracy = accuracy_score(y_test, y_pred_dt)
accuracy

0.9722222222222222

### Random Forest 사용해 보기

In [14]:
# Random Forest 모델 
from sklearn.ensemble import RandomForestClassifier

random_forest = RandomForestClassifier(random_state=111)

# 학습 
random_forest.fit(X_train, y_train)

# 예측과 정확도 확인 
y_pred_rf = random_forest.predict(X_test)

accuracy = accuracy_score(y_test, y_pred_rf)
accuracy

0.9722222222222222

### SVM 사용해 보기

In [15]:
# SVM 모델 
from sklearn import svm
svm_model = svm.SVC()

# 학습
svm_model.fit(X_train, y_train)

# 예측과 정확도 확인 
y_pred_svm = svm_model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred_svm)
accuracy

0.7777777777777778

### SGD Classifier 사용해 보기

In [16]:
# SGD Classifier 모델
from sklearn.linear_model import SGDClassifier
sgd_model = SGDClassifier()

# 학습 
sgd_model.fit(X_train, y_train)

# 예측과 정확도 확인 
y_pred_sgd = sgd_model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred_sgd)
accuracy

0.7222222222222222

### Logistic Regression 사용해 보기

In [17]:
# Logistic Regression 모델 
from sklearn.linear_model import LogisticRegression
logistic_model = LogisticRegression()#solver='liblinear')

# 학습
logistic_model.fit(X_train, y_train)

# 예측과 정확도 확인 
y_pred_lr = logistic_model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred_lr)
accuracy

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


0.9722222222222222

## 6) 모델을 평가해 보기

In [18]:
#  Precision, Recall, F1 score 
# sklearn.metrics의 classification_report를 활용하여 각 지표를 한 번에 확인

# Decision Tree 모델
print("[ Decision Tree ]")
print(classification_report(y_test, y_pred_dt))

[ Decision Tree ]
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        14
           1       1.00      0.93      0.97        15
           2       0.88      1.00      0.93         7

    accuracy                           0.97        36
   macro avg       0.96      0.98      0.97        36
weighted avg       0.98      0.97      0.97        36



In [19]:
# Random Forest 모델
print("[ Random Forest ]")
print(classification_report(y_test, y_pred_rf))

[ Random Forest ]
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        14
           1       1.00      0.93      0.97        15
           2       0.88      1.00      0.93         7

    accuracy                           0.97        36
   macro avg       0.96      0.98      0.97        36
weighted avg       0.98      0.97      0.97        36



In [20]:
# SVM 모델
print("[ SVM ]")
print(classification_report(y_test, y_pred_svm))

[ SVM ]
              precision    recall  f1-score   support

           0       0.93      1.00      0.97        14
           1       0.75      0.80      0.77        15
           2       0.40      0.29      0.33         7

    accuracy                           0.78        36
   macro avg       0.69      0.70      0.69        36
weighted avg       0.75      0.78      0.76        36



In [21]:
# SGD Classifier 모델
print("[ SGD Classifier ]")
print(classification_report(y_test, y_pred_sgd))

[ SGD Classifier ]
              precision    recall  f1-score   support

           0       1.00      0.79      0.88        14
           1       0.60      1.00      0.75        15
           2       0.00      0.00      0.00         7

    accuracy                           0.72        36
   macro avg       0.53      0.60      0.54        36
weighted avg       0.64      0.72      0.65        36



  _warn_prf(average, modifier, msg_start, len(result))


In [22]:
# Logistic Regression 모델
print("[ Logistic Regression ]")
print(classification_report(y_test, y_pred_lr))

[ Logistic Regression ]
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        14
           1       1.00      0.93      0.97        15
           2       0.88      1.00      0.93         7

    accuracy                           0.97        36
   macro avg       0.96      0.98      0.97        36
weighted avg       0.98      0.97      0.97        36



와인 분류 문제는 데이터를 올바르게 판단하는 게 중요한 문제이기 때문에 평가 지표 중 accuracy가 중요합니다.

    Accuracy , 정확도 : (TP+TN) / (TP+TN+FP+FN)
    - 전체 데이터 중 올바르게 판단한 데이터 개수의 비율
    - Accuracy 값은 클수록 좋음

각 모델의 Accuracy는
- Decision Tree : 0.97
- Random Forest : 0.97
- SVM : 0.78
- SGD Classifier : 0.72
- Logistic Regression : 0.97

이므로 이 와인 분류 문제에서는 Decision Tree, Random Forest, Logistic Regression 모델이 잘 예측한 것으로 볼 수 있습니다.