# 사이킷런 손글씨 분류 예제

1. 필요한 모듈 임포트
2. 데이터 준비
3. 데이터 분석
4. train, test 데이터 분리
5. 다양한 모델로 학습
6. 모델 평가

## 1)필요 모듈 import

In [1]:
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

## 2)데이터 준비

In [2]:
digits = load_digits()

## 3)데이터 이해

In [3]:
digits.keys()

dict_keys(['data', 'target', 'frame', 'feature_names', 'target_names', 'images', 'DESCR'])

### Feature Data

In [4]:
digits_data = digits.data

In [5]:
digits_data

array([[ 0.,  0.,  5., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ..., 10.,  0.,  0.],
       [ 0.,  0.,  0., ..., 16.,  9.,  0.],
       ...,
       [ 0.,  0.,  1., ...,  6.,  0.,  0.],
       [ 0.,  0.,  2., ..., 12.,  0.,  0.],
       [ 0.,  0., 10., ..., 12.,  1.,  0.]])

In [6]:
digits_data.shape

(1797, 64)

feature에는 (8,8) 이미지를 1차원으로 만든 배열 1797개가 있다.

### Label Data

In [7]:
digits.target

array([0, 1, 2, ..., 8, 9, 8])

In [8]:
digits_label=digits.target

In [9]:
num_of_class=set(digits_label)
print(num_of_class)
print(len(num_of_class))

{0, 1, 2, 3, 4, 5, 6, 7, 8, 9}
10


In [10]:
digits_label.shape

(1797,)

label은 손글씨 이미지에 대한 분류로 0~9로 10 종류가 있다. 라벨의 개수는 1797개가 있다.

### Target Names

In [11]:
digits.target_names

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

### 데이터 Describe

In [12]:
print(digits.DESCR)

.. _digits_dataset:

Optical recognition of handwritten digits dataset
--------------------------------------------------

**Data Set Characteristics:**

    :Number of Instances: 5620
    :Number of Attributes: 64
    :Attribute Information: 8x8 image of integer pixels in the range 0..16.
    :Missing Attribute Values: None
    :Creator: E. Alpaydin (alpaydin '@' boun.edu.tr)
    :Date: July; 1998

This is a copy of the test set of the UCI ML hand-written digits datasets
https://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits

The data set contains images of hand-written digits: 10 classes where
each class refers to a digit.

Preprocessing programs made available by NIST were used to extract
normalized bitmaps of handwritten digits from a preprinted form. From a
total of 43 people, 30 contributed to the training set and different 13
to the test set. 32x32 bitmaps are divided into nonoverlapping blocks of
4x4 and the number of on pixels are counted in each blo

In [13]:
digits_data[0]

array([ 0.,  0.,  5., 13.,  9.,  1.,  0.,  0.,  0.,  0., 13., 15., 10.,
       15.,  5.,  0.,  0.,  3., 15.,  2.,  0., 11.,  8.,  0.,  0.,  4.,
       12.,  0.,  0.,  8.,  8.,  0.,  0.,  5.,  8.,  0.,  0.,  9.,  8.,
        0.,  0.,  4., 11.,  0.,  1., 12.,  7.,  0.,  0.,  2., 14.,  5.,
       10., 12.,  0.,  0.,  0.,  0.,  6., 13., 10.,  0.,  0.,  0.])

In [14]:
from matplotlib import pyplot as plt


## 4)train, test 데이터 분리

In [15]:
X_train, X_test, y_train, y_test = train_test_split(digits_data,
                                                   digits_label,
                                                   test_size=0.2,
                                                   random_state=54)

In [16]:
print(f'X_train : {X_train.shape} y_train : {y_train.shape}\nX_test : {X_test.shape} y_test : {y_test.shape}')

X_train : (1437, 64) y_train : (1437,)
X_test : (360, 64) y_test : (360,)


## 5)다양한 모델로 학습

- `DecisionTree`
- `RandomForest`
- `SVM`
- `SGD Classifier`
- `Logistic Regression`

### 필요 모듈 임포트

In [17]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn import svm
from sklearn.linear_model import SGDClassifier
from sklearn.linear_model import LogisticRegression

# 1차시도

In [18]:
dt = DecisionTreeClassifier()
rf = RandomForestClassifier()
svm = svm.SVC()
sgd = SGDClassifier()
logi = LogisticRegression()

In [19]:
dt.fit(X_train,y_train)
dt_pred = dt.predict(X_test)

In [20]:
rf.fit(X_train,y_train)
rf_pred = rf.predict(X_test)

In [21]:
svm.fit(X_train,y_train)
svm_pred = svm.predict(X_test)

In [22]:
sgd.fit(X_train,y_train)
sgd_pred = sgd.predict(X_test)

In [23]:
logi.fit(X_train,y_train)
logi_pred = logi.predict(X_test)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


## 6)모델 평가

In [24]:
from sklearn.metrics import confusion_matrix,accuracy_score,precision_score,recall_score,f1_score,classification_report

In [25]:
confusion_matrix(y_test,dt_pred)

array([[30,  0,  0,  0,  0,  0,  0,  0,  0,  0],
       [ 0, 23,  2,  1,  2,  2,  0,  1,  3,  1],
       [ 0,  0, 33,  0,  0,  0,  0,  1,  0,  2],
       [ 0,  0,  0, 29,  0,  0,  0,  0,  3,  3],
       [ 0,  3,  1,  1, 34,  0,  0,  1,  0,  0],
       [ 0,  0,  0,  1,  1, 36,  2,  0,  0,  1],
       [ 0,  0,  1,  1,  0,  0, 34,  0,  0,  0],
       [ 0,  1,  1,  2,  2,  0,  0, 31,  0,  1],
       [ 0,  1,  0,  3,  1,  1,  0,  1, 26,  2],
       [ 0,  2,  1,  1,  0,  1,  0,  1,  1, 27]])

- 클래스가 10개라 오차행렬 가독성 떨어짐

In [32]:
import pandas as pd
model_dict = {'DecisionTree' : dt_pred, 'RandomForest' : rf_pred, 'SVM' : svm_pred, 'SGD': sgd_pred, 'LogiticRegression':logi_pred}
measure=pd.DataFrame(columns=model_dict.keys(),index=['정확도','정밀도','재현율'])
for k, v in model_dict.items():
    accuracy=accuracy_score(y_test,v)
    precisions=precision_score(y_test, v, average=None)
    recalls=recall_score(y_test,v, average=None)
    measure[k]['정확도']=accuracy
    measure[k]['정밀도']=sum(precisions)/10
    measure[k]['재현율']=sum(recalls)/10
    print(f'=============={k}의 성능===============')
    print('요약')
    print(classification_report(y_test,v))
    print(f'정확도 : {accuracy}')
    print(f'정밀도 : {sum(precisions)/10}')
    print(f'재현율 : {sum(recalls)/10}')

요약
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        30
           1       0.77      0.66      0.71        35
           2       0.85      0.92      0.88        36
           3       0.74      0.83      0.78        35
           4       0.85      0.85      0.85        40
           5       0.90      0.88      0.89        41
           6       0.94      0.94      0.94        36
           7       0.86      0.82      0.84        38
           8       0.79      0.74      0.76        35
           9       0.73      0.79      0.76        34

    accuracy                           0.84       360
   macro avg       0.84      0.84      0.84       360
weighted avg       0.84      0.84      0.84       360

정확도 : 0.8416666666666667
정밀도 : 0.8416666666666667
재현율 : 0.8416666666666667
요약
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        30
           1       0.97      1.00      0.99        

In [31]:
print('===============총정리==============')
print(measure)

    DecisionTree RandomForest       SVM       SGD LogiticsRegression
정확도     0.841667     0.980556  0.988889  0.941667           0.958333
정밀도     0.842581     0.980979  0.989112  0.945518           0.959276
재현율     0.841667     0.980556  0.988889  0.941667           0.958333


- 각 카테고리별 정밀도와 재현율이 고루고루 높은 모델은 `SVM`이었다.
- 그 뒤로 `RandomForest` 모델이 높은 정밀도 및 재현율을 보여주었다.

## 왜 정확도, 정밀도, 재현율 사이에 큰 차이가 없는가?
- 생각외로 `정확도`, `정밀도`, `재현율` 수치 간에 유의미한 차이가 없었다.
- `sklearn.metrics`에서 제공하는 `precision()`과 `recall()`은 이진 분류를 기본으로 하고 있어서 각 카테고리별 평균을 산출하기 위해 별도의 인자를 전달해 주어야했는데
- `classification_report()` 에선 `macro` 방식과 `weighted` 방식을 사용해서 알려주고 있다.

In [36]:
import pandas as pd
model_dict = {'DecisionTree' : dt_pred, 'RandomForest' : rf_pred, 
              'SVM' : svm_pred, 'SGD': sgd_pred, 'LogiticRegression':logi_pred}
measure=pd.DataFrame(columns=model_dict.keys(),index=['정확도',
                                                      '정밀도_micro','재현율_micro',
                                                      '정밀도_macro','재현율_macro',
                                                      '정밀도_weighted','재현율_weighted'])
for k, v in model_dict.items():
    measure[k]['정확도']=accuracy_score(y_test,v)
    measure[k]['정밀도_micro']=precision_score(y_test, v, average='micro')
    measure[k]['재현율_micro']=recall_score(y_test, v, average='micro')
    measure[k]['정밀도_macro']=precision_score(y_test, v, average='macro')
    measure[k]['재현율_macro']=recall_score(y_test, v, average='macro')
    measure[k]['정밀도_weighted']=precision_score(y_test, v, average='weighted')
    measure[k]['재현율_weighted']=recall_score(y_test, v, average='weighted')
    
print(measure)

             DecisionTree RandomForest       SVM       SGD LogiticRegression
정확도              0.841667     0.980556  0.988889  0.941667          0.958333
정밀도_micro        0.841667     0.980556  0.988889  0.941667          0.958333
재현율_micro        0.841667     0.980556  0.988889  0.941667          0.958333
정밀도_macro        0.842957     0.981211  0.989014  0.945728          0.960521
재현율_macro        0.842764     0.981025  0.988906  0.942099          0.959413
정밀도_weighted     0.842581     0.980979  0.989112  0.945518          0.959276
재현율_weighted     0.841667     0.980556  0.988889  0.941667          0.958333


### `sklearn.metrics.precision`과 `recall`의 `average` 인자

 average : {'micro', 'macro', 'samples', 'weighted', 'binary'} \
            default='binary'
        This parameter is required for multiclass/multilabel targets.
        If `None`, the scores for each class are returned. Otherwise, this
        determines the type of averaging performed on the data:

- `'binary'`: Only report results for the class specified by `pos_label`. This is applicable only if targets (`y_{true,pred}`) are binary.
            
- `'micro'`: Calculate metrics globally by counting the total true positives, false negatives and false positives.
            
- `'macro'`: Calculate metrics for each label, and find their unweighted mean.  This does not take label imbalance into account.
            
- `'weighted'`: Calculate metrics for each label, and find their average weighted by support (the number of true instances for each label). This alters 'macro' to account for label imbalance; it can result in an F-score that is not between precision and recall.
            
- `'samples'`: Calculate metrics for each instance, and find their average (only meaningful for multilabel classification where this differs from :func:`accuracy_score`).

# 회고

역시 예제 데이터 셋이라서 그런지 전반적으로 성능 평가도 좋게 나왔다.
이진 분류와 다중 분류의 차이점을 알고 성능 평가 지표를 파악하는데 어떤 지표가 더 좋을지 공부해야겠다.