# EP2_1: 숫자 분류
* 숫자 0 ~ 숫자9까지 10개 클래스를 분류하는 문제임. 
* 8x8 이미지(64개 픽셀값)를 가진 총 1797개 데이터를 이용.

## 1. 데이터 준비 및 확인

In [1]:
# (1) 필요한 모듈 import하기
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

In [2]:
# (2) 데이터 준비
digits = load_digits()

In [3]:
digits.keys()

dict_keys(['data', 'target', 'frame', 'feature_names', 'target_names', 'images', 'DESCR'])

In [4]:
# (3) 데이터 이해하기
# Feature Data 지정하기
digits_data = digits.data
# Label Data 지정하기
digits_label = digits.target
# Target Names 출력해 보기
print("<Target Names>")
print(digits.target_names)
# 데이터 Describe 해 보기
print("<DESCR>")
print(digits.DESCR)

<Target Names>
[0 1 2 3 4 5 6 7 8 9]
<DESCR>
.. _digits_dataset:

Optical recognition of handwritten digits dataset
--------------------------------------------------

**Data Set Characteristics:**

    :Number of Instances: 5620
    :Number of Attributes: 64
    :Attribute Information: 8x8 image of integer pixels in the range 0..16.
    :Missing Attribute Values: None
    :Creator: E. Alpaydin (alpaydin '@' boun.edu.tr)
    :Date: July; 1998

This is a copy of the test set of the UCI ML hand-written digits datasets
https://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits

The data set contains images of hand-written digits: 10 classes where
each class refers to a digit.

Preprocessing programs made available by NIST were used to extract
normalized bitmaps of handwritten digits from a preprinted form. From a
total of 43 people, 30 contributed to the training set and different 13
to the test set. 32x32 bitmaps are divided into nonoverlapping blocks of
4x4 and th

## 2. train, test 데이터 나누기

In [5]:
# (4) train, test 데이터 분리
X_train, X_test, y_train, y_test = train_test_split(digits_data, digits_label, test_size=0.2, random_state=7, stratify=digits_label)

print('X_train 개수:', len(X_train), ', X_test 개수:', len(X_test))
print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

X_train 개수: 1437 , X_test 개수: 360
(1437, 64) (1437,)
(360, 64) (360,)


## 3. 모델 학습 및 평가

In [6]:
# (5) 다양한 모델로 학습시켜보기

#  Decision Tree 사용해 보기
from sklearn.tree import DecisionTreeClassifier
decision_tree = DecisionTreeClassifier(random_state=32)
decision_tree.fit(X_train, y_train)
y_pred1 = decision_tree.predict(X_test)
print("Decision Tree: \n", classification_report(y_test, y_pred1))
print(confusion_matrix(y_test, y_pred1))

#  Random Forest 사용해 보기
from sklearn.ensemble import RandomForestClassifier
random_forest = RandomForestClassifier(random_state=32)
random_forest.fit(X_train, y_train)
y_pred2 = random_forest.predict(X_test)
print("Random Forest: \n", classification_report(y_test, y_pred2))
print(confusion_matrix(y_test, y_pred2))

#  SVM 사용해 보기
from sklearn import svm
svm_model = svm.SVC()
svm_model.fit(X_train, y_train)
y_pred3 = svm_model.predict(X_test)
print("SVM          : \n", classification_report(y_test, y_pred3))
print(confusion_matrix(y_test, y_pred3))

#  SGD Classifier 사용해 보기
from sklearn.linear_model import SGDClassifier
sgd_model = SGDClassifier()
sgd_model.fit(X_train, y_train)
y_pred4 = sgd_model.predict(X_test)
print("SGD Classifier:\n", classification_report(y_test, y_pred4))
print(confusion_matrix(y_test, y_pred4))

#  Logistic Regression 사용해 보기
from sklearn.linear_model import LogisticRegression
logistic_model = LogisticRegression()
logistic_model.fit(X_train, y_train)
y_pred5 = logistic_model.predict(X_test)
print("Logistic Reg.: \n", classification_report(y_test, y_pred5))
print(confusion_matrix(y_test, y_pred5))

Decision Tree: 
               precision    recall  f1-score   support

           0       0.94      0.92      0.93        36
           1       0.88      0.81      0.85        37
           2       0.97      0.89      0.93        35
           3       0.86      0.81      0.83        37
           4       0.74      0.89      0.81        36
           5       0.89      0.86      0.87        36
           6       1.00      0.89      0.94        36
           7       0.90      0.97      0.93        36
           8       0.77      0.77      0.77        35
           9       0.78      0.86      0.82        36

    accuracy                           0.87       360
   macro avg       0.87      0.87      0.87       360
weighted avg       0.87      0.87      0.87       360

[[33  0  0  0  2  0  0  0  1  0]
 [ 0 30  0  0  1  1  0  0  1  4]
 [ 0  0 31  1  0  0  0  0  3  0]
 [ 0  0  0 30  1  0  0  2  2  2]
 [ 1  1  0  0 32  0  0  0  1  1]
 [ 0  1  0  1  1 31  0  1  0  1]
 [ 0  0  0  0  4  0 32  0 

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


## 결과
전반적으로 예측결과가 높게 나왔음. 

SVM모델이 가장 높은 정확도(0.99)를 가졌다고 판단함. 

#### 평가지표는 recall을 보는 것이 좋다고 생각함
이유는 1) 전체 샘플의 분포에 영향을 덜 받고, 2) 각 라벨별로 정확한 라벨을 골랐는지를 알 수 있기 때문이다.(recall) 3) 컴퓨터가 예상한 라벨값중에서 맞고 틀린 것에 대한 지표(precision)가 중요한 것이 아님. 