## [作業重點]
使用 Sklearn 中的線性迴歸模型，來訓練各種資料集，務必了解送進去模型訓練的**資料型態**為何，也請了解模型中各項參數的意義

## 作業
試著使用 sklearn datasets 的其他資料集 (wine, boston, ...)，來訓練自己的線性迴歸模型。

### HINT: 注意 label 的型態，確定資料集的目標是分類還是回歸，在使用正確的模型訓練！

In [43]:
import numpy as np
from sklearn import datasets, linear_model
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_curve, auc, roc_auc_score
from sklearn import metrics

In [44]:

wine = datasets.load_wine()
# boston = datasets.load_boston()
# breast_cancer = datasets.load_breast_cancer()

In [45]:
wine
print(wine.DESCR)

.. _wine_dataset:

Wine recognition dataset
------------------------

**Data Set Characteristics:**

    :Number of Instances: 178 (50 in each of three classes)
    :Number of Attributes: 13 numeric, predictive attributes and the class
    :Attribute Information:
 		- Alcohol
 		- Malic acid
 		- Ash
		- Alcalinity of ash  
 		- Magnesium
		- Total phenols
 		- Flavanoids
 		- Nonflavanoid phenols
 		- Proanthocyanins
		- Color intensity
 		- Hue
 		- OD280/OD315 of diluted wines
 		- Proline

    - class:
            - class_0
            - class_1
            - class_2
		
    :Summary Statistics:
    
                                   Min   Max   Mean     SD
    Alcohol:                      11.0  14.8    13.0   0.8
    Malic Acid:                   0.74  5.80    2.34  1.12
    Ash:                          1.36  3.23    2.36  0.27
    Alcalinity of Ash:            10.6  30.0    19.5   3.3
    Magnesium:                    70.0 162.0    99.7  14.3
    Total Phenols:                0

In [46]:
data_X = wine.data
data_Y = wine.target
# 因為訓練邏輯斯迴歸時也要資料, 因此將訓練及切成三部分 train / val / test, 採用 test 驗證而非 k-fold 交叉驗證
# train 用來訓練梯度提升樹, val 用來訓練邏輯斯迴歸, test 驗證效果
train_X, test_X, train_Y, test_Y = train_test_split(data_X, data_Y, test_size=0.2)

In [47]:

rf = RandomForestClassifier(n_estimators=20, min_samples_split=10, min_samples_leaf=10, 
                            max_features=8, max_depth=6, bootstrap=True)
lr = LogisticRegression(solver='lbfgs', max_iter=10000, multi_class='multinomial')

rf.fit(train_X, train_Y)
lr.fit(train_X, train_Y)
train_Y

array([0, 0, 1, 2, 1, 1, 2, 2, 2, 1, 1, 1, 0, 2, 2, 0, 1, 0, 2, 2, 1, 0,
       0, 0, 2, 2, 1, 0, 1, 1, 1, 2, 2, 1, 0, 0, 2, 0, 1, 2, 2, 1, 0, 2,
       1, 1, 0, 1, 0, 2, 1, 0, 0, 2, 1, 2, 1, 2, 2, 2, 1, 1, 1, 0, 0, 0,
       1, 1, 2, 0, 1, 0, 2, 0, 2, 2, 0, 1, 1, 0, 2, 0, 1, 0, 1, 1, 1, 2,
       1, 1, 0, 0, 2, 2, 1, 1, 1, 0, 0, 2, 2, 2, 1, 2, 1, 1, 0, 2, 1, 2,
       2, 0, 1, 0, 1, 0, 1, 0, 2, 2, 1, 1, 1, 0, 0, 0, 0, 1, 2, 2, 0, 1,
       1, 0, 1, 1, 1, 1, 0, 0, 0, 1])

In [48]:
# 將邏輯斯迴歸結果輸出

pred_lr_prob = lr.predict_proba(test_X)
pred_lr = [list(row).index(row.max()) for row in pred_lr_prob]
pred_lr
count_misclassified = (test_Y != pred_lr).sum()
print('Misclassified samples: {}'.format(count_misclassified))
accuracy = metrics.accuracy_score(test_Y, pred_lr)
print('Accuracy: {:.2f}'.format(accuracy))

Misclassified samples: 2
Accuracy: 0.94


In [49]:
# 將隨機森林結果輸出
pred_rf_prob = rf.predict_proba(test_X)
pred_rf = [list(row).index(row.max()) for row in pred_rf_prob]
pred_rf

count_misclassified = (test_Y != pred_rf).sum()
print('Misclassified samples: {}'.format(count_misclassified))
accuracy = metrics.accuracy_score(test_Y, pred_rf)
print('Accuracy: {:.2f}'.format(accuracy))

Misclassified samples: 0
Accuracy: 1.00
