# 使用Scikit-Learn 完成預測
### Scikit-Learn在三個面向提供支援。
1. 獲取資料:***klearn.datasets***
2. 掌握資料:***sklearn.preprocessing*** 
3. 機器學習:***sklearn Estimator API*** 

獲取資料的方式有很多種（包含檔案、資料庫、網路爬蟲、Kaggle Datasets等），<br>
其中最簡單的方式是從Sklearn import 內建的資料庫。由於其特性隨手可得且不用下載，所以我們通常叫他**玩具資料**：

# 基本架構

* 讀取資料&pre-processing
* 切分訓練集與測試集 
* 模型配適
* 預測 
* 評估(計算成績可能是誤差值或正確率或..)


In [2]:
%matplotlib inline

from sklearn import datasets
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


## 讀取Iris資料集與資料前處理

Iris Flowers 資料集

我們在這個項目中使用 Iris Data Set，這個資料集中的每個樣本有4個特徵，1個類別。該資料集1中的樣本類別數為3類，每類樣本數目為50個，總共150個樣本。

屬性資訊：

    花萼長度 sepal length(cm)
    花萼寬度 sepal width(cm)
    花瓣長度 petal length(cm)
    花瓣寬度 petal width(cm)
    類別：
        Iris Setosa
        Iris Versicolour
        Iris Virginica

樣本特徵資料是數值型的，而且單位都相同（釐米）。

![Iris Flowers](images/iris_data.PNG)


In [3]:
iris = datasets.load_iris()
print(iris.DESCR)

.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

:Number of Instances: 150 (50 in each of three classes)
:Number of Attributes: 4 numeric, predictive attributes and the class
:Attribute Information:
    - sepal length in cm
    - sepal width in cm
    - petal length in cm
    - petal width in cm
    - class:
            - Iris-Setosa
            - Iris-Versicolour
            - Iris-Virginica

:Summary Statistics:

                Min  Max   Mean    SD   Class Correlation
sepal length:   4.3  7.9   5.84   0.83    0.7826
sepal width:    2.0  4.4   3.05   0.43   -0.4194
petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
petal width:    0.1  2.5   1.20   0.76    0.9565  (high!)

:Missing Attribute Values: None
:Class Distribution: 33.3% for each of 3 classes.
:Creator: R.A. Fisher
:Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
:Date: July, 1988

The famous Iris database, first used by Sir R.A. Fisher. The dataset is taken
from Fis

* 印出iris的key值與檔案位置
* 查看前10筆資料
* 查看資料型別
* 印出標註的樣本類別資料

In [8]:
print(iris.keys()) #顯示資料集包含的所有鍵值

print(iris['filename']) #顯示資料集包含的所有鍵值

print(iris.data[0:10]) #顯示前10筆特徵資料

print(type(iris.data)) #查看資料的型別

print(iris.target_names) #樣本名稱

print(iris.target) #顯示目標變數（分類標籤）(樣本內容已轉換)

dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename', 'data_module'])
iris.csv
[[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]
 [5.4 3.9 1.7 0.4]
 [4.6 3.4 1.4 0.3]
 [5.  3.4 1.5 0.2]
 [4.4 2.9 1.4 0.2]
 [4.9 3.1 1.5 0.1]]
<class 'numpy.ndarray'>
['setosa' 'versicolor' 'virginica']
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]


In [20]:
# we only take the first two features. 
X = iris.data[:, :2] #選取所有列（150筆資料），但只取前兩個特徵欄位
print(X.shape)       #代表有150筆資料，每筆有2個特徵

Y = iris.target      
print(Y.shape)       #代表有150個目標標籤

#這樣做是為了降維，只使用鳶尾花的前兩個特徵（通常是花萼長度和花萼寬度）來進行後續的機器學習

(150, 2)
(150,)


In [21]:
#以下是組成 pandas DataFrame (也可以不用這種做)
x = pd.DataFrame(iris.data, columns=iris['feature_names'])
x.head(10)

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2
5,5.4,3.9,1.7,0.4
6,4.6,3.4,1.4,0.3
7,5.0,3.4,1.5,0.2
8,4.4,2.9,1.4,0.2
9,4.9,3.1,1.5,0.1


In [22]:
iris['target']

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

In [23]:
#建立Target欄位與資料
y = pd.DataFrame(iris['target'], columns=['target'])
y.head(5)

Unnamed: 0,target
0,0
1,0
2,0
3,0
4,0


In [24]:
#合併資料特徵欄位與目標欄位
iris_data = pd.concat([x,y], axis=1)    #將特徵資料 X 和目標標籤 Y 水平合併（axis=1 表示按欄位合併）
iris_data = iris_data[['sepal length (cm)', 'petal length (cm)', 'target']] #選取三個欄位：花萼長度、花瓣長度、目標標籤
iris_data.head(10)

Unnamed: 0,sepal length (cm),petal length (cm),target
0,5.1,1.4,0
1,4.9,1.4,0
2,4.7,1.3,0
3,4.6,1.5,0
4,5.0,1.4,0
5,5.4,1.7,0
6,4.6,1.4,0
7,5.0,1.5,0
8,4.4,1.4,0
9,4.9,1.5,0


In [26]:
#只選擇目標為0與1的資料
iris_data = iris_data[iris_data['target'].isin([0,1])] #篩選出目標標籤為 0 或 1 的資料列
iris_data
#print(iris['data'].size) #顯示原始資料集的總元素數量（150×4=600）

Unnamed: 0,sepal length (cm),petal length (cm),target
0,5.1,1.4,0
1,4.9,1.4,0
2,4.7,1.3,0
3,4.6,1.5,0
4,5.0,1.4,0
...,...,...,...
95,5.7,4.2,1
96,5.7,4.2,1
97,6.2,4.3,1
98,5.1,3.0,1


## 切分訓練集與測試集
> train_test_split()

In [27]:
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(iris_data[['sepal length (cm)', 'petal length (cm)']], iris_data['target'], test_size=0.3)

In [33]:
X_train.head()

Unnamed: 0,sepal length (cm),petal length (cm)
38,4.4,1.3
3,4.6,1.5
33,5.5,1.4
55,5.7,4.5
5,5.4,1.7


In [34]:
X_test.head()

Unnamed: 0,sepal length (cm),petal length (cm)
42,4.4,1.3
1,4.9,1.4
15,5.7,1.5
10,5.4,1.5
95,5.7,4.2


In [35]:
X_test.shape

(30, 2)

In [36]:
Y_train.head()

38    0
3     0
33    0
55    1
5     0
Name: target, dtype: int32

In [37]:
Y_test.head()

42    0
1     0
15    0
10    0
95    1
Name: target, dtype: int32

# Appendix 

>normalization和standardization是差不多的<br>
都是把數據進行前處理，從而使數值都落入到統一的數值範圍，從而在建模過程中，各個特徵量沒差別對待。<br> 
* normalization一般是把數據限定在需要的範圍，比如一般都是【0，1】，從而消除了數據量綱對建模的影響。<br> 
* standardization 一般是指將數據正態化，使平均值0方差為1.<br> 

因此normalization和standardization 是針對數據而言的，消除一些數值差異帶來的特種重要性偏見。<br>
經過歸一化的數據，能加快訓練速度，促進算法的收斂。

### Standardization (z-score)
    to compute the mean and standard deviation on a training set so as to be able to later reapply the same transformation on the testing set. 

In [72]:
def norm_stats(dfs):  #計算資料集的統計量（最小值、最大值、平均值、標準差）
    minimum = np.min(dfs, axis=0)  # 按欄位計算
    maximum = np.max(dfs, axis=0)
    mu = np.mean(dfs, axis=0)
    sigma = np.std(dfs, axis=0)
    return (minimum, maximum, mu, sigma)

def z_score(col, stats): #執行 Z-score 標準化，公式為 (x - μ) / σ
    m, M, mu, s = stats
    df = pd.DataFrame()
    for i, c in enumerate(col.columns):
        df[c] = (col[c] - mu[i]) / s[i]  # 用數字索引 i
    return df

In [73]:
stats = norm_stats(X_train)  #對訓練集 X計算統計量
arr_x_train = np.array(z_score(X_train, stats)) #使用統計量對 X 進行標準化，轉換為 numpy 陣列
arr_x_train[:10]
#arr_y_train = np.array(Y_train) #將 Y 也轉換為 numpy 陣列

  df[c] = (col[c] - mu[i]) / s[i]  # 用數字索引 i
  df[c] = (col[c] - mu[i]) / s[i]  # 用數字索引 i


array([[-1.7961043 , -1.2418131 ],
       [-1.47941085, -1.0998916 ],
       [-0.05429031, -1.17085235],
       [ 0.26240315,  1.02893085],
       [-0.21263703, -0.95797011],
       [-1.32106412, -1.02893085],
       [-1.00437067,  0.17740187],
       [-1.63775758, -1.2418131 ],
       [-0.84602394, -1.0998916 ],
       [ 0.10405642,  0.60316636]])

## use sklearn

In [74]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler().fit(X_train)  #Compute the statistics to be used for later scaling.
print(sc.mean_)  #mean
print(sc.scale_) #standard deviation

[5.53428571 3.05      ]
[0.63152553 1.40922978]


In [75]:
#transform: (x-u)/std.
X_train_std = sc.transform(X_train)
X_train_std[:5]

array([[-1.7961043 , -1.2418131 ],
       [-1.47941085, -1.0998916 ],
       [-0.05429031, -1.17085235],
       [ 0.26240315,  1.02893085],
       [-0.21263703, -0.95797011]])

The scaler instance can then be used on new data to transform it the same way it did on the training set:

In [76]:
X_test_std = sc.transform(X_test)
print(X_test_std[:10])

[[-1.7961043  -1.2418131 ]
 [-1.00437067 -1.17085235]
 [ 0.26240315 -1.0998916 ]
 [-0.21263703 -1.0998916 ]
 [ 0.26240315  0.81604861]
 [-1.16271739 -1.17085235]
 [-1.16271739 -1.02893085]
 [ 0.10405642  1.02893085]
 [-1.47941085 -1.17085235]
 [ 0.89579006  1.17085235]]


you can also use fit_transform method (i.e., fit and then transform)    

In [77]:
X_train_std = sc.fit_transform(X_train)  
X_test_std = sc.fit_transform(X_test)
print(X_test_std[:10])


[[-1.46519834 -0.78758749]
 [-0.67176964 -0.71726718]
 [ 0.59771629 -0.64694687]
 [ 0.12165907 -0.64694687]
 [ 0.59771629  1.25170155]
 [-0.83045538 -0.71726718]
 [-0.83045538 -0.57662656]
 [ 0.43903055  1.46266249]
 [-1.14782686 -0.71726718]
 [ 1.23245926  1.60330311]]


In [78]:
print('mean of X_train_std:',np.round(X_train_std.mean(),4))
print('std of X_train_std:',X_train_std.std())

mean of X_train_std: -0.0
std of X_train_std: 0.9999999999999999


## Min-Max Normaliaztion
    Transforms features by scaling each feature to a given range.
    The transformation is given by:

    X' = X - X.min(axis=0) / ((X.max(axis=0) - X.min(axis=0))
    X -> N 維資料
    


In [79]:
x1 = np.random.normal(50, 6, 100)  # np.random.normal(mu,sigma,size))
y1 = np.random.normal(5, 0.5, 100)

x2 = np.random.normal(30,6,100)
y2 = np.random.normal(4,0.5,100)
plt.scatter(x1,y1,c='b',marker='s',s=20,alpha=0.8)
plt.scatter(x2,y2,c='r', marker='^', s=20, alpha=0.8)

print(np.sum(x1)/len(x1))
print(np.sum(x2)/len(x2))

49.0467449721713
30.42314155948693


In [80]:
x_val = np.concatenate((x1,x2))
y_val = np.concatenate((y1,y2))

x_val.shape

(200,)

In [81]:
def minmax_norm(X):
    return (X - X.min(axis=0)) / ((X.max(axis=0) - X.min(axis=0)))

In [82]:
minmax_norm(x_val[:10])

array([0.65981303, 0.        , 0.8657218 , 0.75433542, 0.61383663,
       0.53300637, 0.33397844, 0.59001864, 0.47947398, 1.        ])

In [84]:
from sklearn.preprocessing import MinMaxScaler
x_val=x_val.reshape(-1, 1) #1D->2D
scaler = MinMaxScaler().fit(x_val)  # default range 0~1
print(scaler.data_max_)
print(scaler.data_min_)
print(scaler.transform(x_val)[:10])

[64.54385319]
[16.33755385]
[[0.76051364]
 [0.50425846]
 [0.84048355]
 [0.79722382]
 [0.74265754]
 [0.71126505]
 [0.63396749]
 [0.73340721]
 [0.69047438]
 [0.89263391]]
