# Scikit-Learn 快速實作

## 郭耀仁

## 什麼是 Scikit-Learn

- 用來實作資料探勘與機器學習的 Python 套件
- 建構於 NumPy，SciPy 與 Matplotlib 套件之上
- 有六大功能模組：
    - 預處理
    - 降維
    - 迴歸
    - 分群
    - 分類
    - 模型選擇

## 什麼是 Scikit-Learn（2）

- [其他的 Scikits](https://scikits.appspot.com/scikits)
- [Scikit-Learn 機器學習地圖](http://scikit-learn.org/stable/tutorial/machine_learning_map/)

## 內建資料

- 玩具資料（Toy datasets）：<http://scikit-learn.org/stable/datasets/index.html#toy-datasets>
- `datasets` 模組

```python
from sklearn import datasets
import numpy as np

iris = datasets.load_iris()
X = iris.data[:, [0, 2]]
y = iris.target
print(X.shape)
print(y.shape)
print(X[0:5, :])
print(y[0:5])
print(np.unique(y))
```

## 切分訓練與測試資料

- `cross_validation` 模組的 `train_test_split()` 方法
- `test_size = ` 參數設定切分比例
- `random_state = ` 參數設定隨機種子

```python
from sklearn import datasets, cross_validation

iris = datasets.load_iris()
X = iris.data[:, [0, 2]]
y = iris.target
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, test_size = 0.3, random_state = 87)
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)
```

## 正規化

- `preprocessing.StandardScaler` 模組的 `transform()` 方法

```python
from sklearn import datasets, cross_validation, preprocessing

iris = datasets.load_iris()
X = iris.data[:, [0, 2]]
y = iris.target
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, test_size = 0.3, random_state = 87)
sc = preprocessing.StandardScaler()
sc.fit(X_train)
X_train_std = sc.transform(X_train)
X_test_std = sc.transform(X_test)
print(X_train_std[0:5, :])
print(X_test_std[0:5, :])
```

## 訓練與預測（感知器模型）

- `linear_model` 模組的 `Perceptron()` 方法

```python
from sklearn import datasets, cross_validation, preprocessing, linear_model

iris = datasets.load_iris()
X = iris.data[:, [0, 2]]
y = iris.target
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, test_size = 0.3, random_state = 87)
sc = preprocessing.StandardScaler()
sc.fit(X_train)
X_train_std = sc.transform(X_train)
X_test_std = sc.transform(X_test)
ppn = linear_model.Perceptron(n_iter = 40, eta0 = 0.1, random_state = 87)
ppn.fit(X_train_std, y_train)
y_pred = ppn.predict(X_test_std)
mis_classified = (y_test != y_pred).sum()
print("Misclassified samples: %d" %mis_classified)
error_rate = mis_classified/len(y_pred)
print("Accuracy: %f" %(1 - error_rate))
```

## 模型評估

- `metrics` 模組的 `accuracy_score()` 方法

```python
from sklearn import datasets, cross_validation, preprocessing, linear_model, metrics

iris = datasets.load_iris()
X = iris.data[:, [0, 2]]
y = iris.target
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, test_size = 0.3, random_state = 87)
sc = preprocessing.StandardScaler()
sc.fit(X_train)
X_train_std = sc.transform(X_train)
X_test_std = sc.transform(X_test)
ppn = linear_model.Perceptron(n_iter = 40, eta0 = 0.1, random_state = 87)
ppn.fit(X_train_std, y_train)
y_pred = ppn.predict(X_test_std)
accuracy = metrics.accuracy_score(y_test, y_pred)
print("Accuracy: %0.2f" %accuracy)
```

## 視覺化

```python
from matplotlib.colors import ListedColormap
import matplotlib.pyplot as plt

def plot_decision_regions(X, y, classifier):
    markers = ('s', 'x', 'o', '^', 'v')
    colors = ('red', 'blue', 'lightgreen', 'gray', 'cyan')
    cmap = ListedColormap(colors[len(np.unique(y))])
    # plot the decision surface
    x1_min, x1_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    x2_min, x2_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx1, xx2 = np.meshgrid(np.arange(x1_min, x1_max), np.arange(x2_min, x2_max))
    Z = classifier.predict(np.array([xx1.ravel(), xx2.ravel()]).T)
    Z = Z.reshape(xx1.shape)
    plt.contourf(xx1, xx2, Z, alpha = 0.4, cmap = cmap)
    plt.xlim(xx1.min(), xx1.max())
    plt.ylim(xx2.min(), xx2.max())
    for idx, cl in enumerate(np.unique(y)):
        plt.scatter(x = X[y == cl, 0], y = X[y == cl, 1], alpha = 0.8, c = cmap(idx), marker = markers[idx], label = cl)

X = iris.data[:, [0, 2]]
y = iris.target
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, test_size = 0.3, random_state = 87)
sc = preprocessing.StandardScaler()
sc.fit(X_train)
X_train_std = sc.transform(X_train)
X_test_std = sc.transform(X_test)
ppn = linear_model.Perceptron(n_iter = 40, eta0 = 0.1, random_state = 87)
ppn.fit(X_train_std, y_train)
y_pred = ppn.predict(X_test_std)
plot_decision_regions(X = X, y = y, classifier = ppn)
plt.show()
```