# 使用Scikit-learn进行机器学习

## 什么是机器学习？

### 机器学习的基础是归纳（generalize），就是从已知案例数据中找出未知的规律。

#### 机器学习系统通常被看做是有无人类监督学习两种方式。

**有监督学习**是从成对的已经标记好的输入和输出经验数据作为一个输入进行学习,用来预测输出结果,是从有正确答案的例子中学习。

![Spam filter](images/01_spam_filter.png)

常见的监督式机器学习任务就是**分类和回归**。

分类认为需要学会从若干变量约束条件中预测出目标变量的值,就是必须预测出新观测值的类型,种类或标签。
回归问题需要预测连续变量的数值。



**无监督学习**是程序不能从已经标记好的数据中学习。它需要在数据中发现一些规律。

![Clustering](images/01_clustering.png)

常见的无监督式机器学习任务是通过训练数据发现相关观测值的组别,称为簇(clusters)。没有监督,系统只能通过相似性度量方法把观测值分成两类. 
降维(Dimensionality reduction)是另一个常见的无监督学习任务.

## 机器学习怎么工作的呢?

有监督学习的步骤:

1. 首先，使用“标签数据”来训练机器学习模型
    - "标签数据" 是带有正确答案的数据
    - "机器学习模型(规则集)" 学习数据属性和判别结果之间的关系

2. 然后对新的数据进行预测，打上分类标签

<img src="images/supervised_workflow.svg" width="50%">

无监督学习的步骤：
<img src="images/unsupervised_workflow.svg" width="50%">

## 准备工作

- Python
- numpy
- scipy
- matplotlib
- scikit-learn
- Jupyter
- seaborn

用minicoda去安装管理

$ conda install numpy scipy matplotlib scikit-learn ipython-notebook

## 加载数据 Iris

![Iris](images/03_iris.png)

- 3种鸢尾花，每种各50个样本数据 (150 samples total)
- 维度: sepal length 花萼长度, sepal width 花萼宽度 , petal length 花瓣长度, petal width 花瓣宽度

## Machine learning on the iris dataset

- Framed as a **supervised learning** problem: Predict the species of an iris using the measurements
- Famous dataset for machine learning because prediction is **easy**
- Learn more about the iris dataset: [UCI Machine Learning Repository](http://archive.ics.uci.edu/ml/datasets/Iris)

In [None]:
from IPython.display import HTML
HTML('<iframe src=http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data width=300 height=200></iframe>')

## Loading the iris dataset into scikit-learn

In [None]:
# import load_iris function from datasets module
from sklearn.datasets import load_iris

In [None]:
# save "bunch" object containing iris dataset and its attributes
iris = load_iris()
print iris.data

In [None]:
print iris.target

## 机器学习的术语

- 每一行是一个观察值 Each row is an **observation** (also known as: sample, example, instance, record)
    
- 每一列是一个特征 Each column is a **feature** (also known as: predictor, attribute, independent variable, input, regressor, covariate)

In [None]:
# print the names of the four features
print iris.feature_names

In [None]:
# print the encoding scheme for species: 0 = setosa, 1 = versicolor, 2 = virginica
print iris.target_names

- 150 **observations**
- 4 **features** (sepal length, sepal width, petal length, petal width)
- **Response** variable is the iris species
- **Classification** problem since response is categorical
- More information in the [UCI Machine Learning Repository](http://archive.ics.uci.edu/ml/datasets/Iris)

## K-nearest neighbors (KNN) classification

1. 选一个值作为K.
2. 在训练数据中搜索离这个未知iris数据维度最近的k个观察值（Observations）
3. 选取在这K个最近的观察值所对应的分类标签出现频率最多的值作为这个未知iris数据的分类值

### Example training data

![Training data](images/04_knn_dataset.png)

### KNN classification map (K=1)

![1NN classification map](images/04_1nn_map.png)

### KNN classification map (K=5)

![5NN classification map](images/04_5nn_map.png)

In [None]:
# import load_iris function from datasets module
from sklearn.datasets import load_iris

# save "bunch" object containing iris dataset and its attributes
iris = load_iris()

# store feature matrix in "X"
X = iris.data

# store response vector in "y"
y = iris.target

# print the shapes of X and y
print X.shape
print y.shape

## scikit-learn 4-step modeling pattern

**Step 1:** Import the class you plan to use

In [None]:
from sklearn.neighbors import KNeighborsClassifier

**Step 2:** "Instantiate" the "estimator"

- "Estimator" is scikit-learn's term for model
- "Instantiate" means "make an instance of"

In [None]:
knn = KNeighborsClassifier(n_neighbors=1)

- Name of the object does not matter
- Can specify tuning parameters (aka "hyperparameters") during this step
- All parameters not specified are set to their defaults

**Step 3:** Fit the model with data (aka "model training")

- Model is learning the relationship between X and y
- Occurs in-place

In [None]:
knn.fit(X, y)

**Step 4:** Predict the response for a new observation

- New observations are called "out-of-sample" data
- Uses the information it learned during the model training process

In [None]:
knn.predict([3, 5, 4, 2])

In [None]:
X_new = [[3, 5, 4, 2], [5, 4, 3, 2]]
knn.predict(X_new)

## Using a different value for K

In [None]:
# instantiate the model (using the value K=5)
knn = KNeighborsClassifier(n_neighbors=5)

# fit the model with data
knn.fit(X, y)

# predict the response for new observations
knn.predict(X_new)

## Using a different classification model

In [None]:
# import the class
from sklearn.linear_model import LogisticRegression

# instantiate the model (using the default parameters)
logreg = LogisticRegression()

# fit the model with data
logreg.fit(X, y)

# predict the response for new observations
logreg.predict(X_new)

## Quick Application: Optical Character Recognition

To demonstrate the above principles on a more interesting problem, let's consider OCR (Optical Character Recognition) – that is, recognizing hand-written digits.
In the wild, this problem involves both locating and identifying characters in an image. Here we'll take a shortcut and use scikit-learn's set of pre-formatted digits, which is built-in to the library.

### Loading and visualizing the digits data

We'll use scikit-learn's data access interface and take a look at this data:

In [None]:
from sklearn import datasets
digits = datasets.load_digits()
digits.images.shape

In [None]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

# use seaborn for plot defaults
# this can be safely commented out
import seaborn; seaborn.set()

In [None]:
fig, axes = plt.subplots(10, 10, figsize=(8, 8))
fig.subplots_adjust(hspace=0.1, wspace=0.1)

for i, ax in enumerate(axes.flat):
    ax.imshow(digits.images[i], cmap='binary')
    ax.text(0.05, 0.05, str(digits.target[i]),
            transform=ax.transAxes, color='green')
    ax.set_xticks([])
    ax.set_yticks([])

In [None]:
# The images themselves
print(digits.images.shape)
print(digits.images[0])

In [None]:
X= digits.data
y= digits.target

In [None]:
# instantiate the model (using the value K=5)
knn = KNeighborsClassifier(n_neighbors=5)

# fit the model with data
knn.fit(X, y)

# predict the response for new observations
knn.predict(digits.data[5])

In [None]:
from sklearn import svm
clf = svm.SVC(gamma=0.001, C=100)
clf.fit(X,y)
clf.predict(digits.data[5])

## 流程图：如何选择一个模型

<img src="http://scikit-learn.org/dev/_static/ml_map.png" width="100%">

## Resources
- http://scikit-learn.org/
- Book: [Mastering machine learning with scikit-learn](https://www.google.com.au/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&ved=0ahUKEwiH0dq4pcvOAhXHlJQKHUJCCW4QFggbMAA&url=https%3A%2F%2Fwww.packtpub.com%2Fbig-data-and-business-intelligence%2Fmastering-machine-learning-scikit-learn&usg=AFQjCNGQb7JbZWsSy5frB4rVKytaSR_97Q&cad=rja)

## Questions?