# Scikit-learn packages

- sklearn套件是專門針對機器學習設計的套件
- 功能包含資料前處理、建立各類模型、計算評估指標
- sklearn主要接受的資料格式為numpy.ndarray，但由於pandas資料表格式也與numpy相容因此可通用

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

## Load Data
https://scikit-learn.org/stable/datasets/index.html

In [None]:
from sklearn import datasets

In [None]:
diabetes = datasets.load_diabetes()

In [None]:
diabetes

In [None]:
diabetes.keys()

In [None]:
data_df = pd.DataFrame(data = diabetes.data, columns = diabetes.feature_names)
data_df['target'] = diabetes.target

In [None]:
data_df.head()

In [None]:
print(diabetes.DESCR)

---

在這個案例中，我們想要預測的是病人一年後的病情進展程度，需要預測的是連續數值，我們稱這種問題為"回歸"問題。
另外一種狀況下，我們想要預測的是每筆資料屬於哪個類別，這類的問題則為"分類"問題。

兩類的問題在適用的模型上會有一些差異，若誤用可能會讓結果表現得非常差。

---

## Preprocess

- 遺漏值處理 (X)
- 標準化 (X)
- 切分訓練集與驗證集


In [None]:
from sklearn.model_selection import train_test_split

train_x, test_x, train_y, test_y = train_test_split(data_df.drop('target', axis = 1), data_df.target)

In [None]:
print(train_x.shape, train_y.shape)
print(test_x.shape, test_y.shape)

## Build model

sklearn提供了多種機器學習模型可以選擇，這次我們先選擇最基礎的簡單線性模型linear regression做練習

In [None]:
from sklearn.linear_model import LinearRegression

linear_model = LinearRegression()

linear_model.fit(train_x, train_y)

## Evaluation

In [None]:
print('training score : {:.3f}'.format(linear_model.score(train_x, train_y)))
print('testing score : {:.3f}'.format(linear_model.score(test_x, test_y)))

In [None]:
from sklearn.metrics import mean_absolute_error, mean_squared_error

# predict target with test data
test_prediction = linear_model.predict(test_x)

print('RMSE of testing set {:.2f}'.format(mean_squared_error(test_prediction, test_y)**0.5))
print('MAE of testing set {:.2f}'.format(mean_absolute_error(test_prediction, test_y)))

In [None]:
plt.scatter(test_y, test_prediction)
plt.xlabel('true Y')
plt.ylabel('model prediction')
plt.show()

---

## 練習時間

請用boston資料集同樣進行預測，訓練資料與測試資料已經先幫各位切好了

In [None]:
import pandas as pd

from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

boston = load_boston()

In [None]:
boston_data = pd.DataFrame(boston.data, columns = boston.feature_names)
boston_y = boston.target

train_x, test_x, train_y, test_y = train_test_split(boston_data, boston_y, shuffle = True, test_size = 0.2, random_state = 1)

In [None]:
# your code starts from here


---

### 監督式學習1.0

在這個範例與練習後，希望大家能夠

- 對sklearn提供的函數使用方法有簡單的了解

- 在資料不需做太多額外處理的狀況下完成簡單的機器學習模型建立與驗證