# Scikit-learn packages

- sklearn is a package designed for machine learning
- function included: data preprocess, build model, compute metrics
- acceptable data types : numpy.array and pandas.DataFrame 

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

## Load Data
https://scikit-learn.org/stable/datasets/index.html

In [None]:
from sklearn import datasets

In [None]:
diabetes = datasets.load_diabetes()

In [None]:
diabetes

In [None]:
diabetes.keys()

In [None]:
data_df = pd.DataFrame(data = diabetes.data, columns = diabetes.feature_names)
data_df['target'] = diabetes.target

In [None]:
data_df.head()

In [None]:
print(diabetes.DESCR)

---
In this case, we would like to predict each patient's progress of disease after one year. Since it's a continuous value, this is a 'regression problem'. On the contrary, what we face is called 'classification problem' if we want to assign each observation to a certain category.

---

## Preprocess

- missing imputation (X)
- normalization (X)
- validation and testing set seperation


In [None]:
from sklearn.model_selection import train_test_split

train_x, test_x, train_y, test_y = train_test_split(data_df.drop('target', axis = 1), data_df.target)

In [None]:
print(train_x.shape, train_y.shape)
print(test_x.shape, test_y.shape)

## Build model

In [None]:
from sklearn.linear_model import LinearRegression

linear_model = LinearRegression()

linear_model.fit(train_x, train_y)

## Evaluation

In [None]:
print('training score : {:.3f}'.format(linear_model.score(train_x, train_y)))
print('testing score : {:.3f}'.format(linear_model.score(test_x, test_y)))

In [None]:
from sklearn.metrics import mean_absolute_error, mean_squared_error

# predict target with test data
test_prediction = linear_model.predict(test_x)

print('RMSE of testing set {:.2f}'.format(mean_squared_error(test_prediction, test_y)**0.5))
print('MAE of testing set {:.2f}'.format(mean_absolute_error(test_prediction, test_y)))

In [None]:
plt.scatter(test_y, test_prediction)
plt.xlabel('true Y')
plt.ylabel('model prediction')
plt.show()

---

## Practice

Use boston dataset to do the prediction ( training set and testing set are prepared)

In [None]:
import pandas as pd

from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

boston = load_boston()

In [None]:
boston_data = pd.DataFrame(boston.data, columns = boston.feature_names)
boston_y = boston.target

train_x, test_x, train_y, test_y = train_test_split(boston_data, boston_y, shuffle = True, test_size = 0.2, random_state = 1)

In [None]:
# your code starts from here


---

### Supervise learning 1.0

After the exmaple and practice, you should be able to
- have basic knowledge about sklearn
- build and evaluate model without further preprocess
