## [作業重點]
使用 Sklearn 中的 Lasso, Ridge 模型，來訓練各種資料集，務必了解送進去模型訓練的**資料型態**為何，也請了解模型中各項參數的意義。

機器學習的模型非常多種，但要訓練的資料多半有固定的格式，確保你了解訓練資料的格式為何，這樣在應用新模型時，就能夠最快的上手開始訓練！

## 練習時間
試著使用 sklearn datasets 的其他資料集 (boston, ...)，來訓練自己的線性迴歸模型，並加上適當的正則化來觀察訓練情形。

In [1]:
from sklearn.datasets import load_boston
from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

In [2]:
# 載入boston dataset
boston = load_boston()
data = boston.data
target = boston.target


    The Boston housing prices dataset has an ethical problem. You can refer to
    the documentation of this function for further details.

    The scikit-learn maintainers therefore strongly discourage the use of this
    dataset unless the purpose of the code is to study and educate about
    ethical issues in data science and machine learning.

    In this special case, you can fetch the dataset from the original
    source::

        import pandas as pd
        import numpy as np


        data_url = "http://lib.stat.cmu.edu/datasets/boston"
        raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
        data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
        target = raw_df.values[1::2, 2]

    Alternative datasets include the California housing dataset (i.e.
    :func:`~sklearn.datasets.fetch_california_housing`) and the Ames housing
    dataset. You can load the datasets as follows::

        from sklearn.datasets import fetch_california_h

In [3]:
# 切分訓練/測試集
x_train, x_test, y_train, y_test = train_test_split(data, target, test_size = 0.2, random_state = 42)

## Linear

In [4]:
# 創建/訓練線性回歸模型
linear_reg = LinearRegression()
linear_reg.fit(x_train, y_train)
# 使用訓練好的模型進行預測
predictions = linear_reg.predict(x_test)
# 計算預測結果的均方誤差
mse = mean_squared_error(y_test, predictions)
print("Mean Squared Error:", mse)

Mean Squared Error: 24.291119474973456


## Lasso

In [5]:
lasso_reg = Lasso(alpha=0.1)
lasso_reg.fit(x_train, y_train)

pred_lasso = lasso_reg.predict(x_test)

mse_lasso = mean_squared_error(y_test, pred_lasso)
print("Lasso MSE:", mse_lasso)

Lasso MSE: 25.155593753934173


## Ridge

In [6]:
ridge_reg = Ridge(alpha=0.1)
ridge_reg.fit(x_train, y_train)

pred_ridge = ridge_reg.predict(x_test)

mse_ridge = mean_squared_error(y_test, pred_ridge)
print("Ridge MSE:", mse_ridge)

Ridge MSE: 24.30102550019277


* 使用Lasso Regression有許多特徵的係數都變成 0, 可用來做特徵選取
* 使用Lasso或Ridge的結果未必比原本的線性回歸來得好, 因為目標函數加上了正規化函數, 讓模型不能過於複雜, 但同時限制模型擬和資料的能力, 若沒有發現 Over-fitting 的情況, 可以不需要一開始就加上太強的正規化