# 📝 Exercise M7.03

As in the classification metrics exercise, we evaluate the **regression
metrics** within a **cross-validation framework** to get familiar with the **syntax**.

We use the **Ames house prices dataset**.

In [4]:
import pandas as pd
import numpy as np

ames_housing = pd.read_csv("../datasets/house_prices.csv")
data = ames_housing.drop(columns="SalePrice")
target = ames_housing["SalePrice"]
data = data.select_dtypes(np.number)
target /= 1000
data

Unnamed: 0,Id,MSSubClass,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,BsmtFinSF1,BsmtFinSF2,BsmtUnfSF,...,GarageArea,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold
0,1,60,8450,7,5,2003,2003,706,0,150,...,548,0,61,0,0,0,0,0,2,2008
1,2,20,9600,6,8,1976,1976,978,0,284,...,460,298,0,0,0,0,0,0,5,2007
2,3,60,11250,7,5,2001,2002,486,0,434,...,608,0,42,0,0,0,0,0,9,2008
3,4,70,9550,7,5,1915,1970,216,0,540,...,642,0,35,272,0,0,0,0,2,2006
4,5,60,14260,8,5,2000,2000,655,0,490,...,836,192,84,0,0,0,0,0,12,2008
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1455,1456,60,7917,6,5,1999,2000,0,0,953,...,460,0,40,0,0,0,0,0,8,2007
1456,1457,20,13175,6,6,1978,1988,790,163,589,...,500,349,0,0,0,0,0,0,2,2010
1457,1458,70,9042,7,9,1941,2006,275,0,877,...,252,0,60,0,0,0,0,2500,5,2010
1458,1459,20,9717,5,6,1950,1996,49,1029,0,...,240,366,0,112,0,0,0,0,4,2010


<div class="admonition note alert alert-info">
<p class="first admonition-title" style="font-weight: bold;">Note</p>
<p class="last">If you want a deeper overview regarding this dataset, you can refer to the
Appendix - Datasets description section at the end of this MOOC.</p>
</div>

The first step will be to create a linear regression model.

In [5]:
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline

modelp = make_pipeline(StandardScaler(), LinearRegression())

Then, use the **`cross_val_score`** to estimate the statistical performance of
the model. Use a **`KFold` cross-validation with 10 folds**. Make the use of the
**$R^2$ score explicitly** by assigning the parameter **`scoring`** (even though $R^2$ is
the default score).

In [6]:
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold

cv = KFold(n_splits=10, shuffle=True)
scores = cross_val_score(modelp, data, target, cv=cv, scoring='r2')
scores

array([0.76934815, 0.66825649, 0.82852036, 0.84909814, 0.79992011,
       0.83223154, 0.83498268, 0.80152028, 0.87755448, 0.12174257])

Then, instead of using the $R^2$ score, use the **mean absolute error**. You need
to refer to the documentation for the `scoring` parameter.

In [7]:
scores2 = cross_val_score(modelp, data, target, cv=cv, scoring='neg_mean_absolute_error')
scores2

array([-21.56293365, -22.08040317, -26.1159549 , -21.18131599,
       -27.52811049, -20.56810975, -20.6276349 , -18.87125456,
       -21.25012209, -20.53658387])

Finally, use the **`cross_validate` function** and compute **multiple scores/errors**
at once by passing a list of scorers to the **`scoring` parameter**. You can
compute the **$R^2$ score** and the **mean absolute error** for instance.

In [8]:
from sklearn.model_selection import cross_validate

cv_res = cross_validate(modelp, data, target, cv=cv, scoring=['r2', 'neg_mean_absolute_error'])
cv_res

{'fit_time': array([0.00569963, 0.00451922, 0.00437665, 0.00436473, 0.00436735,
        0.00436759, 0.00434685, 0.00434327, 0.00435352, 0.00434327]),
 'score_time': array([0.00169325, 0.00165415, 0.0015533 , 0.00156188, 0.00155807,
        0.00155044, 0.00155473, 0.00153565, 0.00155878, 0.00154734]),
 'test_r2': array([0.89088725, 0.12912053, 0.85617545, 0.82888462, 0.86441723,
        0.84416476, 0.81252139, 0.65896611, 0.78762472, 0.80744548]),
 'test_neg_mean_absolute_error': array([-17.35854724, -24.03598621, -19.13385309, -21.93735682,
        -20.35862596, -23.16075385, -22.6025236 , -23.52255632,
        -25.44614772, -24.16401686])}