# 📝 Exercise M7.03

As with the classification metrics exercise, we will evaluate the regression
metrics within a cross-validation framework to get familiar with the syntax.

We will use the Ames house prices dataset.

In [1]:
import pandas as pd
import numpy as np

ames_housing = pd.read_csv("../datasets/house_prices.csv")
data = ames_housing.drop(columns="SalePrice")
target = ames_housing["SalePrice"]
data = data.select_dtypes(np.number)
target /= 1000

<div class="admonition note alert alert-info">
<p class="first admonition-title" style="font-weight: bold;">Note</p>
<p class="last">If you want a deeper overview regarding this dataset, you can refer to the
Appendix - Datasets description section at the end of this MOOC.</p>
</div>

The first step will be to create a linear regression model.

In [2]:
# Write your code here.
from sklearn.linear_model import LinearRegression

lrm = LinearRegression()

Then, use the `cross_val_score` to estimate the generalization performance of
the model. Use a `KFold` cross-validation with 10 folds. Make the use of the
$R^2$ score explicit by assigning the parameter `scoring` (even though it is
the default score).

In [3]:
# Write your code here.
from sklearn.model_selection import KFold, cross_val_score

cv = KFold(n_splits=10)
scores = cross_val_score(lrm, data, target, cv=cv, scoring='r2', n_jobs=-1)
print(f"R2: {scores.mean():.3f} ± {scores.std():.3f}")

R2: 0.794 ± 0.103


Then, instead of using the $R^2$ score, use the mean absolute error. You need
to refer to the documentation for the [scoring parameter](https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter).

In [6]:
# Write your code here.
scores = cross_val_score(lrm, data, target, cv=cv, scoring='neg_mean_absolute_error', n_jobs=-1)
print(f"Mean Absolute Error: {-scores.mean():.3f} ± {scores.std():.3f}")

Mean Absolute Error: 21.892 ± 2.225


Finally, use the `cross_validate` function and compute multiple scores/errors
at once by passing a list of scorers to the `scoring` parameter. You can
compute the $R^2$ score and the mean absolute error for instance.

In [18]:
# Write your code here.
from sklearn.model_selection import cross_validate
scoring = ['r2', 'neg_mean_absolute_error', 'neg_median_absolute_error', 'neg_mean_absolute_percentage_error']
scores = cross_validate(lrm, data, target, cv=cv, scoring=scoring, n_jobs=-1)

In [20]:
metrics = pd.DataFrame(scores)
col_to_keep = [f"test_{col}" for col in scoring]
# or equally col_to_keep = ["test_{}".format(col) for col in scoring]

metrics = metrics.drop(columns=metrics.columns.difference(col_to_keep))

metrics.head()

Unnamed: 0,test_r2,test_neg_mean_absolute_error,test_neg_median_absolute_error,test_neg_mean_absolute_percentage_error
0,0.843903,-20.480499,-15.590772,-0.131686
1,0.854974,-21.380031,-16.408394,-0.113104
2,0.887523,-21.268315,-18.436481,-0.131649
3,0.749511,-22.868877,-15.716027,-0.143983
4,0.81698,-24.799557,-14.871047,-0.145935


In [31]:
metrics.describe().loc[['mean', 'std']]

Unnamed: 0,test_r2,test_neg_mean_absolute_error,test_neg_median_absolute_error,test_neg_mean_absolute_percentage_error
mean,0.793756,-21.892397,-15.90296,-0.130373
std,0.108695,2.345631,1.712831,0.013504
