# 2. Model Assessment

## Summary of commands
Code from the previous notebook is summarized below.

In [None]:
import pandas as pd
hs = pd.read_csv('data/housing_sample.csv')
X = hs[['GrLivArea']].values
y = hs.pop('SalePrice').values

from sklearn.linear_model import LinearRegression  # step 1 - import
lr = LinearRegression()                            # step 2 - instantiate
lr.fit(X, y)                                       # step 3 - fit

lr.predict(X)

## Assess model performance
All supervised estimators have a `score` method that will accept input data and the target variable and return a metric to evaluate model performance. By default, scikit-learn uses R-squared to assess model performance for regressors. R-squared is a relative metric that returns the percentage of the total variance eliminated from the input. A score of 1 means that input data is perfectly mapped to its output. It is the highest possible score. Let's calculate this now on our data.

In [None]:
lr.score(X, y)

### Root mean squared error of logged sale price
Kaggle uses a different metric to evaluate the model than R-squared. It uses the [root mean squared error][1] of the logged sale price. This is an absolute metric with 0 being the best possible result (no error). It is isn't possible to use the `score` method to calculate this. The `metrics` module contains [many scoring metrics for regressors][2] such as the `mean_squared_log_error` function which is very close to what we need.

The scoring functions are used differently than the `score` method. You must pass it the actual ('true') output values and their corresponding predicted values.
 
[1]: https://www.kaggle.com/c/house-prices-advanced-regression-techniques/overview/evaluation
[2]: https://scikit-learn.org/stable/modules/classes.html#regression-metrics

In [None]:
from sklearn.metrics import mean_squared_log_error
y_pred = lr.predict(X)
mean_squared_log_error(y, y_pred)

Taking the square root results in the exact error metric that Kaggle uses.

In [None]:
import numpy as np
np.sqrt(mean_squared_log_error(y, y_pred))

### Creating your own scoring function
Let's write a function that calculates the root mean squared error of the logged sale price and verify that the result is the same as above.

In [None]:
def rmsle(y_true, y_pred):
    return np.sqrt(mean_squared_log_error(y_true, y_pred))

# test
rmsle(y, y_pred)

## Making a formal scikit-learn scorer
scikit-learn has a function in the `metrics` module called `make_scorer`. It will turn your user-defined metric into a formal scikit-learn 'scorer'. The reason you'd want to do this is so that your custom metric can be used during hyper-parameter tuning when grid searching with `GridSearchCV`.

To create the scorer, pass the `make_scorer` function your custom function. Additionally, if the metric is defined such that a lower score is better, as is the case with root mean squared error, then you need to set the parameter `greater_is_better` to `False`.

In [None]:
from sklearn.metrics import make_scorer
root_mean_squared_log_error = make_scorer(rmsle, greater_is_better=False)

This new object takes the estimator, the input data, and the output data as arguments. Note, that you do not pass it the predicted values.

In [None]:
root_mean_squared_log_error(lr, X, y)

### Why does it return the negative of the metric?
It returns the negative of our previous score so that scikit-learn can have a consistent way of ranking models with a higher score being better.

## Cross Validation
Assessing model performance using the data that the model was trained gives misleading results. For a more accurate assessment, we can use cross validation to train on one subset of the data and test on an unseen subset. The `cross_val_score` function from the `model_selection` module will automate this procedure for us.

We must pass `cross_val_score` our estimator, input data, and output data. You can also set the number of folds with the `cv` parameter. By default, it performs K-fold cross validation, where K is equal to 3 (but this is changing to 5 in version 0.22).

In [None]:
from sklearn.model_selection import cross_val_score
cross_val_score(lr, X, y, cv=5)

### Use our custom scorer
By default, `cross_val_score` returns R-squared values for regressors. Set the `scoring` parameter to the scorer we created above to get root mean squared log error.

In [None]:
cross_val_score(lr, X, y, cv=5, scoring=root_mean_squared_log_error)

### Place custom scorer in a module
Take a look at the metrics.py module. It contains the definition for `root_mean_squared_log_error` which can be imported directly now from there in all the other notebooks.

### Use a specific flavor of cross validation
scikit-learn offers a [variety of different 'splitters'][1] in the `model_selection` module to perform cross validation. These splitters have different strategies for cross validation with several allowing you to shuffle the data which does not happen by default. The splitters are classes that must be instantiated. Here we create a splitter instance that does 5 splits and shuffles the data.

[1]: https://scikit-learn.org/stable/modules/classes.html#splitter-classes

In [None]:
from sklearn.model_selection import KFold
kf = KFold(n_splits=5, shuffle=True)

We can now pass this splitter to the `cv` parameter of `cross_val_score`. You will get different scores each time you run this as the data is getting shuffled randomly.

In [None]:
cross_val_score(lr, X, y, cv=kf, scoring=root_mean_squared_log_error)

## Summary

In [None]:
hs = pd.read_csv('data/housing_sample.csv')
X = hs[['GrLivArea']].values
y = hs.pop('SalePrice').values

from sklearn.linear_model import LinearRegression  # step 1 - import
lr = LinearRegression()                            # step 2 - instantiate
lr.fit(X, y)                                       # step 3 - fit

lr.score(X, y)

In [None]:
def rmsle(y_true, y_pred):
    return np.sqrt(mean_squared_log_error(y_true, y_pred))

from sklearn.metrics import make_scorer
root_mean_squared_log_error = make_scorer(rmsle, greater_is_better=False)

from sklearn.model_selection import KFold
kf = KFold(n_splits=5, shuffle=True)

cross_val_score(lr, X, y, cv=kf, scoring=root_mean_squared_log_error)

# Exercise
Repeat cross validation using our custom scorer on several of the other regressors.

## Extra - more on custom scorer
It isn't necessary to use the `make_scorer` function. You can instead create a function that accepts the estimator, input data, and target variable. Just make sure to return the negative of the score if a lower score is better.

In [None]:
def rmsle_scorer(est, X, y):
    y_pred = est.predict(X)
    return -np.sqrt(mean_squared_log_error(y, y_pred))

In [None]:
rmsle_scorer(lr, X, y)

In [None]:
cross_val_score(lr, X, y, cv=kf, scoring=rmsle_scorer)