# Model Performance: Regression

Whereas classification is about predicting a discrete label such as "cat" or "dog", regression is about predicting a quantity, typically continuous values like amounts, sizes, or prices. For example, consider that a house may be predicted to sell for a specific dollar value. But how do you calculate the accuracy for a regression model? The answer is simple: you don't! Accuracy is a measure of classification, not regression. If you are predicting a numerical value such as the sale price of a house, you don't necessarily want to know if the model predicted the value exactly. Instead, you care more about how close the prediction was to the expected value.  A way to describe the numerical difference between the actual and expected values is **distance** or **error**. This lesson introduces various error metrics that you can use to report the prediction skill of a regression model.


## Error Metrics

<img style="float: right; margin: 15px 15px 15px 15px;" src="img/mse.png" width="200">

There are four error metrics that are commonly used for evaluating and reporting on the quality of a regression predictions:

#### Mean Squared Error (MSE)
* Finds the average squared distance (error) between the predicted and actual values. 
* Tells you how close a regression line is to a set of points by taking the distances from the points to the line and squaring them.
* Squaring removes any (-) negative signs and magnifies large errors. 
* The lower the MSE, the better the prediction skill.
* Formula: 
  - **MSE** = 1 / N * sum for i to N (y_i – yhat_i)^2

#### Root Mean Squared Error (RMSE)

<img style="float: right; margin: 15px 15px 15px 15px;" src="img/rmse.png" width="200">

* Variation of the MSE metric which shows what is the average **deviation** in predictions from actual values.
* Follows an assumption that error is unbiased and follows a normal distribution.
* Just like MSE, RMSE is a non-negative value and the lower the RMSE, the better the prediction skill.
* RMSE punishes large errors and is the best metric for large numbers (actual value or prediction). 
* It is affected by outliers so make sure that you remove them from the dataset beforehand.
* Formula: 
  - **RMSE** = sqrt(1 / N * sum for i to N (y_i – yhat_i)^2)


<img style="float: right; margin: 15px 15px 15px 15px;" src="img/mae.png" width="270">

#### Mean Absolute Error (MAE)

* Computes the average of the absolute error values by forcing the difference between predicted and actual values to be positive.
* Unlike the MSE and RMSE that punish larger errors more than smaller errors, the changes in MAE are linear and therefore more intuitive.
* MAE gives you information on the magnitude of the error, but no idea of the direction, i.e., Is the model over or under estimating?
* Like the others, an error value of 0.0 would be ideal, meaning that all predictions matched the expected values exactly.
* Formula: 
  - **MAE** = 1 / N * sum for i to N abs(y_i – yhat_i)



#### R-Squared (R<sup>2</sup>)

* Also referred to as the **coefficient of determination**.
* Provides an indication of the goodness of fit of a set of predictions to the actual values.
* Yields a value between 0 and 1 for no-fit and perfect fit respectively.
* Formula: 
  - **R<sup>2</sup>** = 1 - Unexplained Variance / Total Variance
<img style="float: right; margin: 15px 15px 15px 15px;" src="img/rsquared.png" width="500">

>**Calculation**: The actual calculation of R<sup>2</sup> requires several steps, including taking data points (observations) of dependent and independent variables, and finding the line of best fit from a regression model. From there you would calculate predicted values, subtract actual values and, square the results. This yields a list of errors squared, which is then summed and equals the **unexplained variance**.
> To calculate the **total variance**, you would subtract the average actual value from each of the actual values, square the results and sum them. From there, divide the first sum of errors (explained variance) by the second sum (total variance), subtract the result from one, and you now have the R-Squared measure.

>**Meaning**: R<sup>2</sup> gives you an idea of how many data points fall within the results of the line formed by the regression equation. The higher the coefficient, the higher percentage of points the line passes through when the data points and line are plotted. If the coefficient is 0.80, then 80% of the points should fall within the regression line. Values of 1 or 0 would indicate the regression line represents all or none of the data, respectively. A higher coefficient is an indicator of a better goodness of fit for the observations

>**Usefulness**: The usefulness of R<sup>2</sup> is its ability to find the likelihood of future events falling within the predicted outcomes. The idea is that if more samples are added, the coefficient would show the probability of a new point falling on the line. Even if there is a strong connection between the two variables, determination does not prove causality. For example, a study on birthdays may show a large number of birthdays happen within a time frame of one or two months. This does not mean that the passage of time or the change of seasons causes pregnancy.

#### Need Help? 
* If these metrics seem complicated, don't worry... 
* The good news is that the computing these metrics in Python with open-source libraries is easy
* All you have to worry about is knowing:
  - The context(s) in which each metric is suitable.
  - Any influencing factors or limitations that may impact your evaluation or threaten its validity.
  - How to invoke the appropriate metrics functions from your code and interpret the results!

## Hands-On with Regression Metrics

In [None]:
#Import Required Libraries
from math import sqrt
import numpy as np
from sklearn import metrics

#Load Prediction Results
actual_values = [9, -3.3, 6, 11]
predictions =   [8.5, -2.9, 6, 9.2]

In [None]:
# Calculate Mean Squared Error (MSE)
print (f'MSE:  {metrics.mean_squared_error(actual_values, predictions)}')

# Calculate Root Mean Squared Error (RMSE)
def rmse(actual_values, predictions):
    actual_values = np.asarray(actual_values)
    predictions = np.asarray(predictions)
    return np.sqrt(((predictions - actual_values) ** 2).mean())
print(f'RMSE: {rmse(actual_values, predictions)}')

# Calculate Mean Absolute Error
print (f'MAE:  {metrics.mean_absolute_error(actual_values, predictions)}')

# Calculate R-Squared
print (f'R^2:  {metrics.r2_score(actual_values, predictions)}')