# Training and Test Errors

So far, we have fit regression models to data and obtained predictions from them, but we have not evaluated whether these predictions were any good. In this lesson, we will discuss different performance metrics that can be used to evaluate predictions from a machine learning model. These performance metrics can be calculated on training data or on test data.

In [4]:
import pandas as pd
import numpy as np

#Read the csv file
cars_df = pd.read_csv("car.csv")

#Data cleaning: Remove unwanted columns
cars_df = cars_df.iloc[:,[0,1,3,4,5,6,7,8,10,11,16]]

#Data cleaning: Remove all rows with no sale price
#Why not predict the price and not drop the row? 
#Because rows with no sale price are not guarenteed to have odometer reading
#Additionaly, the dataset is big enough to be able to afford dropping rows
cars_df = cars_df[cars_df["Sale Price"] != 0]

#Data cleaning: Remove all cars that are not Battery Electric Vehicle (Example: Hybrid cars)
cars_df = cars_df[cars_df["Clean Alternative Fuel Vehicle Type"] == "Battery Electric Vehicle (BEV)"]

#Data cleaning: Getting rid of outliers
cars_df = cars_df[cars_df["Odometer Reading"] >= 0]
cars_df = cars_df[cars_df["Odometer Reading"] <= 250000]
cars_df = cars_df[cars_df["Model Year"] >= 2000]
cars_df = cars_df[cars_df["Model Year"] <= 2023]
cars_df = cars_df[cars_df["Sale Price"] <= 300000]
cars_df = cars_df[cars_df["Sale Price"] > 100]

#Since we know each car has a unique vin number, we
#can simplify the vin numbers and use them as index
cars_df["VIN (1-10)"] = cars_df.reset_index().index + 1
cars_df = cars_df.rename(columns={"VIN (1-10)": "Car ID"})

# Split the data into training and test sets.
cars_df = cars_df.set_index("Car ID")
cars_train = cars_df.loc[:100000].copy()
cars_test = cars_df.loc[100000:].copy()

# Smaller sample data set which has chosen random rows (every 10th row)
cars_temp = cars_df.iloc[::100, :]
cars_temp = cars_temp.iloc[:,[1,2,3,6,7,8,9]]

# Log transform the target(Sale Price) for visualization purposes only
cars_train["log(Sale Price)"] = np.log(cars_train["Sale Price"])

cars_df
#cars_train
#cars_test
#cars_temp

Unnamed: 0_level_0,Clean Alternative Fuel Vehicle Type,Model Year,Make,Model,Vehicle Primary Use,Electric Range,Odometer Reading,New or Used Vehicle,Sale Price,Transaction Year
Car ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1,Battery Electric Vehicle (BEV),2014,MERCEDES-BENZ,B-Class,Passenger,87,37031,Used,19443,2018
2,Battery Electric Vehicle (BEV),2018,TESLA,Model 3,Passenger,215,50,New,65700,2018
3,Battery Electric Vehicle (BEV),2023,TESLA,Model Y,Passenger,0,15,New,84440,2022
4,Battery Electric Vehicle (BEV),2014,TESLA,Model S,Passenger,208,30840,Used,66864,2017
5,Battery Electric Vehicle (BEV),2022,TESLA,Model Y,Passenger,0,15,New,61440,2022
...,...,...,...,...,...,...,...,...,...,...
109424,Battery Electric Vehicle (BEV),2022,RIVIAN,R1T,Truck,0,100,New,82295,2022
109425,Battery Electric Vehicle (BEV),2022,TESLA,Model 3,Passenger,0,187,New,75290,2022
109426,Battery Electric Vehicle (BEV),2011,NISSAN,Leaf,Passenger,73,64000,Used,1540,2022
109427,Battery Electric Vehicle (BEV),2020,TESLA,Model Y,Passenger,291,24482,Used,68800,2022


## Performance Metrics for Regression Models

To evaluate the performance of a regression model, we check the predicted labels from the model against the true labels. Since the labels are quantitative, all of the performance metrics are based on the difference between each predicted label $\hat y$ and the true label $y$. 

One way to make sense of these differences is to square each difference and average the squared differences. This measure of error is known as _mean squared error_ (or _MSE_, for short):

$$ 
\begin{align*}
\textrm{MSE} &= \textrm{mean of } (y - \hat y)^2.
\end{align*}
$$ 

MSE is difficult to interpret because its units are the square of the units of the label. To make MSE more interpretable, it is common to take the _square root_ of the MSE to obtain the _root mean squared error_ (or _RMSE_, for short):

$$ 
\begin{align*}
\textrm{RMSE} &= \sqrt{\textrm{MSE}}.
\end{align*}
$$ 

The RMSE measures how off a "typical" prediction is. Notice that this reasoning is exactly the same reasoning that we used in the past when we defined the standard deviation as the square root of the variance.

Another common measure of error is the _mean absolute error_ (or _MAE_, for short):

$$ 
\begin{align*}
\textrm{MAE} &= \textrm{mean of } |y - \hat y|.
\end{align*}
$$ 

Like the RMSE, the MAE measures how off a "typical" prediction is. 

MSE, RMSE, and MAE are all error metrics; we want them to be as small as possible. There are also performance metrics that we seek to maximize. One example is $R^2$, also known as the _coefficient of determination_:

\begin{align*}
R^2 &= 1 - \frac{\text{mean of } (y - \hat y)^2}{\text{mean of } (y - \bar y)^2}.
\end{align*}

Notice that the denominator, $\text{mean of } (y - \bar y)^2$, is just the variance of the label $y$. So the interpretation of $\frac{\text{mean of } (y - \hat y)^2}{\text{mean of } (y - \bar y)^2}$ is the fraction of the variance in the label $y$ that is "left over" after we fit the regression model. Therefore, $R^2$ can be interpreted as the fraction of variance that is explained by the regression model. It cannot be greater than $1.0$, but it can sometimes be negative if the regression model is worse than useless.

These are just some of the performance metrics that are used to evaluate regression models. For more, refer to the [scikit-learn documentation on regression metrics](https://scikit-learn.org/stable/modules/model_evaluation.html#regression-metrics).

## Training Error

To calculate the performance metrics above, we need data where the true labels are known. Where do we find such data? One natural source of labeled data is the training data, since we needed the true labels to be able to train a model.

For a $k$-nearest neighbors model, the training data is the data from which the $k$-nearest neighbors are selected. So to calculate the training RMSE, we do the following:

For each observation in the training data:
1. Find its $k$-nearest neighbors in the training data.
2. Average the labels of the $k$-nearest neighbors to obtain the predicted label.
3. Compare the predicted label to the true label.

At this point, we can average the square of these differences to obtain the MSE or average their absolute values to obtain the MAE.

Let's calculate the training MSE for the 5-nearest neighbors model that we fit in Chapter 5.2.

In [5]:
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder
from sklearn.pipeline import make_pipeline
from sklearn.neighbors import KNeighborsRegressor

X_train = cars_temp[["Model Year", "Odometer Reading", "Make", "Model"]]
y_train = cars_temp["Sale Price"]

ct = make_column_transformer(
    (MinMaxScaler(), ["Model Year", "Odometer Reading"]),
    (OneHotEncoder(), ["Make", "Model"]),
    remainder="drop"  # all other columns in X will be dropped.
)

pipeline = make_pipeline(
    ct,
    KNeighborsRegressor(n_neighbors=6)
)

pipeline.fit(X=cars_temp[["Model Year", "Odometer Reading", "Make", "Model"]], 
             y=cars_temp["Sale Price"])


Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('minmaxscaler',
                                                  MinMaxScaler(),
                                                  ['Model Year',
                                                   'Odometer Reading']),
                                                 ('onehotencoder',
                                                  OneHotEncoder(),
                                                  ['Make', 'Model'])])),
                ('kneighborsregressor', KNeighborsRegressor(n_neighbors=6))])

To calculate the training error, we need its predictions on the training data.

In [7]:
# Calculate the model predictions on the training data.
y_train_ = pipeline.predict(X=cars_temp[["Model Year", "Odometer Reading", "Make", "Model"]])
y_train_

array([33905.        , 14006.66666667, 39825.        , ...,
       60923.33333333, 48513.33333333, 60923.33333333])

Finally, we compare the predictions `y_train_` (note the trailing underscore) to the true labels `y_train`, which are known, since this is the training data.

In [8]:
# Calculate the mean-squared error.
mse = ((y_train - y_train_) ** 2).mean()
mse

88651768.96927935

We could have also used a scikit-learn function to calculate mean-squared error. The scikit-learn functions for the performance metrics discussed in this chapter are shown in the table below. All of these functions take a 1D-array of the true labels as the first parameter and a 1D-array of the predicted labels as the second.

| Metric | Function Name |
|--------|---------------|
| MSE | `mean_squared_error` |
| MAE | `mean_absolute_error` |
| $R^2$ | `r2_score` |

In [9]:
from sklearn.metrics import mean_squared_error
mean_squared_error(y_train, y_train_)

88651768.96927956

To obtain a measure of error that is more interpretable, we can take the square root to obtain the RMSE.

In [10]:
rmse = np.sqrt(mse)
rmse

9415.506835496395

The RMSE says that the model's predictions are off by about 0.4 on average. This is not too bad, since vintage quality ranges from 2.0 to 5.0.