A sequence of data processing components is called a data _pipeline_.

- - -

A typical performance measure for regression problems is the Root Mean Square Error (RMSE).

$$ \text{RMSE}(\vec{X}, h) = \sqrt{ \frac{1}{m} \sum_{i=1}^{m} ( h(\vec{x}^{(i)}) - \vec{y}^{(i)} )^2}$$

onde $h$ é a função predição (_hypothesis_), de modo que, dado o vetor $\vec{x}^{(i)}$ a predição será $\hat{y}^{(i)} = h(\vec{x}^{(i)})$

For example, suppose that there are many outlier districts. In that case, you may consider using the Mean Absolute Error

$$ \text{MAE}(\vec{X},h) = \frac{1}{m} \sum_{i=1}^{m} |h(\vec{x}^{(i)}) - \vec{y}^{(i)}| $$

particularidades das métricas [ref](https://medium.com/human-in-a-machine-world/mae-and-rmse-which-metric-is-better-e60ac3bde13d):

In [42]:
import numpy as np
import pandas as pd
from IPython.display import display

N = 5
np.random.seed(13)

# evenly distributed errors
_labels = ['ERROR']
_data   = 2 + 0.0001 * np.random.randn(N,1)
df      = pd.DataFrame.from_records(_data, columns=_labels)
df['|ERROR|'] = abs(df['ERROR'])
df['ERROR^2'] = df['ERROR']**2
display( df )
print( 'MAE  = ', 1/len(df) * df['|ERROR|'].sum() )
print( 'RMSE = ', np.sqrt(1/len(df) * df['ERROR^2'].sum()) )

print()

# small variance in errors
_labels = ['ERROR']
_data   = 2 + 1 * np.random.randn(N,1)
df      = pd.DataFrame.from_records(_data, columns=_labels)
df['|ERROR|'] = abs(df['ERROR'])
df['ERROR^2'] = df['ERROR']**2
display( df )
print( 'MAE  = ', 1/len(df) * df['|ERROR|'].sum() )
print( 'RMSE = ', np.sqrt(1/len(df) * df['ERROR^2'].sum()) )

# high variance in errors
_labels = ['ERROR']
_data   = 2 + 13 * np.random.randn(N,1)
df      = pd.DataFrame.from_records(_data, columns=_labels)
df['|ERROR|'] = abs(df['ERROR'])
df['ERROR^2'] = df['ERROR']**2
display( df )
print( 'MAE  = ', 1/len(df) * df['|ERROR|'].sum() )
print( 'RMSE = ', np.sqrt(1/len(df) * df['ERROR^2'].sum()) )

# outliers
_labels = ['ERROR']
_data   = 2 + 1 * np.random.randn(N,1)
df      = pd.DataFrame.from_records(_data, columns=_labels)
df = df.append(pd.DataFrame.from_records([[37.0]], columns=_labels))
df['|ERROR|'] = abs(df['ERROR'])
df['ERROR^2'] = df['ERROR']**2
display( df )
print( 'MAE  = ', 1/len(df) * df['|ERROR|'].sum() )
print( 'RMSE = ', np.sqrt(1/len(df) * df['ERROR^2'].sum()) )

Unnamed: 0,ERROR,|ERROR|,ERROR^2
0,1.999929,1.999929,3.999715
1,2.000075,2.000075,4.000302
2,1.999996,1.999996,3.999982
3,2.000045,2.000045,4.000181
4,2.000135,2.000135,4.000538


MAE  =  2.0000358757337096
RMSE =  2.0000358769574595



Unnamed: 0,ERROR,|ERROR|,ERROR^2
0,2.532338,2.532338,6.412735
1,3.350188,3.350188,11.223759
2,2.861211,2.861211,8.186531
3,3.478686,3.478686,12.101254
4,0.954623,0.954623,0.911305


MAE  =  2.6354091538167514
RMSE =  2.7869547522108777


Unnamed: 0,ERROR,|ERROR|,ERROR^2
0,-8.256857,8.256857,68.175693
1,-14.400877,14.400877,207.385267
2,9.317008,9.317008,86.806642
3,-1.163241,1.163241,1.35313
4,13.878629,13.878629,192.616347


MAE  =  9.403322651971914
RMSE =  10.548337112966182


Unnamed: 0,ERROR,|ERROR|,ERROR^2
0,2.317351,2.317351,5.370115
1,2.127303,2.127303,4.525419
2,4.150383,4.150383,17.225679
3,2.606289,2.606289,6.792741
4,1.973228,1.973228,3.89363
0,37.0,37.0,1369.0


MAE  =  8.362425696205714
RMSE =  15.312345694489881


A conclusão aqui **não é** que o RMSE aumenta com a variância dos erros! O RMSE aumenta com a variância da distribuição de magnitude dos erros!

Outra conclusão é que $$ \text{MAE} \leq \text{RMSE}$$

Em síntese:
+ MAE melhor que RMSE porque RMSE não descreve apenas o erro.
+ RMSE melhor que MAE por não envolver cálculo do módulo.

- - -

Sobre **RMSE vs MSE**, onde basicamente $ \text{RMSE} = \sqrt{\text{MSE}}$: 

> So even though RMSE and MSE are really similar in terms of models scoring, **they can be not immediately interchangeable for gradient based methods**. We will probably need to adjust some parameters like the learning rate.

isso porque a derivada com respeito à predição de um é bem diferente da do outro.

- - -

Sobre $R^2$:

$$R^2 = 1 - \frac{MSE(model)}{MSE(baseline)} = 1 - \frac{ \frac{1}{N}\sum_{i=1}^{N} (y_i - \hat{y}_i)^2 }{ \frac{1}{N}\sum_{i=1}^{N} (y_i - \bar{y}_i)^2 }$$

>  In conclusion, R² is the ratio between how good our model is vs how good is the naive mean model.

- - -

Trabalhando os conceitos do capítulo "End-to-End Machine Learning Project"

> pretending to be a recently hired data scientist in a real estate company

em um problema diferente [New York City Taxi Trip Duration](https://www.kaggle.com/c/nyc-taxi-trip-duration/overview/description)

> In this competition, Kaggle is challenging you to build a model that predicts the total ride duration of taxi trips in New York City. 

- - -