# Loss Functions

In this exercise, you will compare the effects of Loss functions on a linear regression model.

👇 Import the data from the attached csv file

In [0]:
import pandas as pd

data = pd.read_csv("data.csv")

data.sample(5)

Unnamed: 0,Relative Compactness,Surface Area,Wall Area,Roof Area,Overall Height,Glazing Area,Average Temperature
713,0.64,784.0,343.0,220.5,3.5,0.4,20.335
429,0.62,808.5,367.5,220.5,3.5,0.25,14.65
454,0.76,661.5,416.5,122.5,7.0,0.25,36.93
189,0.62,808.5,367.5,220.5,3.5,0.1,13.43
224,0.69,735.0,294.0,220.5,3.5,0.1,12.735


🎯 Your task is to predict the average temperature inside a greenhouse based on its design. Your temperature predictions will help you select the appropriate greenhouse design for each plant, based on their climatic needs. 

🌿 You know that plants can handle small temperature variations, but are exponentially more sensitive as the temperature variations increase. 

## 1. Theory 

❓ Theoretically, which Loss function would you train your model on to limit the risk of killing plants?

<details>
<summary> 🆘 Answer </summary>
    
By theory, you would use a Mean Square Error (MSE) Loss function. It would penalize outlier predictions and prevent your model from committing large errors. This would ensure smaller temperature variations and a lower risk for plants.

</details>

## 2. Application

### 2.1 Preprocessing

👇 Scale the features

In [0]:
from sklearn.preprocessing import StandardScaler

# Select only the features 
X = data.loc[:,'Relative Compactness':'Glazing Area']

# Fit scaler
scaler = StandardScaler().fit(X)

# Scale continuous features 
X_scaled = scaler.transform(X)

### 2.2 Modelling

In this section, you are going to verify the theory by evaluating models optimized on different Loss functions.

### Least Squares (MSE) Loss

👇 **10-Fold Cross-validate** a Linear Regression model optimized by **Stochastic Gradient Descent** (SGD) on a **Least Squares Loss** (MSE)



In [0]:
import numpy as np
from sklearn.model_selection import cross_validate
from sklearn.linear_model import SGDRegressor

# Squared loss SGD Regressor
sgd_model = SGDRegressor(loss="squared_loss")

# Cross Validate Model
sgd_model_cv = cross_validate(sgd_model, 
                              X_scaled, 
                              data['Average Temperature'],
                              cv = 10, 
                              scoring = ['r2','max_error'] )
sgd_model_cv

{'fit_time': array([0.00632977, 0.00577617, 0.00542998, 0.00454187, 0.00380063,
        0.00429988, 0.00408602, 0.00363803, 0.00419927, 0.00407815]),
 'score_time': array([0.00091124, 0.00075698, 0.00128579, 0.00073528, 0.00094008,
        0.001441  , 0.00051618, 0.00055766, 0.00106072, 0.00074124]),
 'test_r2': array([0.78592507, 0.90950123, 0.89554103, 0.88459354, 0.93114979,
        0.89671654, 0.92755099, 0.9158792 , 0.89446858, 0.93930171]),
 'test_max_error': array([-9.8012985 , -8.66938798, -8.79611138, -9.21387885, -8.95029457,
        -8.59269276, -8.54365204, -8.83942373, -8.39240878, -7.77664584])}

👇 Compute 
- the mean cross validated R2 score `r2`
- the single biggest prediction error in °C of all your folds `max_error`?

(Tips: `max_error` is an accepted scoring metrics in sklearn)

In [0]:
r2 = sgd_model_cv['test_r2'].mean()
r2

0.8979187069528661

In [0]:
max_error = abs(sgd_model_cv['test_max_error']).max()
max_error

9.901436629271469

### Mean Absolute Error (MAE) Loss

What if we optimize our model on the MAE instead?

👇 **10-Fold Cross-validate** a Linear Regression model optimized by **Stochastic Gradient Descent** (SGD) on a **MAE** Loss

<details>
<summary>💡 Hints</summary>

- MAE loss cannot be directly specified in `SGDRegressor`. It must be engineered by adjusting the right parameters

</details>

In [0]:
# MAE loss engineered by setting epsilon_insensitive = 0
mae_model = SGDRegressor(loss="epsilon_insensitive", epsilon = 0)

# Cross Validate Model
mae_sgd = cross_validate(mae_model, 
                         X_scaled, 
                         data['Average Temperature'], 
                         cv = 10,  
                         scoring = ['r2','max_error'])

👇 Compute 
- the mean cross validated R2 score `r2_mae`
- the single biggest prediction error of all your folds `max_error_mae`?

In [0]:
r2_mae = mae_sgd['test_r2'].mean()
r2_mae

0.8758482111487428

In [0]:
max_error_mae = abs(mae_sgd['test_max_error']).max()
max_error_mae

11.227205116496187

## 3. Conclusion

❓Which of the models you evaluated seems the most appropriate for your task?

<details>
<summary> 🆘Answer </summary>
    
Although mean cross-validated r2 scores are approximately similar between the two models, the one optimized on a MAE has more chance to make larger mistakes from time to time, increasing risk of killing plants!

    
</details>

# 🏁 Check your code

In [0]:
from nbresult import ChallengeResult

result = ChallengeResult('loss_functions',
    r2 = r2,
    r2_mae = r2_mae,
    max_error = max_error,
    max_error_mae = max_error_mae,                     
)
result.write()
print(result.check())

platform darwin -- Python 3.8.5, pytest-6.1.1, py-1.9.0, pluggy-0.13.1 -- /Users/brunolajoie/.pyenv/versions/3.8.5/envs/lewagon502/bin/python3.8
cachedir: .pytest_cache
rootdir: /Users/brunolajoie/code/lewagon/data-solutions/05-ML/04-Under-the-hood/01-Loss-Functions
plugins: dash-1.18.1, anyio-2.0.2, pylint-0.17.0
collecting ... collected 3 items

tests/test_loss_functions.py::TestLossFunctions::test_max_error_order PASSED [ 33%]
tests/test_loss_functions.py::TestLossFunctions::test_r2_mae_order_of_magnitude PASSED [ 66%]
tests/test_loss_functions.py::TestLossFunctions::test_r2_order_of_magnitude PASSED [100%]



💯 You can commit your code:

[1;32mgit[39m add tests/loss_functions.pickle

[32mgit[39m commit -m [33m'Completed loss_functions step'[39m

[32mgit[39m push origin master
