# Loss Functions

In this exercise, you will compare the effects of Loss functions on a linear regression model.

👇 Import the data from the attached csv file

In [4]:
# YOUR CODE HERE
import pandas as pd 
data = pd.read_csv("data.csv")

data.head()

Unnamed: 0,Relative Compactness,Surface Area,Wall Area,Roof Area,Overall Height,Glazing Area,Average Temperature
0,0.98,514.5,294.0,110.25,7.0,0.0,18.44
1,0.98,514.5,294.0,110.25,7.0,0.0,18.44
2,0.98,514.5,294.0,110.25,7.0,0.0,18.44
3,0.98,514.5,294.0,110.25,7.0,0.0,18.44
4,0.9,563.5,318.5,122.5,7.0,0.0,24.56


🎯 Your task is to predict the average temperature inside a greenhouse based on its design. Your temperature predictions will help you select the appropriate greenhouse design for each plant, based on their climatic needs. 

🌿 You know that plants can handle small temperature variations, but are exponentially more sensitive as the temperature variations increase. 

## 1. Theory 

❓ Theoretically, which Loss function would you train your model on to limit the risk of killing plants?

<details>
<summary> 🆘 Answer </summary>
    
By theory, you would use a Mean Square Error (MSE) Loss function. It would penalize outlier predictions and prevent your model from committing large errors. This would ensure smaller temperature variations and a lower risk for plants.

</details>

> YOUR ANSWER HERE
 Max Error when you want to limit the magnitude of the errors.

## 2. Application

### 2.1 Preprocessing

👇 Scale the features

In [7]:
# YOUR CODE HERE
from sklearn.preprocessing import RobustScaler
x = data.drop(columns=['Average Temperature'])
X_scaler = RobustScaler()
X_scaled = X_scaler.fit_transform(x)

### 2.2 Modelling

In this section, you are going to verify the theory by evaluating models optimized on different Loss functions.

### Least Squares (MSE) Loss

👇 **10-Fold Cross-validate** a Linear Regression model optimized by **Stochastic Gradient Descent** (SGD) on a **Least Squares Loss** (MSE)



In [47]:
# YOUR CODE HERE
from sklearn.model_selection import cross_validate
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import SGDRegressor
from sklearn.metrics import max_error
y = data['Average Temperature']
model = SGDRegressor() 
resc_score = cross_validate(model,X_scaled,y,cv=10, 
                            scoring=['max_error','r2'])
r2 = resc_score['test_r2'].mean()
r2
resc_score


{'fit_time': array([0.02666593, 0.021595  , 0.01657295, 0.01668787, 0.01472378,
        0.01520777, 0.01352501, 0.01465988, 0.01483488, 0.01324105]),
 'score_time': array([0.00152397, 0.00096107, 0.00088811, 0.00080991, 0.00061822,
        0.00057125, 0.00055695, 0.00069499, 0.00058603, 0.00054097]),
 'test_max_error': array([-9.50642347, -8.98205091, -9.1526795 , -9.55360327, -9.25266251,
        -8.92755672, -8.89038019, -9.13852207, -8.72436762, -8.02627166]),
 'test_r2': array([0.78344094, 0.90710576, 0.89371679, 0.88144367, 0.93132852,
        0.89670093, 0.92827084, 0.91598622, 0.89625941, 0.93893261])}

👇 Compute 
- the mean cross validated R2 score `r2`
- the single biggest prediction error in °C of all your folds `max_error`?

(Tips: `max_error` is an accepted scoring metrics in sklearn)

In [48]:
# YOUR CODE HERE
max_error = abs(resc_score['test_max_error'].mean()) 
max_error


9.015451790310522

### Mean Absolute Error (MAE) Loss

What if we optimize our model on the MAE instead?

👇 **10-Fold Cross-validate** a Linear Regression model optimized by **Stochastic Gradient Descent** (SGD) on a **MAE** Loss

<details>
<summary>💡 Hints</summary>

- MAE loss cannot be directly specified in `SGDRegressor`. It must be engineered by adjusting the right parameters

</details>

In [43]:
# YOUR CODE HERE
from sklearn.metrics import mean_absolute_error
model = SGDRegressor(loss = "epsilon_insensitive") 
co = cross_validate(model,X_scaled,y,cv=10, 
                            scoring=['max_error','r2'])
co 

{'fit_time': array([0.02059078, 0.01515412, 0.01249886, 0.01048112, 0.00987005,
        0.00840497, 0.0076921 , 0.008389  , 0.00791812, 0.00867176]),
 'score_time': array([0.00176311, 0.0025239 , 0.00141525, 0.00118089, 0.00078297,
        0.0006752 , 0.00055909, 0.00060201, 0.00050902, 0.00071216]),
 'test_max_error': array([-13.42772052, -11.67835599, -11.89911909, -12.34841932,
        -12.60239486, -12.37672087, -12.26684416, -13.22819147,
        -12.6102108 , -12.11130556]),
 'test_r2': array([0.67479576, 0.81926097, 0.83669133, 0.79560386, 0.88964983,
        0.83181251, 0.88976513, 0.8591194 , 0.83980402, 0.91546335])}

👇 Compute 
- the mean cross validated R2 score `r2_mae`
- the single biggest prediction error of all your folds `max_error_mae`?

In [46]:
# YOUR CODE HERE
r2_mae = co['test_r2'].mean()
r2_mae
max_error_mae = abs(co['test_max_error'].mean())
print(max_error_mae, r2_mae)

12.454928265906513 0.8351966151860228


## 3. Conclusion

❓Which of the models you evaluated seems the most appropriate for your task?

<details>
<summary> 🆘Answer </summary>
    
Although mean cross-validated r2 scores are approximately similar between the two models, the one optimized on a MAE has more chance to make larger mistakes from time to time, increasing risk of killing plants!

    
</details>

> YOUR ANSWER HERE
Least Squares (mse) Loss

# 🏁 Check your code

In [49]:
from nbresult import ChallengeResult

result = ChallengeResult('loss_functions',
    r2 = r2,
    r2_mae = r2_mae,
    max_error = max_error,
    max_error_mae = max_error_mae,                     
)
result.write()
print(result.check())


platform darwin -- Python 3.8.12, pytest-6.2.5, py-1.10.0, pluggy-1.0.0 -- /Users/selmalopez/.pyenv/versions/lewagon_current/bin/python3
cachedir: .pytest_cache
rootdir: /Users/selmalopez/code/selmalopez/data-challenges/05-ML/04-Under-the-hood/01-Loss-Functions
plugins: dash-2.0.0, anyio-3.3.2
[1mcollecting ... [0mcollected 3 items

tests/test_loss_functions.py::TestLossFunctions::test_max_error_order [32mPASSED[0m[32m [ 33%][0m
tests/test_loss_functions.py::TestLossFunctions::test_r2 [32mPASSED[0m[32m          [ 66%][0m
tests/test_loss_functions.py::TestLossFunctions::test_r2_mae [32mPASSED[0m[32m      [100%][0m



💯 You can commit your code:

[1;32mgit[39m add tests/loss_functions.pickle

[32mgit[39m commit -m [33m'Completed loss_functions step'[39m

[32mgit[39m push origin master
