# Loss Functions

In this exercise, you will compare the effects of Loss functions on a `LinearRegression` model.

👇 Let's download a CSV file to use for this challenge and parse it into a DataFrame

In [1]:
import pandas as pd

data = pd.read_csv("https://wagon-public-datasets.s3.amazonaws.com/05-Machine-Learning/04-Under-the-Hood/loss_functions_dataset.csv")
data.sample(5)

Unnamed: 0,Relative Compactness,Surface Area,Wall Area,Roof Area,Overall Height,Glazing Area,Average Temperature
74,0.74,686.0,245.0,220.5,3.5,0.1,11.92
431,0.62,808.5,367.5,220.5,3.5,0.25,14.17
616,0.64,784.0,343.0,220.5,3.5,0.4,20.46
394,0.86,588.0,294.0,147.0,7.0,0.25,28.905
144,0.98,514.5,294.0,110.25,7.0,0.1,24.98


🎯 Your task is to predict the average temperature inside a greenhouse based on its design. Your temperature predictions will help you select the appropriate greenhouse design for each plant, based on their climate needs. 

🌿 You know that plants can handle small temperature variations, but are exponentially more sensitive as the temperature variations increase. 

## 1. Theory 

❓ Theoretically, which Loss function would you train your model on to limit the risk of killing plants?

<details>
<summary> 🆘 Answer </summary>
    
By theory, you would use a Mean Square Error (MSE) Loss function. It would penalize outlier predictions and prevent your model from committing large errors. This would ensure smaller temperature variations and a lower risk for plants.

</details>

The exponential sensitivity of plants to temperature variations suggests that errors with higher magnitude should be penalized more severely. This aligns with the characteristics of the **Mean Squared Error (MSE)** loss function. By squaring the errors, MSE amplifies the impact of larger deviations, which is suitable when predicting sensitive temperature variations to prevent significant temperature errors that could be detrimental to plant survival.

## 2. Application

### 2.1 Preprocessing

❓ Standardise the features

In [2]:
from sklearn.preprocessing import StandardScaler

# Initialize the standard scaler
scaler = StandardScaler()

# Extract the feature columns
features = data.drop(columns=['Average Temperature'])

# Fit the scaler on the features and transform them
scaled_features = scaler.fit_transform(features)

# Create a new DataFrame with the scaled features
scaled_features_df = pd.DataFrame(scaled_features, columns=features.columns)

# Display the first few rows of the scaled dataset
scaled_features_df.head()


Unnamed: 0,Relative Compactness,Surface Area,Wall Area,Roof Area,Overall Height,Glazing Area
0,2.041777,-1.785875,-0.561951,-1.470077,1.0,-1.760447
1,2.041777,-1.785875,-0.561951,-1.470077,1.0,-1.760447
2,2.041777,-1.785875,-0.561951,-1.470077,1.0,-1.760447
3,2.041777,-1.785875,-0.561951,-1.470077,1.0,-1.760447
4,1.284979,-1.229239,0.0,-1.198678,1.0,-1.760447


### 2.2 Modeling

In this section, you are going to verify the theory by evaluating models optimized on different Loss functions.

### Least Squares (MSE) Loss

❓ **10-Fold Cross-validate** a Linear Regression model optimized by **Stochastic Gradient Descent** (SGD) on a **Least Squares Loss** (MSE)



In [3]:
from sklearn.linear_model import SGDRegressor
from sklearn.model_selection import cross_val_score
import numpy as np

# Define the target variable (labels)
target = data['Average Temperature']

# Initialize the SGD Regressor with the correct Mean Squared Error (MSE) loss
sgd_regressor_mse = SGDRegressor(loss='squared_error')

# Perform 10-Fold cross-validation
cv_scores_mse = cross_val_score(sgd_regressor_mse, scaled_features, target, cv=10, scoring='neg_mean_squared_error')

# Convert negative MSE to positive and calculate the mean
mean_cv_score_mse = -np.mean(cv_scores_mse)

mean_cv_score_mse


9.062743691028787

❓ Compute 
- the mean cross-validated R2 score and save it in the variable `r2`
- the single biggest prediction error in °C of all your folds and save it in the variable `max_error_celsius`?

(Tips: `max_error` is an accepted scoring metric in sklearn)

In [4]:
from sklearn.metrics import make_scorer, r2_score, max_error

# Define the scorer for max error
max_error_scorer = make_scorer(max_error, greater_is_better=False)

# Perform 10-fold cross-validation for R2 score
cv_scores_r2 = cross_val_score(sgd_regressor_mse, scaled_features, target, cv=10, scoring='r2')

# Perform 10-fold cross-validation for max error
cv_scores_max_error = cross_val_score(sgd_regressor_mse, scaled_features, target, cv=10, scoring=max_error_scorer)

# Calculate the mean R2 score
r2 = np.mean(cv_scores_r2)

# Calculate the maximum prediction error across all folds (make positive)
max_error_celsius = -np.min(cv_scores_max_error)

print(f"The mean cross-validated R² score of the model is: {r2:.4f}")
print(f"The largest prediction error across all folds is: {max_error_celsius:.2f}°C")

r2, max_error_celsius

The mean cross-validated R² score of the model is: 0.8983
The largest prediction error across all folds is: 9.84°C


(0.8983361912678971, 9.838834690046092)

### Mean Absolute Error (MAE) Loss

What if we optimize our model on the MAE instead?

❓ **10-Fold Cross-validate** a Linear Regression model optimized by **Stochastic Gradient Descent** (SGD) on a **MAE** Loss

<details>
<summary>💡 Hints</summary>

- MAE loss cannot be directly specified in `SGDRegressor`. It must be engineered by adjusting the right parameters

</details>

In [5]:
# Initialize the SGD Regressor with the "epsilon_insensitive" loss approximating MAE
sgd_regressor_mae = SGDRegressor(loss='epsilon_insensitive', epsilon=0)

# Perform 10-Fold cross-validation with 'neg_mean_absolute_error' scoring
cv_scores_mae = cross_val_score(sgd_regressor_mae, scaled_features, target, cv=10, scoring='neg_mean_absolute_error')

# Convert negative MAE to positive and calculate the mean
mean_cv_score_mae = -np.mean(cv_scores_mae)

mean_cv_score_mae


2.2881477831885184

❓ Compute 
- the mean cross-validated R2 score, store it in `r2_mae`
- the single biggest prediction error of all your folds, store it in `max_error_mae`?

In [6]:
# Perform 10-Fold cross-validation for R² score
cv_scores_r2_mae = cross_val_score(sgd_regressor_mae, scaled_features, target, cv=10, scoring='r2')

# Perform 10-Fold cross-validation for max error
cv_scores_max_error_mae = cross_val_score(sgd_regressor_mae, scaled_features, target, cv=10, scoring=max_error_scorer)

# Calculate the mean R² score
r2_mae = np.mean(cv_scores_r2_mae)

# Calculate the maximum prediction error across all folds (make positive)
max_error_mae = -np.min(cv_scores_max_error_mae)

print(f"Mean cross-validated R² score: {r2_mae:.4f}")
print(f"Single biggest prediction error: {max_error_mae:.2f}°C")

r2_mae, max_error_mae

Mean cross-validated R² score: 0.8764
Single biggest prediction error: 11.19°C


(0.8764332071780456, 11.189636433776172)

## 3. Conclusion

❓Which of the models you evaluated seems the most appropriate for your task?

<details>
<summary> 🆘Answer </summary>
    
Although mean cross-validated r2 scores are approximately similar between the two models, the one optimized on a MAE has more chance to make larger mistakes from time to time, increasing the risk of killing plants!

    
</details>

To determine which model is most appropriate for predicting greenhouse temperature:

1. **MSE-Optimized Model**:
   - Mean R² Score: **0.8979**
   - Biggest Prediction Error: **9.80°C**

2. **MAE-Optimized Model**:
   - Mean R² Score: **0.8763**
   - Biggest Prediction Error: **11.20°C**

**Analysis**:

- **Accuracy**: The MSE-optimized model has a higher mean R² score, indicating it generally fits the data better than the MAE-optimized model.
- **Biggest Error**: The MSE model also has a smaller largest prediction error compared to the MAE model.

**Conclusion**:

The model optimized with the MSE loss function appears to be more suitable. It has a higher R² score, meaning it explains more variance in the data, and its prediction errors are less extreme, which is important given the exponential sensitivity of plants to temperature variations.

# 🏁 Check your code and push your notebook

In [7]:
from nbresult import ChallengeResult

result = ChallengeResult(
    'loss_functions',
    r2 = r2,
    r2_mae = r2_mae,
    max_error = max_error_celsius,
    max_error_mae = max_error_mae
)

result.write()
print(result.check())


platform darwin -- Python 3.10.6, pytest-7.1.3, pluggy-1.0.0 -- /Users/ramzimalhas/.pyenv/versions/3.10.6/envs/lewagon/bin/python
cachedir: .pytest_cache
rootdir: /Users/ramzimalhas/code/ramzimalhas/05-ML/04-Under-the-hood/data-loss-functions/tests
plugins: asyncio-0.19.0, anyio-3.7.1, typeguard-2.13.3
asyncio: mode=strict
[1mcollecting ... [0mcollected 3 items

test_loss_functions.py::TestLossFunctions::test_max_error_order [32mPASSED[0m[32m   [ 33%][0m
test_loss_functions.py::TestLossFunctions::test_r2 [32mPASSED[0m[32m                [ 66%][0m
test_loss_functions.py::TestLossFunctions::test_r2_mae [32mPASSED[0m[32m            [100%][0m



💯 You can commit your code:

[1;32mgit[39m add tests/loss_functions.pickle

[32mgit[39m commit -m [33m'Completed loss_functions step'[39m

[32mgit[39m push origin master

