# Deep Learning - Exercise 2

This lecture is about introduction to using ANN for regression tasks.

We will use our models on [Auto MPG](https://archive.ics.uci.edu/ml/datasets/auto+mpg) dataset.

This dataset contains fule consumptions of several vehicles in miles per gallon. So, we need to predict the fuel efficiencies of various vehicles from the data that has been provided.

**Core Concepts**
* ⛽ Regression task of predicting fuel consumption
* 💾 Auto MPG dataset from UCI Machine Learning Repository
* 🚗 Predicting fuel efficiency of vehicles
* 🧪 Using provided data to train ANN regression models

[Open in Google colab](https://colab.research.google.com/github/rasvob/VSB-FEI-Deep-Learning-Exercises/blob/main/dl_02.ipynb)
[Download from Github](https://github.com/rasvob/VSB-FEI-Deep-Learning-Exercises/blob/main/dl_02.ipynb)

##### Remember to set **GPU** runtime in Colab!

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt # plotting
import seaborn as sns # plotting
import tensorflow as tf
import tensorflow.keras as keras
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.preprocessing import MinMaxScaler, PowerTransformer

tf.version.VERSION

In [None]:
"""
Computes MAPE
"""
def mean_absolute_percentage_error(y_true: np.array, y_pred: np.array) -> float:
    return np.mean(np.abs((y_true - y_pred) / y_true)) * 100

"""
Computes SMAPE
"""
def symetric_mean_absolute_percentage_error(y_true: np.array, y_pred: np.array) -> float:
    return np.mean(np.abs((y_pred - y_true) / ((np.abs(y_true) + np.abs(y_pred))/2.0))) * 100

"""
Computes MAE, MSE, MAPE, SMAPE, R2
"""
def compute_metrics(df: pd.DataFrame) -> pd.DataFrame:
    y_true, y_pred = df['y_true'].values, df['y_pred'].values
    return compute_metrics_raw(y_true, y_pred)

def compute_metrics_raw(y_true: pd.Series, y_pred: pd.Series) -> pd.DataFrame:
    mae, mse, mape, smape, r2 = mean_absolute_error(y_true=y_true, y_pred=y_pred), mean_squared_error(y_true=y_true, y_pred=y_pred), mean_absolute_percentage_error(y_true=y_true, y_pred=y_pred), symetric_mean_absolute_percentage_error(y_true=y_true, y_pred=y_pred), r2_score(y_true=y_true, y_pred=y_pred)
    return pd.DataFrame.from_records([{'MAE': mae, 'MSE': mse, 'MAPE': mape, 'SMAPE': smape, 'R2': r2}], index=[0])

In [None]:
def show_history(history):
    plt.figure()
    for key in history.history.keys():
        plt.plot(history.epoch, history.history[key], label=key)
    plt.legend()
    plt.tight_layout()

In [None]:
def show_history_loss(history):
    plt.figure()
    for key in history.history.keys():
        if 'loss' not in key:
            continue
        plt.plot(history.epoch, history.history[key], label=key)
    plt.legend()
    plt.tight_layout()

## 🤔 Questions to explore before we dive in! 

1️⃣ **Regression vs Classification**
* What is the key difference between predicting a continuous value vs assigning a category?
* Can you think of real examples where regression would be more appropriate than classification?

2️⃣ **Solving Regression Tasks**
* What steps would you include in your ML pipeline for regression?
* Which model architecture would you choose and why?
* How would you measure if your predictions are good?

3️⃣ **ANN vs Linear Regression**
* What makes neural networks more powerful than simple linear models?
* When would the added complexity of an ANN be worth it?


# Load the dataset first

## Dataset info
* Number of Instances: 398
* Number of Attributes: 9 including the class attribute

**Attribute Information:**

| # | Feature | Type | Description |
|---|---------|------|-------------|
| 1 | mpg | continuous | Miles per gallon (higher = better) |
| 2 | cylinders | discrete | Number of engine cylinders |
| 3 | displacement | continuous | Engine displacement volume |
| 4 | horsepower | continuous | Engine power output |
| 5 | weight | continuous | Vehicle weight |
| 6 | acceleration | continuous | Time to accelerate 0-60 mph |
| 7 | model year | discrete | Year of manufacture |
| 8 | origin | discrete | Manufacturing region |
| 9 | car name | string | Unique vehicle identifier |


* Missing Attribute Values:  horsepower has 6 missing values

In [None]:
url = 'https://raw.githubusercontent.com/rasvob/VSB-FEI-Deep-Learning-Exercises/main/datasets/auto-mpg.csv'
rel_path = 'datasets/auto-mpg.csv'
df = pd.read_csv(url, na_values='?', sep=';')

In [None]:
df

## Check missing values

In [None]:
df.isna().sum()

## 📊 Exploring the Data Visually

Let's analyze our dataset through these key questions:

1. 🤔 Which row/columns carry the most significance and why?

2. 🔍 Can you spot the categorical features from these visualizations?
   * Look for discrete values
   * Check for non-numeric patterns

3. 📏 Numeric Features Analysis:
   * Are the scales consistent across features?
   * How might different ranges impact our model?

4. 🔗 Feature Relationships:
   * Look for potential correlations
   * Identify possible colinear features

In [None]:
sns.pairplot(df)

## Do you see any colinearity in the data?
* Can it cause any issue? How to deal with it?

In [None]:
sns.heatmap(df.corr(numeric_only=True), cmap='Greens', annot=True)

## We can plot the categorical data using boxplots
* Beware that the data are about cars from 80s, we won't see many 6 or 8 cylinder cars nowadays

In [None]:
sns.boxplot(data = df, x='cylinders', y='mpg')

In [None]:
sns.boxplot(data = df, x='origin', y='mpg')

## 💡 There is no info about the *origin* feature = detective work incoming 🙂

### What do you think that the origin means based on the printed data?
* And what car origin is your favourite? 🙂

In [None]:
df.loc[df.origin == 1, 'car_name']

In [None]:
df.loc[df.origin == 2, 'car_name']

In [None]:
df.loc[df.origin == 3, 'car_name']

## Okay, now we have the basic understanding of the data we can start to try some models
* We need to deal with the NA values first, as is it just a few rows, we will drop the data

In [None]:
df = df.loc[~df.horsepower.isna(), :].copy()

## 🏷️ Handling Categorical Features

Let's examine our categorical variables:

### Origin Feature 🌍
* Even though *origin* appears numerical (1, 2, 3)
* These numbers are actually codes representing:
  * 1 → American
  * 2 → European
  * 3 → Asian
* ⚠️ Why treat as categorical? Numbers don't represent order or magnitude!

### Car Name Feature 🚗
* Text data that needs encoding
* Contains brand/model information

❓ Key Question:
* Why is *origin* categorical despite being numerical? What's the catch?

### car_name is problematic beacause we have quite a few brands so one-hot encoding would add too many columns
* We will drop the feature

In [None]:
df['car_name'].apply(lambda x: x.split(' ')[0]).value_counts()

In [None]:
df = df.drop('car_name', axis=1)

In [None]:
df['origin'] = df['origin'].replace({1: 'USA', 2: 'EUR', 3: 'JAP'})

In [None]:
df = pd.get_dummies(df, columns=['origin'], prefix=['origin_'])

In [None]:
df.head()

## Split the data into input and output part

In [None]:
X, y = df.drop('mpg', axis=1), df.mpg

In [None]:
X.shape, y.shape

## Do the train/test in ratio 80:20

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=13)

In [None]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

## 🎯 Creating Baseline Model

We'll start with Linear Regression as our foundation:

### Linear Regression vs ANN 🤔
* Linear regression:
  * Simple mathematical formula
  * Clear coefficients for each feature
  * Direct feature importance interpretation

* Neural Network:
  * Complex layered structure
  * Hidden transformations
  * "Black box" nature

❓ Key Question:
* Which model provides better explainability - ANN or Linear Regression? Why?

![Meme01](https://github.com/rasvob/VSB-FEI-Deep-Learning-Exercises/blob/main/images/dl_meme_reg_01.jpg?raw=true)

## We will use just *horsepower* and *model_year* features because the high correlation values

In [None]:
alg = LinearRegression()
alg.fit(X_train.loc[:, ['horsepower', 'model_year']], y_train)
y_pred = alg.predict(X_test.loc[:, ['horsepower', 'model_year']])

## 📊 Model Evaluation in Regression

### Common Regression Metrics 📏

1. Basic Metrics:
* MAE (Mean Absolute Error)
* RMSE (Root Mean Square Error)

2. Advanced Metrics:
* R² (R-squared)
* MAPE (Mean Absolute Percentage Error)
* sMAPE (Symmetric MAPE)

❓ Key Question:
* Can you write mathematical formulas for any of these metrics?
  * Think about:
    * Actual values (y)
    * Predicted values (ŷ)
    * Number of samples (n)

💡 Note: I've prepared evaluation functions to help you calculate all metrics easily!

In [None]:
df_pred = pd.DataFrame({'y_true': y_test, 'y_pred': y_pred})
compute_metrics(df_pred)

# Now we can create our first deep learning model and compare it to the baseline
* The ANN model can use more features as it is designed for bigger datasets and multicolinearity is not so big issue as in the LR case
* We will start with a raw data
* The evaluation step is the same

![Meme02](https://github.com/rasvob/VSB-FEI-Deep-Learning-Exercises/blob/main/images/dl_meme_reg_04.jpg?raw=true)

## 🔎 Why do we use *linear* activation in the output layer?

# 📒 NOTE for Task 2: This is the benchmark model

In [None]:
(X_train.shape[1],)

In [None]:
inp = keras.layers.Input(shape=(X_train.shape[1],))
                         
hidden_1 = keras.layers.Dense(128, activation='relu')(inp)
hidden_2 = keras.layers.Dense(32, activation='relu')(hidden_1)

out = keras.layers.Dense(1, activation='linear')(hidden_2)

model = keras.Model(inp, out)

model.compile(loss=keras.losses.MeanSquaredError(),  
              optimizer=keras.optimizers.RMSprop(), 
              metrics=[keras.metrics.MeanAbsoluteError(), keras.metrics.MeanAbsolutePercentageError()])
model.summary()

## Train the model

In [None]:
model_checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
    filepath='best.weights.h5',
    save_weights_only=True,
    monitor='val_loss',
    mode='auto',
    save_best_only=True)

history = model.fit(X_train, y_train, validation_split=0.2, callbacks=[model_checkpoint_callback], batch_size=8, epochs=100)

## 📈 Comparing Model Metrics

### Model Comparison 🔄
* Compare metrics between:
  * Linear Regression
  * Our Neural Network

❓ Key Questions:
* Is the model performing better? Why?
* What's the purpose of .ravel()?
  * Hint: Think about array dimensions! 

💡 Note: .ravel() transforms multi-dimensional arrays into 1D arrays, which is often required for metric calculations.

In [None]:
model.load_weights("best.weights.h5")

y_pred = model.predict(X_test).ravel()

df_pred = pd.DataFrame({'y_true': y_test, 'y_pred': y_pred})
compute_metrics(df_pred)

## It is very good practice to check the loss function values of train/validation data during the training and not only the metrics
* Do you see any issue with the val_loss?

In [None]:
show_history_loss(history)

## The loss function plot show clear instability of learning
* This is a big issue in the regression tasks and it is pretty common one
* It is caused by the features magnitude differences
* We can solve the matter with feature scaling (normalization)
* A https://www.tensorflow.org/api_docs/python/tf/keras/layers/Normalization layer can be used for solving the matter

### Why is magnitude difference an issue?

* You can see that the gradient of the slope is orders of magnitude larger than the intercept.

![Grad01](https://github.com/rasvob/VSB-FEI-Deep-Learning-Exercises/blob/main/images/dl_reg_noscale.png?raw=true)

* If we take a look at the one optimization step values change you can see that only the slope changed in value (we see a vertical line in the plot above, with no change in the intercept parameter). 
    * That’s because the slope gradient is way bigger than the intercept gradient.
    * Gradient actually points in the direction of steepest ascent.
    * Gradient is the vector of all partial derivatives of the loss function with respect to all the model weights.
        * **Basically these values will tell you in which direction (+ or - delta) and how much you should change the individual weights values to lower the loss function value**
        * The amount we adjust our slope each iteration is controlled by a *learning rate* parameter
    
![Grad02](https://github.com/rasvob/VSB-FEI-Deep-Learning-Exercises/blob/main/images/dl_reg_noscale_grad.png?raw=true)

### There are a few ways we can solve our problem above. The most common way is to simply scale your features before gradient descent.

![Grad03](https://github.com/rasvob/VSB-FEI-Deep-Learning-Exercises/blob/main/images/dl_reg_scale.png?raw=true)

* We can see that not the optimization process is not stuck and computed gradients in the individual steps points in the right direction.

![Grad04](https://github.com/rasvob/VSB-FEI-Deep-Learning-Exercises/blob/main/images/dl_reg_scale_grad.png?raw=true)


* **I recommend visiting https://www.tomasbeuzen.com/deep-learning-with-pytorch/chapters/chapter1_gradient-descent.html for more details about the topic**

## 🔄 Data Normalization

### Why Normalize? 🎯
* Neural networks are sensitive to input scales
* Features with different ranges can cause:
  * Slower convergence
  * Poor model performance
  * Training instability

### Process 🔧
1. Normalize features to similar ranges
2. Retrain the model
3. Compare results with previous version

❓ Key Questions:
* Will normalized data improve our model?
* How will the metrics change?

In [None]:
norm_layer = tf.keras.layers.Normalization()
norm_layer.adapt(X_train.to_numpy())

## We can take a look at the mean and variance used in the normalization process for each feature

In [None]:
print('Mean: ', np.array(norm_layer.variables[0]))
print('Variance: ', np.array(norm_layer.variables[1]))

In [None]:
inp = keras.layers.Input(shape=(X_train.shape[1],))
norm = norm_layer(inp)                  
hidden_1 = keras.layers.Dense(128, activation='relu')(norm)
hidden_2 = keras.layers.Dense(32, activation='relu')(hidden_1)

out = keras.layers.Dense(1, activation='linear')(hidden_2)

model = keras.Model(inp, out)

model.compile(loss=keras.losses.MeanSquaredError(),  
              optimizer=keras.optimizers.RMSprop(), 
              metrics=[keras.metrics.MeanAbsoluteError(), keras.metrics.MeanAbsolutePercentageError()])
model.summary()

## Train the model

In [None]:
model_checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
    filepath='best.weights.h5',
    save_weights_only=True,
    monitor='val_loss',
    mode='auto',
    save_best_only=True)

history = model.fit(X_train, y_train, validation_split=0.2, callbacks=[model_checkpoint_callback], batch_size=8, epochs=100)

In [None]:
model.load_weights("best.weights.h5")

y_pred = model.predict(X_test).ravel()

df_pred = pd.DataFrame({'y_true': y_test, 'y_pred': y_pred})
compute_metrics(df_pred)

## 👀 Analyzing the Impact of Normalization

### Before vs After Comparison 📊
* Training behavior
* Convergence speed
* Final metrics

### Why Normalization Matters 🎯
* Similar scales → Stable gradients
* Benefits:
  * ⚡ Faster convergence
  * 📈 Higher learning rates possible
  * 🎯 Better numerical stability

❓ Key Question:
* Do you notice any differences in model performance after normalization?

In [None]:
show_history_loss(history)

## We can transform the output as well
* There are multiple scaling options
    * MinMax, Std. scale, Log, BoxCox, ...
    
### We will test *MinMaxScaler* into (-1;1) range

In [None]:
scaler = MinMaxScaler(feature_range=(-1, 1))
y_train_scaled = scaler.fit_transform(np.array(y_train).reshape((-1, 1))).ravel()
y_test_scaled = scaler.transform(np.array(y_test).reshape((-1, 1))).ravel()

In [None]:
y_train_scaled[:10]

## ⚠️ Output Activation Function Warning

### The Activation Range Problem 🎯
* Sigmoid → [0,1] range only
* Can't produce negative values
* Real data may need wider range

❓ Key Question:
* What happens when activation function range doesn't match our target variable range?

💡 Remember: Always match your output activation to your target variable range!
* Linear → unbounded values
* ReLU → positive values
* Sigmoid → [0,1]
* Tanh → [-1,1]

### Anti-Pattern Example ☣️
* Using sigmoid for unbounded regression
* Model will be limited to positive values
* Can't predict full range of target variable

In [None]:
inp = keras.layers.Input(shape=(X_train.shape[1],))
norm = norm_layer(inp)                  
hidden_1 = keras.layers.Dense(128, activation='relu')(norm)
hidden_2 = keras.layers.Dense(32, activation='relu')(hidden_1)

out = keras.layers.Dense(1, activation='sigmoid')(hidden_2)

model = keras.Model(inp, out)

model.compile(loss=keras.losses.MeanSquaredError(),  
              optimizer=keras.optimizers.RMSprop(), 
              metrics=[keras.metrics.MeanAbsoluteError(), keras.metrics.MeanAbsolutePercentageError()])
model.summary()

## Train the model

In [None]:
model_checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
    filepath='best.weights.h5',
    save_weights_only=True,
    monitor='val_loss',
    mode='auto',
    save_best_only=True)

history = model.fit(X_train, y_train_scaled, validation_split=0.2, callbacks=[model_checkpoint_callback], batch_size=8, epochs=100)

In [None]:
model.load_weights("best.weights.h5")

y_pred = model.predict(X_test).ravel()

df_pred = pd.DataFrame({'y_true': y_test_scaled, 'y_pred': y_pred})
compute_metrics(df_pred)

## Now we can transfer the data back

In [None]:
y_pred = scaler.inverse_transform(y_pred.reshape((-1, 1))).ravel()
y_pred[:10]

In [None]:
df_pred = pd.DataFrame({'y_true': y_test, 'y_pred': y_pred})
compute_metrics(df_pred)

## 📊 Analyzing Predictions vs Reality

### Spotting the Problem 🔍
* Predictions limited to [0,1] range
* Actual values much wider range
* Clear mismatch visible in plot

### Ideal Plot Should Show 📈
* Points following diagonal line
* No range restrictions
* Even distribution above/below line

❓ Key Questions:
* Can you identify the sigmoid limitation in the plot?

💡 Remember: A good regression plot should show points clustered along y=x line without artificial boundaries!

In [None]:
sns.scatterplot(x=y_test, y=y_pred)

In [None]:
show_history_loss(history)

# ✅ Now we will try to fix the issue and replace sigmoid function with the correct one
* What function can we use? Why?

In [None]:
inp = keras.layers.Input(shape=(X_train.shape[1],))
norm = norm_layer(inp)                  
hidden_1 = keras.layers.Dense(128, activation='relu')(norm)
hidden_2 = keras.layers.Dense(32, activation='relu')(hidden_1)

out = keras.layers.Dense(1, activation='tanh')(hidden_2)

model = keras.Model(inp, out)

model.compile(loss=keras.losses.MeanSquaredError(),  
              optimizer=keras.optimizers.RMSprop(), 
              metrics=[keras.metrics.MeanAbsoluteError(), keras.metrics.MeanAbsolutePercentageError()])
model.summary()

## Train the model

In [None]:
model_checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
    filepath='best.weights.h5',
    save_weights_only=True,
    monitor='val_loss',
    mode='auto',
    save_best_only=True)

history = model.fit(X_train, y_train_scaled, validation_split=0.2, callbacks=[model_checkpoint_callback], batch_size=8, epochs=100)

In [None]:
model.load_weights("best.weights.h5")

y_pred = model.predict(X_test).ravel()

df_pred = pd.DataFrame({'y_true': y_test_scaled, 'y_pred': y_pred})
compute_metrics(df_pred)

## Now we can transfer the data back

In [None]:
y_pred = scaler.inverse_transform(y_pred.reshape((-1, 1))).ravel()
y_pred[:10]

In [None]:
df_pred = pd.DataFrame({'y_true': y_test, 'y_pred': y_pred})
compute_metrics(df_pred)

# Plot of the y_test vs. y_pred
* Is it better?

In [None]:
sns.scatterplot(x=y_test, y=y_pred)

## The convergence was quite fast
* We can see that there is an issue with the val_loss stability as the changes are very low now

In [None]:
show_history_loss(history)

## ✅  Tasks for the lecture (2p)

1) Try to use [PowerTransformer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PowerTransformer.html) for the output values in a similar manner 
as the MinMaxScaler - **(1p)**

    - When do we use it? Why?
    
    - If you wanted to guess if it helps, what do you think? 
        * Plot histogram of the output (*mpg*), you can make an educated guess based on it 🙂
    
2) Try to design your own network and beat the **benchmark** network used in the lecture - **(1p)**