# **Regression tutorial**
In today's tutorial we will design and train deep neural networks to solve a regression problem.

We will use [**TensorFlow**](https://ekababisong.org/gcp-ml-seminar/tensorflow/) framework and [**Keras**](https://keras.io/) open-source library to rapidly prototype deep neural networks.

# **Preliminary operations**
The following code downloads all the necessary material into the remote machine. At the end of the execution select the **File** tab to verify that everything has been correctly downloaded.

In [None]:
!wget https://archive.ics.uci.edu/ml/machine-learning-databases/00501/PRSA2017_Data_20130301-20170228.zip

!unzip PRSA2017_Data_20130301-20170228.zip

!rm PRSA2017_Data_20130301-20170228.zip

# **Useful modules import**
First of all, it is necessary to import useful modules used during the tutorial.

In [None]:
import glob
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import train_test_split
from tensorflow import keras
from tensorflow.keras import layers
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression

# **Utility functions**
Execute the following code to define some utility functions used in the tutorial:
- **plot_history** draws in a graph the loss trend over epochs on both training and validation sets. Moreover, if provided, it draws in the same graph also the trend of the given metric;
- **plot_prediction_results** plots the predicted and the true values and visualizes the error distribution. 

In [None]:
def plot_history(history,metric=None):
  fig, ax1 = plt.subplots(figsize=(10, 8))

  epoch_count=len(history.history['loss'])

  line1,=ax1.plot(range(1,epoch_count+1),history.history['loss'],label='train_loss',color='orange')
  ax1.plot(range(1,epoch_count+1),history.history['val_loss'],label='val_loss',color = line1.get_color(), linestyle = '--')
  ax1.set_xlim([1,epoch_count])
  ax1.set_ylim([0, max(max(history.history['loss']),max(history.history['val_loss']))])
  ax1.set_ylabel('loss',color = line1.get_color())
  ax1.tick_params(axis='y', labelcolor=line1.get_color())
  ax1.set_xlabel('Epochs')
  _=ax1.legend(loc='lower left')

  if (metric!=None):
    ax2 = ax1.twinx()
    line2,=ax2.plot(range(1,epoch_count+1),history.history[metric],label='train_'+metric)
    ax2.plot(range(1,epoch_count+1),history.history['val_'+metric],label='val_'+metric,color = line2.get_color(), linestyle = '--')
    ax2.set_ylim([0, max(max(history.history[metric]),max(history.history['val_'+metric]))])
    ax2.set_ylabel(metric,color=line2.get_color())
    ax2.tick_params(axis='y', labelcolor=line2.get_color())
    _=ax2.legend(loc='upper right')

def plot_prediction_results(y,y_pred,output_labels,bin_count=50):
  fig, axs = plt.subplots(2,len(output_labels),figsize=(25, 10),squeeze=False)
  
  for i in range(len(output_labels)):
    axs[0,i].set_title(output_labels[i])
    axs[0,i].scatter(y[:,i], y_pred[:,i],s=1)
    axs[0,i].set_xlabel('True Values')
    if i==0:
      axs[0,i].set_ylabel('Predictions')
    max_value=max(max(y[:,i]),max(y_pred[:,i]))
    x_lims = [0, max_value]
    y_lims = [min(0,min(y[:,i]),min(y_pred[:,i])), max_value]
    axs[0,i].set_xlim(x_lims)
    axs[0,i].set_ylim(y_lims)
    axs[0,i].plot(y_lims, y_lims, color='k')

    errors = y[:,i]-y_pred[:,i]
    axs[1,i].hist(errors, bins=bin_count)
    axs[1,i].set_xlabel('Prediction Error')
    if i==0:
      axs[1,i].set_ylabel('Count')
    axs[1,i].set_xlim([min(errors),max(errors)])

# **Dataset**
This tutorial uses the [Beijing Multi-Site Air-Quality Data Data Set](https://archive.ics.uci.edu/ml/datasets/Beijing+Multi-Site+Air-Quality+Data) maintained by the [UC Irvine Machine Learning Repository](https://archive.ics.uci.edu/ml/index.php), a public repository containing hundreds of databases useful for the machine learning community.

The data set includes hourly air pollutants data from 12 nationally-controlled air-quality monitoring sites of the Beijing municipal environmental monitoring center. It contains 420768 istances with 18 attributes:
- *No*: row number
- *year*: year of data
- *month*: month of data
- *day*: day of data
- *hour*: hour of data
- *PM2.5*: PM2.5 concentration (ug/m^3)
- *PM10*: PM10 concentration (ug/m^3)
- *SO2*: SO2 concentration (ug/m^3)
- *NO2*: NO2 concentration (ug/m^3)
- *CO*: CO concentration (ug/m^3)
- *O3*: O3 concentration (ug/m^3)
- *TEMP*: temperature (degree Celsius)
- *PRES*: pressure (hPa)
- *DEWP*: dew point temperature (degree Celsius)
- *RAIN*: precipitation (mm)
- *wd*: wind direction
- *WSPM*: wind speed (m/s)
- *station*: name of the air-quality monitoring site

The dataset is stored in multiple CSV files and can be easily loaded in memory using [**pandas**](https://pandas.pydata.org/), a software library for data manipulation and analysis.

In [None]:
li = []
for filename in glob.glob('PRSA_Data_20130301-20170228' + "/*.csv"):
    df = pd.read_csv(filename, index_col=None, header=0)
    li.append(df)

dataframe = pd.concat(li, axis=0, ignore_index=True)

The variable *dataframe* is an instance of the pandas class [**DataFrame**](https://pandas.pydata.org/pandas-docs/stable/reference/frame.html), a 2-dimensional labeled data structure with columns of potentially different types.

## **Visualization**
*row_count* randomly selected rows can be shown by executing the following code.

In [None]:
row_count=5

dataframe.sample(row_count)

## **Statistics**
The [**info**](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.info.html) method can be used to print a brief summary of a **DataFrame** including the index and the type of each column, the non-null values and the memory usage.

In [None]:
dataframe.info()

To show the overall statistics of the dataset can be used the method [**describe**](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html).

In [None]:
dataframe.describe().transpose()

From the statistics it is clear how each feature covers a very different range.

The method [**hist**](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.hist.html) draws a histogram for each column in the **DataFrame**.

In [None]:
dataframe.hist(bins=50, figsize=(20,15))
plt.show()

## **Data preparation**
Most machine learning algorithms require data to be formatted in a specific way, so datasets generally require some amount of preparation before they can yield useful insights. Some datasets have values that are missing, invalid, or otherwise difficult for an algorithm to process.

### **Missing values**
The dataset contains several missing values as reported by method [**isna**](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.isna.html).

In [None]:
dataframe.isna().sum()

The simplest solution to missing values is to remove the corresponding rows. This can be done by calling the [**dropna**](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html) method. 

In [None]:
prepared_dataframe = dataframe.copy()
prepared_dataframe = prepared_dataframe.dropna()
prepared_dataframe.info()

### **Encode cyclical data**
Air pollution is strongly related to the time of the day (e.g., 9 A.M. or 10 P.M.) and the time of the year (e.g., January or August).

The following code plots the *hour* column in a graph.

In [None]:
plt.plot(prepared_dataframe['hour'][:130].values)

The graph report the hourly data for a week: a cycle between 0 and 23 that repeats 7 times presenting a **jump discontinuity** at the end of each day, when the hour value goes from  23  to  00.

Presenting cyclical data to a machine learning algorithm is a problem. For instance, it would consider the difference between 23 and 00 greater than that between 22 and 23.

A common method for encoding cyclical data is to transform the data into two dimensions using a sine and cosine transformation.

The hour sine and cosine values are computed and plotted by executing the following code.

In [None]:
hour_sin = np.sin(2 * np.pi * prepared_dataframe['hour']/23.0)
hour_cos = np.cos(2 * np.pi * prepared_dataframe['hour']/23.0)

plt.figure(figsize=(5, 5))
plt.xlabel('hour_sin')
plt.ylabel('hour_cos')
plt.scatter(hour_sin,hour_cos)

As expected, the hour information are encoded as a cycle.

The following code adds the two new features (*hour_sin* and *hour_cos*) in the **DataFrame** as two new columns.

In [None]:
prepared_dataframe['hour_sin']=hour_sin
prepared_dataframe['hour_cos']=hour_cos

prepared_dataframe.sample(row_count)

The same thing can be done with the *month* column by executing the following cell.

In [None]:
month_sin = np.sin(2 * np.pi * prepared_dataframe['month']/12.0)
month_cos = np.cos(2 * np.pi * prepared_dataframe['month']/12.0)

prepared_dataframe['month_sin']=month_sin
prepared_dataframe['month_cos']=month_cos

prepared_dataframe.sample(row_count)

### **Remove unuseful columns**
The *No*, *month* and *hour* columns contain no useful information. They can be removed from the dataset using the [**drop**](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html) method.

In [None]:
prepared_dataframe=prepared_dataframe.drop(['No','month','hour'],axis=1)
prepared_dataframe.sample(row_count)

###**Convert categorical data**
The *wd* and *station* columns are categorical, not numeric. Their conversion into numeric format can be done in two ways: 
- *label encoding*, converting each category to a number;
- *one hot encoding*, converting each category value into a new column and assigns a 1 or 0 (True/False) value to the column. 

**Label encoding**

First of all, if the column type is *object* and not *category*, the [**astype**](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.astype.html) method can be used to convert a column to a category.

In [None]:
label_enc_dataframe=prepared_dataframe.copy()

label_enc_dataframe['wd'] = prepared_dataframe['wd'].astype('category')
label_enc_dataframe['station'] = prepared_dataframe['station'].astype('category')
label_enc_dataframe.dtypes

Then the encoded values can be assigned to the corresponding column using the **cat.codes** accessor.

In [None]:
label_enc_dataframe['wd'] = label_enc_dataframe['wd'].cat.codes
label_enc_dataframe['station'] = label_enc_dataframe['station'].cat.codes
label_enc_dataframe.sample(row_count)

Label encoding has the advantage that it is straightforward but it has the disadvantage that the numeric values can be “misinterpreted” by the algorithms. For example, the value of 0 is obviously less than the value of 4 but does that really correspond to reality (e.g., *station*)?

**One hot encoding**

Pandas supports this feature using the [**get_dummies**](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html) function.

In [None]:
one_hot_enc_dataframe=pd.get_dummies(prepared_dataframe, columns=['wd', 'station'], prefix=['wd', 'station'])
one_hot_enc_dataframe.sample(row_count)

One hot encoding has the benefit of not weighting a value improperly but does have the downside of adding more columns to the data set (it depends by the number of categories in a column).

**What is the best solution?**

It depends on the specific dataset used.

In this tutorial, because both *wd* and *station* columns contain categorical values without any numerical relation, it is better to use the *one hot encoding* solution.

In [None]:
#used_dataframe=label_enc_dataframe
used_dataframe=one_hot_enc_dataframe

## **Split features from target values**

The following code separates the target values (the concentration of air pollutants) from the features.

In [None]:
target_data=['PM2.5','PM10','SO2','NO2','CO','O3']

dataframe_x=used_dataframe.drop(target_data, axis=1)
dataframe_y=used_dataframe[target_data]

Some randomly selected feature rows can be shown by executing the following code.

In [None]:
dataframe_x.sample(row_count)

Some randomly selected target rows can be shown by executing the following code.

In [None]:
dataframe_y.sample(row_count)

The Numpy representation of the **DataFrame** can be obtained using the [**values**](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.values.html) property.

In [None]:
x=dataframe_x.values
y=dataframe_y.values

print('Feature shape: ',x.shape)
print('Target shape: ',y.shape)

## **Split data into training and test sets**
To evaluate the generalization capabilites of the regression model, it is necessary to have a separate dataset (called test set) to use in the final evaluation of our model after the training process. 

For this reason, *x* is divided into two subsets: training and test sets. 

**Scikit-learn** library provides the function [**train_test_split**](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) to separate a dataset into two parts.

The *test_size* parameter represents the percentage (or the absolute number) of patterns to include in the test set.

The *shuffle* parameter is used to mix patterns before splitting.

In [None]:
test_size=0.25

train_x, test_x, train_y, test_y = train_test_split(x, y, test_size = test_size,random_state = 1,shuffle=True)

print('Train feature shape: ',train_x.shape)
print('Train target shape: ',train_y.shape)
print('Test feature shape: ',test_x.shape)
print('Test target shape: ',test_y.shape)

# **Linear regression**
As a starting point, we will evaluate the performance of the least squares linear regression using the class [**LinearRegression**](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html) provided by Scikit-learn.

The [**fit**](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression.fit) method fits a linear model to minimize the residual sum of squares between the target values, and the values predicted by the linear approximation.

In [None]:
linear_model = LinearRegression().fit(train_x, train_y)

## **Performance evaluation**
The following code calls the [**predict**](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression.predict) method to generate the predictions (*train_y_pred* and *test_y_pred*) of the training and test sets (*train_x* and *test_x*).

In [None]:
train_y_pred=linear_model.predict(train_x)
test_y_pred=linear_model.predict(test_x)

print('Train predictions shape: ',train_y_pred.shape)
print('Test predictions shape: ',test_y_pred.shape)

### **RMSE**
The regression accuracy can be measured using the RMSE.

Scikit-learn library provides the function [**mean_squared_error**](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html) to compute MSE and RMSE metrics.

If the *squared* parameter is set to False, the function returns the RMSE value.

In [None]:
rmse_train = mean_squared_error(train_y,train_y_pred,squared=False)
rmse_test = mean_squared_error(test_y,test_y_pred,squared=False)

print('RMSE - Train: {:.3f} Test: {:.3f}'.format(rmse_train,rmse_test))

### **True vs predicted values and error distributions**
To better analyze the model performance on the test set, it is useful to plot the predicted and the true values and to visualize the error distribution. 

In [None]:
plot_prediction_results(test_y,test_y_pred,target_data,200)

### **Best and worst predictions**
To select best and worst predictions the RMSE value for each test instance is computed.

In [None]:
rmse_test_instances=np.sqrt(mean_squared_error(test_y.transpose(),test_y_pred.transpose(),multioutput='raw_values'))

rmse_test_instances_sorted_indices=np.argsort(rmse_test_instances)

The following code shows the best predictions returned by the model.

In [None]:
with np.printoptions(precision=1, suppress=True):
  print('Best RMSE:')
  print(rmse_test_instances[rmse_test_instances_sorted_indices[:row_count]])

  print('True values:')
  print(test_y[rmse_test_instances_sorted_indices[:row_count]])

  print('Predicted values:')
  print(test_y_pred[rmse_test_instances_sorted_indices[:row_count]])

The following code shows the worst predictions of the model.

In [None]:
with np.printoptions(precision=1, suppress=True):
  print('Worst RMSE:')
  print(rmse_test_instances[rmse_test_instances_sorted_indices[-row_count:]])

  print('True values:')
  print(test_y[rmse_test_instances_sorted_indices[-row_count:]])

  print('Predicted values:')
  print(test_y_pred[rmse_test_instances_sorted_indices[-row_count:]])

# **Linear neural network**
Before building a DNN model, we start with a simple neural network to apply a linear transformation:

$\boldsymbol{\rm{y=Wx+b}}$

## **Model definition**
Training a model with Keras starts by defining the model architecture.

The following function creates a simple linear neural network given:
- the number of input features (*input_count*);
- the number of output targets (*output_count*).

In Keras, a sequential is a stack of layers where each layer has exactly one input and one output. It can be created by passing a list of layers to the  constructor [**keras.Sequential**](https://keras.io/guides/sequential_model/).

[**Keras layers API**](https://keras.io/api/layers/) offers a wide range of built-in layers ready for use, including:
- [**Input**](https://keras.io/api/layers/core_layers/input/) - the input of the model. Note that, you can also omit the **Input** layer. In that case the model doesn't have any weights until the first call to a training/evaluation method (since it is not yet built);
- [**Dense**](https://keras.io/api/layers/core_layers/dense/) - a fully-connected layer.

In [None]:
def build_linear_nn(input_count,output_count):
	model = keras.Sequential(
        [
          layers.Input(shape=(input_count)),
          layers.Dense(output_count)
        ]
      )

	return model

## **Model creation**
The following code creates a linear neural network by calling the **build_linear_nn** function defined above.

In [None]:
linear_nn=build_linear_nn(train_x.shape[1],train_y.shape[1])

## **Model visualization**
A string summary of the network can be printed using the [**summary**](https://keras.io/api/models/model/#summary-method) method.

In [None]:
linear_nn.summary()

The summary is useful for simple models, but can be confusing for complex models.

Function [**keras.utils.plot_model**](https://keras.io/api/utils/model_plotting_utils/) creates a plot of the neural network graph that can make more complex models easier to understand.

In [None]:
keras.utils.plot_model(linear_nn,show_shapes=True, show_layer_names=False)

## **Model compilation**
The compilation is the final step in configuring the model for training. 

The following code use the [**compile**](https://keras.io/api/models/model_training_apis/#compile-method) method to compile the model.
The important arguments are:
- the optimization algorithm (*optimizer*);
- the loss function (*loss*);
- the metrics used to evaluate the performance of the model (*metrics*).

The most common [optimization algorithms](https://keras.io/api/optimizers/#available-optimizers), [loss functions](https://keras.io/api/losses/#available-losses) and [metrics](https://keras.io/api/metrics/#available-metrics) are already available in Keras. You can either pass them to **compile** as an instance or by the corresponding string identifier. In the latter case, the default parameters will be used.

In [None]:
linear_nn.compile(loss='mse', optimizer='SGD',metrics=[keras.metrics.RootMeanSquaredError(name='rmse')])

## **Split data into training and validation sets**
In order to avoid overfitting during training, it is necessary to have a separate dataset (called validation set), in addition to the training and test datasets, to choose the optimal value for the hyperparameters.

![alt text](https://biolab.csr.unibo.it/ferrara/Courses/DL/Tutorials/Regression/TrainValTestSets.png)

For this reason, *train_x* and *train_y* are divided into training and validation sets using the **train_test_split** function provided by Scikit-learn.

The *val_size* variable represents the percentage (or the absolute number) of patterns to include in the validation set.

In [None]:
val_size=0.25

train_x, val_x, train_y, val_y = train_test_split(train_x, train_y, test_size = val_size,random_state = 1,shuffle=True)

print('Train feature shape: ',train_x.shape)
print('Train target shape: ',train_y.shape)
print('Validation feature shape: ',val_x.shape)
print('Validation target shape: ',val_y.shape)

## **Training**
Now we are ready to train our model by calling the [**fit**](https://keras.io/api/models/model_training_apis/#fit-method) method.

It trains the model for a fixed number of epochs (*epoch_count*) using the training set (*train_x* and *train_y*) divided into mini-batches of *batch_size* elements. During the training process, the performances will be evaluated on both training and validation (*train_x* and *val_x*) sets.

In [None]:
epoch_count = 2
batch_size = 512

history = linear_nn.fit(train_x, train_y,validation_data=(val_x,val_y), epochs=epoch_count, batch_size=batch_size,shuffle = True)

The neural network does not converge. This is because the features present values in very different ranges (as shown in the table of statistics).

This happens because the features are multiplied by the model weights. So the scale of the outputs and the scale of the gradients are affected by the scale of the inputs.

Although a model might converge without feature normalization, normalization makes training much more stable.

### **Data normalization**
It is good practice to normalize features that use different scales and ranges.

Scikit-learn library provides the class [**StandardScaler**](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) to normalize features by removing the mean and scaling to unit variance.

In [None]:
scaler = StandardScaler().fit(train_x)
train_x = scaler.transform(train_x)
val_x=scaler.transform(val_x)
test_x = scaler.transform(test_x)

Once normalized the features, the training process can be launched again.

<u>Note that, it is necessary to create and compile a new model before executing the training process, otherwise it will be performed on a model already trained.</u>

In [None]:
epoch_count = 10
batch_size = 512

linear_nn=build_linear_nn(train_x.shape[1],train_y.shape[1])

linear_nn.compile(loss='mse', optimizer='SGD',metrics=[keras.metrics.RootMeanSquaredError(name='rmse')])

history = linear_nn.fit(train_x, train_y,validation_data=(val_x,val_y), epochs=epoch_count, batch_size=batch_size,shuffle = True)

### **Visualize the training process**
We can learn a lot about our model by observing the graph of its performance over time during training.

The **fit** method returns an object (*history*) containing loss and metrics values at successive epochs for both training and validation sets.

The following code calls the **plot_history** function defined above to draw in a graph the loss and RMSE trend over epochs on both training and validation sets.

In [None]:
plot_history(history,'rmse')

## **Performance evaluation**
The following code calls the [**predict**](https://keras.io/api/models/model_training_apis/#predict-method) method to generate the predictions (*train_y_pred*, *val_y_pred* and *test_y_pred*) of the training, validation and test sets (*train_x*, *val_x* and *test_x*).

In [None]:
train_y_pred=linear_nn.predict(train_x)
val_y_pred=linear_nn.predict(val_x)
test_y_pred=linear_nn.predict(test_x)

print('Train predictions shape: ',train_y_pred.shape)
print('Validation predictions shape: ',val_y_pred.shape)
print('Test predictions shape: ',test_y_pred.shape)

### **RMSE**
The regression accuracy can be measured using the RMSE.

In [None]:
rmse_train = mean_squared_error(train_y,train_y_pred,squared=False)
rmse_val = mean_squared_error(val_y,val_y_pred,squared=False)
rmse_test = mean_squared_error(test_y,test_y_pred,squared=False)

print('RMSE - Train: {:.3f} Val: {:.3f} Test: {:.3f}'.format(rmse_train,rmse_val,rmse_test))

### **True vs predicted values and error distributions**
To better analyze the model performance on the test set, it is useful to plot the predicted and the true values and to visualize the error distribution. 

In [None]:
plot_prediction_results(test_y,test_y_pred,target_data,200)

### **Best and worst predictions**
To select best and worst predictions the RMSE value for each test instance is computed.

In [None]:
rmse_test_instances=np.sqrt(mean_squared_error(test_y.transpose(),test_y_pred.transpose(),multioutput='raw_values'))

rmse_test_instances_sorted_indices=np.argsort(rmse_test_instances)

The following code shows the best predictions returned by the model.

In [None]:
with np.printoptions(precision=1, suppress=True):
  print('RMSE:')
  print(rmse_test_instances[rmse_test_instances_sorted_indices[:row_count]])

  print('True values:')
  print(test_y[rmse_test_instances_sorted_indices[:row_count]])

  print('Predicted values:')
  print(test_y_pred[rmse_test_instances_sorted_indices[:row_count]])

The following code shows the worst predictions of the model.

In [None]:
with np.printoptions(precision=1, suppress=True):
  print('RMSE:')
  print(rmse_test_instances[rmse_test_instances_sorted_indices[-row_count:]])

  print('True values:')
  print(test_y[rmse_test_instances_sorted_indices[-row_count:]])

  print('Predicted values:')
  print(test_y_pred[rmse_test_instances_sorted_indices[-row_count:]])

# **Deep neural network**
The previous section implemented a simple linear neural network.

This section implements a DNN model. The code is basically the same except the model is expanded to include some *hidden* non-linear layers. The nonlinearity is introduced using the *ReLU* activation function.

## **Model definition**
The following function creates a DNN model given:
- the number of input features (*input_count*);
- the number of output targets (*output_count*);
- the number of neurons for each hidden layer (*neuron_count_per_hidden_layer*);
- the string identifier of the activation function of the hidden layers (*activation*).

In [None]:
def build_dnn(input_count,output_count,neuron_count_per_hidden_layer=[128,128],activation='relu'):
  model = keras.Sequential()
  model.add(layers.Input(shape=(input_count)))

  for n in neuron_count_per_hidden_layer:
    model.add(layers.Dense(n,activation=activation))

  model.add(layers.Dense(output_count))

  return model

## **Model creation**
The following code creates a DNN model by calling the **build_dnn** function defined above.

In [None]:
dnn=build_dnn(train_x.shape[1],train_y.shape[1])

## **Model visualization**
A string summary of the network can be printed by executing the following code.

In [None]:
dnn.summary()

Alternatively, a plot of the neural network graph can be visualized.

In [None]:
keras.utils.plot_model(dnn,show_shapes=True, show_layer_names=False)

## **Model compilation**
The following code compiles the model as already done for the linear neural network.

In [None]:
dnn.compile(loss='mse', optimizer='SGD',metrics=[keras.metrics.RootMeanSquaredError(name='rmse')])

## **Training**
Now we are ready to train our model by calling the **fit** method.

In [None]:
epoch_count = 2
batch_size = 512

history = dnn.fit(train_x, train_y,validation_data=(val_x,val_y), epochs=epoch_count, batch_size=batch_size,shuffle = True)

The neural network does not converge. This is because the learning rate is too high.

The learning rate needs to be reduced before the training process can be launched again.

<u>Note that, it is necessary to create and compile a new model before executing the training process, otherwise it will be performed on a model already trained.</u>

In [None]:
epoch_count = 5
batch_size = 512
learning_rate=0.0001

dnn=build_dnn(train_x.shape[1],train_y.shape[1])

optimizer=keras.optimizers.SGD(learning_rate=learning_rate)
dnn.compile(loss='mse', optimizer=optimizer,metrics=[keras.metrics.RootMeanSquaredError(name='rmse')])

history = dnn.fit(train_x, train_y,validation_data=(val_x,val_y), epochs=epoch_count, batch_size=batch_size,shuffle = True)

### **Stop the training process in advance**
Break training when a metric or the loss has stopped improving on the validation set, helps to avoid overfitting.

For this purpose, Keras provides a class called [**EarlyStopping**](https://keras.io/api/callbacks/early_stopping/). Important class parameters are:
- *monitor* - the name of the metric or the loss to be observed; 
- *patience* - the number of epochs with no improvement after which training will be stopped;
- *restore_best_weights* - whether to restore model weights from the epoch with the best value of the monitored quantity.

Once created an instance of the **EarlyStopping** class, it can be passed to the **fit** method in the *callbacks* parameter.

<u>Note that, it is necessary to create and compile a new model before executing the training process, otherwise it will be performed on a model already trained.</u>

In [None]:
epoch_count = 100
batch_size = 512
learning_rate=0.0001
patience=5

dnn=build_dnn(train_x.shape[1],train_y.shape[1])

optimizer=keras.optimizers.SGD(learning_rate=learning_rate)
dnn.compile(loss='mse', optimizer=optimizer,metrics=[keras.metrics.RootMeanSquaredError(name='rmse')])

early_stop = keras.callbacks.EarlyStopping(monitor='val_loss', patience=patience, restore_best_weights=True)

history = dnn.fit(train_x, train_y,validation_data=(val_x,val_y), epochs=epoch_count, batch_size=batch_size,shuffle = True,callbacks=[early_stop])

### **Visualize the training process**
The following code calls the **plot_history** function defined above to draw in a graph the loss and RMSE trend over epochs on both training and validation sets.

In [None]:
plot_history(history,'rmse')

## **Performance evaluation**
The following code calls the **predict** method to generate the predictions (*train_y_pred*, *val_y_pred* and *test_y_pred*) of the training, validation and test sets (*train_x*, *val_x* and *test_x*).

In [None]:
train_y_pred=dnn.predict(train_x)
val_y_pred=dnn.predict(val_x)
test_y_pred=dnn.predict(test_x)

print('Train predictions shape: ',train_y_pred.shape)
print('Validation predictions shape: ',val_y_pred.shape)
print('Test predictions shape: ',test_y_pred.shape)

### **RMSE**
The regression accuracy can be measured using the RMSE.

In [None]:
rmse_train = mean_squared_error(train_y,train_y_pred,squared=False)
rmse_val = mean_squared_error(val_y,val_y_pred,squared=False)
rmse_test = mean_squared_error(test_y,test_y_pred,squared=False)

print('RMSE - Train: {:.3f} Val: {:.3f} Test: {:.3f}'.format(rmse_train,rmse_val,rmse_test))

### **True vs predicted values and error distributions**
To better analyze the model performance on the test set, it is useful to plot the predicted and the true values and to visualize the error distribution. 

In [None]:
plot_prediction_results(test_y,test_y_pred,target_data,200)

### **Best and worst predictions**
To select best and worst predictions the RMSE value for each test instance is computed.

In [None]:
rmse_test_instances=np.sqrt(mean_squared_error(test_y.transpose(),test_y_pred.transpose(),multioutput='raw_values'))

rmse_test_instances_sorted_indices=np.argsort(rmse_test_instances)

The following code shows the best predictions returned by the model.

In [None]:
with np.printoptions(precision=1, suppress=True):
  print('RMSE:')
  print(rmse_test_instances[rmse_test_instances_sorted_indices[:row_count]])

  print('True values:')
  print(test_y[rmse_test_instances_sorted_indices[:row_count]])

  print('Predicted values:')
  print(test_y_pred[rmse_test_instances_sorted_indices[:row_count]])

The following code shows the worst predictions of the model.

In [None]:
with np.printoptions(precision=1, suppress=True):
  print('RMSE:')
  print(rmse_test_instances[rmse_test_instances_sorted_indices[-row_count:]])

  print('True values:')
  print(test_y[rmse_test_instances_sorted_indices[-row_count:]])

  print('Predicted values:')
  print(test_y_pred[rmse_test_instances_sorted_indices[-row_count:]])

# **Exercise 1**
Improve the performance of the DNN model. It is recommended to evaluate the following hyperparameters (listed in priority order):
1. the depth of the network and the number of neurons per hidden layer (*neuron_count_per_hidden_layer*);
2. the number of training epochs (*epoch_count*);
3. the optimization algorithm (*optimizer*);
4. the learning rate (*learning_rate*);
5. the mini-batch size (*batch_size*).

# **Exercise 2**
Solve another regression problem chosen from the following list:
- [Seoul Bike Sharing Demand Data Set](https://archive.ics.uci.edu/ml/datasets/Seoul+Bike+Sharing+Demand);
- [
Bike Sharing Dataset Data Set](https://archive.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset);
- [PM2.5 Data of Five Chinese Cities Data Set](https://archive.ics.uci.edu/ml/datasets/PM2.5+Data+of+Five+Chinese+Cities);
- [
Metro Interstate Traffic Volume Data Set](https://archive.ics.uci.edu/ml/datasets/Metro+Interstate+Traffic+Volume);
- [Wine Quality Data Set](https://archive.ics.uci.edu/ml/datasets/Wine+Quality).