# **TP4 : premiers vers les réseaux de neurones profonds**


# Context & Objectives

In this TP, we propose to reformulate linear and logistic regression models as one-layered neural networks, and implement them in [Tensorflow](https://www.tensorflow.org/?hl=fr). We will also benchmark performance of several models, from a simple linear regression to a fully-connected deep learning model. 



We will be using the AUTO-MPG dataset, downloaded as follows:

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

import tensorflow as tf

from tensorflow import keras
from tensorflow.keras import layers

print(tf.__version__)


2.8.0


In [None]:
# Load the TensorBoard notebook extension.
%load_ext tensorboard

from datetime import datetime
from packaging import version

import tensorflow as tf
from tensorflow import keras

print("TensorFlow version: ", tf.__version__)
assert version.parse(tf.__version__).release[0] >= 2, \
    "This notebook requires TensorFlow 2.0 or above."

import tensorboard
tensorboard.__version__

# Clear any logs from previous runs
! rm -rf ./logs/

# Define the Keras TensorBoard callback.
logdir="logs/fit/" + datetime.now().strftime("%Y%m%d-%H%M%S")
tensorboard_callback = keras.callbacks.TensorBoard(log_dir=logdir)

TensorFlow version:  2.8.0


In [None]:
url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data'
column_names = ['MPG', 'Cylinders', 'Displacement', 'Horsepower', 'Weight',
                'Acceleration', 'Model Year', 'Origin']

raw_dataset = pd.read_csv(url, names=column_names,
                          na_values='?', comment='\t',
                          sep=' ', skipinitialspace=True)

dataset = raw_dataset.copy()

dataset.tail()
dataset.shape

(398, 8)

In [None]:
dataset.describe().T[['mean', 'std']]

Unnamed: 0,mean,std
MPG,23.514573,7.815984
Cylinders,5.454774,1.701004
Displacement,193.425879,104.269838
Horsepower,104.469388,38.49116
Weight,2970.424623,846.841774
Acceleration,15.56809,2.757689
Model Year,76.01005,3.697627
Origin,1.572864,0.802055


# 1. Dataset preparation

**Question 1.1 (BONUS, the solution is given..)** : using the pandas library, count and remove all NaN values. Also, the variable `Origin` refers to different countries, ie 1, 2 and 3 correspond to the country names 'USA', 'Europe' and 'Japan'. Replace these numerical values by these names. One-hot encode these values with [pd.get_dummies](https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html).


In [None]:
# here is what a one-hot encoding looks like
pd.get_dummies(dataset['Origin'].map({1: 'USA', 2: 'Europe', 3: 'Japan'}), columns=['Origin'], prefix='', prefix_sep='')

Unnamed: 0,Europe,Japan,USA
0,0,0,1
1,0,0,1
2,0,0,1
3,0,0,1
4,0,0,1
...,...,...,...
393,0,0,1
394,1,0,0
395,0,0,1
396,0,0,1


In [None]:
# just count the number of nans in the dataset
number_nans = dataset.isna().sum()

# remove all rows containing at least one nan
dataset = dataset.dropna()

# write the names of countries in the variable Origin
dataset['Origin'] = dataset['Origin'].map({1: 'USA', 2: 'Europe', 3: 'Japan'})

# perform one-hot encoding that replaces the variable Origin
dataset = pd.get_dummies(dataset, columns=['Origin'], prefix='', prefix_sep='')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


**Question 1.2 (BONUS, the solution is given..)** : create the train/test partition with a ratio of 0.8/0.2, using the variable `MPG` as label vector.

*tips : to keep working with pandas objects, you can use the [`sample`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sample.html) method*

In [None]:
train_dataset = dataset.sample(frac=0.8, random_state=0)
test_dataset = dataset.drop(train_dataset.index)

train_features = train_dataset.copy()
test_features = test_dataset.copy()

train_labels = train_features.pop('MPG') # this will define train_labels as the column MPG of train_features while removing this column in train_features
test_labels = test_features.pop('MPG')

**Question 1.3**: create a numpy array `X` by normalizing your training features using [`tf.keras.layers.Normalization`](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Normalization)

In the following, we will not use this variable as this normalization will directly take the form of a neural network layer.

# 2. Linear regression model (using a neural network)



In this section we will try to predict the variable `MPG` with the single predictive variable `Horsepower`



In [None]:
horsepower = np.array(train_features['Horsepower'])[:,np.newaxis]

**Question 2.1**: build a neural network architecture for a linear regression using the `tf.keras.Sequential` tool. Re-use [`tf.keras.layers.Normalization`](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Normalization) to integrate your normalization to the architecture. Visualize the model with the `summary()` method and comment output shapes and numbers of parameters.

**Question 2.2**: [compile](https://www.tensorflow.org/api_docs/python/tf/keras/Model#compile) the model using a Stochastic Gradient Descent optimizer and a Mean Absolute Error both as loss and accuracy metric. [Fit](https://www.tensorflow.org/api_docs/python/tf/keras/Model#fit) it to your data.

**Question 2.3**: visualize the evolution of your loss function

**Question 2.4**: plot the regression line on your data 

*tips : two possible solutions, 1) based on your model predictions with the method using [predict](https://www.tensorflow.org/api_docs/python/tf/keras/Model#predict), 2) based on your model coefficients using [get_weights](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Layer#get_weights)*


**Question 2.5**: [evaluate](https://www.tensorflow.org/api_docs/python/tf/keras/Model#evaluate) your performance model, and store them into a dictionary having model names as keys

# 3. Multilinear regression model (using a neural network)

**Question 3.1**: answer all previous questions for a multilinear regression model taking as inputs all available variables. Comment optimization and performance results.

# 4. Fully-connected multi-layer neural network model

**Question 4.1**: write a function `def build_and_compile_model(input, nber_layer, nber_neurons_per_layer, activation):` able to build a fully-connected multi-layer neural network architecture with `nber_layer` layers, each containing `nber_neurons_per_layer` neurons with an activaton stored in the list `activation`.

**Question 4.2**: using this fonction, build and compile a DNN model with two layers of 64 and 32 neurons with `relu` activations to predict MPG from all other variables

**Question 4.3**: display a result table showing the MAE for the two models

# 5. Logistic regression model 

After having performed our model benchmark on a regression task, we will now address a binary classification task using a logistic regression model.

**Question 5.1**: let's start building a dataset suited for this classification task. From the variable `Cylinders`, build a new binary variable with a value of 0 for cylinders values inferior to 5, and 1 otherwise

**Question 5.2**: based on the neural network architecture of a linear regression model (part 2), build a neural network architecture performing a logistic regression model

*tips : think about the neuron activation*

**Question 5.3**: compile and fit your model with the proper [loss](https://keras.io/api/losses/), and visualize the evolutions of loss and accuracy metrics

# 6. Visualizing model graphs with tensorboard (optional)

**Question 6:** use [tensorboard](https://www.tensorflow.org/tensorboard/graphs) to examine the graph of your last model