# Building neural networks for regression

So far we have focused on classification into two classes with neural networks. Now, we will look at the use of neural networks for regression problems. We're going back to our diabetes progression dataset that we've seen in our regression tree building. We will also talk about learning rate - one of the parameters that you can choose in building your network. The learning rate refers to how strongly the model is changed during each of the weights/biases update step of backpropagation. We'll talk about it more later in this script.

## Dataset

You'll already be familiar with this dataset.

In [1]:
##### added line to ensure plots are showing
%matplotlib inline
#####

import pandas as pd
import numpy as np
from sklearn.datasets import load_diabetes
from sklearn.preprocessing import StandardScaler

dataset = load_diabetes()

X = pd.DataFrame(data=dataset['data'],columns=dataset['feature_names'])

y = pd.DataFrame(data=dataset['target'],columns=['progression'])

## Building a neural network regressor

Again, we prepare our training and test sets:

In [3]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3)

# Scale the training and the test data
X_train = StandardScaler().fit_transform(X_train)
X_test = StandardScaler().fit(X_train).transform(X_test)




This time we obtain a metric that can be used for regression, the mean squared error, which is also used for the loss function. You'll remember this metric as the equivalent to, for example, cross entropy which is used for classification tasks.

We now also use a different optimiser (instead of stochastic gradient descent) which works better in this instance and run 10 epochs.

In [4]:
from keras.models import Sequential
from keras.layers.core import Dense, Activation
from sklearn.metrics import mean_squared_error as mse
from tensorflow.keras.optimizers import RMSprop, Adam, SGD

input_dim = X_train.shape[1]
# We only have 1 output dimension, as our regression outputs a real number
output_dim = 1

model = Sequential()
model.add(Dense(50,input_dim=input_dim))

model.add(Dense(output_dim))

# We now use a dedicated optimizer instance - this allows us to input the learning rate later
model.compile(optimizer=Adam(),loss='mean_squared_error',metrics=['mean_squared_error'])

model.summary()

# We add the number of epochs as a parameter to our fit method
model.fit(X_train,y_train,epochs=10)

prediction = model.predict(X_test)

print('RMSE:', np.sqrt(mse(y_test,prediction)))

2022-12-01 12:07:00.013455: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-12-01 12:07:00.153615: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-12-01 12:07:00.153653: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2022-12-01 12:07:00.865968: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
2022-

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense (Dense)               (None, 50)                550       
                                                                 
 dense_1 (Dense)             (None, 1)                 51        
                                                                 
Total params: 601
Trainable params: 601
Non-trainable params: 0
_________________________________________________________________
Epoch 1/10


2022-12-01 12:07:01.991470: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2022-12-01 12:07:01.991501: W tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:265] failed call to cuInit: UNKNOWN ERROR (303)
2022-12-01 12:07:01.991522: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (noteable): /proc/driver/nvidia/version does not exist
2022-12-01 12:07:01.991756: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
RMSE: 166.0361824163538


Note how our mean squared error decreases over the 10 epochs that we're running as the model is learning with each runthrough of the data.

## Hyperparameters

Now, we will try to do a small hyperparameter optimisation exercise where we try to find the best regression model by altering:
- The number of neurons in the hidden layer
- The activation function
- The learning rate
- The number of epochs

From these four we've not discussed the learning rate in a lot of detail so far. But it is one of the more important parameters in tuning your model. Think back to how backpropagation works - we're adjusting weights and biases according to the "wishes" of our training data. How strongly we actually change weights and biases in each of the backpropagation runs is referred to as the learning rate. In the plotted gradient descent example from class you can think about this as the step size that we take down the hill in our search of the minimum.

A large learning rate means that we adapt quicker to our problem. But it also means that we might converge too quickly and not find an optimal solution. Think about this as big steps down the hill.
A small learning rate means that we take greater care in finding our best solution. But small steps take a lot of time. Think about this as small steps down the hill.

We'll now try to optimise the learning rate as well as some other hyperparameters.

We can use ```GridSearchCV``` from scikit-learn. However, we need to make an instance of a neural network we can feed to the grid search. Hence, we first create a neural network with the hyperparameters as inputs:

In [5]:
from keras.models import Sequential
from keras.layers.core import Dense, Activation
from tensorflow.keras.optimizers import Adam

def nn_model(no_neurons,learning_rate,kernel='relu'):
    model = Sequential()
    model.add(Dense(no_neurons,input_dim=X_train.shape[1]))
    model.add(Activation(kernel))

    # Extra hidden layer
    model.add(Dense(no_neurons))
    model.add(Activation(kernel))

    # Output
    model.add(Dense(1))
    model.add(Activation('sigmoid'))
    
    # Here, we can add the learning rate to the optimiser
    model.compile(optimizer=Adam(learning_rate=learning_rate),loss='mean_squared_error',metrics=['mean_squared_error'])
        
    return model

Now, we add that model to our grid search as follows. Notice also how we setup the parameters to match the inputs of the model we just created.

Warning: the next cell may take a few minutes to run.

In [6]:
from keras.wrappers.scikit_learn import KerasClassifier
from sklearn.model_selection import GridSearchCV

# We create a dictionary again, with keys matching our neural network function we create above 
parameters = {'no_neurons':[5,20],'kernel':['relu','linear'],'learning_rate':[0.0001,0.01],'epochs':[5,10],'verbose':[0]} 

# We wrap our model into KerasClassifier to bridge the gap between scikit-learn and Keras
grid_search = GridSearchCV(KerasClassifier(nn_model), parameters, cv=5,scoring='neg_mean_squared_error')
grid_search.fit(X_train, y_train.values.ravel())

means = grid_search.cv_results_['mean_test_score']
stds = grid_search.cv_results_['std_test_score']

print('Mean RMSE (+/- standard deviation), for parameters')
for mean, std, params in zip(means, stds, grid_search.cv_results_['params']):
    print("%0.3f (+/- %0.03f) for %r"
          # The MSE is return as a negative, so we multiple it with -1 before squaring it
          % (np.sqrt(-1*mean), np.sqrt(std), params))

  grid_search = GridSearchCV(KerasClassifier(nn_model), parameters, cv=5,scoring='neg_mean_squared_error')


Mean RMSE (+/- standard deviation), for parameters
146.710 (+/- 55.127) for {'epochs': 5, 'kernel': 'relu', 'learning_rate': 0.0001, 'no_neurons': 5, 'verbose': 0}
147.321 (+/- 55.991) for {'epochs': 5, 'kernel': 'relu', 'learning_rate': 0.0001, 'no_neurons': 20, 'verbose': 0}
142.932 (+/- 54.369) for {'epochs': 5, 'kernel': 'relu', 'learning_rate': 0.01, 'no_neurons': 5, 'verbose': 0}
142.932 (+/- 54.369) for {'epochs': 5, 'kernel': 'relu', 'learning_rate': 0.01, 'no_neurons': 20, 'verbose': 0}
146.197 (+/- 54.962) for {'epochs': 5, 'kernel': 'linear', 'learning_rate': 0.0001, 'no_neurons': 5, 'verbose': 0}
146.250 (+/- 56.566) for {'epochs': 5, 'kernel': 'linear', 'learning_rate': 0.0001, 'no_neurons': 20, 'verbose': 0}
142.960 (+/- 54.367) for {'epochs': 5, 'kernel': 'linear', 'learning_rate': 0.01, 'no_neurons': 5, 'verbose': 0}
142.932 (+/- 54.369) for {'epochs': 5, 'kernel': 'linear', 'learning_rate': 0.01, 'no_neurons': 20, 'verbose': 0}
145.902 (+/- 54.501) for {'epochs': 10, '

Scroll all the way down and you'll see the error rate that is achieved with different combinations of number of epochs/kernel/learning rate/number of neurons.

It seems there is very little difference in terms of RMSE. We cannot say which hyperparameters are working better than others. Perhaps we should do a wider search, but this takes even more time. Later on, we will see how different results can be given the hyperparameters.
A good hyperparameter search can result in very different networks more or less suitable for the data at hand.