# Introduction to Training Neural Networks with Keras
## And universal function approximation
## IADS Summer School, 2nd August 2022

### Dr Michael Fairbank, University of Essex, UK

- Email: m.fairbank@essex.ac.uk
- This is Jupyter Notebook 1.2 of the course

## A simple 1D function

- First build some datapoints that represent a simple 1D function, for the sake of a learning example...

In [None]:
import numpy as np
import math
import matplotlib.pyplot as plt

# dataset for a simple regression problem (1 input 1 output):
x_train=np.linspace(-2,2,100).astype(np.float32).reshape(100,1)
y_train=(np.sin(x_train*math.pi)*0.3+x_train+-2).astype(np.float32).reshape(100,1)

In [None]:
print("x_train",x_train.shape, x_train[0:10])
print("y_train",y_train.shape, y_train[0:10])

In [None]:
# show training set
plt.plot(x_train, y_train)

## Build a neural-network model capable of learning this function, from the datapoints
- Use Keras to build a 3-layer feed-forward network (i.e. with 2 hidden layers).
<img src="./images/ffnn_3layers.svg" alt="3-layer FFNN" width="400">

In [None]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

# Define Sequential model with 3 layers
model = keras.Sequential(name="my_neural_network")
layer1=layers.Dense(10, activation="tanh", input_shape=(1,))
model.add(layer1)
layer2=layers.Dense(10, activation="tanh")
model.add(layer2)
layer3=layers.Dense(1)
model.add(layer3)

In [None]:
model.summary()

## Understanding the model architecture

- The above model print out shows us this is a 3-layer neural network.  

- The "Output Shape" shows us how many outputs each layer has.  The first dimension is the batch size (which is flexible, hence "None"), and the second dimension is the number of outputs for that layer.

- We can see from the final layer's output shape how many outputs this network has.

- We can see from line 7 of the code how many inputs this network has.

**Questions:** 

1. How many inputs and how many outputs does this neural network have?  **Answer**:
2. Why do all of the output shapes start with "None"?  **Answer**:
3. What is the "rank" of all of the output shapes?  **Answer**:



## Understanding model layers

- In this network layer1 and layer2 are called "hidden layers", because they are only used for the internal calculation of the network output.

- Each layer is parameterised by one or more tensors.  Tensors are just multidimensional arrays of numbers, e.g. a matrix or a vector.
    - E.g. a matrix of shape $5 \times 5$ is a rank-2 tensor of shape=(5,5)

- For each Dense layer, there is one weights matrix and one bias vector.  These can be seen below.  

- They are initially created with random values.


In [None]:
print("layer1 weights",layer1.trainable_weights)

Question:

1. How many parameters in W and b are there for the first layer? **Answer:**


In [None]:
print("layer2 weights",layer2.trainable_weights)

In [None]:
print("layer3 weights",layer3.trainable_weights)

**Questions:**

1. How many parameters in W and b are there for the second layer?  **Answer**:

2. Do these match what the model.summary() said (on previous code block) **Answer**:


- Each layer acts as a callable function, and the whole model we have created acts as a callable function.

- Layer1 has a weights matrix W (with shape \[1,10\]) and a bias vector b (with shape \[10\]).  It computes its output $y$ for an input $x$ by $y=tanh(xW+b)$

In [None]:
# Try putting a single input into the layer 1
print(layer1(np.array([[4]])))

Can you verify that this matches $y=tanh(xW+b)$?  
- **Do this**: Fill in the missing line of code below to help you, and check you get the same output as above.

In [None]:
x=tf.constant([[4.0]])
W=layer1.trainable_weights[0] # This is the weight matrix of shape [1,10]
b=layer1.trainable_weights[1] # This is the bias vector of shape [10]
print(tf.matmul(x,W))  # TODO fix this line using the tensorflow functions tf.tanh(A) and tf.matmul(A,B) and the tf.add(A,B) functions.  

- Note that in the final add in the above code, tensorflow used ["broadcasting"](https://numpy.org/devdocs/user/theory.broadcasting.html) to allow it to add a rank-2 tensor (a 2d array) to a rank-1 tensor (a 1d array).

- The whole network acts as a function too.  It just puts the input into the first layer, and then the output of that into the next layer, and so on.


**Questions:** 

1. If the $k$th layer can be written as a function $y=tanh(x.Wk+bk)$, then how could we write the whole network as a single mathematical function?  **Answer:**  (Enter in markdown here):

2. Why do we need the tanh functions after every layer?  What would happen if we removed them?  **Answer:**


- The neural network expects its input to be a rank-2 tensor (i.e. a matrix)
- Each row of that matrix corresponds to a different input vector.

In [None]:
# Try putting a single input into the whole network
print(model(np.array([[4]])))

In [None]:
# Try putting a "batch" of 2 input vectors through the network
print(model(np.array([[4],[2]])))

- Notice how even though the model function accepts 1 input, it can process two 1d-vector inputs at the same time.  They are processed independently of each other - we see we get the same output now when we push "4" though the model as when we pushed "4" though on its own.

In [None]:
# Let's put a whole "batch" of x values through:
print("input vectors", x_train[0:10,:])
print("output vectors", model(np.array(x_train[0:10].reshape(10,1))))

In [None]:
# Let's plot the model's current behaviour:
plt.plot(x_train, y_train, label = "targets")
plt.plot(x_train, model(x_train).numpy(),label="model output")
plt.legend()

- The above graph shows the neural network is not doing what we want it to yet
    - because we've just build our network with entirely random weights.  

## Training the neural network

So next we'll "train" the network, i.e. change the values of its weights so that its outputs match the target curve.  Note, that by the universal function approximation theorem for neural networks, if we have enough weights and hidden layers, then we can in theory learn any function to arbitrary accuracy.  

There is no closed-form solution to this "training" problem, so we need to use an iterative numerical method.

First we define a loss function which we want to minimise with respect to all of the trainable variables in the neural network.



In [None]:
model.compile(
    optimizer=keras.optimizers.Adam(0.01),  # Optimizer
    # Loss function to minimize
    loss=keras.losses.MeanSquaredError(),
    # List of metrics to monitor
    metrics=[keras.metrics.MeanSquaredError()],
)

Next we run the iterative procedure.  Here we say we're going to run a full pass through the training set (all of the elements of x_train), 1000 times...

In [None]:
history = model.fit(
    x_train,
    y_train,
    batch_size=len(x_train),
    epochs=1000
)

Now we see how the neural network's output has (hopefully) improved...

In [None]:
plt.plot(x_train, y_train, label = "targets")
plt.plot(x_train, model(x_train).numpy(),label="model output")
plt.legend()

Run the previous 2 cells again to train the network a bit more.  

We can see the universal function approximation capability of the neural network in action.

Next we'll view how some of the weights have changed from earlier, by the training process

In [None]:
print("layer1 weights",layer1.trainable_weights)

## Understanding the Training Objective, and Loss Function

These weights have changed - because the training process works by iteratively adjusting the weights to perform gradient descent on the "loss" function.  Here we used the Mean Squared Error, so we have minimised
$$L=(1/N)\sum_{k=1}^N (f({x}_k,w)-y_k)^2$$
with respect to all of the weights $w$, where $w=$(layer1.traininable_weights, layer2.trainable_weights, layer3.trainable_weights), and where $f$ is the neural network model, and $(x_k, y_k)$ are the $k$th training point's $x$ and (target) $y$ value.

We can plot how $L$ decreased over time during training...

In [None]:
plt.plot(history.history['loss'])
plt.title('model loss')
plt.yscale('log')
plt.ylabel('loss')
plt.xlabel('epoch')

## Saving your network

- We can save our final model, and its weights and biases, as follows:

In [None]:
model.save('saved_model')

- We can then load it back at a later date with...

In [None]:
model2 = keras.models.load_model('saved_model') # just need to give it a folder name here.
model2.summary()

## Further Challenges

If you get time today then:

- What happens if we put a tanh activation function into the final layer?  Try it?  What problems do we get for learning this particular dataset?  **Answer:**

- How many hidden layers should we have?  Try removing layer 1 and layer 2, so the neural network becomes a simple linear function, and retrain it.  What happens then?  **Answer:**

- What will happen to the function approximation capabilities of this network if we increase the number of nodes in each hidden layer?  **Answer:**

## Follow-up Reading

- Learn more about the [keras train and evaluate](https://www.tensorflow.org/guide/keras/train_and_evaluate) process.

-  For most learning tasks you need a validation set too, and you can use it to check you are not overfitting the data.  See [overfit and underfit](https://www.tensorflow.org/tutorials/keras/overfit_and_underfit)
