# CF969 - Big Data for Computational Finance
## Lab 7b: Neural Networks in TensorFlow

The present Notebook for this lab is based on a previous notebook written by Dr. Bart de Keijzer and on a tutorial by Dr. Michael Fairbank; these have been adapted to work for Python 3 and Tensorflow 2. Some of the Python code scripts here are based on code from tensorflow.org and from https://github.com/aymericdamien/TensorFlow-Examples/, by Aymeric Damien.

When going through these notes, please experiment with the pieces of code, take your time with them, and look up the meaning of the various statements in the online TensorFlow documentation whenever you do not fully understand what is happening. 

## Introduction: Building a neural network with Keras
Let's begin by showing a simple neural network for the XOR (eXclusive OR) problem. In this, we get in input two binary variables (let's say, $x_1$ and $x_2$) and the output is $0$, when $x_1=x_2$, and $1$ otherwise.

The first attempt uses Keras, a powerful framework on top of Tensorflow. Keras is more high-level than Tensorflow; this makes it easier for the average user to design and train/test a neural network, as much of the implementation details are taken care of. On the other hand, this allows the user less flexibility and control.

You may notice that the results are not that good. You can try to increase the number of layers (can you guess how?) or the number of nodes per layer (again, how?). You can also experiment with different activation functions (e.g., instead of 'sigmoid', you can select 'relu', or delete the argument and go for the standard linear activation).

We begin by importing necessary packages and libraries.

In [1]:
import tensorflow as tf
from tensorflow.keras import layers
from tensorflow.keras.layers import Activation, Dense
import numpy as np




Next, we provide the input, that is, what needs to be learnt. There are four possible inputs (recall, this is a very simple function) and we provide the corresponding outputs.

In [2]:
# X = input of our XOR function
X = np.array(([0,0],[0,1],[1,0], [1,1]), dtype=float)

# y = our output of our neural network
y  = np.array(([0],[1],[1],[0]), dtype=float)

The next snippet defines the neural network. You can check the [keras API reference](https://keras.io/api/) if you wish to learn more about the syntax. In particular, we first define the *model* and the *layers*.

Do you get what **Dense** stands for? Or, what is the role of **input_dim=2**? Do you remember what **sigmoid** stands for? 

In [3]:
model = tf.keras.Sequential()
model.add(Dense(2, input_dim=2, activation='sigmoid', use_bias=True))
model.add(Dense(1, activation='sigmoid', use_bias=True))




Time to define the loss functions. We choose the (well known from the lectures) Mean Squared Error, together with an [optimiser](https://keras.io/api/optimizers/).

In [4]:
# we define the loss function and select the optimizer we wish to use
model.compile(loss='mean_squared_error', optimizer='adam', metrics=['accuracy'])




Now, it is time to *train* the neural network. Recall that we already know what is the desired output for each of the four possible inputs. 
We are, therefore, *not* trying to generalise (so that we can handle unknown inputs) but to demonstrate how we can perform computations by having appropriate weights and biases.

Do you get what **fit** does? What are the *epochs*?

In [5]:
# let's train
print("*** Training... ***")
model.fit(X, y, batch_size=4, epochs=10000, verbose=0)
print("*** Training done! ***")

print("*** Model prediction on [[0,0],[0,1],[1,0],[1,1]] ***")
result = model.predict(X).round()
print(result)

*** Training... ***


*** Training done! ***
*** Model prediction on [[0,0],[0,1],[1,0],[1,1]] ***
[[0.]
 [0.]
 [1.]
 [1.]]


The example above created a neural network with one hidden layer. While in theory a hidden layer with 2 nodes is sufficient to solve XOR, in practise convergence to good weights is tricky, as XOR admits several local optima where the optimizer may get trapped. Depending on the initial weights (or, the number of epochs), you might observe a wrong result when you run the script above.

**Task:** Can you solve for the [NAND gate](https://en.wikipedia.org/wiki/NAND_gate) (as opposed to the XOR one)? We briefly discussed NAND in the lecture. It takes in input two binary variables $x_1$ and $x_2$ and returns $0$ if $x_1 = x_2 = 1$, and $1$ otherwise.

You may have noticed that it is easy to define new layers and almost all details were hidden from the programmer; this makes Keras a reasonable (albeit somewhat lazy) approach. In the following, we dig into the details of neural networks. 

## Part I: A Neural Network "by hand" in Only a Few Lines of Code

TensorFlow has some of the basic and common neuron activation functions built-in. A tensor can be passed to such a function, after which said function is applied to each element of the tensor. In particular, there is the function *sigmoid($A$)* which applies the sigmoid function to each element of tensor $A$.

While in our theoretical treatment of neural networks, we usually represent a network's input by a column vector, in TensorFlow we will use a *row vector* to represent the input. We do this for ease of programming. Recall that for a sigmoid neuron, the sigmoid function is applied on a value that is a weighted sum of all inputs, plus a bias. Therefore, if we now represent the weights by a column vector $w$ and the bias by a scalar $b$ (i.e., a rank 0 tensor), we can now write the application of a sigmoid neuron to the input vector x using the command *tf.sigmoid(tf.matmul(x,w)+b)*. Note however that for *matmul* to work, we must explicitly define $x$ and $w$ as rank 2 tensors, despite that they can be more naturally seen as rank 1 tensors. This is simply because *matmul* expects its argument to be of rank at least 2.

In [6]:
x=tf.constant([[1,1]], tf.float32) # our input vector
w=tf.constant([[2],[1]],tf.float32) # weight vector of a sigmoid neuron
b=tf.constant(1,tf.float32) # bias of a sigmoid neuron
y=tf.sigmoid(tf.matmul(x,w) + b)
print(y.numpy())

[[0.98201376]]


We can generalise this easily to applying the input to multiple sigmoid neurons (that is, a layer of neurons). Let $A$ be a matrix where the columns are the weight vectors of all the neurons. Let $b$ be a row vector of all the biases of the neurons. The command *tf.sigmoid(tf.matmul(x,A)+b)* then results in a row vector containing the output of each of the neurons. 

**Task:** Take the above code and modify it by adding a second neuron with weights 1 and -1, and with a bias of -0.5. The answer should be a row vector with the values 0.98201376 and 0.37754068.

In [9]:
# Take the above code and modify it by adding a second neuron with weights 1 and -1, and with a bias of -0.5.
# The answer should be a row vector with the values 0.98201376 and 0.37754068.

x=tf.constant([[1,1]], tf.float32) # our input vector
w=tf.constant([[1],[-1]],tf.float32) # weight vector of a sigmoid neuron
b=tf.constant(-0.5,tf.float32) # bias of a sigmoid neuron
y=tf.sigmoid(tf.matmul(x,w) + b)
print(y.numpy())

[[0.37754068]]


Thus, we can build any single layer neural network in TensorFlow by specifying only a matrix-vector multiplication, a vector addition, and a single application of the activation function to a row vector.

## Adding Layers to our Network

We can easily add more layers by repeating the same operations, where we treat the output $y$ as the input to the next layer. 
The following code evaluates the input [1,1] on a neural network of two-separate single-neuron layers. Analyse the code an verify for yourself that you understand the choice of shapes of the defined tensors.

In [10]:
x=tf.constant([[1,1]], tf.float32) # our input vector
w1=tf.constant([[2],[1]],tf.float32) # weight vector of a sigmoid neuron
w2=tf.constant([[-1]],tf.float32) # weight vector of a second sigmoid neuron
b1=tf.constant(1,tf.float32) # bias of a sigmoid neuron
b2=tf.constant(-0.5,tf.float32)
y1=tf.sigmoid(tf.matmul(x,w1) + b1) # result of first layer
y2=tf.sigmoid(tf.matmul(y1,w2) + b2) # result of second layer
print(y2.numpy())

[[0.18512344]]


**Exercise:** Consider the following neural network with one hidden layer.

![title](images/neuralnet.png)

Answer the following questions about this neural network when one would implement it in TensorFlow, if we use $W_1$ and $b_1$ for the weights and biases of the hidden layer, $W_2$ and $b_2$ for the weights and biases of the output layer, $h_1$ for the output of the hidden layer, and $y$ for the final output of the network:

* What is the (TensorFlow)-shape of $W_1$?
* What is the shape of $W_2$?
* What is the shape of $b_1$?
* What is the shape of $b_2$?
* What is the shape of $h_1$?
* What is the shape of $y$?

Please ask us to verify your answer.

**Task:** build a neural network with two inputs, one hidden layer of 2 neurons, and an output layer with 1 neuron. For all neurons, use the sigmoid activation function. The weights should be initialised randomly according to a normal distribution with standard deviation 0.1. This can be done by using the function **random.truncated_normal**. As an example, **W1=tf.Variable(tf.random.truncate_normal([4,5], stddev=0.1))** creates a 4 times 5 matrix called *W1*, containing normally distributed random values with mean 0 and standard deviation 0.1. Please complete the exercise by filling in the missing code.

In [None]:
x = tf.constant([[1,1]], tf.float32) # our input vector

# Build our random weight and bias matrices, of appropriate shapes
# TODO: fill in the missing numbers at the #-symbols.
W1 = tf.Variable(tf.random.truncated_normal([#,#], stddev=0.1))
b1 = tf.Variable(tf.random.truncated_normal([#,#], stddev=0.1))
W2 = tf.Variable(tf.random.truncated_normal([#,#], stddev=0.1))
b2 = tf.Variable(tf.random.truncated_normal([#,#], stddev=0.1))


# define our feed-forward neural network here:
h1 = #TODO: h1 should represent the output of the hidden layer
y = #TODO: y should represent the output of the network.

print(y.numpy())  #prints our output vector

It is possible to evaluate the network on multiple inputs simultaneously. Given your understanding of TensorFlow so far, and seeing how the above neural networks are implemented in TensorFlow, you can probably see how to do that. We may just add additional rows to the input tensor. Tensorflow's broadcasting behaviour will ensure that additional rows are generated in the output tensor; each output row representing the output for its corresponding input.

# Training our Neural Network

**Task:** Evaluate the network of the last exercise on all four 0-1 inputs, by changing the first line to _x = tf.constant([[0,0],[0,1],[1,0],[1,1]], tf.float32)_.

We will now proceed to train our network so that it learns a function. The function we want it to learn is the XOR function, which I briefly mentioned in the first lecture. XOR is defined as the function that takes two binary variables, outputs 1 if exactly one of these input variables is set to 1, and otherwise outputs 0.

![title](images/xor.png)

To train our network, we will use the Gradient Descent method that we used in previous labs to minimise a quadratic function. You may recall that _SGD_ performs gradient descent minimisation on any function that we provide it, using some fixed step size parameter that we have to provide it as well. In our case, we have to pass to it a function of the weights and biases, such that this function represents the error resulting from running our neural network on all (above) four combinations of inputs. The error function that we use is again the MSE (mean-squared-error) loss function, where we define a target value for each of the four inputs, and we take the sum of the squares of the distance by which each output is removed from the target. Given that TensorFlow computes the output of all our four inputs as a column vector, we can represent the target values as $t = [0,1,1,0]^T$, and simply compute the sum of squares of the entries of the column vector ($y-t$), where $y$ is the output vector resulting from evaluating the network on all four inputs.

In TensorFlow code, we can conveniently do this using the functions _subtract_ , _square_ , and _reduce&#95;sum_ , which we introduced in previous labs. 

**Task:** Define below the appropriate vector of target values. Verify that _loss_ represents the correct function to minimise.

In [None]:
t = # TODO

loss=lambda: tf.reduce_sum(tf.square(tf.subtract(tf.sigmoid(tf.matmul(tf.sigmoid(tf.matmul(x,W1) + b1),W2) + b2), t)))


We can now run gradient descent on the function loss, where set the step size of 0.5. TensorFlow minimises the function loss with respect to the set of all variables that the function depends on. These are W1, b1, W2, and b2 and SGD will modified them in an attempt to minimise the loss function, hence the result of running SGD on _loss_ is a vector of weights and biases for our simple network of 2 hidden neurons and 1 output neuron, which (hopefully) gives us a network that performs the XOR function.

**Task:** Study and run the code in the following cell, which runs Gradient Descent on _loss_. In the cell after that, write some code that evaluates the network on the four possible 0-1-valued inputs and write some code that ouputs all the learned weights and biases.

In [None]:
optimizer = tf.optimizers.SGD(0.5)

for i in range(20000):
    optimizer.minimize(loss,[W1,b1,W2,b2])

In [None]:
# TODO: Write some code here that evaluates the network on the four possible 0-1-valued inputs 
# and write some code that outputs all the learned weights and biases.


## Softmax and Cross-Entropy

A popular way to handle classification problems with neural networks is by using a [softmax](https://en.wikipedia.org/wiki/Softmax_function) layer as the output layer. The short explanation is that a softmax layer applies a special type of activation function which ensures that all the outputs of the layer sum to 1. This allows interpreting the output of the network as a probability distribution over classes. If the output layer of one's neural network is a softmax layer, then one usually does this in combination with a special kind of cost function that is particularly suitable for training a network with a softmax layer: the _cross-entropy_ cost function.

TensorFlow has support for softmax layers together with cross-entropy built in. One can use the following combination of function calls.

In [None]:
cross_entropy = tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(labels=t, logits=y))

The above will apply a softmax layer to inputs y and compare it with the target values t through applying the cross-entropy function. The variable _cross&#95;entropy_ then represents the average cross entropy among all inputs, and will then serve as the function the we would like our SGD optimizer to minimise.

When using this function, note that TensorFlow expects that t is the label ID, and not the one-hot encoding thereof. 

When you want to add a softmax output layer, do always use this built-in function rather than implement a softmax layer manually. This is because the built-in function includes additional code for handling numerical instabilities.

## Part II: The MNIST Dataset

In the running example of the lectures we use the MNIST dataset for training a neural network that performs handwritten character recognition. The MNIST dataset is a very popular and standard benchmark dataset, and therefore you can download it directly through TensorFlow. The following code does that.

In [None]:
from tensorflow.keras import Model, layers

mnist = tf.keras.datasets.mnist
(x_train, y_train),(x_test, y_test) = mnist.load_data()

The format of this dataset is as follows. The training labels are integers from 0 to 9. The images are represented as 28 times 28 matrices of integers between 0 and 255. The training set has 60,000 images, and the test set has 10,000 images.

We first flatten the images such that they become row vectors 784. We also rescale the integer range to floating point numbers between 0 and 1.

In [None]:
# Convert to float32.
x_train, x_test = np.array(x_train, np.float32), np.array(x_test, np.float32)
# Flatten images to 1-D vector of 784 features (28*28).
x_train, x_test = x_train.reshape([-1, 784]), x_test.reshape([-1, 784])
# Normalize images value from [0, 255] to [0, 1].
x_train, x_test = x_train / 255., x_test / 255.

We now define some network parameters.

In [None]:
# Training parameters.
learning_rate = 0.1
training_steps = 2000
batch_size = 256
display_step = 100

# Network parameters.
n_hidden_1 = 128 # 1st layer number of neurons
n_hidden_2 = 256 # 2nd layer number of neurons

In [None]:
# Use tf.data API to shuffle and batch data
train_data = tf.data.Dataset.from_tensor_slices((x_train, y_train))
train_data = train_data.repeat().shuffle(5000).batch(batch_size).prefetch(1)

The next block defines the neural network. First, we will define it in the "traditional" way, as above. Then, I will present an alternative way to define the network. You can observe the output of each way, by choosing *not* to run the other approach. So, to check the traditional approach, run the following 2 snippets and skip the third one. To check the compact approach, skip the next 2 snippets and run the third one.

In [None]:
# First way: the "traditional" approach. We first initialise weights and, in the following snippet, we build the layers.

# Store layers weight & bias

# A random value generator to initialize weights.
random_normal = tf.initializers.RandomNormal()

weights = {
    'h1': tf.Variable(random_normal([784, n_hidden_1])),
    'h2': tf.Variable(random_normal([n_hidden_1, n_hidden_2])),
    'out': tf.Variable(random_normal([n_hidden_2, 10]))
}
biases = {
    'b1': tf.Variable(tf.zeros([n_hidden_1])),
    'b2': tf.Variable(tf.zeros([n_hidden_2])),
    'out': tf.Variable(tf.zeros([10]))
}

In [None]:
# Create model.
def neural_net(x, is_training):
    # Hidden fully connected layer with 128 neurons.
    layer_1 = tf.add(tf.matmul(x, weights['h1']), biases['b1'])
    # Apply sigmoid to layer_1 output for non-linearity.
    layer_1 = tf.nn.relu(layer_1)
    
    # Hidden fully connected layer with 256 neurons.
    layer_2 = tf.add(tf.matmul(layer_1, weights['h2']), biases['b2'])
    # Apply sigmoid to layer_2 output for non-linearity.
    layer_2 = tf.nn.relu(layer_2)
    
    # Output fully connected layer with a neuron for each class.
    out_layer = tf.matmul(layer_2, weights['out']) + biases['out']
    if is_training:
        return out_layer
    # Apply softmax to normalize the logits to a probability distribution.
    return tf.nn.softmax(out_layer)

In [None]:
# Second way: The "compact" approach. We simply provide the number of layers and the activation function we desire.
# When an activation function is not defined (as in self.out), it is by default the "linear" activation function, i.e., Wx+b. 

# Create TF Model
class NeuralNet(Model):
    # Set layers.
    def __init__(self):
        super(NeuralNet, self).__init__()
        # First fully-connected hidden layer
        self.fc1 = layers.Dense(n_hidden_1, activation= 'relu')
        # Second fully-connected hidden layer
        self.fc2 = layers.Dense(n_hidden_2, activation= 'relu')
        # Output layer
        self.out = layers.Dense(10)

    # Set forward pass
    def call(self, x, is_training=False):
        x = self.fc1(x)
        x = self.fc2(x)
        x = self.out(x)
        if not is_training:
            # tf cross entropy expect logits without softmax, so only
            # apply softmax when not training.
            x = tf.nn.softmax(x)
        return x

# Build neural network model.
neural_net = NeuralNet()

We now define the cross-entropy loss and the optimization procedure.

In [None]:
# Cross-Entropy Loss.
# Note that this will apply 'softmax' to the logits.
def cross_entropy_loss(x, y):
    # Convert labels to int 64 for tf cross-entropy function.
    y = tf.cast(y, tf.int64)
    # Apply softmax to logits and compute cross-entropy.
    loss = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=x)
    # Average loss across the batch.
    return tf.reduce_mean(loss)

# Accuracy metric.
def accuracy(y_pred, y_true):
    # Predicted class is the index of highest score in prediction vector (i.e. argmax).
    correct_prediction = tf.equal(tf.argmax(y_pred, 1), tf.cast(y_true, tf.int64))
    return tf.reduce_mean(tf.cast(correct_prediction, tf.float32), axis=-1)

# Stochastic gradient descent optimizer.
optimizer = tf.optimizers.SGD(learning_rate)

In the following snippet, select the initialisation of _trainable variables_ based on which approach you are following.

In [None]:
# Optimization process. 
def run_optimization(x, y):
    # Wrap computation inside a GradientTape for automatic differentiation
    with tf.GradientTape() as g:
        # Forward pass
        pred = neural_net(x, is_training=True)
        # Compute loss
        loss = cross_entropy_loss(pred, y)
        
    # Variables to update, i.e. trainable variables. Comment out one of the two lines, depending on the approach.
    trainable_variables = list(weights.values())+list(biases.values()) # If following the traditional approach
#    trainable_variables = neural_net.trainable_variables # If following the compact approach

    # Compute gradients
    gradients = g.gradient(loss, trainable_variables)
    
    # Update W and b following gradients
    optimizer.apply_gradients(zip(gradients, trainable_variables))

We are now ready to train our network.

In [None]:
# Run training for the given number of steps.
for step, (batch_x, batch_y) in enumerate(train_data.take(training_steps), 1):
    # Run the optimization to update W and b values
    run_optimization(batch_x, batch_y)
    
    if step % display_step == 0:
        pred = neural_net(batch_x, is_training=True)
        loss = cross_entropy_loss(pred, batch_y)
        acc = accuracy(pred, batch_y)
        print("step: %i, loss: %f, accuracy: %f" % (step, loss, acc))

In [None]:
# Test model
pred = neural_net(x_test, is_training=False)
print("Test Accuracy: %f" % accuracy(pred, y_test))

Let's check how our prediction works against 5 images; feel free to modify this number.

In [None]:
# Visualize predictions
import matplotlib.pyplot as plt

In [None]:
# Predict 5 images from validation set
n_images = 5
test_images = x_test[:n_images]
predictions = neural_net(test_images, is_training=False)

# Display image and model prediction
for i in range(n_images):
    plt.imshow(np.reshape(test_images[i], [28, 28]), cmap='gray')
    plt.show()
    print("Model prediction: %i" % np.argmax(predictions.numpy()[i]))

# Some questions
* How many  hidden layers does the above NN have?
* What recognition rate does it achieve for these hand-written characters?
* Which recognition rate is best to use here – test set or training set?

Finally, try to implement the following modifications and rerun the code to see if you implemented them successfully.
* Try using the _tf.nn.tanh_ function
* Try using the _tf.sigmoid_ function
* Try adding more hidden layers


## Epilogue: MNIST using Keras

We began with a straightforward implementation for XOR using Keras. We will conclude with a standalone implementation for MNIST using Keras. The snippet below can be executed on its own; there is no need to run any of the above snippets.

In [None]:
import tensorflow as tf
import numpy as np
from tensorflow.keras import layers
from tensorflow.keras.layers import Activation, Dense

mnist = tf.keras.datasets.mnist
(x_train, y_train),(x_test, y_test) = mnist.load_data()

# Convert to float32.
x_train, x_test = np.array(x_train, np.float32), np.array(x_test, np.float32)
# Flatten images to 1-D vector of 784 features (28*28).
x_train, x_test = x_train.reshape([-1, 784]), x_test.reshape([-1, 784])
# Normalize images value from [0, 255] to [0, 1].
x_train, x_test = x_train / 255., x_test / 255.

# convert class vectors to binary class matrices
y_train = tf.keras.utils.to_categorical(y_train, 10)
y_test = tf.keras.utils.to_categorical(y_test, 10)

# Training parameters.
batch_size = 256
epochs = 15

# Network parameters.
n_hidden_1 = 128 # 1st layer number of neurons
n_hidden_2 = 256 # 2nd layer number of neurons

model = tf.keras.Sequential()
model.add(Dense(n_hidden_1, activation= 'relu'))
model.add(Dense(n_hidden_2, activation= 'relu'))
model.add(Dense(10, activation="softmax"))
model.compile(loss="categorical_crossentropy", optimizer="adam", metrics=["accuracy"])
model.fit(x_train, y_train, batch_size=batch_size, epochs=15, validation_split=1.0/7.0)
score = model.evaluate(x_test, y_test, verbose=0)
print("Test loss:", score[0])
print("Test accuracy:", score[1])

In [None]:
# Visualize predictions
import matplotlib.pyplot as plt
from numpy import random

rnd_start = random.randint(1000)
n_images = 5
test_images = x_test[rnd_start:rnd_start+n_images]
predictions = model.predict(test_images)
for i in range(n_images):
    plt.imshow(np.reshape(test_images[i], [28, 28]), cmap='gray')
    plt.show()
    print("Prediction = ", np.argmax(predictions[i]))
