# The Artificial Neuron


Artifical neural networks are a connected system of "artifical neurons", computational elements that imitate natural nerve cells. The connection between neurons allows for signal to be tranmitted from one to the other, with the receiving neuron able to combine the input signals and generate an output.

## Neurons

A single neuron can take multiple inputs, $x$, and perform a calculation to output $y$ (Figure 1). Depending on the type of the model, this may be classification or regression.

<p style="text-align:center">
<img src="../img/neuron.png" width="40%"></img>
<b>Figure 1:</b> An artificial neuron</p>


### Logistic Regression

Let's imagine we are trying to predict whether a molecule is a drug (Yes|No), given its molecule fingerprint. 

We could use a linear model, to give the predicted output $y$:

$$y = W\cdot{x} + b$$

Where $W$ is the weight vector, $x$ is the molecule fingerprint and $b$ is the bias, a real number.

However, this linear model would not give values within the range 0 to 1, and indeed we wish to assess the probability that the molecule is a drug. 

"What is the probability of being a drug, given $x$?"

$$y = P(y=1|x)$$

We can therefore apply a function to this linear model, for example, the sigmoid function:

$$z = W\cdot{x} + b$$

$$y = \sigma(z)$$

If the probability is >0.5, the prediction is a drug.

<p style="text-align:center">
<img src="../img/sigmoid.png" width="50%">
<b> Figure 2:</b> A sigmoid function</p>

For every value of z, the sigmoid function gives a value of $y$ between 0 and 1, creating an 's' shaped curve (Figure 2).

The neuron is responsible for performing this calculation, but which values should be used for $W$ and $b$? Neural network modelling adopts and forward and backward approach to handle this.

In the <b> feed-forward </b> pass, the inputs $x$ are taken, and the output $y$ is determined using fixed weights and biases ($W$ and $b$).

Neural networks then undergo backward-propagation (gradient descent), in which a cost function is used to assess the error associated with the model using the current parameters ($\lambda$). During each iteration, the weights and biases are updated in order to minimise the cost.

<div class="alert alert-info" role="alert"><b>RECAP: Neural network modelling</b><br>
Neural network modelling typically adopts a two-stage process:<br>
<ol>
<li>Implementing the feed-forward pass
    <ol>
        <li> Taking the input and determining the output </li>
        <li> Use fixed weights </li>
    </ol>
</li>
<li>Implementing the backward propagation/gradient descent
    <ol>
        <li> calculate the error and gradients</li>
        <li> update the parameters using the gradient</li>
    </ol>
</li>
</ol>
</div>

## Feed forward for a single neuron (classification)

Let's do the feed forward for our single neuron using the molecule fingerprints.

Note: We shall use $y$ to denote predicted output, and $\hat{y}$ to denote actual output.

<p style="text-align:center">
<img src="../img/neuron_parameters.png" width="50%">
<br><b>Figure 3:</b> Logistic regression model. </p>

### Ensuring the parameter shapes are correct

For a single training example:

As we only have a single output, of shape $[1,1]$, working back, it means that the shape of $z$ must also be $[1,1]$:

$$[1,1] = \sigma[1,1]$$

$b$ is a constant, and must match the shape of $z$, so is also of shape $[1,1]$:

$$[1,1] = W \cdot x + [1,1] $$

We know from our data we have 2048 features as input (a bit fingerprint), so the shape of $x$ is $[n_{features},1]$.

$$[1,1] = W \cdot [2048,1] + [1,1] $$

Using matrix multiplication, how can we get a matrix of shape $[1,1]$ when multipled by a matrix of shape $[2048,1]$?

Answer: $$[1,2048] \times [2048,1] = [1,1]$$

Therefore, the shape of $W$ must be [1,2048]:

$$[1,1] = [1,2048] \cdot [2048,1] + [1,1]$$

$$ z = W \cdot x + b $$
<div class="alert alert-info" role="alert">
<u><b> Rules for shapes of parameters</u></b>

The "weight" parameter is a vector of the shape:

$$ W_{current} = [n_{current},n_{prev}]$$

The "bias" parameter is an array of the shape:

$$b_{current} = [n_{current},1]$$

Where:

$n_{current}$ is the number of units in the current layer <br>
$n_{prev}$ is the number of units in the previous layer
</div>


### Initialising the parameters

As we know nothing about the feature weights to start with, it is custom to initialise the weight vector using random numbers, and initialise the bias using zero.

## Other activation functions
<br>
<div class="alert alert-warning" role="alert">
<b>Exercise:</b> Identify the functions (1-5) from the graphs (A-E).</div>

<p style="text-align:center">
<img src="../img/functions.png" width="75%"/>
</p></font>

1) $y = 2x + 1$ <br><br>
2) $y = \frac{1}{1+e^{-x}}$<br><br>
3) $y = max(0,x)$<br><br>
4) $y = \frac{2}{1+e^{-2x}}-1$<br><br>
5) $y = max(x,0.1x)$<br><br>

In [1]:
#Answers:
# 1E - linear
# 2C - sigmoid
# 3B - relu
# 4A - tan
# 5D - leaky relu

### The linear activation function

<p style="text-align:center">
<img src="../img/linear.png" width="50%">
    <b> Figure 5:</b> The linear activation function.<br>
    Used for the output stage of a regressor.</p>
    
### The tan activation function

<p style="text-align:center">
<img src="../img/tan.png" width="50%">
    <b> Figure 6:</b> The tan activation function</p>
    
### The ReLU activation function

<p style="text-align:center">
<img src="../img/relu.png" width="50%">
    <b> Figure 7:</b> The ReLU activation function.<br>
    A good choice for hidden layers. Gives faster training because it is easy to differentiate.</p>
    
    
### The Leaky ReLU activation function

<p style="text-align:center">
<img src="../img/leakyrelu.png" width="50%">
    <b> Figure 8:</b> The leaky ReLU activation function</p>


# Building the neural network

Now we have understood the processes using a single neuron, let's look at building a network of neurons to create our model.

In a neural network, there are multiple layers:

* Input layer - passes the data directly to the first hidden layer
* Hidden layer - transforms the inputs into something that the output layer can use
* Output layer - transforms the hidden layer activations into the scale required for the output

Every neuron in a layer is able to receive information from every neuron in the layer before it, and is able to pass on information to every neuron in the layer after it, creating a densely connected network.

<p style="text-align:center">
<img src="../img/neural_network.png" width="50%"> 
    <b>Figure 9:</b> A neural network</p>
    
Here, it's important to introduce some notation, subscript numbers are the training example numbers, e.g. $x_1$ is training example 1. Superscript numbers, outside of square brackets denote the feature/neuron number, e.g. $x^1$ is feature 1. Superscript numbers in square brackets refer to the layer number, whilst $h$ and $o$ refer to 'hidden' and 'output' respectively, e.g. $h^{[1]}$ is hidden layer 1, whilst $o^{[4]}$ is output layer 4. The layer numbers start at input layer being 0.

In the neural network above, (Figure 9), you can see there are:

* An input layer consisting of three features ($x^1$,$x^2$ and $x^3$), note we have only shown three but will be using all features in the following example.
* 5 hidden layer 1 neurons.
* 2 hidden layer 2 neurons.
* 1 output layer 3 neuron, outputting a single value, $y$.

In total, there are 7 neurons spread across 3 layers (we do not count the input layer), so there are 3 weight vectors, and 3 bias vectors.
In keeping with the notation above, the weight vector and bias for the first layer would be denoted $W^{[1]}$ and $b^{[1]}$ respectively.

## Further reading

1. [Introduction to neural networks](https://www.analyticsvidhya.com/blog/2016/03/introduction-deep-learning-fundamentals-neural-networks/)