# Feedforward in neural networks

In this notebook we will take the first step towards desigining and training neural networks. We'll start slowly by writing code to implement the feedforward step with weights and biases fixed. In the next notebook, we'll take on the challenge of training the network (i.e. optimizing the weights and biases).

In [65]:
import numpy as np

## Network architecture

Let's design a network with:
- an input layer with 3 neurons
- a hidden layer with 2 neurons
- an output layer with 2 neurons

<img src="NN.png"  width="600" height="300">

We therefore need two weights matrices, $w^2$ and $w^3$, and two bias vectors $b^2$ and $b^3$. Note that the input layer (layer 1) isn't really a layer of neurons and so doesn't need any weights or biases. Let's set up an example input $\mathbf{x}$ to start with. The vector $\mathbf{x}$ has three components.

In [66]:
x = np.array([1,0,-2])

Let's also pick some weights and biases. We'll just make up some numbers here as we are not yet training our network, just computing the output.

In [67]:
w2 = np.array(
    [[1,2,1],[-1,-2,0]]
)
w3 = np.array(
    [[1,-1],[2,2]]
)
b2 = np.array([2,-1])
b3 = np.array([1,-1])

Make sure you understand in this simple example what the entries in these matrices mean. As a test just for yourself (no need to write this down), find the weight which is multiplied by the first input (`x[0]`) in the second neuron of the second layer.

To get the activations for the second layer (the hidden layer), we need to use the formula
$$
a^2 = \sigma(w^1 x+b^1) 
$$
Let's define the sigmoid now.

In [68]:
def sigmoid(z):
    return 1/(1+np.exp(-z))

Now we can compute the activations in the second (hidden) layer.

In [69]:
a2 = sigmoid(w2 @ x + b2)

In [70]:
a2

array([0.73105858, 0.11920292])

This is a vector of size two, as it should be -- one number for each neuron in the hidden layer. Now let's compute the output from the last layer.

In [71]:
a3 = sigmoid(w3 @ a2 + b3)

In [72]:
a3

array([0.83366886, 0.66830372])

The activations in the final layer are also known as the _output_ of our neural network. So far we've done everything fairly carefully and by hand. If we want our code to work for larger networks, we'll want to use a `for` loop. Let's first make two lists, one of weights and one of biases. 

In [73]:
w = [w2, w3]
b = [b2, b3]

To perform the feedforward, we loop through each non-input layer and compute the activation. We'll use a single variable `a` to track the activations, and use `zip()` to put the weights and biases into a single list.

In [74]:
def feedforward(weights, biases, x):
    a = x
    for w, b in zip(weights, biases):
        a = sigmoid(w @ a + b)
    return a

Let's make sure it agrees with our slow computation.

In [75]:
feedforward(w,b,x)

array([0.83366886, 0.66830372])

## Training data

Now let's introduce some training data. We're still not going to actually train our network, but we'll compute a cost function and see what happens. Our training data needs to consist of _training inputs_ and training labels or _targets_. We'll use a hypothetical classification problem on 3D data. Our classification consists of two classes: class A is for vectors with norm at most 1, and class B is for vectors with norm more than 1. 

As with any classification problem (e.g. MNIST), we need to encode a classification as an output vector. We'll use $[1,0]$ for class A, $[0,1]$ for class B. More generally, for an output $[a,b]$, $a$ encodes the confidence for class A and $b$ encodes the confidence in class B.

Let's build our training data. We'll generate ten random vectors with coordinates between $-1$ and $1$.

In [76]:
np.random.seed(2023)
X = np.random.uniform(-1,1,size=(10,3))

In [77]:
X

array([[-0.35602339,  0.7808449 ,  0.17610451],
       [-0.74680781, -0.71731755, -0.06420882],
       [-0.95582068,  0.45454943,  0.04877468],
       [ 0.08987048, -0.08725348,  0.00276453],
       [-0.21106289, -0.69765539, -0.27824965],
       [-0.67584599, -0.32408261, -0.63935344],
       [-0.2180172 , -0.92870358,  0.1297233 ],
       [-0.59307702, -0.35879108, -0.24687243],
       [-0.63189172, -0.79209633, -0.09014555],
       [-0.60827233, -0.24294915,  0.86106392]])

Each row is a data point, and we need to compute a ground truth label for it (that is, a binary vector). We'll start with an empty list and add either $[1,0]$ or $[0,1]$ to it as we go through our training inputs.

In [78]:
y = []
for x in X:
    if np.linalg.norm(x) <= 1:
        y.append(np.array([1,0]))
    else:
        y.append(np.array([0,1]))

In [79]:
y

[array([1, 0]),
 array([0, 1]),
 array([0, 1]),
 array([1, 0]),
 array([1, 0]),
 array([1, 0]),
 array([1, 0]),
 array([1, 0]),
 array([0, 1]),
 array([0, 1])]

Note that usually the training data is given to you. The only reason we had to generate it here was because we are generating an example ourselves. 

## Accuracy

We just saw how to compute the output for a _single_ input. Now let's compute the outputs for every training input.

In [80]:
all_outputs = [
    feedforward(w, b, X[i]) for i in range(len(X))
]

In [81]:
all_outputs

[array([0.86621361, 0.75634521]),
 array([0.66232822, 0.80348226]),
 array([0.83240746, 0.7889282 ]),
 array([0.83005625, 0.78845977]),
 array([0.70717419, 0.79443203]),
 array([0.7168432 , 0.76475332]),
 array([0.68307604, 0.82030231]),
 array([0.73726489, 0.79766862]),
 array([0.65760997, 0.80072279]),
 array([0.79093087, 0.85260651])]

As you can see, we get a list of 2D arrays. How do we determine the level of accuracy of this output? We need to decide, for each output, what the prediction actually means. A simple way to do this is to look at the output vector and record the highest value. So if the first coordinate is higher, we interpret that as "class A" and if the second is higher we interpret that as "class B". Let's record this classification in a vector called `predicted`, and record the real answer as `truth`. Make sure you understand what these `for` loops are doing.

In [82]:
predicted = []
for v in all_outputs:
    if v[0] > v[1]:
        predicted.append('A')
    else:
        predicted.append('B')

In [83]:
predicted

['A', 'B', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'B']

In [84]:
truth = []
for v in y:
    if v[0] == 1:
        truth.append('A')
    else:
        truth.append('B')

In [85]:
truth

['A', 'B', 'B', 'A', 'A', 'A', 'A', 'A', 'B', 'B']

To find our accuracy, we can make a vector recording whether `predicted` and `truth` agree, count the truth values, and divide by the total number of training inputs.

In [86]:
np.array(truth) == np.array(predicted)

array([ True,  True, False,  True, False, False, False, False,  True,
        True])

In [87]:
sum(np.array(truth) == np.array(predicted))/len(truth)

0.5

Looks like our badly chosen weights and biases got it right about 50% of the time. This is about as good as the sophisticated machine learning algorithm called "flipping a coin". This is an important lesson: neural networks are not good because we can compute their output, but because we can _train_ them.

## Cost function

Now we'll define our cost function. Our function `cost` has to do the following _for each training input_: (1) compute the output of the network and (2) compute the mean squared error using the training labels. We then add all of these values together and divide by $2n$, where $n$ is the number of training inputs. See the Exercise below.

**Note:** We could theoretically use accuracy as our cost function, but then we would not be able to take any derivatives. We hope that our cost function does a good job of approximating accuracy.

## Exercise 1

(a) Fill in the ... below to complete the `cost` function. If you want to see the formula for the cost, check out Equation (6) in Nielsen. 

(b) Run the cost function using the weights and biases we picked before and the training data `X` and training labels `y`. You should get an answer of about `0.3` or so.

(c) Compute the cost again, but this time use the following weights and biases which I obtained by optimizing the network a bit (what you'll learn to do in a later notebook):

$$
w^2 = \begin{pmatrix}
1.3 & -0.3 & -1.4 \\
2.1 & 0.7 & -1.8
\end{pmatrix} \ w^3 = \begin{pmatrix}
0.1 & 1.2 \\ -0.9 & -2.3 
\end{pmatrix}\ b^2 = \begin{pmatrix} 0.1 \\ 1.0 \end{pmatrix} \ b^3 = \begin{pmatrix} -0.1 \\ 1.0 \end{pmatrix}
$$

(d) Compute the accuracy for these new weights and biases. Did the neural network do better or worse on the training data with these new weights?

(e) Compute the output of the network with the weights and biases in (c) when the input is the vector $[100,100,100]^\mathsf{T}$. Based on the output, does the network think this vector has norm less than or greater than $1$? Does the answer surprise you?




In [88]:
#part a
def cost(weights, biases, training_input, training_labels):
    total_cost = 0
    n = len(training_input)
    for i in range(len(training_input)):
        output = feedforward(weights, biases, training_input[i]) #compute feedforward for the ith training input
        cost = (np.linalg.norm(output - training_labels[i]))**2 #compute the square of the norm of the difference between the output and the ith training label
        total_cost = total_cost + cost
        
    return total_cost/(2 * n)

In [89]:
#part b
cost(w, b, X, y)

0.32176264964887974

In [90]:
#part c
v2 = np.array(
    [[1.3,-0.3,-1.4],[2.1,0.7,-1.8]]
)
v3 = np.array(
    [[0.1,1.2],[-0.9,-2.3]]
)
u2 = np.array([0.1,1])
u3 = np.array([-0.1,1])
v = [v2, v3]
u = [u2, u3]
cost(v, u, X, y)

0.17275998032344547

In [91]:
#part d
all_outputs2 = [
    feedforward(v, u, X[i]) for i in range(len(X))
]
all_outputs2

[array([0.66193477, 0.33346052]),
 array([0.56702314, 0.50870135]),
 array([0.57410903, 0.522435  ]),
 array([0.70293876, 0.22463358]),
 array([0.6740951 , 0.26642097]),
 array([0.66898526, 0.28249864]),
 array([0.60978181, 0.40464942]),
 array([0.62922836, 0.37257945]),
 array([0.58279122, 0.46908921]),
 array([0.51431626, 0.64567114])]

In [92]:
predicted2 = []
for m in all_outputs2:
    if m[0] > m[1]:
        predicted2.append('A')
    else:
        predicted2.append('B')

In [93]:
sum(np.array(truth) == np.array(predicted2))/len(truth)

0.7

Seems these weight and biases performed better by 20%. 


In [94]:
#part e
training_input = np.array([100, 100, 100])
output = feedforward(v, u, training_input) 
output

array([0.75026011, 0.21416502])

In [95]:
np.linalg.norm(np.array([100, 100, 100]))

173.20508075688772

In [96]:
a2 = sigmoid(v2 @ training_input + u2)
a3 = sigmoid(v3 @ a2 + u3)
print(a2)
print(a3)

[4.69515757e-18 1.00000000e+00]
[0.75026011 0.21416502]


The network thinks [100, 100, 100]^T has a norm less than 1. This is surprising, but when looking at the activations it makes some sense. 

# Challenge

By hand, find weights and biases for this network so that the output answers the following question:

_Given an input vector $x$, are the entries of $x$ in ascending order?_

Test your network on a few inputs to make sure it's working.

In [172]:
w2_ch = np.array(
    [[-1, 0, 1],[0, -1, 1]] #difference between the third and first should be greater than difference between second and third
)
w3_ch = np.array(
    [[1.9, 0],[0,1.9]]
)
b2_ch = np.array([0.01,-0.01])
b3_ch = np.array([0,0])

In [167]:
w_ch = [w2_ch, w3_ch]
b_ch = [b2_ch, b3_ch]

Recall Truth = ['A', 'B', 'B', 'A', 'A', 'A', 'A', 'A', 'B', 'B'] (I kept with the same pattern, except a [1,0] will mean ascending and [0,1] will mean not ascending. 

In [168]:
test = [np.array([0.1, 0.2, 0.3]), np.array([0.52, 0.42, 0.73]), np.array([0.5, 0.2, 0.1]), np.array([-0.1, 0.4, 0.6]), np.array([0, 0.8, 0.9]), 
        np.array([-0.25, 0.34, 0.61]), np.array([-0.5, -0.4, -0.3]), np.array([-0.97, -0.43, -0.3]), np.array([0.5, 0.2, 0.3]), np.array([0.6, 0.4, 0.3])]

In [169]:
all_outputs3 = [
    feedforward(w_ch, b_ch, test[i]) for i in range(len(test))
]
all_outputs3

[array([0.74065679, 0.72962507]),
 array([0.74155771, 0.74865293]),
 array([0.68288519, 0.71049773]),
 array([0.78138609, 0.73884605]),
 array([0.79489701, 0.72962507]),
 array([0.79232821, 0.74513073]),
 array([0.74065679, 0.72962507]),
 array([0.77921601, 0.73241925]),
 array([0.70266494, 0.72962507]),
 array([0.69279255, 0.71049773])]

In [170]:
predicted3 = []
for m in all_outputs3:
    if m[0] > m[1]:
        predicted3.append('A')
    else:
        predicted3.append('B')

In [171]:
sum(np.array(truth) == np.array(predicted3))/len(truth)

1.0