# Assignment 4: Neural Networks
## Part 1: Manual calculation of neural network output and gradient descent 

The figure below shows a 2-layer, feed-forward neural network with two
hidden-layer nodes and one output node. $x_1$ and $x_2$ are the two inputs,
and $o_1$ is the output. For the following questions, assume the learning
rate is $0.5$.  Each node also has a bias input value ($x_0$) of $1$.
Assume there is a sigmoid ($\sigma(x)$) activation function at the hidden
layer nodes and at the output layer node.

There are a variety of activation functions, each with different pros and cons.
We are using the sigmoid activation function for this assignment. It has the
property of clamping the output between 0 and 1. Note that sigmoid function is

$$
\sigma(x) = \frac{1}{1 + e^{-x}}
$$

and its derivative is

$$
\sigma'(x) = \sigma(x) (1 - \sigma(x))
$$

**Note that this is not the update rule. You will find this derivative in the
update rule equation in the slides.**

![](static/neural_network_manual_calculation.png)

Calculate the output values at nodes $h_1$, $h_2$ and $o_1$ of this
network for input $x_1 = 1, x_2 = 0$. Remember that bias term $x_0 = 1$.
Each unit produces as its output the real value computed by the unit's
associated sigmoid function.

Then calculate one step of backpropagation, for the same input and target
output $t = 1$ for $o_1$. Compute updated weights for the 3 connections
into output layer, and then for the 6 connections into the hidden layer. There
should be 9 updated weights in total. Make sure you understand when the
activation function for a layer is applied in a neural network, and use the
derivative of the activation function combined with the learning rate to
determine the change in weights.

People use many different functions as the loss function, depending on the task
and the properties of the output data. For this question, use the L1 loss
function to determine the error for backpropagation. L1 is defined as

$$
  \text{L1} = \frac{1}{n} \sum^n_{i = 1} t_i - o_i
$$

where n is the number of dimensions in the output layer. It's a simple function
calculating the difference between the target vector and the output vector. In
this case, the output node is a single real value, so $n = 1$. And the loss
becomes

$$
  \text{L1} =  t - o_1
$$

Feel free to refer to the slides, which contain all the equations you need to
complete part 1. **Note that the equations in the slides use
the sigmoid activation function and L1 loss, referring to them might make this process easier**. You can also seek and read other
sources sure you fully understand how neural networks work. Your TA is also
glad to answer any questions.

Submit your final answers, along with the steps to calculate them. If you
consult external resources for the assignment, please include a link or
citation to them too.


## Answers

a. Weight calculation

    The weights are - wh2x0 = 2, wh2x1 = -1.5, wh2x2 = 3, wh1x0 = 1.5, wh1x1 = -2.5, wh1x2 = 1, wo1x0 = -1, wo1h1 = 1, wo1h2 = 0.5 and the values of bias i.e. x0 = 1, and x1 = 1 and x2 = 0.
    The output value at h1 = summation of w_ix_i where i = 0 to N.
    that is summation of these values = wh1x0 * x0 = 1.5 * 1 = 1.5, wh1x1 * x1 = -2.5 * 1 = -2.5, wh1x2 * x2 = 1 * 0 = 0
    = 1.5 + (-2.5) + 0
    = -1
    Now, applying the activation function to the value (-1), activation function is a sigmoid = 1 / 1 + e ^ -x
    i.e. = 1 / 1 + e ^ -(-1) = 1 / 1 + e = 1 / 1 + 2.7 ~ 0.27

    The output value at h2 = summation of w_ix_i where i = 0 to N.
    that is summation of these values = wh2x0 * x0 = 2 * 1 = 2, wh2x1 * x1 = -1.5 * 1 = -1.5, wh2x2 * x2 = 3 * 0 = 0
    = 2 + (-1.5) + 0
    = 0.5
    Now, applying the activation function to the value (0.5), activation function is a sigmoid = 1 / 1 + e ^ -x
    i.e. = 1 / 1 + e ^ -(0.5) =  1 / 1 + 0.61 ~ 0.62

    The output value at 01 = summation of w_ix_i where i = 0 to N.
    that is summation of these values = wo1x0 * x0 = -1 * 1 = -1, wo1h1 * h1 = 1 * 0.27 = 0.27, wo1h2 * h2 = 0.5 * 0.65 = 0.32
    = -1 + 0.27 + 0.32
    = -0.41
    Now, applying the activation function to the value (0.5), activation function is a sigmoid = 1 / 1 + e ^ -x
    i.e. = 1 / 1 + e ^ -(-0.41) ~ 0.39

    Hence, the approximate output values at h1, h2, and o1 are 0.27, 0.62, 0.39 (reduced to the second decimal place).

    The loss calculation for target output t = 1 is defined as 1/n summation of differences between target and output. So, in our case, the Loss is:
    = L1 = t − o1 = 1−0.39 ~ 0.61

b. Back propagation:
    The error calculation Delta at output node = derivation of activation function * loss
    Delta = sigma(x) * (1 - sigma(x)) * (t1 - o1) where sigma(x) is out activation function and (t1 - o1) is the loss
    = sigma(x) * (1 - sigma(x)) * 0.61
    = 0.39 * (1 - 0.39) * 0.61, because sigma(x) is the value we received at output o1
    ~ 0.1451

    Now, we update the weights to the output layer by incorporating the error:
    w <- w + Delta is the new weight
    wo1x0' = wo1x0 + delta of wo1x0 = −1 + learning rate * Delta at output node * x0 = -1 + 0.5 * 0.1451 * 1 = -0.92
    wo1h1' = wo1h1 + delta of wo1h1 = 1 + learning rate * Delta at output node * h1 = 1 + 0.5 * 0.1451 * 0.27 = 1.0195
    wo1h2' = wo1h2 + delta of wo1h2 = 0.5 + learning rate * Delta at output node * h2 = 0.5 + 0.5 * 0.1451 * 0.62 = 0.5449

    The error calculation at hidden layers = derivation of activation function * summation of wk,h * delta_k where k belongs to OUT as shown in class slides
    for h1 - 
    delta_h1 = h1(1-h1) * Delta_o1 * wo1h1 = 0.27(1-0.27) * 0.1451 * 1 ~ 0.028

    for h2 - 
    delta_h2 = h2(1-h2) * Delta_o1 * wo1h2 = 0.62(1-0.62) * 0.1451 * 0.5 ~ 0.0171

    Now, we update the weights for hidden layers by incorporating the error:

    wh1x0' = wh1x0 + delta = 1.5 + learning rate * Delta at h1 node * x0 = 1.5 + 0.5 * 0.028 * 1 = 1.514
    wh1x1' = wh1x1 + delta = -2.5 + learning rate * Delta at h1 node * x1 = -2.5 + 0.5 * 0.028 * 1 = -2.486
    wh1x2' = wh1x2 + delta = 1 + learning rate * Delta at h1 node * x2 = 1 + 0.5 * 0.028 * 0 = 1 + 0 = 1 

    wh2x0' = wh2x0 + delta = 2 + learning rate * Delta at h2 node * x0 = 2 + 0.5 * 0.0171 * 1 = 2.00855
    wh2x1' = wh2x1 + delta = -1.5 + learning rate * Delta at h2 node * x1 = -1.5 + 0.5 * 0.0171 * 1 = -1.49145
    wh2x2' = wh2x2 + delta = 3 + learning rate * Delta at h2 node * x2 = 3 + 0.5 * 0.0171 * 0 = 3



## Part 2: Training a neural network

We're now going to get past the pesky XOR problem that have been the bane of
neural networks' existence.

First, let's install libraries. We will be using the PyTorch package to train out neural network. You can use
`pip` or `conda` to install the `torch` package.

Let import the libraries

In [1]:
import torch
import random

torch.manual_seed(1)
random.seed(1)

Then let's set up the network architecture. We are constructing the same model as the image above.

In [2]:
model = torch.nn.Sequential(
    # connecting the input to the hidden layer
    torch.nn.Linear(2, 2), # takes an vector of 2 numbers and outputs a vector of 2 numbers
    torch.nn.Sigmoid(), # activation function
    # connecting the hidden layer to the output layer
    torch.nn.Linear(2, 1), # takes an vector of 2 numbers and outputs a vector of 1 number
    torch.nn.Sigmoid() # activation function
)

Now we need to train the network. Note that we are going to use the binary
cross entropy loss function to optimally train this tiny network. This is a
loss (error) measure we did not cover in class. We are using it for the
assignment because, as its name suggests, it is well-suited for learning binary
functions. XOR is of course such a function, having inputs and outputs that are
1s and 0s, not continuous quantities.
It's best for training if the examples presented to the network are random. So we are going to randomly generate XOR examples in each loop.

PyTorch helps us out with automatic differentiation. We just need to use the gradient calculated with the loss function to modify the value of the model parameters (weights).



In [53]:
learning_rate = 1
def random_binary():
    return random.choice([0, 1])

loss_fn = torch.nn.BCELoss()
for i in range(5000):

    # we're creating a random input and output tensor (vector) that follows the
    # xor rule
    x1, x2 = random_binary(), random_binary()
    true_output = torch.Tensor([x1 ^ x2])

    # getting the model's predicted output
    predicted_output = model(torch.Tensor([x1, x2]))

    # calculating the loss (error)
    loss = loss_fn(predicted_output, true_output)


    # calculating the gradients, in order to update the weights
    model.zero_grad()
    loss.backward()

    with torch.no_grad(): # prevents gradient tracking during updates
        # for each parameter (weight and bias) in the model
        for param in model.parameters():
            # update the parameter using the gradient scaled by the learning
            # rate
            param -= learning_rate * param.grad


We can now see what the neural network outputs for a given input. We are asking
the model to predict $\text{XOR}(0, 1)$ and $\text{XOR}(1, 1)$. For your
submission, please print the output for all input combinations.

In [82]:
print(model(torch.Tensor([0, 1])))
print(model(torch.Tensor([1, 1])))
print(model(torch.Tensor([1, 0])))
print(model(torch.Tensor([0, 0])))

tensor([0.9999], grad_fn=<SigmoidBackward0>)
tensor([0.0001], grad_fn=<SigmoidBackward0>)
tensor([0.9999], grad_fn=<SigmoidBackward0>)
tensor([7.1151e-05], grad_fn=<SigmoidBackward0>)


If the neural network is not successful at XOR, try retraining it from scratch or increasing the number of epochs (training iterations).

## Prying open a neural network
Now that we have a trained network, we can look at the weights and biases for each layer of the network. To do so, use 

In [99]:
import numpy as np

def sigmoid(x):
    return 1 / (1 + np.exp(-x))

W0 = np.array([[-7.4787, -7.4792], [-8.8528, -8.8541]])
b0 = np.array([11.1868, 3.9412])
W2 = np.array([[19.7294, -20.0986]])
b2 = np.array([-9.5642])

inputs = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])

def predict(inputs):
    layer0_output = np.dot(inputs, W0.T) + b0
    layer0_output = sigmoid(layer0_output)
    print(layer0_output)
    layer2_output = np.dot(layer0_output, W2.T) + b2
    final_output = sigmoid(layer2_output)
    print(final_output)
    return final_output

predictions = predict(inputs)
rounded_predictions = np.round(predictions)

for i, (input_combination, prediction) in enumerate(zip(inputs, rounded_predictions)):
    print(f"Input: {tuple(input_combination)}, XOR Output: {int(prediction)}")

[[9.99986144e-01 9.80945246e-01]
 [9.76051274e-01 7.29749419e-03]
 [9.76062959e-01 7.30691774e-03]
 [2.25084248e-02 1.05107086e-06]]
[[7.11464145e-05]
 [9.99928521e-01]
 [9.99928524e-01]
 [1.09427006e-04]]
Input: (0, 0), XOR Output: 0
Input: (0, 1), XOR Output: 1
Input: (1, 0), XOR Output: 1
Input: (1, 1), XOR Output: 0


In [101]:
# ----- HELPER FUNCTION BY STUDENT ---------
import numpy as np
import math as Math

def sigmoid(x):
    return 1 / (1 + np.exp(-x))

def predict(x1, x2):
    x0 = 1
    z_h1 = w_h1_x0 * x0 + w_h1_x1 * x1 + w_h1_x2 * x2
    h1 = sigmoid(z_h1)
    print(h1)

    z_h2 = w_h2_x0 * x0 + w_h2_x1 * x1 + w_h2_x2 * x2
    h2 = sigmoid(z_h2)
    print(h2)
    z_o = w_o_0 * x0 + w_o_1 * h1 + w_o_2 * h2
    o = sigmoid(z_o)
    print(o)
    return o

# Layer 0.weight
w_h1_x1 = -7.4787
w_h1_x2 = -7.4792
# Layer 0.bias
w_h2_x1 = -8.8528
w_h2_x2 = -8.8541
# Layer 2.weight
w_h1_x0 = 11.1868
w_h2_x0 = 3.9412
# Layer 2.bias
w_o_0 = -9.5642
w_o_1 = 19.7294
w_o_2 = -20.0986

inputs = [(0, 0), (0, 1), (1, 0), (1, 1)]
predictions = [(x1, x2, predict(x1, x2)) for x1, x2 in inputs]

for x1, x2, output in predictions:
    print(f"Input: ({x1}, {x2}) -> XOR Output: {output:.4f}")


0.9999861442999742
0.9809452456394838
7.114641454479416e-05
0.9760512741967798
0.0072974941917869586
0.9999285210869288
0.9760629590074197
0.007306917739368373
0.9999285240268854
0.02250842476520902
1.0510708555512887e-06
0.00010942700604166475
Input: (0, 0) -> XOR Output: 0.0001
Input: (0, 1) -> XOR Output: 0.9999
Input: (1, 0) -> XOR Output: 0.9999
Input: (1, 1) -> XOR Output: 0.0001


In [85]:
print(list(model.named_parameters()))
print(model)

[('0.weight', Parameter containing:
tensor([[-7.4787, -7.4792],
        [-8.8528, -8.8541]], requires_grad=True)), ('0.bias', Parameter containing:
tensor([11.1868,  3.9412], requires_grad=True)), ('2.weight', Parameter containing:
tensor([[ 19.7294, -20.0986]], requires_grad=True)), ('2.bias', Parameter containing:
tensor([-9.5642], requires_grad=True))]
Sequential(
  (0): Linear(in_features=2, out_features=2, bias=True)
  (1): Sigmoid()
  (2): Linear(in_features=2, out_features=1, bias=True)
  (3): Sigmoid()
)


Here's some sample output 

```python
[('0.weight',
  Parameter containing:
  tensor([[-6.2839, -6.2317],
          [-8.0318, -8.0085]], requires_grad=True)),
 ('0.bias',
  Parameter containing:
  tensor([9.3037, 3.4865], requires_grad=True)),
 ('2.weight',
  Parameter containing:
  tensor([[ 14.3675, -14.7218]], requires_grad=True)),
 ('2.bias',
  Parameter containing:
  tensor([-6.8876], requires_grad=True))]
```

Note that each of the 2 layers have 2 items in the output. Weights and biases.

The weights are represented as a transformation matrix transforming the input
to a layer.

For example, for the Python output above, the weight connecting
$x_1$ to $h_1$ has the value $-6.2839$, and the weight connecting $x_2$
to $h_1$ has the value $-6.2317$. The bias value for $h_1$ is $9.3037$, and the bias value for $h_2$ is $3.4865$.

## Questions

* What were the final weights and biases for your network? How well did it do?
* Using the weights and biases for your network, calculate the
predictions for each for the four input-output combinations in the XOR
truth table (you can find this in the slides or write it down
yourself). You may do this manually or using code (looping, making
functions, etc.), but do not use the trained model to do so.

  To make this easier, here is a function that calculates the sigmoid
  of an input vector.

  ```python
  import numpy as np

  def sigmoid(x):
    return 1 / (1 + np.exp(-x))
  ```

* Conduct an analysis of the weights and biases of the network you have just trained. Why does it compute the XOR function? See if you can intepret the neural networks as a combination of simpler logical functions such as OR and AND, or outline a different insight regarding the behavior of the network. Be creative and extensive here!

## Credits

Part 1 was adapted from University of Wisconsin-Madison's CS 540 class.

## Submission guidelines

Please submit the `.ipynb` file to Canvas.

## Answers
1. Below is the output after tweaking the epoch count - 

[('0.weight', Parameter containing:
tensor([[-7.4787, -7.4792], [-8.8528, -8.8541]], requires_grad=True)), 

('0.bias', Parameter containing:
tensor([11.1868,  3.9412], requires_grad=True)), 

('2.weight', Parameter containing:
tensor([[ 19.7294, -20.0986]], requires_grad=True)), 

('2.bias', Parameter containing:
tensor([-9.5642], requires_grad=True))]

Model - 
Sequential(
  (0): Linear(in_features=2, out_features=2, bias=True)
  (1): Sigmoid()
  (2): Linear(in_features=2, out_features=1, bias=True)
  (3): Sigmoid()
)

For the Python output above, the weights are - 
- Layer 0.weight
w_h1_x1 = -7.4787
w_h1_x2 = -7.4792

- Layer 0.bias
w_h2_x1 = -8.8528
w_h2_x2 = -8.8541

- Layer 2.weight
w_h1_x0 = 11.1868
w_h2_x0 = 3.9412

- Layer 2.bias
w_o_0 = -9.5642
w_o_1 = 19.7294
w_o_2 = -20.0986

The model seems to have performed well. 
The number of epoch was increased from the default value and i could see an increase from 0.9995 accuracy to almost 0.9999 for the xor (0, 1) case which is 1 as our target value.
In the other case of XOR (1, 1), the answer we expect is 0 but after 5000 epoch training, we get as close as = 0.0001. After trying to increase the epoch size to even double, these values do not seem to increase.

2. Getting values with sigmoid activation function for XOR with 0 and 1 values as in traditionally
Input: (0 0) -> XOR Output: 0, expected value - 0
Input: (0, 1) -> XOR Output: 1, expected value - 1
Input: (1, 0) -> XOR Output: 1, expected value - 1
Input: (1, 1) -> XOR Output: 0, expected value - 0

Helper function added to check the values - 
import numpy as np

def sigmoid(x):
    return 1 / (1 + np.exp(-x))

W0 = np.array([[-7.4787, -7.4792], [-8.8528, -8.8541]])
b0 = np.array([11.1868, 3.9412])
W2 = np.array([[19.7294, -20.0986]])
b2 = np.array([-9.5642])

inputs = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])

def predict(inputs):
    layer0_output = np.dot(inputs, W0.T) + b0
    layer0_output = sigmoid(layer0_output)
    layer2_output = np.dot(layer0_output, W2.T) + b2
    final_output = sigmoid(layer2_output)
    return final_output

predictions = predict(inputs)
rounded_predictions = np.round(predictions)

for i, (input_combination, prediction) in enumerate(zip(inputs, rounded_predictions)):
    print(f"Input: {tuple(input_combination)}, XOR Output: {int(prediction)}")

3. The weights are as mentioned in the above questions - 
W0 = np.array([[-7.4787, -7.4792], [-8.8528, -8.8541]])
b0 = np.array([11.1868, 3.9412])
W2 = np.array([[19.7294, -20.0986]])
b2 = np.array([-9.5642])

The weights of layer 0 are negative, this means, the neurons will only be activated with certain inputs of x1 and x2. The first hidden layer h1 can be seen as implementing 
something similar to the NOT AND function, since it will output true for all combinations except when both inputs are true. The activation function in turn just helps standardize the output values between 0 and 1 for better understanding.

in this case, if we see the layer W2's weights, we see one that one is very positive and one is very negative, this is helping our complex XOR function
be sensitive to different cases. Because of this, XOR calculations of different values ends up being penalized and 0 whereas same values is not penalised.
Additionally, in the cases of same values, like 0, 0, we see the calculations of h1 and h2 and o are like these
h1 - 0.9999861442999742, h2 - 0.9809452456394838, o - 7.114641454479416e-05. Whereas, in the case of different numbers like - 0, 1, we see the values are like - 
h1 - 0.9760512741967798, h2 - 0.0072974941917869586, o - 0.9999285210869288. The difference is, same values end up in high h1 and h2 and different values
end up in different h1 and h2, based on the weights we give to h1 and h2, the layer b0, biases for h1 and h2, h1 is more preferred and hence output is higher.

We can indeed interpret it as a OR AND functions if we see the output layer which is a weighted sum layer followed by a transformation. 
In case of different values, we can see the XOR behaves similar to the OR function, if one of the output of hidden layer is closer to 1, the output is closer to 1.
This is the case for 0, 1 - h1 - 0.9760512741967798, h2 - 0.0072974941917869586, o - 0.9999285210869288. 0 and 1 both values contribute to the values h1 and h2
the weights of layer 0 and the bias term for h1 outputs a number closer to 1 whereas the same output a number closer to 0.

An example quoted from Mitchell's neural nets book - "If we assume boolean values of 1 (true) and -1 (false), then one way to
use a two-input perceptron to implement the AND function is to set the weights wo = -8, and wl = wz = .5 and to determine it as an Or function, 
we alter wo = -.3." We see, the weights for our layer w2 are similar to this example, 19.7294, -20.0986. So, this neural net behaves indeed like an AND and OR function.
Essentially, the second layer combines the outputs of the previous layer, emphasizes the effect of differing inputs and marginalizes outputs when both inputs are the same.

