<a href="https://colab.research.google.com/github/robotictang/BAA3284-Capstone-Project/blob/pytorch/t81_558_class_03_5_weights.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Applications of Deep Neural Networks
**Module 3: Introduction to PyTorch and Keras**


# Module 3 Material

* Part 3.1: Deep Learning and Neural Network Introduction Keras [[Video]](https://www.youtube.com/watch?v=zYnI4iWRmpc) [[Notebook]](t81_558_class_03_1_neural_net.ipynb)
* Part 3.2: Introduction to Keras [[Video]](https://www.youtube.com/watch?v=PsE73jk55cE) [[Notebook]](t81_558_class_03_2_pytorch.ipynb)
* Part 3.3: Saving and Loading a Keras Neural Network [[Video]](https://www.youtube.com/watch?v=-9QfbGM1qGw) [[Notebook]](t81_558_class_03_3_save_load.ipynb)
* Part 3.4: Early Stopping in Keras to Prevent Overfitting [[Video]](https://www.youtube.com/watch?v=m1LNunuI2fk) [[Notebook]](t81_558_class_03_4_early_stop.ipynb)
* Part 3.5: Extracting Weights and Manual Calculation Keras [[Video]](https://www.youtube.com/watch?v=7PWgx16kH8s) [[Notebook]](t81_558_class_03_5_weights.ipynb)
* Part 3.6: Deep Learning and Neural Network Introduction PyTorch [[Video]](https://www.youtube.com/watch?v=zYnI4iWRmpc&list=PLjy4p-07OYzulelvJ5KVaT2pDlxivl_BN) [[Notebook]](t81_558_class_03_6_neural_net.ipynb)
* Part 3.7: Introduction to PyTorch [[Video]](https://www.youtube.com/watch?v=PsE73jk55cE&list=PLjy4p-07OYzulelvJ5KVaT2pDlxivl_BN) [[Notebook]](t81_558_class_03_7_pytorch.ipynb)
* Part 3.8: Saving and Loading a PyTorch Neural Network [[Video]](https://www.youtube.com/watch?v=-9QfbGM1qGw&list=PLjy4p-07OYzulelvJ5KVaT2pDlxivl_BN) [[Notebook]](t81_558_class_03_8_save_load.ipynb)
* Part 3.9: Early Stopping in PyTorch to Prevent Overfitting [[Video]](https://www.youtube.com/watch?v=m1LNunuI2fk&list=PLjy4p-07OYzulelvJ5KVaT2pDlxivl_BN) [[Notebook]](t81_558_class_03_9_early_stop.ipynb)
* **Part 3.10: Extracting Weights and Manual Calculation** [[Video]](https://www.youtube.com/watch?v=7PWgx16kH8s&list=PLjy4p-07OYzulelvJ5KVaT2pDlxivl_BN) [[Notebook]](t81_558_class_03_10_weights.ipynb)

# Google CoLab Instructions

The following code ensures that Google CoLab is running and maps Google Drive if needed.

In [None]:
try:
    import google.colab
    COLAB = True
    print("Note: using Google CoLab")
except:
    print("Note: not using Google CoLab")
    COLAB = False

# Part 3.5: Extracting Weights and Manual Network Calculation

## Weight Initialization

The weights of a neural network determine the output for the neural network. The training process can adjust these weights, so the neural network produces useful output. Most neural network training algorithms begin by initializing the weights to a random state. Training then progresses through iterations that continuously improve the weights to produce better output.

The random weights of a neural network impact how well that neural network can be trained. If a neural network fails to train, you can remedy the problem by simply restarting with a new set of random weights. However, this solution can be frustrating when you are experimenting with the architecture of a neural network and trying different combinations of hidden layers and neurons. If you add a new layer, and the network’s performance improves, you must ask yourself if this improvement resulted from the new layer or from a new set of weights. Because of this uncertainty, we look for two key attributes in a weight initialization algorithm:

* How consistently does this algorithm provide good weights?
* How much of an advantage do the weights of the algorithm provide?

One of the most common yet least practical approaches to weight initialization is to set the weights to random values within a specific range. Numbers between -1 and +1 or -5 and +5 are often the choice. If you want to ensure that you get the same set of random weights each time, you should use a seed. The seed specifies a set of predefined random weights to use. For example, a seed of 1000 might produce random weights of 0.5, 0.75, and 0.2. These values are still random; you cannot predict them, yet you will always get these values when you choose a seed of 1000. 
Not all seeds are created equal. One problem with random weight initialization is that the random weights created by some seeds are much more difficult to train than others. The weights can be so bad that training is impossible. If you cannot train a neural network with a particular weight set, you should generate a new set of weights using a different seed.

Because weight initialization is a problem, considerable research has been around it. By default, PyTorch uses a [uniform random distribution](https://discuss.pytorch.org/t/how-are-layer-weights-and-biases-initialized-by-default/13073) based on the size of the layer. The Xavier weight initialization algorithm, introduced in 2006 by Glorot & Bengio[[Cite:glorot2010understanding]](http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf), is also a common choice for weight initialization. This relatively simple algorithm uses normally distributed random numbers.  

To use the Xavier weight initialization, it is necessary to understand that normally distributed random numbers are not the typical random numbers between 0 and 1 that most programming languages generate. Normally distributed random numbers are centered on a mean ($\mu$, mu) that is typically 0. If 0 is the center (mean), then you will get an equal number of random numbers above and below 0. The next question is how far these random numbers will venture from 0. In theory, you could end up with both positive and negative numbers close to the maximum positive and negative ranges supported by your computer. However, the reality is that you will more likely see random numbers that are between 0 and three standard deviations from the center.

The standard deviation ($\sigma$, sigma) parameter specifies the size of this standard deviation. For example, if you specified a standard deviation of 10, you would mainly see random numbers between -30 and +30, and the numbers nearer to 0 have a much higher probability of being selected.  

The above figure illustrates that the center, which in this case is 0, will be generated with a 0.4 (40%) probability. Additionally, the probability decreases very quickly beyond -2 or +2 standard deviations. By defining the center and how large the standard deviations are, you can control the range of random numbers that you will receive.

The Xavier weight initialization sets all weights to normally distributed random numbers. These weights are always centered at 0; however, their standard deviation varies depending on how many connections are present for the current layer of weights. Specifically, Equation 4.2 can determine the standard deviation:

$$ Var(W) = \frac{2}{n_{in}+n_{out}} $$

The above equation shows how to obtain the variance for all weights. The square root of the variance is the standard deviation. Most random number generators accept a standard deviation rather than a variance. As a result, you usually need to take the square root of the above equation. Figure 3.XAVIER shows how this algorithm might initialize one layer. 

**Figure 3.XAVIER: Xavier Weight Initialization**
![Xavier Weight Initialization](https://github.com/jeffheaton/t81_558_deep_learning/blob/pytorch/images/xavier_weight.png?raw=1)

We complete this process for each layer in the neural network.  

## Manual Neural Network Calculation

This section will build a neural network and analyze it down the individual weights. We will train a simple neural network that learns the XOR function. It is not hard to hand-code the neurons to provide an [XOR function](https://en.wikipedia.org/wiki/Exclusive_or); however, we will allow PyTorch for simplicity to train this network for us. The neural network is small, with two inputs, two hidden neurons, and a single output. We will use 100K epochs on the ADAM optimizer. This approach is overkill, but it gets the result, and our focus here is not on tuning.

In [39]:
import torch
from torch.autograd import Variable
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import numpy as np

x = torch.Tensor(
    [[0,0],
     [0,1], 
     [1,0], 
     [1,1]])
y = torch.Tensor([0,1,1,0]).view(-1,1)

class Net(nn.Module):
    def __init__(self, input_dim = 2, output_dim=1):
        super(XOR, self).__init__()
        self.lin1 = nn.Linear(input_dim, 2)
        self.lin2 = nn.Linear(2, output_dim)
    
    def forward(self, x):
        x = self.lin1(x)
        x = torch.sigmoid(x)
        x = self.lin2(x)
        return x

    def reset(self):
      for layer in self.children():
        if hasattr(layer, 'reset_parameters'):
            layer.reset_parameters()

model = Net()

loss_func = nn.MSELoss()

#optimizer = optim.SGD(model.parameters(), lr=0.02, momentum=0.9)
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)

i = 0
loss = 1
while loss>1e-2:
  i += 1
  optimizer.zero_grad()
  pred = model(x)
  loss = loss_func.forward(pred, y)
  loss.backward()
  optimizer.step()
        
  if i % 100 == 0:
    print(f"Epoch: {i}, {loss}")
  #    print("Epoch: {0}, Loss: {1}, ".format(i, loss.data.numpy()[0]))
  if i % 1000 == 0:
    model.reset()

print(f"Final loss: {float(loss)}")

      

Epoch: 100, 0.25281381607055664
Epoch: 200, 0.2504228949546814
Epoch: 300, 0.25005096197128296
Epoch: 400, 0.24999704957008362
Epoch: 500, 0.2499832808971405
Epoch: 600, 0.24997738003730774
Epoch: 700, 0.24997231364250183
Epoch: 800, 0.24996590614318848
Epoch: 900, 0.24995684623718262
Epoch: 1000, 0.24994313716888428
Epoch: 1100, 0.24922847747802734
Epoch: 1200, 0.24562330543994904
Epoch: 1300, 0.23647019267082214
Epoch: 1400, 0.2098490297794342
Epoch: 1500, 0.17246952652931213
Epoch: 1600, 0.14777694642543793
Epoch: 1700, 0.1369965672492981
Epoch: 1800, 0.1326705813407898
Epoch: 1900, 0.1305185854434967
Epoch: 2000, 0.12924674153327942
Epoch: 2100, 0.24947486817836761
Epoch: 2200, 0.24793633818626404
Epoch: 2300, 0.235604390501976
Epoch: 2400, 0.162125825881958
Epoch: 2500, 0.029538724571466446
Final loss: 0.009675893001258373


The output above should have two numbers near 0.0 for the first and fourth spots (input [0,0] and [1,1]).  The middle two numbers should be near 1.0 (input [1,0] and [0,1]).  These numbers are in scientific notation.  Due to random starting weights, it is sometimes necessary to run the above through several cycles to get a good result.

Now that we've trained the neural network, we can dump the weights.  

In [51]:
for layerNum, layer in enumerate(model.children()):
  for toNeuronNum, bias in enumerate(layer.bias):
        print(f'{layerNum}B -> L{layerNum+1}N{toNeuronNum}: {bias}')
    
  for fromNeuronNum, wgt in enumerate(layer.weight):
      for toNeuronNum, wgt2 in enumerate(wgt):
        print(f'L{layerNum}N{fromNeuronNum} \
              -> L{layerNum+1}N{toNeuronNum} = {wgt2}')

0B -> L1N0: -0.9977866411209106
0B -> L1N1: -3.027963876724243
L0N0               -> L1N0 = -2.822441339492798
L0N0               -> L1N1 = 2.410576105117798
L0N1               -> L1N0 = 4.310168266296387
L0N1               -> L1N1 = -4.205879211425781
1B -> L2N0: -0.34139105677604675
L1N0               -> L2N0 = 1.5188194513320923
L1N0               -> L2N1 = 1.6310256719589233


In [None]:
# Dump weights
for layerNum, layer in enumerate(model.layers):
    weights = layer.get_weights()[0]
    biases = layer.get_weights()[1]
    
    for toNeuronNum, bias in enumerate(biases):
        print(f'{layerNum}B -> L{layerNum+1}N{toNeuronNum}: {bias}')
    
    for fromNeuronNum, wgt in enumerate(weights):
        for toNeuronNum, wgt2 in enumerate(wgt):
            print(f'L{layerNum}N{fromNeuronNum} \
                  -> L{layerNum+1}N{toNeuronNum} = {wgt2}')

0B -> L1N0: 1.3025760914331386e-08
0B -> L1N1: -1.4192625741316078e-08
L0N0                   -> L1N0 = 0.659289538860321
L0N0                   -> L1N1 = -0.9533336758613586
L0N1                   -> L1N0 = -0.659289538860321
L0N1                   -> L1N1 = 0.9533336758613586
1B -> L2N0: -1.9757269598130733e-08
L1N0                   -> L2N0 = 1.5167843103408813
L1N1                   -> L2N0 = 1.0489506721496582


If you rerun this, you probably get different weights.  There are many ways to solve the XOR function.

In the next section, we copy/paste the weights from above and recreate the calculations done by the neural network.  Because weights can change with each training, the weights used for the below code came from this:

```
0B -> L1N0: 1.3025760914331386e-08
0B -> L1N1: -1.4192625741316078e-08
L0N0 -> L1N0 = 0.659289538860321
L0N0 -> L1N1 = -0.9533336758613586
L0N1 -> L1N0 = -0.659289538860321
L0N1 -> L1N1 = 0.9533336758613586
1B -> L2N0: -1.9757269598130733e-08
L1N0 -> L2N0 = 1.5167843103408813
L1N1 -> L2N0 = 1.0489506721496582
```

In [54]:
input0 = 0
input1 = 1

hidden0Sum = (input0*0.66)+(input1*-0.66)+(0)
hidden1Sum = (input0*-0.95)+(input1*0.95)+(0)

print(hidden0Sum) # -0.66
print(hidden1Sum) # 0.95

hidden0 = max(0,hidden0Sum)
hidden1 = max(0,hidden1Sum)

print(hidden0) # 0
print(hidden1) # 0.95

outputSum = (hidden0*1.5)+(hidden1*1.0)+(0)
print(outputSum) # 0.95

output = max(0,outputSum)

print(f"Final output: {output}") # 0.96

-0.66
0.95
0
0.95
0.95
Final output: 0.95
