# Part 3.5: Extracting Weights and Manual Network Calculation

### Weight Initialization

The weights of a neural network determine the output for the neural network.  The process of training can adjust these weights so the neural network produces useful output.  Most neural network training algorithms begin by initializing the weights to a random state.  Training then progresses through a series of iterations that continuously improve the weights to produce better output.

The random weights of a neural network impact how well that neural network can be trained.  If a neural network fails to train, you can remedy the problem by simply restarting with a new set of random weights. However, this solution can be frustrating when you are experimenting with the architecture of a neural network and trying different combinations of hidden layers and neurons.  If you add a new layer, and the network’s performance improves, you must ask yourself if this improvement resulted from the new layer or from a new set of weights.  Because of this uncertainty, we look for two key attributes in a weight initialization algorithm:

* How consistently does this algorithm provide good weights?
* How much of an advantage do the weights of the algorithm provide?

One of the most common, yet least effective, approaches to weight initialization is to set the weights to random values within a specific range.  Numbers between -1 and +1 or -5 and +5 are often the choice.  If you want to ensure that you get the same set of random weights each time, you should use a seed.  The seed specifies a set of predefined random weights to use.  For example, a seed of 1000 might produce random weights of 0.5, 0.75, and 0.2. These values are still random; you cannot predict them, yet you will always get these values when you choose a seed of 1000. 
Not all seeds are created equal.  One problem with random weight initialization is that the random weights created by some seeds are much more difficult to train than others.  In fact, the weights can be so bad that training is impossible.  If you find that you cannot train a neural network with a particular weight set, you should generate a new set of weights using a different seed.

Because weight initialization is a problem, there has been considerable research around it.  In this course we use the Xavier weight initialization algorithm, introduced in 2006 by Glorot & Bengio[[Cite:glorot2010understanding]](http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf), produces good weights with reasonable consistency.  This relatively simple algorithm uses normally distributed random numbers.  

To use the Xavier weight initialization, it is necessary to understand that normally distributed random numbers are not the typical random numbers between 0 and 1 that most programming languages generate.  In fact, normally distributed random numbers are centered on a mean ($\mu$, mu) that is typically 0.  If 0 is the center (mean), then you will get an equal number of random numbers above and below 0.  The next question is how far these random numbers will venture from 0.  In theory, you could end up with both positive and negative numbers close to the maximum positive and negative ranges supported by your computer.  However, the reality is that you will more likely see random numbers that are between 0 and three standard deviations from the center.

The standard deviation ($\sigma$, sigma) parameter specifies the size of this standard deviation.  For example, if you specified a standard deviation of 10, then you would mainly see random numbers between -30 and +30, and the numbers nearer to 0 have a much higher probability of being selected.  

The above figure illustrates that the center, which in this case is 0, will be generated with a 0.4 (40%) probability.  Additionally, the probability decreases very quickly beyond -2 or +2 standard deviations. By defining the center and how large the standard deviations are, you are able to control the range of random numbers that you will receive.

The Xavier weight initialization sets all of the weights to normally distributed random numbers.  These weights are always centered at 0; however, their standard deviation varies depending on how many connections are present for the current layer of weights.  Specifically, Equation 4.2 can determine the standard deviation:

$ Var(W) = \frac{2}{n_{in}+n_{out}} $

The above equation shows how to obtain the variance for all of the weights.  The square root of the variance is the standard deviation.  Most random number generators accept a standard deviation rather than a variance.  As a result, you usually need to take the square root of the above equation.  Figure 3.XAVIER shows how one layer might be initialized. 

**Figure 3.XAVIER: Xavier Weight Initialization**
<img src="xavier_weight.png"  style="width: 150px;"/>

This process is completed for each layer in the neural network.  



## Summary

The weights determine the result of the neural network, what we really train are the weights, here 'train' means 
adjusting the weights.

To initialize the algorithm we use random weights. Some weights can be better than others 
in order to initialize the algorithm. 

The initial weights conditionate the result of the neural network. If a neural network fails
to train, we can restart the neural network with different starting weights.

Sometimes you are improving your neural network, for instance adding hidden layers or neurons, and you cannot
differenciate if the improvement is coming from the neural network architecture or from the initial weight.

This is pointing us that we need a initialization algorithm. We look for two key attributes in a weight initialization algorithm:

* How consistently does this algorithm provide good weights?
* How much of an advantage do the weights of the algorithm provide?

One of the most common, yet least effective, approaches to weight initialization is to set the weights to random values within a specific range.  Numbers between -1 and +1 or -5 and +5 are often the choice.  If you want to ensure that you get the same set of random weights each time, you should use a seed. 

Not all seeds are created equal.  One problem with random weight initialization is that the random weights created by some seeds are much more difficult to train than others.  In fact, the weights can be so bad that training is impossible.  If you find that you cannot train a neural network with a particular weight set, you should generate a new set of weights using a different seed.

Because weight initialization is a problem, there has been considerable research around it.  In this course we use the Xavier weight initialization algorithm, introduced in 2006 by Glorot & Bengio[[Cite:glorot2010understanding]](http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf), produces good weights with reasonable consistency.  This relatively simple algorithm uses normally distributed random numbers.



# Manual Neural Network Calculation

Simple neural network that learn the XOR, exclusive or inclusive. For simplicity we use Keras to train the network for us. The neural network is small.  Two inputs, two hidden neurons, and a single output.

In [4]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation
import numpy as np

# Create a dataset for the XOR function
x = np.array([
    [0,0],
    [1,0],
    [0,1],
    [1,1]
]) # predictor

y = np.array([
    0,
    1,
    1,
    0
]) # dependent variable

# Build the network
# sgd = optimizers.SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True)

done = False
cycle = 1

while not done:
    print("Cycle #{}".format(cycle))
    cycle+=1
    model = Sequential()
    model.add(Dense(2, input_dim=2, activation='relu')) 
    model.add(Dense(1)) 
    model.compile(loss='mean_squared_error', optimizer='adam')
    model.fit(x,y,verbose=0,epochs=10000)

    # Predict
    pred = model.predict(x)
    
    # Check if successful.  It takes several runs with this 
    # small of a network
    done = pred[0]<0.01 and pred[3]<0.01 and pred[1] > 0.9 \
        and pred[2] > 0.9 
    print(pred)

Cycle #1
[[0.49999997]
 [0.49999997]
 [0.49999997]
 [0.49999997]]
Cycle #2
[[0.49999997]
 [0.49999997]
 [0.49999997]
 [0.49999997]]
Cycle #3
[[4.9999988e-01]
 [1.0000000e+00]
 [4.9999988e-01]
 [1.1920929e-07]]
Cycle #4
[[0.3333333 ]
 [0.99999994]
 [0.3333333 ]
 [0.3333333 ]]
Cycle #5
[[4.9999985e-01]
 [1.0000000e+00]
 [4.9999985e-01]
 [1.1920929e-07]]
Cycle #6
[[9.4441020e-08]
 [9.9999988e-01]
 [9.9999988e-01]
 [1.6084614e-07]]


In [5]:
# Dump weights
for layerNum, layer in enumerate(model.layers):
    weights = layer.get_weights()[0]
    biases = layer.get_weights()[1]
    
    for toNeuronNum, bias in enumerate(biases):
        print(f'{layerNum}B -> L{layerNum+1}N{toNeuronNum}: {bias}')
    
    for fromNeuronNum, wgt in enumerate(weights):
        for toNeuronNum, wgt2 in enumerate(wgt):
            print(f'L{layerNum}N{fromNeuronNum} \
                  -> L{layerNum+1}N{toNeuronNum} = {wgt2}')

0B -> L1N0: -1.2041040658950806
0B -> L1N1: 7.557218140163968e-08
L0N0                   -> L1N0 = 1.2041043043136597
L0N0                   -> L1N1 = 1.4311785697937012
L0N1                   -> L1N0 = 1.2041040658950806
L0N1                   -> L1N1 = 1.4311779737472534
1B -> L2N0: 4.163685574098963e-08
L1N0                   -> L2N0 = -1.6609855890274048
L1N1                   -> L2N0 = 0.6987249255180359


In [6]:
input0 = 0
input1 = 1

hidden0Sum = (input0*1.3)+(input1*1.3)+(-1.3)
hidden1Sum = (input0*1.2)+(input1*1.2)+(0)

print(hidden0Sum) # 0
print(hidden1Sum) # 1.2

hidden0 = max(0,hidden0Sum)
hidden1 = max(0,hidden1Sum)

print(hidden0) # 0
print(hidden1) # 1.2

outputSum = (hidden0*-1.6)+(hidden1*0.8)+(0)
print(outputSum) # 0.96

output = max(0,outputSum)

print(output) # 0.96

0.0
1.2
0
1.2
0.96
0.96
