The following code implements a neural network with an hidden layer.
For a simplest Neural Network check this: http://iamtrask.github.io/2015/07/12/basic-python-network/

In [1]:
#dependencies (matrix math) 
import numpy as np

In [2]:
#input data
input_data = np.array([[0,0,1],
            [0,1,1],
            [1,0,1],
            [1,1,1]])
                
output_labels = np.array([[0],
            [1],
            [1],
            [0]])

print(input_data)
print(output_labels)

[[0 0 1]
 [0 1 1]
 [1 0 1]
 [1 1 1]]
[[0]
 [1]
 [1]
 [0]]


In [3]:
# sigmoid function
def activate(x,deriv=False):
    if(deriv==True):
        return x*(1-x)
    return 1/(1+np.exp(-x))

In [4]:
# 2 weight values
synaptic_weight_0 = 2*np.random.random((3,4)) - 1
synaptic_weight_1 = 2*np.random.random((4,1)) - 1

print(synaptic_weight_0)
print(synaptic_weight_1)

[[ 0.54969199 -0.7388136   0.65276113  0.27332282]
 [-0.95528218 -0.04012118 -0.14603297  0.85252821]
 [-0.00283674 -0.94953697  0.12963559 -0.34949967]]
[[-0.81678388]
 [ 0.90764242]
 [ 0.14785876]
 [-0.53525641]]


### Optimization
We want to find the smallest error so to yield the optimal parameter values fr our model.
The process of searching this space of our model to obtain a better evaluation is called optimization.
The following is the gradient descent strategy and when we apply gradient descent to Neural Network is called backpropagation.
But How?
- Initialize the weights randomly
- Calculate the error using our error function (ex. MSE)
- Use that error values to compute the partial derivative with respect to all the weights and we can call this the gradient
- Use the gradient to update the weights slightly in a certain direction towards the samllest error value
- Do these steps over and over  until the weights converge to their optimal values

![optimization.png](images/optimization.png)

![optimization3.png](images/optimization3.png)

In [24]:
for j in range(60000):

    # Forward propagate through layers 0, 1, and 2
    layer0 = input_data
    layer1 = activate(np.dot(layer0,synaptic_weight_0))
    layer2 = activate(np.dot(layer1,synaptic_weight_1))

    #calculate error for layer 2
    layer2_error = output_labels - layer2
    
    if (j % 10000) == 0:
        print ("Error:" + str(np.mean(np.abs(layer2_error))))
        
    #Use it to compute the gradient
    layer2_gradient = layer2_error*activate(layer2,deriv=True)

    #calculate error for layer 1
    layer1_error = layer2_gradient.dot(synaptic_weight_1.T)
    
    #Use it to compute its gradient
    layer1_gradient = layer1_error * activate(layer1,deriv=True)
    
    #update the weights using the gradients
    synaptic_weight_1 += layer1.T.dot(layer2_gradient)
    synaptic_weight_0 += layer0.T.dot(layer1_gradient)

Error:0.0037511378445238145
Error:0.0034609556160009073
Error:0.003228404626190851
Error:0.0030366894373669443
Error:0.002875144132431823
Error:0.0027366293755244644


In [27]:
#testing
print(activate(np.dot(list([0, 1, 1]), synaptic_weight_0)))

[0.86024    0.96566549 0.00311442 0.99991361]


##### Which optimizer would we use?

![adam.png](images/adam.png)

We calculate the first moment, that is the mean, and the second moment, the variance, of the gradient respectively.
Then we use those values to update the weights (parameters).

![adam2.png](images/adam2.png)