### 1. Introduction to Neural Predictions

### Deep Learning vs Machine Learning​

ML: Statistical models that enable computers to learn and make predictions or decisions without being explicitly programmed

DL: subset of machine learning that uses neural networks with multiple layers to analyze complex patterns and relationships in data.

AI: AI uses predictions and automation to optimize and solve complex tasks that humans have historically done, such as facial and speech recognition, decision making and translation.
![](pynb_pics/AIMLEllipses.jpg)<br>

**ANN**: Artificial Neural Networks - What we see today in Deep learning models<br>
**SNN**: Spiking Neural Networks - NEW - Model has the ability to train with new information as it it in operation.


In [1]:
# Please make sure you have these libraries installed
# !pip install --no-index numpy
# !pip install --no-index torch
# !pip install --no-index matplotlib
# !pip install --no-index sklearn
# !pip install --no-index pandas
import numpy as np


In [2]:
# Let's make a single prediction

# This is our dataset.  For each time the team won, we averaged the number of toes (some players had one or more broken toes)
number_of_toes = [8.5, 9.5, 10, 9]

# In this example, the weight is the probability of the team winning based on the number of toes
weight = 0.1 

def neural_network(input, weight):
    prediction = input*weight
    return prediction

input = number_of_toes[0] # 8.5 from the dataset, will this team win?
pred = neural_network(input, weight)

print(round(pred, 2))

0.85


Given an input average of 8.5 toes, there is 85% chance the team will win.  This step can be visualized as a network of 2 neurons: an input multipled by a given weight to produce an output.

![](pynb_pics/single_in.jpg)

In [3]:
# Multiple inputs variables and single output network

toes = [8.5, 9.5, 10, 9] # dataset 1 number of average toes for the team
wlrec = [0.65, 0.8, 0.8, 0.9] # dataset 2 - win / loss ratio
nfans = [1200, 1300, 500, 1000] # dataset 3 - number of fans

weights = [0.1, 0.2, 0]

def w_sum(a,b):  # For calculated the weighted sum of the inputs and weights
    assert(len(a) == len(b))
    output = 0
    for i in range(len(a)):
        output += (a[i] * b[i])
    return output

def neural_network(input, weight):
    prediction = w_sum(input, weights)
    return prediction

input = [toes[0],wlrec[0],nfans[0]] # 3 inputs to predict win: 8.5 toes, 0.65 win/loss, 1200 fans
pred = neural_network(input, weights)

print(round(pred, 3))

0.98


Given the 8.5 toes, 0.65 Win/Loss ratio and 1200 fans, there's a 98% chance the team will win the match. This step can be visualized as a network of multiple neurons: 3 inputs multipled by their respective weight and summed to produce an output.

![](pynb_pics/mult_inp.jpg)

In [4]:
# Single input and multiple outputs: scenario 1, scenario 2, scenario 3

wlrec = [0.65, 0.8, 0.8, 0.9]

weights = [0.3, 0.2, 0.9] # dataset - win / loss ratio

def ele_mul(number, vector):
    output = [0, 0, 0]
    assert(len(output) == len(vector))
    for i in range(len(vector)):
        output[i] = number * vector[i]
    return output

def neural_network(input, weights):
    pred = ele_mul(input, weights)
    return pred

input = wlrec[0]
pred = neural_network(input, weights)

print(np.round(pred, 3))

[0.195 0.13  0.585]


Since we are assigning different weights to a unique input, we interpret the outputs as possible scenarios. This step can be visualized as a network of multiple neurons: 1 input multipled by its respective weight and summed to produce 3 possible outputs.

![](pynb_pics/mult_out.jpg)

In [5]:
# Combine multiple inputs and multiple outputs: hurt, win, sad predictions

toes = [8.5, 9.5, 9.9, 9.0]
wlrec = [0.65, 0.8, 0.8, 0.9]
nfans = [1200, 1300, 500, 1000]

weights = [ [0.1, 0.1, -0.3],
            [0.1, 0.2, 0.0],
            [0.0, 1.3, 0.1] ]

def w_sum(a,b):
    assert(len(a) == len(b))
    output = 0
    for i in range(len(a)):
        output += (a[i] * b[i])
    return output

def vect_mat_mul(vect, matrix):
    output = [0] * len(vect)
    for i in range(len(vect)):
        output[i] = w_sum(vect, matrix[i])
        if output[i] < 0:
            output[i]=0 # Negative values have no meaning in this case so make them nul
    return output # Predictions

def neural_network(input, weights):
    pred = vect_mat_mul(input,weights)
    return pred

input = [toes[0],wlrec[0],nfans[0]]
pred = neural_network(input, weights)

print(np.round(pred, 2))

[  0.     0.98 120.84]


Given team 8.5 toes, 0.65 win/loss ratio, 1200 fans, there are no hurt, 98% chance of win and 120 sad. This step can be visualized as a network of multiple neurons: 3 inputs multipled by their respective weight and summed to produce 3 outputs.

![](pynb_pics/mult_inout.jpg)

### Predicting on predictions

In [6]:
# Neural networks can be stacked. A network can take output of a network and feed it as input to another network.
# This results in 2 consecutive vector matrix multiplications - In image classification (later)

           #toes %win #fans
ih_wgt = [ [0.1, 0.2, -0.1],#hid[0]
           [-0.1,0.1, 0.9], #hid[1]
           [0.1, 0.4, 0.1] ]#hid[2]

        # hid[0] hid[1] hid[2]
hp_wgt = [ [0.3, 1.1, -0.3],#hurt?
           [0.1, 0.2, 0.0], #win?
           [0.0, 1.3, 0.1] ]#sad?
weights = [ih_wgt, hp_wgt]

def w_sum(a,b):
    assert(len(a) == len(b))
    output = 0
    for i in range(len(a)):
        output += (a[i] * b[i])
    return output

def vect_mat_mul(vect, matrix):
    output = [0, 0, 0]
    for i in range(len(vect)):
        output[i] = w_sum(vect, matrix[i])
    return output

def neural_network(input, weights):
    hid = vect_mat_mul(input, weights[0]) # Call prediction with hidden weights
    pred = vect_mat_mul(hid, weights[1]) # hid is input for final prediction
    return pred

toes = [8.5, 9.5, 9.9, 9.0]
wlrec = [0.65,0.8, 0.8, 0.9]
nfans = [1200, 1300, 500, 1000]

input = [toes[0], wlrec[0], nfans[0]]
pred = neural_network(input, weights)

print(np.round(pred, 2))

[1115.1   203.94 1415.09]


Now given team 8.5 toes, 0.65 win/loss ratio, 1200 fans, and assigning a random set of weights, our outputs don't seem to make any more sense. There are errors in this models and as we propagate the calculations from left to right, those errors increase through the depth of the neural network.

![](pynb_pics/mult_inout_stack.jpg)

For the neural network to be useful, we need to minimize the errors that propagate through it. This starts with the notion of gradient descent.

In [7]:
# Refactoring the code from above using NumPy functions makes it more readable

import numpy as np

# Datasets
toes = np.array([8.5, 9.5, 9.9, 9.0])
wlrec = np.array([0.65,0.8, 0.8, 0.9])
nfans = np.array([1200, 1300, 500, 1000])

            #toes %win #fans
ih_wgt = np.array(              # Input to hiddens weight matrix
         [ [0.1, 0.2, -0.1],    #hid[0]
           [-0.1,0.1, 0.9],     #hid[1]
           [0.1, 0.4, 0.1] ]).T #hid[2]

        # hid[0] hid[1] hid[2] 
hp_wgt = np.array(              # hiddens to predictions weight matrix
         [ [0.3, 1.1, -0.3],    #hurt?
           [0.1, 0.2, 0.0],     #win?
           [0.0, 1.3, 0.1] ]).T #sad?

weights = [ih_wgt, hp_wgt]

def neural_network(input, weights):
    hid = input.dot(weights[0])
    pred = hid.dot(weights[1])
    return pred

input = np.array([toes[0], wlrec[0], nfans[0]])
pred = neural_network(input, weights)

print(np.round(pred, 2))

[1115.1   203.94 1415.09]


### 2. Introduction to neural learning

#### 2.1 Gradient descent

#### Introduction
Gradient descent is an optimization algorithm which is commonly-used to train machine learning models and neural networks. It trains machine learning models by minimizing errors between predicted and actual results.

Training data helps these models learn over time, and the cost function (errors at each iteration) within gradient descent specifically acts as a barometer, gauging its accuracy with each iteration of parameter updates. Until the function is close to or equal to zero, the model will continue to adjust its parameters to yield the smallest possible error. Once machine learning models are optimized for accuracy, they can be powerful tools for artificial intelligence (AI) and computer science applications.

#### How does gradient descent work?
Before we dive into gradient descent, it may help to review some concepts from linear regression. You may recall the following formula for the slope of a line, which is y = mx + b, where m represents the slope and b is the intercept on the y-axis.

You may also recall plotting a scatterplot in statistics and finding the line of best fit, which required calculating the error between the actual output and the predicted output (y-hat) using the mean squared error formula. The gradient descent algorithm behaves similarly, but it is based on a convex function.

The starting point is just an arbitrary point for us to evaluate the performance. From that starting point, we will find the derivative (or slope), and from there, we can use a tangent line to observe the steepness of the slope. The slope will inform the updates to the parameters—i.e. the weights and bias. The slope at the starting point will be steeper, but as new parameters are generated, the steepness should gradually reduce until it reaches the lowest point on the curve, known as the point of convergence.   

Similar to finding the line of best fit in linear regression, the goal of gradient descent is to minimize the cost function, or the error between predicted and actual y. In order to do this, it requires two data points—a direction and a learning rate. These factors determine the partial derivative calculations of future iterations, allowing it to gradually arrive at the local or global minimum (i.e. point of convergence).

Learning rate (also referred to as step size or the alpha) is the size of the steps that are taken to reach the minimum. This is typically a small value, and it is evaluated and updated based on the behavior of the cost function. High learning rates result in larger steps but risks overshooting the minimum. Conversely, a low learning rate has small step sizes. While it has the advantage of more precision, the number of iterations compromises overall efficiency as this takes more time and computations to reach the minimum.
The cost (or loss) function measures the difference, or error, between actual y and predicted y at its current position. This improves the machine learning model's efficacy by providing feedback to the model so that it can adjust the parameters to minimize the error and find the local or global minimum. It continuously iterates, moving along the direction of steepest descent (or the negative gradient) until the cost function is close to or at zero. At this point, the model will stop learning. Additionally, while the terms, cost function and loss function, are considered synonymous, there is a slight difference between them. It’s worth noting that a loss function refers to the error of one training example, while a cost function calculates the average error across an entire training set.

Let's have a look at what this means by going back to our 2 neurons example: INPUT -> WEIGHT -> OUTPUT

In [8]:
# Brief explanation of gradient descent
weight = 0.5          # Starting value.  What we estimate to be correct.
input = 0.5
goal_prediction = 0.8 # This represents our validation data I.e. what we know to be true
step_amount = 10      # alpha - How much to move the weights in each iteration

def neural_network(input, weight):
  prediction = input * weight
  return prediction

for iteration in range(20):
  prediction = neural_network(input, weight)
  error = (prediction - goal_prediction) ** 2
  direction_and_amount = (prediction - goal_prediction) * input
  weight = weight - direction_and_amount
  print("Step:" + str(iteration+1) )
  print("Error:" + str(error) + " Prediction:" + str(prediction) )
  print("Weight:" + str(weight) )

Step:1
Error:0.30250000000000005 Prediction:0.25
Weight:0.775
Step:2
Error:0.17015625000000004 Prediction:0.3875
Weight:0.9812500000000001
Step:3
Error:0.095712890625 Prediction:0.49062500000000003
Weight:1.1359375
Step:4
Error:0.05383850097656251 Prediction:0.56796875
Weight:1.251953125
Step:5
Error:0.03028415679931642 Prediction:0.6259765625
Weight:1.33896484375
Step:6
Error:0.0170348381996155 Prediction:0.669482421875
Weight:1.4042236328125
Step:7
Error:0.00958209648728372 Prediction:0.70211181640625
Weight:1.453167724609375
Step:8
Error:0.005389929274097089 Prediction:0.7265838623046875
Weight:1.4898757934570312
Step:9
Error:0.0030318352166796153 Prediction:0.7449378967285156
Weight:1.5174068450927733
Step:10
Error:0.0017054073093822882 Prediction:0.7587034225463867
Weight:1.53805513381958
Step:11
Error:0.0009592916115275371 Prediction:0.76902756690979
Weight:1.553541350364685
Step:12
Error:0.0005396015314842384 Prediction:0.7767706751823426
Weight:1.5651560127735138
Step:13
Error:

Last step correctly approaches the 0.8 goal.
A single line of code to calculate both the direction and the amount was added to change our weight to reduce error.
This provides a superior form of learning that increments/decrements the weight by a small amount to get as close to zero error as possible. Note that the ending weight to acheive the goal_prediction=0.8 becomes 1.6 compared to the initial value we chose of 0.5.

#### 2.2 Direction and amount - Explanation

On the line **direction_and_amount = (prediction - goal_prediction) * input**
the term: **(prediction - goal_prediction)** , represents the pure error.  This represents raw direction and amount that we missed - Let's call it Offset
If the Offset is Positive - Predicted too high
If the Offset is Negative - Predicted too low
Weight ajustment by subtracting the offset

#### 2.3 Iterations

During training, the backpropagation of error estimates the amount of error for which the weights of a node in the network are responsible. Instead of updating the weight with the full amount, it is scaled by the learning rate.

This means that a learning rate of 0.01, a traditionally common default value, would mean that weights in the network are updated 0.01 * (estimated weight error) or 1% of the estimated weight error each time the weights are updated.

In [11]:
#dataset
toes = [8.5, 9.5, 9.9, 9.0]
wlrec = [0.65, 0.8, 0.8, 0.9]
nfans = [1.2, 1.3, 0.5, 1.0]

win_or_lose_binary = [1, 1, 0, 1]  # What we know to be true for the output I.e. Validation data
true = win_or_lose_binary[0]       # Since we will train on index 0 of the inputs, we know the output value should be '1'

alpha = 0.01                       # Learning rate
weights = [0.1, 0.2, -.1]

def neural_network(input, weights):
  out = 0
  for i in range(len(input)):
    out += (input[i] * weights[i])
  return out

def ele_mul(scalar, vector):
  out = [0,0,0]
  for i in range(len(out)):
    out[i] = vector[i] * scalar
  return out


input = [toes[0],wlrec[0],nfans[0]]

for iter in range(3):
  pred = neural_network(input,weights)
  error = (pred - true) ** 2
  delta = pred - true
  weight_deltas=ele_mul(delta,input)
    
#   weight_deltas[0] = 0
  print("Iteration:" + str(iter+1))
  print("Pred:" + str(pred))
  print("Error:" + str(error))
  print("Delta:" + str(delta))
  print("Weights:" + str(weights))
  print("Weight_Deltas:")
  print(str(weight_deltas))
  print()
  for i in range(len(weights)):
    weights[i]-=alpha*weight_deltas[i]

Iteration:1
Pred:0.8600000000000001
Error:0.01959999999999997
Delta:-0.1399999999999999
Weights:[0.1, 0.2, -0.1]
Weight_Deltas:
[-1.189999999999999, -0.09099999999999994, -0.16799999999999987]

Iteration:2
Pred:0.9637574999999999
Error:0.0013135188062500048
Delta:-0.036242500000000066
Weights:[0.1119, 0.20091, -0.09832]
Weight_Deltas:
[-0.30806125000000056, -0.023557625000000044, -0.04349100000000008]

Iteration:3
Pred:0.9906177228125002
Error:8.802712522307997e-05
Delta:-0.009382277187499843
Weights:[0.11498061250000001, 0.20114557625, -0.09788509000000001]
Weight_Deltas:
[-0.07974935609374867, -0.006098480171874899, -0.011258732624999811]



What happens for several iterations ?
More iterations will allow our network to learn
Slopes are reflected by the weight_delta values
Over the iterations, the slope is decreasing as we approach the bottom of the parabole

Given a perfectly configured learning rate, the model will learn to best approximate the function given available resources (the number of layers and the number of nodes per layer) in a given number of training epochs (passes through the training data).

![](pynb_pics/grad_desc.jpg)

### 3. Building a DNN

#### Let's build a DNN
Streetlights is the dataset of observations made at an intersection.
walk_vs_stop are the results observed.  It's what we know.

In [12]:
import numpy as np
weights = np.array([0.5,0.48,-0.7])
alpha = 0.1         

# Here we are training the dataset on the outputs we know to be true and the data patterns observed.
streetlights = np.array( [[ 1, 0, 1 ],
                          [ 0, 1, 1 ],
                          [ 0, 0, 1 ],
                          [ 1, 1, 1 ],
                          [ 0, 1, 1 ],
                          [ 1, 0, 1 ] ] )
walk_vs_stop = np.array( [[ 0 ],
                          [ 1 ],
                          [ 0 ],
                          [ 1 ],
                          [ 1 ],
                          [ 0 ] ] )

for iteration in range(40):
  error_for_all_lights = 0
  for row_index in range(len(walk_vs_stop)):
    input = streetlights[row_index]
    goal_prediction = walk_vs_stop[row_index]
    prediction = input.dot(weights)
    error = (goal_prediction - prediction) ** 2
    error_for_all_lights += error
    delta = prediction - goal_prediction
    weights = weights - (alpha * (input * delta)) # Weight updated.  Shared error measure and multiplying by each respective input
    print( "Weights:" + str(weights))
    print( "Prediction:" + str(prediction))
  print( "Error:" + str(error_for_all_lights) + "\n")

Weights:[ 0.52  0.48 -0.68]
Prediction:-0.19999999999999996
Weights:[ 0.52  0.6  -0.56]
Prediction:-0.19999999999999996
Weights:[ 0.52   0.6   -0.504]
Prediction:-0.5599999999999999
Weights:[ 0.5584  0.6384 -0.4656]
Prediction:0.6160000000000001
Weights:[ 0.5584   0.72112 -0.38288]
Prediction:0.17279999999999995
Weights:[ 0.540848  0.72112  -0.400432]
Prediction:0.17552
Error:[2.65612311]

Weights:[ 0.5268064  0.72112   -0.4144736]
Prediction:0.14041599999999999
Weights:[ 0.5268064   0.79045536 -0.34513824]
Prediction:0.3066464
Weights:[ 0.5268064   0.79045536 -0.31062442]
Prediction:-0.34513824
Weights:[ 0.52614267  0.78979163 -0.31128815]
Prediction:1.006637344
Weights:[ 0.52614267  0.84194128 -0.2591385 ]
Prediction:0.4785034751999999
Weights:[ 0.49944225  0.84194128 -0.28583891]
Prediction:0.26700416768
Error:[0.96287018]

Weights:[ 0.47808192  0.84194128 -0.30719925]
Prediction:0.213603334144
Weights:[ 0.47808192  0.88846708 -0.26067345]
Prediction:0.5347420299776
Weights:[ 0.4780

Look at the Weights.  What do they tells us?
The highest weight means there is a correlation with the second parameter in each pattern
streetlights = np.array( [[ 1, 0, 1 ],<br>
                          [0, 1, 1 ], <br>
                          [ 0, 0, 1 ],<br>
                          [ 1, 1, 1 ],<br>
                          [ 0, 1, 1 ],<br>
                          [ 1, 0, 1 ] ] )<br>

In this case, we observe it's the second weight which is far greater than the 2 others! This is caused by the up or down pressures on the weights during gradient descent. On average, there is more up pressure on middle param weight and more down pressure on other 2

#### 3.1 Up and Down pressure

The up and down pressures explain why generalization is desired to ensure that the model works well with data

##### Neural Network learn by Error Attibution

This means the process of assigning and understanding the contribution of weight to the overall error
or loss of the network. It's a crucial aspect of training and optimizing NN.  The purpose if to identify
how changes in the parameters of the network affect the error, which helps in adjusting the model to improve its performance.

**In the above code:**
weights = weights - (alpha * (input * delta)) 
weights is updated by a shared error measure based on learning rate and multiplying by each respective input

But given the shared error, we want the network to figure out which weights contributed the most to the output to focus the adjustment on those ones.

#### 3.2 Backpropagation and introducing non-linearity

Neural networks would be restricted to modeling only linear relationships between inputs and outputs without an 
activation function.
The activation function decides whether a neuron should be activated or not by calculating the weighted sum and further adding non-linearity to it. The purpose of the activation function is to introduce non-linearity into the output of a neuron.

![](pynb_pics/act_fct.jpg)

Backpropagation consists of adjusting the weights of the previousl layer according to the delta (error) calculated 
during an iteration

![](pynb_pics/back_prop.jpg)

### 4.0 A Deep Neural Network Comes to Life

Let's implement the neural network above and observe what happens during one iteration of the training.

In [23]:
# 1 iteration of the Neural Network

# Initialize weights and data
import numpy as np
np.random.seed(1)

def relu(x):
  return (x > 0) * x

#Returns 1 when output is more than 0, zero otherwise
def relu2deriv(output):
  return output>0

alpha = 0.2
hidden_size = 3
streetlights = np.array( [[ 1, 0, 1 ],
                          [ 0, 1, 1 ],
                          [ 0, 0, 1 ],
                          [ 1, 1, 1 ] ] )
walk_vs_stop = np.array([[ 1, 1, 0, 0]]).T

weights_0_1 = 2*np.random.random((3,hidden_size)) - 1
weights_1_2 = 2*np.random.random((hidden_size,1)) - 1

# Code makes a prediction and calculated the ouput error and delta
layer_2_error = 0
layer_0 = streetlights[0:1]
layer_1 = relu(np.dot(layer_0,weights_0_1))
layer_2 = np.dot(layer_1,weights_1_2)
layer_2_error += np.sum((layer_2 - walk_vs_stop[0:1]) ** 2)
layer_2_delta = (walk_vs_stop[0:1] - layer_2)

# Backpropagating from layer 2 to layer 1
layer_1_delta=layer_2_delta.dot(weights_1_2.T)*relu2deriv(layer_1) 

# Weight deltas and updates weights
weights_1_2 += alpha * layer_1.T.dot(layer_2_delta)
weights_0_1 += alpha * layer_0.T.dot(layer_1_delta)

print( "Error:" + str(layer_2_error))
print( "Layer 0:" + str(layer_0))
print( "Layer 1:" + str(layer_1))
print( "Layer 2:" + str(layer_2))
print( "Layer 1 Delta:" + str(layer_1_delta))
print( "Layer 2 Delta:" + str(layer_2_delta))
print( "Weights 1_2:" + str(weights_1_2))
print( "Weights 0_1:" + str(weights_0_1))


Error:1.0430445982842722
Layer 0:[[1 0 1]]
Layer 1:[[-0.          0.13177044 -0.        ]]
Layer 2:[[-0.02129555]]
Layer 1 Delta:[[ 0.         -0.16505257  0.        ]]
Layer 2 Delta:[[1.02129555]]
Weights 1_2:[[ 0.07763347]
 [-0.13469566]
 [ 0.370439  ]]
Weights 0_1:[[-0.16595599  0.40763847 -0.99977125]
 [-0.39533485 -0.70648822 -0.81532281]
 [-0.62747958 -0.34188906 -0.20646505]]


In [24]:
weights_0_1 = 2*np.random.random((3,hidden_size)) - 1
print(weights_0_1)

[[-0.5910955   0.75623487 -0.94522481]
 [ 0.34093502 -0.1653904   0.11737966]
 [-0.71922612 -0.60379702  0.60148914]]


In [17]:
import numpy as np
np.random.seed(1)

def relu(x):
  return (x > 0) * x

def relu2deriv(output):
  return output>0

alpha = 0.2
hidden_size = 4
streetlights = np.array( [[ 1, 0, 1 ],
                          [ 0, 1, 1 ],
                          [ 0, 0, 1 ],
                          [ 1, 1, 1 ] ] )
walk_vs_stop = np.array([[ 1, 1, 0, 0]]).T

weights_0_1 = 2*np.random.random((3,hidden_size)) - 1
weights_1_2 = 2*np.random.random((hidden_size,1)) - 1

for iteration in range(40):
  layer_2_error = 0
  for i in range(len(streetlights)):
    layer_0 = streetlights[i:i+1]
    layer_1 = relu(np.dot(layer_0,weights_0_1))
    layer_2 = np.dot(layer_1,weights_1_2)
    layer_2_error += np.sum((layer_2 - walk_vs_stop[i:i+1]) ** 2)
    layer_2_delta = (walk_vs_stop[i:i+1] - layer_2)
    layer_1_delta=layer_2_delta.dot(weights_1_2.T)*relu2deriv(layer_1)
    weights_1_2 += alpha * layer_1.T.dot(layer_2_delta)
    weights_0_1 += alpha * layer_0.T.dot(layer_1_delta)
  # if(iteration == 0):
    print("Iteration:" +  str(iteration) + "\n")
    print( "Error:" + str(layer_2_error))
    print( "Layer 0:" + str(layer_0))
    print( "Layer 1:" + str(layer_1))
    print( "Layer 2:" + str(layer_2))
    print( "Layer 1 Delta:" + str(layer_1_delta))
    print( "Layer 2 Delta:" + str(layer_2_delta))
    print( "Weights 1_2:" + str(weights_1_2))
    print( "Weights 0_1:" + str(weights_0_1))
  

Iteration:0

Error:0.3697329913497495
Layer 0:[[1 0 1]]
Layer 1:[[-0.          0.51828245 -0.         -0.        ]]
Layer 2:[[0.39194327]]
Layer 1 Delta:[[-0.          0.45983371 -0.          0.        ]]
Layer 2 Delta:[[0.60805673]]
Weights 1_2:[[-0.5910955 ]
 [ 0.8192639 ]
 [-0.94522481]
 [ 0.34093502]]
Weights 0_1:[[-0.16595599  0.53261573 -0.99977125 -0.39533485]
 [-0.70648822 -0.81532281 -0.62747958 -0.30887855]
 [-0.20646505  0.16960021 -0.16161097  0.370439  ]]
Iteration:0

Error:1.3281972624432705
Layer 0:[[0 1 1]]
Layer 1:[[-0.         -0.         -0.          0.06156045]]
Layer 2:[[0.02098811]]
Layer 1 Delta:[[-0.          0.         -0.          0.33377944]]
Layer 2 Delta:[[0.97901189]]
Weights 1_2:[[-0.5910955 ]
 [ 0.8192639 ]
 [-0.94522481]
 [ 0.3529887 ]]
Weights 0_1:[[-0.16595599  0.53261573 -0.99977125 -0.39533485]
 [-0.70648822 -0.81532281 -0.62747958 -0.24212266]
 [-0.20646505  0.16960021 -0.16161097  0.43719489]]
Iteration:0

Error:1.4142058374213209
Layer 0:[[0 0 1]

In each iteration, we adjust the weights in the right direction with each input (4 times for 4 inputs).  Eventually, those weights will gravitate to the very botton of the loss function vally where the model will perform very well meaning we get as close to the right answer as possible for all of the inputs. We observe that the error is getting closer to zero

### 5.0 A Deep Neural Network with PyTorch