<a href="https://colab.research.google.com/github/samuelkb/gColab/blob/main/notebooks/Introduction%20to%20neural%20networks.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction to neural networks

What is deep learning?
What is it used for? Pretty much everywhere, such as being humans in games such as Go, jeopardy, detecting spam in emails, forecasting stock prices, recognizing images in a picture, diagnosing illnesses sometimes with more precision than doctors, and the most celebrated applications of deep learning is in self-driving cars. 

At the heart of deep learning? Neural networks.

Neural networks mimic the process of how the brain operates. Given some data in the form of blue or red points, the neural networks will look for the best line that separates / classifies them.

### Perceptrons

Perceptrons are the building blocks of neural networks, and are just an encoding of our equations into a small graph. 
A great application of this perceptrons are logical operators, the most common of these, the **AND**, **OR**, **NOT**, and **XOR**.

![AND_perceptron.png](attachment:AND_perceptron.png)

### What are the weights and bias for the AND perceptron?

Let's play and set the weights and bias to values that will correctly determine the AND, OR, and NOT operation as shown above.
(Consider there are more than one set of values that will work)

In [3]:
import pandas as pd

# Set weight1, weight2 and bias (you can replace it and continue playing)
weight1 = 1.0
weight2 = 1.0
bias = -2.0

#Dont change the next block of code
#Inputs and outputs
test_inputs = [(0, 0), (0, 1), (1, 0), (1, 1)]
correct_outputs = [False, False, False, True]
outputs = []

# Validate outputs
for test_input, correct_output in zip(test_inputs, correct_outputs):
    linear_combination = weight1 * test_input[0] + weight2 * test_input[1] + bias
    output = int(linear_combination >= 0)
    is_correct_string = 'Yes' if output == correct_output else 'No'
    outputs.append([test_input[0], test_input[1], linear_combination, output, is_correct_string])
    
# Print results
num_wrong = len([output[4] for output in outputs if output[4] == 'No'])
output_frame = pd.DataFrame(outputs, columns=['Input 1', '  Input 2', '  Linear Combination', '  Activation Output', '  Is Correct'])
if not num_wrong:
    print('Nice!  You got it all correct.\n')
else:
    print('You got {} wrong.  Keep trying!\n'.format(num_wrong))
print(output_frame.to_string(index=False))

Nice!  You got it all correct.

 Input 1    Input 2    Linear Combination    Activation Output   Is Correct
       0          0                  -2.0                    0          Yes
       0          1                  -1.0                    0          Yes
       1          0                  -1.0                    0          Yes
       1          1                   0.0                    1          Yes


### What are the weights and bias for the OR perceptron?

![OR_Perceptron.png](attachment:OR_Perceptron.png)

In [4]:
import pandas as pd

# Set weight1, weight2 and bias (you can replace it and continue playing)
weight1 = 2.0
weight2 = 2.0
bias = -2.0

#Dont change the next block of code
#Inputs and outputs
test_inputs = [(0, 0), (0, 1), (1, 0), (1, 1)]
correct_outputs = [False, True, True, True]
outputs = []

# Validate outputs
for test_input, correct_output in zip(test_inputs, correct_outputs):
    linear_combination = weight1 * test_input[0] + weight2 * test_input[1] + bias
    output = int(linear_combination >= 0)
    is_correct_string = 'Yes' if output == correct_output else 'No'
    outputs.append([test_input[0], test_input[1], linear_combination, output, is_correct_string])
    
# Print results
num_wrong = len([output[4] for output in outputs if output[4] == 'No'])
output_frame = pd.DataFrame(outputs, columns=['Input 1', '  Input 2', '  Linear Combination', '  Activation Output', '  Is Correct'])
if not num_wrong:
    print('Nice!  You got it all correct.\n')
else:
    print('You got {} wrong.  Keep trying!\n'.format(num_wrong))
print(output_frame.to_string(index=False))

Nice!  You got it all correct.

 Input 1    Input 2    Linear Combination    Activation Output   Is Correct
       0          0                  -2.0                    0          Yes
       0          1                   0.0                    1          Yes
       1          0                   0.0                    1          Yes
       1          1                   2.0                    1          Yes


Notice that to go from AND to OR operation we can just increase the weights or decrease the magnitude of the bias.

### What are the weights and bias for the NOT perceptron?

This operator only cares about one input. The other inputs to the perceptron are ignored.

In [6]:
import pandas as pd

# Set weight1, weight2 and bias (you can replace it and continue playing)
weight1 = -1.0
weight2 = -3.0
bias = 2.0

#Dont change the next block of code
#Inputs and outputs
test_inputs = [(0, 0), (0, 1), (1, 0), (1, 1)]
correct_outputs = [True, False, True, False]
outputs = []

# Validate outputs
for test_input, correct_output in zip(test_inputs, correct_outputs):
    linear_combination = weight1 * test_input[0] + weight2 * test_input[1] + bias
    output = int(linear_combination >= 0)
    is_correct_string = 'Yes' if output == correct_output else 'No'
    outputs.append([test_input[0], test_input[1], linear_combination, output, is_correct_string])
    
# Print results
num_wrong = len([output[4] for output in outputs if output[4] == 'No'])
output_frame = pd.DataFrame(outputs, columns=['Input 1', '  Input 2', '  Linear Combination', '  Activation Output', '  Is Correct'])
if not num_wrong:
    print('Nice!  You got it all correct.\n')
else:
    print('You got {} wrong.  Keep trying!\n'.format(num_wrong))
print(output_frame.to_string(index=False))

Nice!  You got it all correct.

 Input 1    Input 2    Linear Combination    Activation Output   Is Correct
       0          0                   2.0                    1          Yes
       0          1                  -1.0                    0          Yes
       1          0                   1.0                    1          Yes
       1          1                  -2.0                    0          Yes


### What are the weights and bias for the XOR perceptron?

![Captura%20de%20Pantalla%202020-10-25%20a%20la%28s%29%2022.48.10.png](attachment:Captura%20de%20Pantalla%202020-10-25%20a%20la%28s%29%2022.48.10.png)

We can get the XOR operation doing a multi-layer perceptron with AND, NOT, and OR preceptrons. For that we can use the following diagram:

![xor-quiz.png](attachment:xor-quiz.png)

Where:
- "A" is the AND perceptron
- "B" is the OR perceptron
- "C" is the NOT  perceptron

![xor-quiz2.png](attachment:xor-quiz2.png)

So, we reviewed perceptrons, using logic and mathematical knowledge to build the most common logical operators. 
In  real life, though, we can't be building these perceptrons ourselves. The idea is that we give them the result, and they build themselves.

### How do we find the line that separates one group of data from other?

In the graphs above, we saw that we received a 1 when the dot was on blue area, and 0 when the dot was on red area. Perceptrons help us to see graphically the classification of data for this time with a linear ecuation, but not always we have data separated by a linear model.

The computer doesn't know where to start to classify a data set, so it might start at a random place by picking a random linear equation. That line will define two areas y some data will be in the "red" part, and another in the "blue" part. It's really probable that some data will be missclasified, so we will be looking how badly this line is doing the classification and then move it arround to try to get better results.

To know how bad is the initial clasification, we ask for all the data set / points, we find the correctly classified, and also we will see those that are incorrectly classified and we want to know as much information as we can from them to tell us something that we can improve with our initial classification line. 

What can a missclasified point say us? Does that point want the line of classification closer or farther?
It's a good start, for missclasified points the option will be closer to the line.

We can see an example to better understand this problem approach:

We have a data set with an ecuation dividing into two parts, negative and positive areas, which ecuation is:
$$
3x_1 + 4x_2 - 10 = 0
$$

All positive dots will be defined by:
$$
3x_1 + 4x_2 > 10
$$

All negative dots will be defined by:
$$
3x_1 + 4x_2 < 10
$$

![Captura%20de%20Pantalla%202020-10-26%20a%20la%28s%29%207.20.41.png](attachment:Captura%20de%20Pantalla%202020-10-26%20a%20la%28s%29%207.20.41.png)

And we have a point (4, 5) missclasified into the red area. The point say to the line: "Come closer!"
How do we get that point to come closer to the line? A good idea can be to take the (4, 5) and modify the equation of the line to get the line to move closer to the point. We shouldn't forget the bias, and what we will do is substract these numbers from the parameters of the line to get:

![Captura%20de%20Pantalla%202020-10-26%20a%20la%28s%29%207.28.12.png](attachment:Captura%20de%20Pantalla%202020-10-26%20a%20la%28s%29%207.28.12.png)

The new line will have params -1, -1, -11, that is a drastical change, but we don't want to do a drastical change, because we can accidentally misclassify all our other points. We want to move the line towards that point with small steps.

### Learning rate

To move our line in steps, we will introduce the learning rate, that is a small number used to substract for the original equation, taking the values of our point misclassified and multiply them fo the learning rate, let's say in our example Learning rate = 0.1:

![Captura%20de%20Pantalla%202020-10-26%20a%20la%28s%29%207.33.45.png](attachment:Captura%20de%20Pantalla%202020-10-26%20a%20la%28s%29%207.33.45.png)

That will give us the next equation:
$$
2.6x_1 + 3.5x_2 - 10.1 = 0
$$

That will make our line to come closer to the misclassified point in small steps and also, if we have a point incorrectly classified on the red area, we can follow the same approach, but in this case instead to substract from the original equation, we will be adding:

![Captura%20de%20Pantalla%202020-10-26%20a%20la%28s%29%207.37.24.png](attachment:Captura%20de%20Pantalla%202020-10-26%20a%20la%28s%29%207.37.24.png)

**So we can use this trick repeatedly for the Perceptron Algorithm.**

For this second example where we defined a line described by:
$$
3x_1 + 4x_2 -10 = 0
$$

and considering our learning rate = 0.1. How many times would we have to apply the perceptron trick to mve the line to a position where the blue point (1, 1) is correctly classified?
Yes, we will have to apply 10 times the perceptron trick, let's do it.

In [None]:
# Pending to learn how to graph and show our iterations

### Perceptron Algorithm

Remember the computer starts with a random line with random weights ($W_1 ... W_n, b $), and each point of our data sets tell us how correctly or incorrectly cliassified are. The misclassified points ($X_1, ..., X_n $) says to the line "Come closer". 

So if our prediction is equal to 0, we want for each point from 1 to n to add $W_i  =W_i + \alpha X_1$, where $\alpha$ is our learning rate. Then we also change the B as unit to b plus $\alpha$, as we show: ($b + \alpha$), because that will move our line closer to the misclassified point.

So if our prediction is equal to 1, we want for each point from 1 to n to substract $W_i = W_i - \alpha X_1$, where $\alpha$ is our learning rate. Then we also change the B as unit to b minus $\alpha$, as we show: ($b - \alpha$), because that will move our line closer to the misclassified point.


Then, we just have to repeat this step until we get no errors, or repeat a specific number of times.

### Coding the perceptron algorithm

Let's play with a data set to separate the data given in a data.csv file.

Remember, our algorithm works as follows:
For points with coordinates (p, q), label y, an prediction given by the equation y' = $step(w_1*x_1 + w_2x_2 + b)$:
- If the point is correctly classified, do nothing
- If the point is calassified positive, but it has a negative label, subtract $\alpha p, \alpha q$, and $\alpha$ from $w_1, w_2$ and $b$ respectively.
- If the point is classified negative, but it has a positive label, add $\alpha p, \alpha q$, and $\alpha$ from $w_1, w_2$ and $b$ respectively.

You can play with the parameters to see what happens and how your initial conditions can affect the solution.

In [20]:
import numpy as np
# importing the required module 
import matplotlib.pyplot as plt 

# Setting the random seed, feel free to change it and see different solutions.
np.random.seed(100)

def stepFunction(t):
    if t >= 0:
        return 1
    return 0

def prediction(X, W, b):
    return stepFunction((np.matmul(X,W)+b)[0])

# TODO: Fill in the code below to implement the perceptron trick.
# The function should receive as inputs the data X, the labels y,
# the weights W (as an array), and the bias b,
# update the weights and bias W, b, according to the perceptron algorithm,
# and return W and b.
def perceptronStep(X, y, W, b, learn_rate = 0.01):
    # Fill in code
    for i in range(len(X)):
        y_hat = prediction(X[i],W,b)
        if y[i]-y_hat == 1:
            W[0] += X[i][0] * learn_rate
            W[1] += X[i][1] * learn_rate
            b += learn_rate
        elif y[i]-y_hat == -1:
            W[0] -= X[i][0] * learn_rate
            W[1] -= X[i][1] * learn_rate
            b -= learn_rate
    return W, b
    
# This function runs the perceptron algorithm repeatedly on the dataset,
# and returns a few of the boundary lines obtained in the iterations,
# for plotting purposes.
# Feel free to play with the learning rate and the num_epochs,
# and see your results plotted below.
def trainPerceptronAlgorithm(X, y, learn_rate = 0.01, num_epochs = 10):
    x_min, x_max = min(X.T[0]), max(X.T[0])
    y_min, y_max = min(X.T[1]), max(X.T[1])
    W = np.array(np.random.rand(2,1))
    b = np.random.rand(1)[0] + x_max
    # These are the solution lines that get plotted below.
    boundary_lines = []
    for i in range(num_epochs):
        # In each epoch, we apply the perceptron step.
        W, b = perceptronStep(X, y, W, b, learn_rate)
        boundary_lines.append((-W[0]/W[1], -b/W[1]))
        plt.plot(i, boundary_lines)
    return boundary_lines
## How to graph this??


Well, you can see now graphically how the line is learning and trying to separate dots. But data in real world usually can't be separated by a line, there are more complex ecuations that can classify better our data sets. But now our perceptron algorithm won't work for us this time. We need to redefine our perceptron algorithm.

### Error functions

An error function is simply something that tells us how far we are from the solution. It will guide us, checking in which direction I can take a step to get closer to the solution.

If we make a analogy to understand error functions, we can say that we are on the top of a big mountain and we want to descend, the mountain is really big so we don't have the hole picture, we just have all the posible directions to take and we'll chose that direction that help us to descend the most, once we take that direction, we repear te process until we arrive the button of the mountain. 

All of our decisions are based on the height of the mountain, we will call height the error. That error tell us how badly we're doing at the moment and how far we are from an ideal solution. If we constantly take steps to decrease the error then we'll eventually solve our problem.

You can notice that method to solve can give us wrong solutions, following our analogy, to get into a valley or local minimum. That happens a lot in machine learning and we'll see other forms to solve it later.

To take advantage of error functions, we have to build continuous functions, that will guide us better to be increasing errors.

![Captura%20de%20Pantalla%202020-11-23%20a%20la%28s%29%208.19.49.png](attachment:Captura%20de%20Pantalla%202020-11-23%20a%20la%28s%29%208.19.49.png)

### Discrete vs Continuous predictions

A discrete answer gives us a yes/no, while a continues answer gives us a number, usually between 0 and 1, considered probability. The way we move from discrete predictions to continuous is changing our activation function from the step function (0's/1's) to a new function called **Sigmoid Function**. 

The sigmoid function is a function which a large positive numbers will give us values very close to one, for large negative numbers will give us values very close to zero, for numbers closer to zero will give us values that are close to 0.5 The formula is:

$$
\alpha(x) =  \frac{\mathrm{1} }{\mathrm{1} + e^-x }
$$


The way we obtain probailities from spaces like our classification points on red/blue area is simple. We just combine the linear function `Wx + b` with the sigmoid function. So the prediction is defined as:

$$
\hat{y} = \alpha(Wx + b)
$$

For now, this formula take us closer to a specific prediction (e. g. closer to one id the prediction is "Is the point blue?" when it is deep blue area).

![Captura%20de%20Pantalla%202020-11-23%20a%20la%28s%29%2022.37.13.png](attachment:Captura%20de%20Pantalla%202020-11-23%20a%20la%28s%29%2022.37.13.png)

What our new Sigmoid perceptron does is to take the inputs, multiplies them by the weights in the edges and adds the results, then applies the sigmoid function. So instead of returning one and zero like before, it returns values between zero and one. Now we know the probability, not just a yes or not answer.

![Captura%20de%20Pantalla%202020-11-23%20a%20la%28s%29%2022.40.25.png](attachment:Captura%20de%20Pantalla%202020-11-23%20a%20la%28s%29%2022.40.25.png)

### The Softmax function

The softmax function is the equivalent of the sigmoid activation function, but when the problem has 3 or more classes.

When we need to classify between 3 or more options we have the problem of translating our data values into a probability, while the sum of all our data probability is equal to 1.

![Captura%20de%20Pantalla%202020-11-23%20a%20la%28s%29%2022.56.41.png](attachment:Captura%20de%20Pantalla%202020-11-23%20a%20la%28s%29%2022.56.41.png)

Exponential functions help us to avoid negative numbers and allow us to take the value for an option and divide it by the sum of all the values without the fear of being dividing into zero.

Let's say we have N classes and a linear model that gives us the following scores: 
$$Z_1, ... Z_n$$
Each score for each of the classes. What we do to turn them into probabilities is to say that the probability that object is in class i is going to be e to the power of the Zi divided by the sum of e to the power of Z1 plus all the way to e to the power zn:
$$
P(class i) = \frac{\mathrm{e^{Zi}} }{\mathrm{e^{Zi}} + ... + e^{Zn} }
$$
That's how we turn scores into probabilities.

Well we can start programming the Softmax function!

In [2]:
import numpy as np

# Write a function that takes as input a list of numbers, and returns
# the list of values given by the softmax function.
def softmax(L):
    expL = np.exp(L)
    sumExpL = sum(expL)
    result = []
    for i in expL:
        result.append(i*1.0/sumExpL)
    return result

In [5]:
# Lets call our function:
print (softmax([5, 6, 7]) )

[0.09003057317038046, 0.24472847105479764, 0.6652409557748219]


Not every time we will have our data as a number, and we can't create dependencies between data and numeric valiables, for these cases we should add more columns to our data set to specify better the structure of our data. This process is called the One-Hot Encoding and we will use it a lot for processing data.

### Maximum likehood

Probability will be one of our best frinds as we go through Deep Learning. Let's see how we can use it to evaluate and improve our models.


The best model is the one that gives the higher probabilities to the events that happened to us, whether it's acceptance or rejection. That method is known as Maximum Likehood. What we do is pick the model that gives the existing labels the highest probability, thus, by maximazing the probability, we can pick the best possible model. 


If we compare two models that classify points acording with his area color, where the first model correctly classified two of four points, and the second one correctly classified four points of four, we will be sure our second model is the best, but if we want to prove it by probability, we have to take the probability of each point of beeing the color what it is, and then multiply all the points probabilities. 

Aplying maximum likehood we will notice that the second model, which has the higher probabilities of each point of beeing the color what it is, has a higher probability in total.

![Captura%20de%20Pantalla%202020-11-24%20a%20la%28s%29%2020.10.04.png](attachment:Captura%20de%20Pantalla%202020-11-24%20a%20la%28s%29%2020.10.04.png)

### Maximazing probabilities

We have talked about error function and how minimizing this function will take us to the best possible solution. Could it be that maximizing the probability is equivalent to minimizing the error function?

What we want is to maximize probability, but probability is a product of numbers and products are hard because if we have a product of thousands of numbers I will be having something like 0.00000XX which is bad for our porpouses. 

Avoiding products, lets change to sums, we need to find a function that will help us turn products into sums.

![Captura%20de%20Pantalla%202020-11-24%20a%20la%28s%29%2020.18.35.png](attachment:Captura%20de%20Pantalla%202020-11-24%20a%20la%28s%29%2020.18.35.png)

The logarithm function help us because it has a  nice identity that says that the logarithm of the product A times B is the sum of the logarithms of A and B.

So we take our products and we take the logarithms. Let's say our model 1 has these four probabilities for each point:
$$
0.6 * 0.2 * 0.1 * 0.7 = 0.0084
$$

And our model 2 has these four probabilities for each point:
$$
0.7 * 0.9 * 0.8 * 0.6 = 0.3024
$$

We change to logarithms, for model 1:
$$
\ln(0.6) + \ln(0.2) + \ln(0.1) + \ln(0.7)
$$
$$-0.51   -1.61   -2.3   -0.36$$

and for model 2:
$$
\ln(0.7) + \ln(0.9) + \ln(0.8) + \ln(0.6)
$$
$$-0.36   -0.1   -0.22   -0.51$$

Well, you will notice that we will be recieving negative numbers for each natural logarithm, that makes sense because the logarithm of a number between 0 and 1 is always a negative number since the logarithm of one is zero. 

We will take the negative of the logarithm of the probabilities:
for model 1:
$$
-\ln(0.6) - \ln(0.2) - \ln(0.1) - \ln(0.7)
$$
$$0.51   +1.61   +2.3   +0.36$$

and for model 2:
$$
-\ln(0.7) - \ln(0.9) - \ln(0.8) - \ln(0.6)
$$
$$0.36   +0.1   +0.22   +0.51$$

That sums up negatives of logarithms of the probabilities we will called **cross entropy** which is a very important concept. 

Calculating the entropies,  we found that our model 1 has a cross entropy of 4.8 which is high, while our model 2 has a cross entropy of 1.2 which is low. A good model gives us a low cross entropy and a bad model will give us a high cross entropy.

This method is actually much more powerful than we think, if we calculate the probabilities and pair the points with the corresponding logarithms, we actually get an error for each point. 

![Captura%20de%20Pantalla%202020-11-24%20a%20la%28s%29%2020.37.11.png](attachment:Captura%20de%20Pantalla%202020-11-24%20a%20la%28s%29%2020.37.11.png)



If we look carefully at the values we can see that the points that are missclassified have high values and the points that are correctly classified have small values. We can think of the negatives of these logarithms as errors at each point. 

Our cross entropy will tell us if a model is good a bad. So we change from maximizing the probability to minimizing our cross entropy. 

### Cross Entropy

Cross entropy says: if I have a bunch of events and a bunch of probabilities, how likely is it that those eventes happen based on the probabilities? If is is very likely, we have a small cross entropy. If it is unlikely, we have a large cross entropy.


Let's take another example, here is a table with all the possible scenarios for three doors that have different probabilities to have a gift behind them: 

![Captura%20de%20Pantalla%202020-11-24%20a%20la%28s%29%2020.58.48.png](attachment:Captura%20de%20Pantalla%202020-11-24%20a%20la%28s%29%2020.58.48.png)

There are eight scenarios since each door gives us two possibilities each. We obtain the probability of each arrangement by multiplying the three independent probabilities to get these numbers. 

Notice that the events with high probability have low cross entropy and the events with low probability have high cross entropy.

Taking this example we can make a formula taking some variables as the probabilities for each door, the presence of a gift behind the door, and we have that:

![Captura%20de%20Pantalla%202020-11-24%20a%20la%28s%29%2021.04.38.png](attachment:Captura%20de%20Pantalla%202020-11-24%20a%20la%28s%29%2021.04.38.png)

The cross entropy really tells us when two vectors are similar or different. 


Let's code the formula for cross entropy in python. Where Y is the category and P the probability:

In [1]:
import numpy as np

# Write a function that takes as input two lists Y, P,
# and returns the float corresponding to their cross-entropy.
def cross_entropy(Y, P):
    Y = np.float_(Y)
    P = np.float_(P)
    return -np.sum(Y* np.log(P) + (1-Y)*(np.log(1-P)))

If we test it:

In [3]:
Y = [1, 0, 1, 1]
P = [0.4, 0.6, 0.1, 0.5]
print(cross_entropy(Y, P))

4.828313737302301


We show an example where we have two classes (gift or not), we continue with our three doors, but now behind each door there can be an animal, and the animal can be of three types: duck, beaver or walrus. And here is the probability table for our example:

![Captura%20de%20Pantalla%202020-11-24%20a%20la%28s%29%2021.15.15.png](attachment:Captura%20de%20Pantalla%202020-11-24%20a%20la%28s%29%2021.15.15.png)

We take our doors table and define a new Cross entropy formula where m is a number of classes, Yij the probability of the event of an specific animal behind the door:
![Captura%20de%20Pantalla%202020-11-24%20a%20la%28s%29%2021.19.36.png](attachment:Captura%20de%20Pantalla%202020-11-24%20a%20la%28s%29%2021.19.36.png)

![Captura%20de%20Pantalla%202020-11-24%20a%20la%28s%29%2021.20.49.png](attachment:Captura%20de%20Pantalla%202020-11-24%20a%20la%28s%29%2021.20.49.png)

### Logistic regression

One of the most popular and useful algorithms in ML, and the building block of all that constitutes DL is the logistic regression, which do something like this:
- Take your data
- Pick a random model
- Calculate the error
- Minimize the error, and obtain a better model
- Enjoy!

Okay, how do we calculate the error? Coming back to our example of points in blue and red area, We can define that if Y =1 (blue area), the probability of a point to be blue is $\hat{y}$ and the error is equal to the negative of the natural logarithm of $\hat{y}$.
And if Y = 0 (red area), the probability of a point to be red is 1 minus probaility of a point to be blue, it means 1 - $\hat{y}$ and the error is equal to the negative of the natural logarithm of $1 - \hat{y}$..
So we can define a formula for error that goes like this one:
$$
Error = - (1 - y)(ln(1 - \hat{y})) - y(ln(\hat{y}))
$$

Where if the point is red then `y = 0` and the second term of the formula is 0 and the first one is logarithm of 1 - $\hat{y}$, and if the point is blue then `y = 1` and the first term of the ofrmula is 0.

Following that, the formula for the error function is simply the sum over all the error functions of points. 
$$
Error function = - \frac{\mathrm{1} }{\mathrm{m} }\ \sum_{i=1}^{m} (1 - y_i)(ln(1 - \hat{y_i})) - y_i(ln(\hat{y_i})) \
$$

Since $\hat{y}$ is given by the sigmoid of the linear function `Wx + b`, then the total formula for the error is actually in terms of w and b, which are the weights of the model. That is simply the summation we see here:

$$
ErrorFunction(W,b) = - \frac{\mathrm{1} }{\mathrm{m} }\ \sum_{i=1}^{m} (1 - y_i)(ln(1 - \sigma(Wx^{i} + b ))) - y_i(ln(\sigma(Wx^{i} + b ) )) \
$$

In this case $Y_i$ is just the label of the point $x^i$, so now that we've calculated it, our goal is to minimize it. 

This error function applies for binary classification problems, if we have a multiclass classification problem, then the error is now given by the multiclass entropy:

$$
ErrorFunction(W,b) = - \frac{\mathrm{1} }{\mathrm{m} }\ \sum_{i=1}^{m}  \sum_{j=1}^{n} y_{ij}ln(\hat{y_{ij}}) \
$$

This last formula is given here where for every data point we take the product of the label times the logarithm of the prediction and then we average all these values.

### Minimizing the error function

To minimize the error function, we started some random weights, which give us the predictions $\sigma(Wx + b)$ that also gives us a error function given by the formula `Error function(W, b)`, remembering that each point will give us a larger function if it is missclasified and a smaller one if it is correctly classified. 

To reduce that formula we will use gradient descent in order to "get to the bottom of our mountain" which gives us a smaller error function `Error function(W, b)`. This will give rise to new weights, W' and b' which will give us a much better prediction, namely $\sigma(W'x + b')$. 

### Gradient descent

Our error function is a function of the weights, we need to go down and we have inputs W1 and W2 and the Error function is given by E. Then the gradient of E is given by the vector sum of the partial derivatives of E with respect to W1 and W2. This gradient tells us the direction we want to move if we want to increase the error fuction the most. 
Thus if we take the negative of the gradient, this will tell us how to decrease the error function the most. **Yes. That is what we want to do.**

We will take the negative of the fradient of the error function at that point, then we take a step in that direction, what will bring us to a lowe position, so we repeat until we are able to "get to the bottom of our mountain"

How we calculate the gradient?
We start with our initial prediction: $\hat{y} = \sigma(Wx + b)$ and we afirm this prediction is bad because the error is large since we're at our starting point. The prediction will look like this:
$$
\hat{y} =  \sigma(W_1X_1 + ... + W_nX_n + b)
$$
The error function is given by the formula `Error function(W, b)`, but what matters here is the gradient of the error function: 
$$
\nabla E = \frac{\mathrm{\partial E} }{\mathrm{\partial W_1} } ..., \frac{\mathrm{\partial E} }{\mathrm{\partial W_n}  }, \frac{\mathrm{\partial E} }{\mathrm{\partial b} }
$$
We notice is precisely the vector formed by the partial defivative of the error function with respect to the weights and the bias.

Now we take a step in the direction of the negative of the gradient, but we don't want to make any dramatic changes, so we introduce a smaller learning rate $\alpha$, for example, $\alpha = 0.1$. And we multiply the gradient by the learning rate. Taking a step by that form is exactly updating the weights and the bias:
$$
W_i' \leftarrow W_i - \alpha \frac{\mathrm{\partial E} }{\mathrm{\partial W_1} }
$$

Where the weight $W_i$ will now become $W_i'$, and the bias now become b' given by:
$$
b' \leftarrow b - \alpha \frac{\mathrm{\partial E} }{\mathrm{\partial b} }
$$


This take us to a prediction with a lower error function. So we can conclude that the prediction we have now with weights W' b': $\hat{y} = \sigma(W'x + b')$ is better than the one we have before with W and b. This is just the gradient descent step.

Let's get our hands dirty and actually compute the derivative of the error function. The sigmoid function has a really nice derivative:
$\sigma '(x) = \sigma(x)(1 - \sigma(x))$
Why? we calculate it using the quotient formula:
$$
\sigma '(x) = \frac{\mathrm{\partial} }{\mathrm{\partial x} } \frac{\mathrm{1} }{\mathrm{1 + e^{-x}} } = \frac{\mathrm{e^{-x}} }{\mathrm{(1 + e^{-x})^2 } } = \frac{\mathrm{e^{-x}} }{\mathrm{(1 + e^{-x})^2 } } · \frac{\mathrm{e^{-x}} }{\mathrm{(1 + e^{-x})^2 } } = \sigma(x)(1 - \sigma(x)) 
$$

Let's recall that if we have m point labelled $x^{1}, x^{2}, ..., x^{m}$, the error formula is: $$E = - \frac{\mathrm{1} }{\mathrm{m} }\ \sum_{i=1}^{m} (y_i ln(\hat{y_i}) + (1 - \hat{y_i})ln(1 - \hat{y_i})) \$$ where the prediction is given by $\hat{y_i} =  \sigma(Wxî + b)$. Our goal is to calculate the gradient of E, at a point x = ($x_1, ..., x_n$) given the partial derivates:
![Captura%20de%20Pantalla%202020-11-25%20a%20la%28s%29%2023.12.56.png](attachment:Captura%20de%20Pantalla%202020-11-25%20a%20la%28s%29%2023.12.56.png)

We can calculate the derivative of the error E at a pont x, with respect to the weight $w_j$:

![Captura%20de%20Pantalla%202020-11-25%20a%20la%28s%29%2023.14.56.png](attachment:Captura%20de%20Pantalla%202020-11-25%20a%20la%28s%29%2023.14.56.png)

This actually can tell us a very important thing: For a point with coordinates ($x_1, ..., x_n$), label y and prediction $\hat{y}$, the gradient of the error function at that point is:
$$
(-(y - \hat{y})x_1, ..., -(y - \hat{y})x_n, -(y - \hat{y}))
$$
So in summary, the gradient is: $\nabla E = -(y - \hat{y})(x_1, ..., x_n)$. This is amazing because the gradient es actually a scalar times the coordinates of the point. And the scalar is a multiple of the difference between the label and the prediction. 

![Captura%20de%20Pantalla%202020-11-25%20a%20la%28s%29%2023.21.07.png](attachment:Captura%20de%20Pantalla%202020-11-25%20a%20la%28s%29%2023.21.07.png)

A small gradient means we will change our coordinates by a little bit, and a large gradient means we will change our coordinates by a lot. This sound like the perceptron algorithm, we will see it in a bit.

### Gradient descent algorithm

We can build the pseudocode of the gradient descent algorithm:
- Start with random weights: $W_1, ..., W_n, b$, which gives us the whole probability function given by $\sigma(Wx + b)$. For every point we calculate the error (error is large for missclasified data and small for correct classified)
- For every point of our coordinates ($X_1, ..., X_n$):
-- We update Wi by adding the learning rate: $W_i ' \leftarrow W_i - \alpha \frac{\mathrm{\partial E} }{\mathrm{\partial W_i}}$
-- We update b': $b' \leftarrow b - \alpha \frac{\mathrm{\partial E} }{\mathrm{\partial b}}$

We already calculated that partials and we now that they are:
- Updating...
-- Update Wi': $W_i ' \leftarrow W_i - \alpha (\hat{y} - y)x_i$
-- Update b': $b' \leftarrow b - \alpha (\hat{y} - y)$

That is how we update our weights, then:
- We repeat this process until the error is small.

The number of times is called epochs, we will talk about them later. 

### Perceptron algorithm VS Gradient descent algorithm

Let's see something interesting:

In Gradient descent algorithm we change $w_i to w_i + \alpha (y - \hat{y})x_i$.

In Perceptron algorithm not every point changes weights, only the missclasified ones, if x is missclasified, we change $w_i$ to:
- $w_i + \alpha x_i$ if positive
- $w_i - \alpha x_i$ if negative

Are both cases the same?

In perceptron algorithm the label are 1 and 0 and the predictions $\hat{y}$ are also one and zero
- So if the point is correct classified: $y - \hat{y} = 0$
- If missclassified:
-- $y - \hat{y} = 1$ if positive
-- $y - \hat{y} = -1$ if negative

Well, efectively, both cases are the same, the only diference is that the gradient descent algorithm $\hat{y}$ can take any number between zero and one, whereas in the perceptron algorithm $\hat{y}$ can only be 1 or 0.

The gradient descent algorithm does something aditional to the perceptron algorithm: it is changing the weights, if a point is correctly classified, it like to be even more into the correct classified region, so our prediction is even closer to one and your error is even smaller. The misclassified points asks the classification to come closer and the correct classified points asks the classification to go farther away.

### Non-Linear data

WE have been dealing a lot with data sets that can be separated by a line, but the real world is much more complex than that. This is where neural networks can show their full potential. Let's see how can we deal with more complicated data sets that require highly non-linear boundaries.


Given a data of students accceptance at a university:
![Captura%20de%20Pantalla%202020-11-26%20a%20la%28s%29%207.37.42.png](attachment:Captura%20de%20Pantalla%202020-11-26%20a%20la%28s%29%207.37.42.png)

Where we can not divide the accepted ad rejected students by a simple line, we look some solutions, the one we consider more seriously is a curve that help us to separate almost all points correctly. How we find that curve? We still use the gradient descent.

We take the data which is not separable with a line, we're goint to create a probability function where the points in blue area are more likely to be blue and the points in red area are more likely to be red and the curve that separates is a set of points which are equally likely to be blue or red. Everything will be the same we have been doing, just the equation will not be linear, and there is where neural networks come into play.

### Neural network architecture

So, we are ready to build our neural network (yes, a multi layer perceptron).

We want to find a equation for a curve like this:
![Captura%20de%20Pantalla%202020-11-26%20a%20la%28s%29%207.47.08.png](attachment:Captura%20de%20Pantalla%202020-11-26%20a%20la%28s%29%207.47.08.png)

So we are going to do a simple trick, we will combine two linear models into a nonlinear model:

![Captura%20de%20Pantalla%202020-11-26%20a%20la%28s%29%207.48.50.png](attachment:Captura%20de%20Pantalla%202020-11-26%20a%20la%28s%29%207.48.50.png)

We are saying, this line plus this line equals to that curve. How we do that?

A liner model is a whole probability space, for every point it gives us the probability of the point being blue.

For example, our point on top righ side of both linear models is in the blue area, and in the first model it has a probability of 0.7 to be blue, while the same point given the second model, its probability of beeing blue is 0.8. How do we combine these two?

The simplest way to combine two numbers is to add them, so $0.7 + 0.8 = 1.5$. That doesn't look like a probability since is bigger than one. Lets turn this number into something between 0 and 1. We already reviewed a tool that translates everything to a number between 0 and 1, that is a sigmoid function. So that is what we are going to do:
$$
0.7 + 0.8 = 1.5 \rightarrow \alpha(1.5) =  \frac{\mathrm{1} }{\mathrm{1} + e^{-1.5} } = 0.82
$$

That is the probability of the point to be blue in the result model of the sum. 

Well, we already have managed to create a probability function for every single point in the plane. That is the way to combine two models:
- We calculate the probability for one of them
- We calculate the probability for the other of them
- We add the probabilities 
- We apply the sigmoid function.

What if we want to weight this sum? What if we want a first model in the left above to have more of a saying the resulting probability than the second model? Something where the resulting model looks a lot more like the first model than the second model, we can add weights, for example, we can say "I want 7 times the first model plus the second one "
![Captura%20de%20Pantalla%202020-11-26%20a%20la%28s%29%208.03.45.png](attachment:Captura%20de%20Pantalla%202020-11-26%20a%20la%28s%29%208.03.45.png)

So what that represents is that we take the probability of the same point we have been working on (blue on top right of our model) and we multiply it by our weights: $0.7 * 7$ and we take the second one and let's say we want to add a weight of 5, we multiply it by 5 and we can even add a bias if we want. Let's say our bias is -6, so we have the following equation: 
$$
(7 * 0.7) + (5 * 0.8) - 6 =  2.9 \rightarrow \alpha(2.9) =  \frac{\mathrm{1} }{\mathrm{1} + e^{-2.9} } = 0.95
$$

At the begining we had a line that is a linear combination of the input values times the weight plus a bias, now we have that this model is a linear combination of the two previous models times the weights plus a bias. We can thing of it as the line between the two models. **This is how neural networks get built.**

We can keep doing this always obtaining more new complex models out of linear combinations of the existing ones, and that is what we do when we build neural networks.

For now we have been doing a lot like perceptrons where we can take a value times a constant plus another value times a constant plus a bias and get a new value. 

Let's take the following example:
![Captura%20de%20Pantalla%202020-11-26%20a%20la%28s%29%208.17.51.png](attachment:Captura%20de%20Pantalla%202020-11-26%20a%20la%28s%29%208.17.51.png)

We have the top model represented by the equation $5x_1 - 2x_2 + 8$, the button model represented by the equation $7x_1 - 3x_2 - 1$, both represented by the perceptrons of the left. And we use another perceptron to combine our two models using the linear equation. Here is where the magic happens and we join these together and we get our neural network.
![Captura%20de%20Pantalla%202020-11-26%20a%20la%28s%29%208.21.45.png](attachment:Captura%20de%20Pantalla%202020-11-26%20a%20la%28s%29%208.21.45.png)

We can lean up a little bit and we have this:
![Captura%20de%20Pantalla%202020-11-26%20a%20la%28s%29%208.23.07.png](attachment:Captura%20de%20Pantalla%202020-11-26%20a%20la%28s%29%208.23.07.png)

All the weights are there, the weights on the left tell us what equations the linear model have, and the weights on the right tell us tell us what the linear combination is of the two models to obtain a curve non-linear model.

Whenever you see a neural network like the one just above, think of what could be the nonlinear boundary defined by the neural network. Note that the draw using the notation that pputs a bias inside the node but we can also drawn it using the notation that keeps the bias as a separate node:

![Captura%20de%20Pantalla%202020-11-26%20a%20la%28s%29%208.26.57.png](attachment:Captura%20de%20Pantalla%202020-11-26%20a%20la%28s%29%208.26.57.png)

So we have a bias unit coming from a node with a one on it. We can see that this neural network uses the sigmoid activation function and the perceptrons.

### Feedforward

Feedfoward is the process neural networks use to turn the input unto an output. 

Training a neural network really means what parameters should they have on the edges in order to model our data well. So, to learn how to train them, we need to look carefully at how they process the input to obtain an output. 

Let's come back at our simplest neural network, a perceptron, which receives a data point of the form $x = (x_1, x_2)$ where the label is $y = 1$, what means that the point is blue. The perceptron is defined by a linear equation: $w_1 X_1 + w_2 X_2 + b$ where $w_1$ and $w_2$ are the weights in the edges and B is the bias in the node.
In this case W1 is bigger than w2. What the perceptron does is it plots the point $(x_1, x_2)$ and it outputs the probability that te point is blue.

![Captura%20de%20Pantalla%202020-11-26%20a%20la%28s%29%2013.02.25.png](attachment:Captura%20de%20Pantalla%202020-11-26%20a%20la%28s%29%2013.02.25.png)


Here the point is in the red area and then the output is a small number, since the point is now very likely to be blue. This process is known as feedforward. 

We can see that this is a bad model because the point is actually blue. Given that the third coordinate, the $y$ is 1.

If we have a more complicated neural network, then the process is the same. 

![Captura%20de%20Pantalla%202020-11-26%20a%20la%28s%29%2013.10.15.png](attachment:Captura%20de%20Pantalla%202020-11-26%20a%20la%28s%29%2013.10.15.png)

We have thick edges corresponding to large weights and thin edges corresponding to small weights and the neural network plots the point in the top model and also in the bottom model and the outputs coming out will be a small numner from the top model. The point lies in the red area which means it has a small probability of being blue and large number from the second model, since the point lies in the blue area which means it has a larger probability of being blue. As we combine this 2 models into this nonlinear model and the output layer just plots the point and it tells the probability that the point is blue. As you can see, is a bad model because  it puts the point in the red area and the point is blue.

Again, this feedforward process has to be reviewed more carefully. For that, we will take a perceptron notation with bias as an input and a two matrices. 

![Captura%20de%20Pantalla%202020-11-26%20a%20la%28s%29%2013.26.38.png](attachment:Captura%20de%20Pantalla%202020-11-26%20a%20la%28s%29%2013.26.38.png)

The first matrix denotes the first layer and the entries are the weights $w_11$ to $w_32$, the bias have now been written as $w_3$.
The second matix denotes the second layer which contains the weights that tell us how to combine the linear models in the first layer to obtain the nonlinear model in the second layer.

What happens is some maths, the input in the form $(x_1, x_2, 1)$ which will by multiplied by the matix $w_1$ to get outputs $w_{11}^{1}, ..., w_{32}^{1}$. Then we apply the sigmoid function to turn the outputs into values between 0 and 1. 

Then the vector format these values gests a one attatched for the bias unit and multiply it bu the second matrix $(w_{11}^{2}, w_{21}^{2}, w_{31}^{2})$. This returns an output that now gets thrown into a sigmoid function to obtain the final output which is $\hat{y}$.

$\hat{y}$ is the prediction or probability that the point is labeled blue. Neural network take the input vector and then apply a sequence of lonear models and sigmoid functions, these maps when combined become a highly non-linear map. And the final formula is just $\hat{y} = \alpha o W^2 o \alpha o W^1(x)$

Just for redundance, we do the same with multilayer perceptrons to obtain a prediction ($\hat{y}$). 

![Captura%20de%20Pantalla%202020-11-26%20a%20la%28s%29%2013.29.06.png](attachment:Captura%20de%20Pantalla%202020-11-26%20a%20la%28s%29%2013.29.06.png)

And nice to meet the feedforward process.

### Error function

Neural networks also produces an error function, which at the end, is what we will be minimizing.
Let's remember at what the error function was for perceptrons:

![Captura%20de%20Pantalla%202020-11-26%20a%20la%28s%29%2013.35.03.png](attachment:Captura%20de%20Pantalla%202020-11-26%20a%20la%28s%29%2013.35.03.png)

This give us a prediction and also a error function of how badly each point is being classified. Roughly, this a very small number if the point is correctly classified and a measure of how far the point is from the line and the point is incorrectly classified. 

As we see, out prediction is simply a combination of matrix multiplications and sigmoid functions. But the error function can be the exact same thing exept now $\hat{y}$ is just a bit more complicated:

![Captura%20de%20Pantalla%202020-11-26%20a%20la%28s%29%2013.38.28.png](attachment:Captura%20de%20Pantalla%202020-11-26%20a%20la%28s%29%2013.38.28.png)

And still this function will tell us how badly a point gets misclassified, except now it is looking at a more complicated boundanary.

### Backpropagation

We are ready to get our hands into training a neural network. For this, we will use the method known as **backpropagation**, which consists of:
- Doing a feedforward operation
- Comparing the output of the model with the desired output
- Calculating the error
- Running the feedforward operation backwards (backpropagation) to spread the error to each of the weights.
- Use this to update the weights, and get a better model.
- Continue this until we have a model that is good

It seems more complicated than what it actually is.

Let's start from the point where we were working our perceptron in feedforward, we had a bad model misclassifying a blue point in the red area, we asked the point, "What do you want the model to do for you?", and the point says "Well, I am misclassified so I want this boundary to come closer to me", And we saw that the line got closer to it by opdating the weights. Namely, in that case, let's say that it tells the weight $w_1$ to go lower and the weight $w_2$ to go higher, and we have something like a $w_1'$ and $w_2'$ which define a new line which is now closer to the point. 

What we are doing is like descending from a mountain, yes, as gradient descent. The height is going to be the error function E(W) and we calculate the gradient of the error function which is exactly like asking the point what does is it want the model to do. And as we take the step down the direction of the negative of the gradient, we decrease the error to come down the mountain. This give us a new error E(W') and a new model W' with a smaller error, which means we get a new line closer to the point. We continue doing this process in order to minimize the error. That was for a single perceptron.

#### What about multi-layer perceptrons?
We still do the same process of reducing the error by descending from the mountain, except now, since the error function is more complicated. But same thing, we calculate the error function and its gradient.
$$
E(W) =  - \frac{\mathrm{1} }{\mathrm{m} }\ \sum_{i=1}^{m}y_i(ln(\hat{y_i})) + (1 - y_i)(ln(1 - \hat{y_i})) \
$$
We then walk in the direction of the negative of the gradient in order to find a new model W' with a smaller error E(W') which will give us a better prediction. And we continue the process in order to minimize the error. 

What does the feedforwarddo in the procees in a multi-layer perceptron is to put in our initial models the points, get combined the points and plotted in the resulting non-linear model in the output layer. And the probability that the point is blue is obtained by the position of this point in the final model. 

Pay attention to this because is the key for training neural networks, it is backpropagation.

We'll do as before, we will check the error. So the model we have worked on is not good because predicts that the point will be red when in reality the point is blue. So we ask the point "What do you want this model to do in order for you to be better classified?", and it says us "I kind of want this blue region to come closer to me". What does it mean for the region to come closer to it? 

See the two linear mmodels in the hidden layer: 
![Captura%20de%20Pantalla%202020-11-26%20a%20la%28s%29%2022.18.51.png](attachment:Captura%20de%20Pantalla%202020-11-26%20a%20la%28s%29%2022.18.51.png)

It seems like the top one is misclassifying the point whereas the bottom one is classifying it correctly. We kind of want to listen to the bottom one more and to  the top one less. What we want to do is to reduce the weight coming from the top model and increase the weight coming from the bottom model. So now our final model will look a lot more like the bottom model than like the top model. But we can do even more, we can actually go to the linear models and ask the point: "What can these models do to classify you better?", and the point will say "Well, the top model is misclassifying me, so I kind of want this line to move closer to me, and the second model is correctly classifying me, so I want this line to move farther away from me".

That change in the model will actually update the weights, increasing the bottom ones and decrease the top ones.

So after we update all the weights we have a better predictions at all the models in the hidden later and also a better prediction at the model in the output layer. Remember, when you update the weights, we are also updating the bias unit. 

### Backpropagation math

![Captura%20de%20Pantalla%202020-11-26%20a%20la%28s%29%2022.35.26.png](attachment:Captura%20de%20Pantalla%202020-11-26%20a%20la%28s%29%2022.35.26.png)

On the left we have a single perceptron with the input vector, the weights and the bias, and the sigmoid function inside the node. On the right, we have a formula for the prediction, which is the sigmoid function of the linear function of the input. Below we have the formula for error, which is the average of all points of the blue term for the blue points and the red term for the red points. And finally we have the gradient descent, which is simple the vector formed by all the partial derivatives of the error fuction with respect to the weights $W_1$ up to $W_n$ and the bias b. 

What we do with multi-layer perceptrons? well, this time it's a little more complicated but it's pretty much the same thing:
![Captura%20de%20Pantalla%202020-11-26%20a%20la%28s%29%2022.37.38.png](attachment:Captura%20de%20Pantalla%202020-11-26%20a%20la%28s%29%2022.37.38.png)

If twe want to write more formally the above, we recall that the prediction is a composition of sigmids and matrix multiplications, where these are the matrices and the gradient is just going to be formed by all these partial derivatives. In the image $\nabla E$ seems like a matrix but it is just a long vector. And the gradient descent is going to take each weight $W_{ij}'^{k}$ and we update it by adding a small number, the learning rate times the partial derivative of E with respect to that same weight. That's the gradient descent step so it will give us new updated weight $W_{ij}'^{k}$. That step is going to give us a whole new model with new weights that will classify the point much better. 

![Captura%20de%20Pantalla%202020-11-26%20a%20la%28s%29%2022.44.33.png](attachment:Captura%20de%20Pantalla%202020-11-26%20a%20la%28s%29%2022.44.33.png)

### Chain rule

Before we start calculating derivatives, let's refesh the chain rule which is the main technique we will use to calculate them. 

Chain rule says: if you have a variable x on a function f that you apply to `x` to get `f` of `x`, which we're gonna call A, and then another function `g`, which apply to `f` of `x` to get `g` of `f` of `x`, which we're gonna call B, the rule says if you want to find the partial derivative of B with respect to `x`, that's just a partial derivative of B with respect to A times the partial derivative of A with respect to `x`. 

![Captura%20de%20Pantalla%202020-11-26%20a%20la%28s%29%2022.55.56.png](attachment:Captura%20de%20Pantalla%202020-11-26%20a%20la%28s%29%2022.55.56.png)

It literally says, when composing functions, that derivatives just multiply, and that is gonna be super useful for us because feedfowarding is literally composing a bunch of functions, and back propagation is literally taking the derivative at each pice, and since taking the derivative of a composition is the same as multiplying the partial derivatives, then all we're gonna do is multiply a brunch of partial derivatives to get what we want. 

Coming back to our neural network, we have weights with superscript 1 (example: $W_{12}^{1}$) belong the first layer and the weights with superscript 2 (example: $W_{11}^{2}$) belong to the second layer. Also we have that the bias is not called b anymore, now it takes the name $W_{31}^{1}, W_{32}^{1}$, and so on, so we can have everything in matrix notation.

![Captura%20de%20Pantalla%202020-11-27%20a%20la%28s%29%207.41.54.png](attachment:Captura%20de%20Pantalla%202020-11-27%20a%20la%28s%29%207.41.54.png)

Now, what happens with the input? We do the feedforward process.

In the first layer, we take the inputs and multiply it by the weights and that gives us $h_1$, which is a linear function of the input and the weights:
$$
h_1 = W_{11}^{1}x_1 + W_{21}^{1}x_2 + W_{31}^{1}
$$

We do the same thing with $h_2$, given by the following formula:
$$
h_2 = W_{12}^{1}x_1 + W_{22}^{1}x_2 + W_{32}^{1}
$$

In the second layer, we would take $h_1, h_2$, and the new bias, and apply the sigmoid function and then apply a linear function to them by multiplying them by the weights and adding them to get a value of $h$:
$$
h = W_{11}^{2}\sigma(h_1) + W_{21}^{2}\sigma(h_2) + W_{31}^{2}
$$

And finally in the third layer, we just take a sigmoid function of $h$ to get our prediction or probability between 0 and 1, which is $\hat{y}$.
$$
\hat{y} = \sigma(h)
$$

We can read this in more condensed notation by saying that a matrix corresponding to the first layer is $W^1$, the matrix corresponding to the second layer is $W^2$, and then the prediction we had is just going to be the sigmoid of $W^2$ combined with the sigmoid of $W^1$ applied to the input $x$ and that is **feedforward**:
$$
\hat{y} = \sigma o W^{(2)} o \sigma o W^{(1)}(x)
$$

![Captura%20de%20Pantalla%202020-11-27%20a%20la%28s%29%207.56.19.png](attachment:Captura%20de%20Pantalla%202020-11-27%20a%20la%28s%29%207.56.19.png)

We are going to develop backpropagation, which is precisely the reverse of feedforward.

We are going to calculate the derivative of the error function with respect to each of the weights in the labels by using the chain rule. So we recall our error function:
$$
E(W) =  - \frac{\mathrm{1} }{\mathrm{m} }\ \sum_{i=1}^{m}y_iln(\hat{y_i}) + (1 - y_i)ln(1 - \hat{y_i}) \
$$

This is a function of the prediction $\hat{y}$, but since the prediction is a function of all the weights $W_{ij}$, then the error function can be seen as the function on all the $W_{ij}$:
$$
E(W) = E(W_{11}^{(1)}, W_{12}^{(1)}, ... W_{31}^{(2)})
$$

Therefore, the gradient is simply the vector formed by all the partial derivatives of the error function E with respect to each of the weights:
$$
\nabla E = (\frac{\mathrm{\partial E} }{\mathrm{\partial W_{11}^{(1)}} }, ..., \frac{\mathrm{\partial E} }{\mathrm{\partial W_{31}^{(2)}} })
$$

Let's calculate the derivatives, the derivative of E with respect to $W_{11}^{(1)}$, so since the prediction is simply a composition of functions and by the chain rule, we know that the derivative with respect to this is the product of all the partial derivatives:
$$
\frac{\mathrm{\partial E} }{\mathrm{\partial W_{11}^{(1)}}} = \frac{\mathrm{\partial E} }{\mathrm{\partial \hat{y}}} · \frac{\mathrm{\partial \hat{y}} }{\mathrm{\partial h}} · \frac{\mathrm{\partial h} }{\mathrm{\partial h_1}} · \frac{\mathrm{\partial h_1} }{\mathrm{\partial W_{11}^{(1)}}}
$$

This may seem complicated, but the fact that we can calculate a derivative of such a complicated composition function by just multiplying 4 partial derivatives is remarkable.

Now we already calculated the first one, the derivative of E with respect to $\hat{y}$. And if you remember, we got $\hat{y} - y$. So let's calculate the other ones. For that, let's zoom in a bit and look at just one piece of our multi-layer perceptron. 

![Captura%20de%20Pantalla%202020-11-27%20a%20la%28s%29%208.18.32.png](attachment:Captura%20de%20Pantalla%202020-11-27%20a%20la%28s%29%208.18.32.png)

The inputs are some values $h_1$ and $h_2$, which are values coming in from before, and once we apply the sigmoid function and a linear function on $h_1$ and $h_2$ and $1$ corresponding the bias unit, we get a result $h$. So, now what is the derivative of $h$ with respect to $h_1$? 

Well, $h$ is a sum of three things and only one of them contains $h_1$. So the second and third sum on just give us a derivative of 0. The first sum on gives us $W_{11}^{(2)}$ because that is a constant, and that times the derivative of the sigmoid function with respect $h_1$:
$$
\frac{\mathrm{\partial h} }{\mathrm{\partial h_1}} = W_{11}^{(2)}\sigma (h_1)[1 - \sigma (h_1)]
$$

This calculated and remembering that the sigmoid function has a beutiful derivative, namely the derivative of sigmoid of $h$ is precisely sigmoid of $h$ times one minus sigmoid of $h$:
$$
\sigma '(x) = \frac{\mathrm{\partial} }{\mathrm{\partial x} } \frac{\mathrm{1} }{\mathrm{1 + e^{-x}} } = \frac{\mathrm{e^{-x}} }{\mathrm{(1 + e^{-x})^2 } } = \frac{\mathrm{e^{-x}} }{\mathrm{(1 + e^{-x})^2 } } · \frac{\mathrm{e^{-x}} }{\mathrm{(1 + e^{-x})^2 } } = \sigma(x)(1 - \sigma(x)) 
$$

And that is it, that is how you train a neural network. 