<a href="https://colab.research.google.com/github/samuelkb/gColab/blob/main/notebooks/Introduction%20to%20neural%20networks.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction to neural networks

What is deep learning?
What is it used for? Pretty much everywhere, such as being humans in games such as Go, jeopardy, detecting spam in emails, forecasting stock prices, recognizing images in a picture, diagnosing illnesses sometimes with more precision than doctors, and the most celebrated applications of deep learning is in self-driving cars. 

At the heart of deep learning? Neural networks.

Neural networks mimic the process of how the brain operates. Given some data in the form of blue or red points, the neural networks will look for the best line that separates / classifies them.

### Perceptrons

Perceptrons are the building blocks of neural networks, and are just an encoding of our equations into a small graph. 
A great application of this perceptrons are logical operators, the most common of these, the **AND**, **OR**, **NOT**, and **XOR**.

![Captura%20de%20Pantalla%202020-10-25%20a%20la%28s%29%2020.25.11.png](attachment:Captura%20de%20Pantalla%202020-10-25%20a%20la%28s%29%2020.25.11.png)

### What are the weights and bias for the AND perceptron?

Let's play and set the weights and bias to values that will correctly determine the AND, OR, and NOT operation as shown above.
(Consider there are more than one set of values that will work)

In [3]:
import pandas as pd

# Set weight1, weight2 and bias (you can replace it and continue playing)
weight1 = 1.0
weight2 = 1.0
bias = -2.0

#Dont change the next block of code
#Inputs and outputs
test_inputs = [(0, 0), (0, 1), (1, 0), (1, 1)]
correct_outputs = [False, False, False, True]
outputs = []

# Validate outputs
for test_input, correct_output in zip(test_inputs, correct_outputs):
    linear_combination = weight1 * test_input[0] + weight2 * test_input[1] + bias
    output = int(linear_combination >= 0)
    is_correct_string = 'Yes' if output == correct_output else 'No'
    outputs.append([test_input[0], test_input[1], linear_combination, output, is_correct_string])
    
# Print results
num_wrong = len([output[4] for output in outputs if output[4] == 'No'])
output_frame = pd.DataFrame(outputs, columns=['Input 1', '  Input 2', '  Linear Combination', '  Activation Output', '  Is Correct'])
if not num_wrong:
    print('Nice!  You got it all correct.\n')
else:
    print('You got {} wrong.  Keep trying!\n'.format(num_wrong))
print(output_frame.to_string(index=False))

Nice!  You got it all correct.

 Input 1    Input 2    Linear Combination    Activation Output   Is Correct
       0          0                  -2.0                    0          Yes
       0          1                  -1.0                    0          Yes
       1          0                  -1.0                    0          Yes
       1          1                   0.0                    1          Yes


### What are the weights and bias for the OR perceptron?

![Captura%20de%20Pantalla%202020-10-25%20a%20la%28s%29%2022.34.18.png](attachment:Captura%20de%20Pantalla%202020-10-25%20a%20la%28s%29%2022.34.18.png)

In [4]:
import pandas as pd

# Set weight1, weight2 and bias (you can replace it and continue playing)
weight1 = 2.0
weight2 = 2.0
bias = -2.0

#Dont change the next block of code
#Inputs and outputs
test_inputs = [(0, 0), (0, 1), (1, 0), (1, 1)]
correct_outputs = [False, True, True, True]
outputs = []

# Validate outputs
for test_input, correct_output in zip(test_inputs, correct_outputs):
    linear_combination = weight1 * test_input[0] + weight2 * test_input[1] + bias
    output = int(linear_combination >= 0)
    is_correct_string = 'Yes' if output == correct_output else 'No'
    outputs.append([test_input[0], test_input[1], linear_combination, output, is_correct_string])
    
# Print results
num_wrong = len([output[4] for output in outputs if output[4] == 'No'])
output_frame = pd.DataFrame(outputs, columns=['Input 1', '  Input 2', '  Linear Combination', '  Activation Output', '  Is Correct'])
if not num_wrong:
    print('Nice!  You got it all correct.\n')
else:
    print('You got {} wrong.  Keep trying!\n'.format(num_wrong))
print(output_frame.to_string(index=False))

Nice!  You got it all correct.

 Input 1    Input 2    Linear Combination    Activation Output   Is Correct
       0          0                  -2.0                    0          Yes
       0          1                   0.0                    1          Yes
       1          0                   0.0                    1          Yes
       1          1                   2.0                    1          Yes


Notice that to go from AND to OR operation we can just increase the weights or decrease the magnitude of the bias.

### What are the weights and bias for the NOT perceptron?

This operator only cares about one input. The other inputs to the perceptron are ignored.

In [6]:
import pandas as pd

# Set weight1, weight2 and bias (you can replace it and continue playing)
weight1 = -1.0
weight2 = -3.0
bias = 2.0

#Dont change the next block of code
#Inputs and outputs
test_inputs = [(0, 0), (0, 1), (1, 0), (1, 1)]
correct_outputs = [True, False, True, False]
outputs = []

# Validate outputs
for test_input, correct_output in zip(test_inputs, correct_outputs):
    linear_combination = weight1 * test_input[0] + weight2 * test_input[1] + bias
    output = int(linear_combination >= 0)
    is_correct_string = 'Yes' if output == correct_output else 'No'
    outputs.append([test_input[0], test_input[1], linear_combination, output, is_correct_string])
    
# Print results
num_wrong = len([output[4] for output in outputs if output[4] == 'No'])
output_frame = pd.DataFrame(outputs, columns=['Input 1', '  Input 2', '  Linear Combination', '  Activation Output', '  Is Correct'])
if not num_wrong:
    print('Nice!  You got it all correct.\n')
else:
    print('You got {} wrong.  Keep trying!\n'.format(num_wrong))
print(output_frame.to_string(index=False))

Nice!  You got it all correct.

 Input 1    Input 2    Linear Combination    Activation Output   Is Correct
       0          0                   2.0                    1          Yes
       0          1                  -1.0                    0          Yes
       1          0                   1.0                    1          Yes
       1          1                  -2.0                    0          Yes


### What are the weights and bias for the XOR perceptron?

![Captura%20de%20Pantalla%202020-10-25%20a%20la%28s%29%2022.48.10.png](attachment:Captura%20de%20Pantalla%202020-10-25%20a%20la%28s%29%2022.48.10.png)

We can get the XOR operation doing a multi-layer perceptron with AND, NOT, and OR preceptrons. For that we can use the following diagram:

![xor-quiz.png](attachment:xor-quiz.png)

Where:
- "A" is the AND perceptron
- "B" is the OR perceptron
- "C" is the NOT  perceptron

![xor-quiz2.png](attachment:xor-quiz2.png)

So, we reviewed perceptrons, using logic and mathematical knowledge to build the most common logical operators. 
In  real life, though, we can't be building these perceptrons ourselves. The idea is that we give them the result, and they build themselves.

### How do we find the line that separates one group of data from other?

In the graphs above, we saw that we received a 1 when the dot was on blue area, and 0 when the dot was on red area. Perceptrons help us to see graphically the classification of data for this time with a linear ecuation, but not always we have data separated by a linear model.

The computer doesn't know where to start to classify a data set, so it might start at a random place by picking a random linear equation. That line will define two areas y some data will be in the "red" part, and another in the "blue" part. It's really probable that some data will be missclasified, so we will be looking how badly this line is doing the classification and then move it arround to try to get better results.

To know how bad is the initial clasification, we ask for all the data set / points, we find the correctly classified, and also we will see those that are incorrectly classified and we want to know as much information as we can from them to tell us something that we can improve with our initial classification line. 

What can a missclasified point say us? Does that point want the line of classification closer or farther?
It's a good start, for missclasified points the option will be closer to the line.

We can see an example to better understand this problem approach:

We have a data set with an ecuation dividing into two parts, negative and positive areas, which ecuation is:
$$
3x_1 + 4x_2 - 10 = 0
$$

All positive dots will be defined by:
$$
3x_1 + 4x_2 > 10
$$

All negative dots will be defined by:
$$
3x_1 + 4x_2 < 10
$$

![Captura%20de%20Pantalla%202020-10-26%20a%20la%28s%29%207.20.41.png](attachment:Captura%20de%20Pantalla%202020-10-26%20a%20la%28s%29%207.20.41.png)

And we have a point (4, 5) missclasified into the red area. The point say to the line: "Come closer!"
How do we get that point to come closer to the line? A good idea can be to take the (4, 5) and modify the equation of the line to get the line to move closer to the point. We shouldn't forget the bias, and what we will do is substract these numbers from the parameters of the line to get:

![Captura%20de%20Pantalla%202020-10-26%20a%20la%28s%29%207.28.12.png](attachment:Captura%20de%20Pantalla%202020-10-26%20a%20la%28s%29%207.28.12.png)

The new line will have params -1, -1, -11, that is a drastical change, but we don't want to do a drastical change, because we can accidentally misclassify all our other points. We want to move the line towards that point with small steps.

### Learning rate

To move our line in steps, we will introduce the learning rate, that is a small number used to substract for the original equation, taking the values of our point misclassified and multiply them fo the learning rate, let's say in our example Learning rate = 0.1:

![Captura%20de%20Pantalla%202020-10-26%20a%20la%28s%29%207.33.45.png](attachment:Captura%20de%20Pantalla%202020-10-26%20a%20la%28s%29%207.33.45.png)

That will give us the next equation:
$$
2.6x_1 + 3.5x_2 - 10.1 = 0
$$

That will make our line to come closer to the misclassified point in small steps and also, if we have a point incorrectly classified on the red area, we can follow the same approach, but in this case instead to substract from the original equation, we will be adding:

![Captura%20de%20Pantalla%202020-10-26%20a%20la%28s%29%207.37.24.png](attachment:Captura%20de%20Pantalla%202020-10-26%20a%20la%28s%29%207.37.24.png)

**So we can use this trick repeatedly for the Perceptron Algorithm.**

For this second example where we defined a line described by:
$$
3x_1 + 4x_2 -10 = 0
$$

and considering our learning rate = 0.1. How many times would we have to apply the perceptron trick to mve the line to a position where the blue point (1, 1) is correctly classified?
Yes, we will have to apply 10 times the perceptron trick, let's do it.

In [None]:
# Pending to learn how to graph and show our iterations

### Perceptron Algorithm

Remember the computer starts with a random line with random weights ($W_1 ... W_n, b $), and each point of our data sets tell us how correctly or incorrectly cliassified are. The misclassified points ($X_1, ..., X_n $) says to the line "Come closer". 

So if our prediction is equal to 0, we want for each point from 1 to n to add $W_i  =W_i + \alpha X_1$, where $\alpha$ is our learning rate. Then we also change the B as unit to b plus $\alpha$, as we show: ($b + \alpha$), because that will move our line closer to the misclassified point.

So if our prediction is equal to 1, we want for each point from 1 to n to substract $W_i = W_i - \alpha X_1$, where $\alpha$ is our learning rate. Then we also change the B as unit to b minus $\alpha$, as we show: ($b - \alpha$), because that will move our line closer to the misclassified point.


Then, we just have to repeat this step until we get no errors, or repeat a specific number of times.

### Coding the perceptron algorithm

Let's play with a data set to separate the data given in a data.csv file.

Remember, our algorithm works as follows:
For points with coordinates (p, q), label y, an prediction given by the equation y' = $step(w_1*x_1 + w_2x_2 + b)$:
- If the point is correctly classified, do nothing
- If the point is calassified positive, but it has a negative label, subtract $\alpha p, \alpha q$, and $\alpha$ from $w_1, w_2$ and $b$ respectively.
- If the point is classified negative, but it has a positive label, add $\alpha p, \alpha q$, and $\alpha$ from $w_1, w_2$ and $b$ respectively.

You can play with the parameters to see what happens and how your initial conditions can affect the solution.

In [20]:
import numpy as np
# importing the required module 
import matplotlib.pyplot as plt 

# Setting the random seed, feel free to change it and see different solutions.
np.random.seed(100)

def stepFunction(t):
    if t >= 0:
        return 1
    return 0

def prediction(X, W, b):
    return stepFunction((np.matmul(X,W)+b)[0])

# TODO: Fill in the code below to implement the perceptron trick.
# The function should receive as inputs the data X, the labels y,
# the weights W (as an array), and the bias b,
# update the weights and bias W, b, according to the perceptron algorithm,
# and return W and b.
def perceptronStep(X, y, W, b, learn_rate = 0.01):
    # Fill in code
    for i in range(len(X)):
        y_hat = prediction(X[i],W,b)
        if y[i]-y_hat == 1:
            W[0] += X[i][0] * learn_rate
            W[1] += X[i][1] * learn_rate
            b += learn_rate
        elif y[i]-y_hat == -1:
            W[0] -= X[i][0] * learn_rate
            W[1] -= X[i][1] * learn_rate
            b -= learn_rate
    return W, b
    
# This function runs the perceptron algorithm repeatedly on the dataset,
# and returns a few of the boundary lines obtained in the iterations,
# for plotting purposes.
# Feel free to play with the learning rate and the num_epochs,
# and see your results plotted below.
def trainPerceptronAlgorithm(X, y, learn_rate = 0.01, num_epochs = 10):
    x_min, x_max = min(X.T[0]), max(X.T[0])
    y_min, y_max = min(X.T[1]), max(X.T[1])
    W = np.array(np.random.rand(2,1))
    b = np.random.rand(1)[0] + x_max
    # These are the solution lines that get plotted below.
    boundary_lines = []
    for i in range(num_epochs):
        # In each epoch, we apply the perceptron step.
        W, b = perceptronStep(X, y, W, b, learn_rate)
        boundary_lines.append((-W[0]/W[1], -b/W[1]))
        plt.plot(i, boundary_lines)
    return boundary_lines
## How to graph this??


Well, you can see now graphically how the line is learning and trying to separate dots. But data in real world usually can't be separated by a line, there are more complex ecuations that can classify better our data sets. But now our perceptron algorithm won't work for us this time. We need to redefine our perceptron algorithm.

### Error functions

An error function is simply something that tells us how far we are from the solution. It will guide us, checking in which direction I can take a step to get closer to the solution.

If we make a analogy to understand error functions, we can say that we are on the top of a big mountain and we want to descend, the mountain is really big so we don't have the hole picture, we just have all the posible directions to take and we'll chose that direction that help us to descend the most, once we take that direction, we repear te process until we arrive the button of the mountain. 

All of our decisions are based on the height of the mountain, we will call height the error. That error tell us how badly we're doing at the moment and how far we are from an ideal solution. If we constantly take steps to decrease the error then we'll eventually solve our problem.

You can notice that method to solve can give us wrong solutions, following our analogy, to get into a valley or local minimum. That happens a lot in machine learning and we'll see other forms to solve it later.

To take advantage of error functions, we have to build continuous functions, that will guide us better to be increasing errors.

![Captura%20de%20Pantalla%202020-11-23%20a%20la%28s%29%208.19.49.png](attachment:Captura%20de%20Pantalla%202020-11-23%20a%20la%28s%29%208.19.49.png)

### Discrete vs Continuous predictions

A discrete answer gives us a yes/no, while a continues answer gives us a number, usually between 0 and 1, considered probability. The way we move from discrete predictions to continuous is changing our activation function from the step function (0's/1's) to a new function called **Sigmoid Function**. 

The sigmoid function is a function which a large positive numbers will give us values very close to one, for large negative numbers will give us values very close to zero, for numbers closer to zero will give us values that are close to 0.5 The formula is:

$$
\alpha(x) =  \frac{\mathrm{1} }{\mathrm{1} + e^-x }
$$


The way we obtain probailities from spaces like our classification points on red/blue area is simple. We just combine the linear function `Wx + b` with the sigmoid function. So the prediction is defined as:

$$
\hat{y} = \alpha(Wx + b)
$$

For now, this formula take us closer to a specific prediction (e. g. closer to one id the prediction is "Is the point blue?" when it is deep blue area).

![Captura%20de%20Pantalla%202020-11-23%20a%20la%28s%29%2022.37.13.png](attachment:Captura%20de%20Pantalla%202020-11-23%20a%20la%28s%29%2022.37.13.png)

What our new Sigmoid perceptron does is to take the inputs, multiplies them by the weights in the edges and adds the results, then applies the sigmoid function. So instead of returning one and zero like before, it returns values between zero and one. Now we know the probability, not just a yes or not answer.

![Captura%20de%20Pantalla%202020-11-23%20a%20la%28s%29%2022.40.25.png](attachment:Captura%20de%20Pantalla%202020-11-23%20a%20la%28s%29%2022.40.25.png)

### The Softmax function

The softmax function is the equivalent of the sigmoid activation function, but when the problem has 3 or more classes.

When we need to classify between 3 or more options we have the problem of translating our data values into a probability, while the sum of all our data probability is equal to 1.

![Captura%20de%20Pantalla%202020-11-23%20a%20la%28s%29%2022.56.41.png](attachment:Captura%20de%20Pantalla%202020-11-23%20a%20la%28s%29%2022.56.41.png)

Exponential functions help us to avoid negative numbers and allow us to take the value for an option and divide it by the sum of all the values without the fear of being dividing into zero.

Let's say we have N classes and a linear model that gives us the following scores: 
$$Z_1, ... Z_n$$
Each score for each of the classes. What we do to turn them into probabilities is to say that the probability that object is in class i is going to be e to the power of the Zi divided by the sum of e to the power of Z1 plus all the way to e to the power zn:
$$
P(class i) = \frac{\mathrm{e^{Zi}} }{\mathrm{e^{Zi}} + ... + e^{Zn} }
$$
That's how we turn scores into probabilities.

Well we can start programming the Softmax function!

In [2]:
import numpy as np

# Write a function that takes as input a list of numbers, and returns
# the list of values given by the softmax function.
def softmax(L):
    expL = np.exp(L)
    sumExpL = sum(expL)
    result = []
    for i in expL:
        result.append(i*1.0/sumExpL)
    return result

In [5]:
# Lets call our function:
print (softmax([5, 6, 7]) )

[0.09003057317038046, 0.24472847105479764, 0.6652409557748219]


Not every time we will have our data as a number, and we can't create dependencies between data and numeric valiables, for these cases we should add more columns to our data set to specify better the structure of our data. This process is called the One-Hot Encoding and we will use it a lot for processing data.

### Maximum likehood

Probability will be one of our best frinds as we go through Deep Learning. Let's see how we can use it to evaluate and improve our models.


The best model is the one that gives the higher probabilities to the events that happened to us, whether it's acceptance or rejection. That method is known as Maximum Likehood. What we do is pick the model that gives the existing labels the highest probability, thus, by maximazing the probability, we can pick the best possible model. 


If we compare two models that classify points acording with his area color, where the first model correctly classified two of four points, and the second one correctly classified four points of four, we will be sure our second model is the best, but if we want to prove it by probability, we have to take the probability of each point of beeing the color what it is, and then multiply all the points probabilities. 

Aplying maximum likehood we will notice that the second model, which has the higher probabilities of each point of beeing the color what it is, has a higher probability in total.

![Captura%20de%20Pantalla%202020-11-24%20a%20la%28s%29%2020.10.04.png](attachment:Captura%20de%20Pantalla%202020-11-24%20a%20la%28s%29%2020.10.04.png)

### Maximazing probabilities

We have talked about error function and how minimizing this function will take us to the best possible solution. Could it be that maximizing the probability is equivalent to minimizing the error function?

What we want is to maximize probability, but probability is a product of numbers and products are hard because if we have a product of thousands of numbers I will be having something like 0.00000XX which is bad for our porpouses. 

Avoiding products, lets change to sums, we need to find a function that will help us turn products into sums.

![Captura%20de%20Pantalla%202020-11-24%20a%20la%28s%29%2020.18.35.png](attachment:Captura%20de%20Pantalla%202020-11-24%20a%20la%28s%29%2020.18.35.png)

The logarithm function help us because it has a  nice identity that says that the logarithm of the product A times B is the sum of the logarithms of A and B.

So we take our products and we take the logarithms. Let's say our model 1 has these four probabilities for each point:
$$
0.6 * 0.2 * 0.1 * 0.7 = 0.0084
$$

And our model 2 has these four probabilities for each point:
$$
0.7 * 0.9 * 0.8 * 0.6 = 0.3024
$$

We change to logarithms, for model 1:
$$
\ln(0.6) + \ln(0.2) + \ln(0.1) + \ln(0.7)
$$
$$-0.51   -1.61   -2.3   -0.36$$

and for model 2:
$$
\ln(0.7) + \ln(0.9) + \ln(0.8) + \ln(0.6)
$$
$$-0.36   -0.1   -0.22   -0.51$$

Well, you will notice that we will be recieving negative numbers for each natural logarithm, that makes sense because the logarithm of a number between 0 and 1 is always a negative number since the logarithm of one is zero. 

We will take the negative of the logarithm of the probabilities:
for model 1:
$$
-\ln(0.6) - \ln(0.2) - \ln(0.1) - \ln(0.7)
$$
$$0.51   +1.61   +2.3   +0.36$$

and for model 2:
$$
-\ln(0.7) - \ln(0.9) - \ln(0.8) - \ln(0.6)
$$
$$0.36   +0.1   +0.22   +0.51$$

That sums up negatives of logarithms of the probabilities we will called **cross entropy** which is a very important concept. 

Calculating the entropies,  we found that our model 1 has a cross entropy of 4.8 which is high, while our model 2 has a cross entropy of 1.2 which is low. A good model gives us a low cross entropy and a bad model will give us a high cross entropy.

This method is actually much more powerful than we think, if we calculate the probabilities and pair the points with the corresponding logarithms, we actually get an error for each point. 

![Captura%20de%20Pantalla%202020-11-24%20a%20la%28s%29%2020.37.11.png](attachment:Captura%20de%20Pantalla%202020-11-24%20a%20la%28s%29%2020.37.11.png)



If we look carefully at the values we can see that the points that are missclassified have high values and the points that are correctly classified have small values. We can think of the negatives of these logarithms as errors at each point. 

Our cross entropy will tell us if a model is good a bad. So we change from maximizing the probability to minimizing our cross entropy. 

### Cross Entropy

Cross entropy says: if I have a bunch of events and a bunch of probabilities, how likely is it that those eventes happen based on the probabilities? If is is very likely, we have a small cross entropy. If it is unlikely, we have a large cross entropy.


Let's take another example, here is a table with all the possible scenarios for three doors that have different probabilities to have a gift behind them: 

![Captura%20de%20Pantalla%202020-11-24%20a%20la%28s%29%2020.58.48.png](attachment:Captura%20de%20Pantalla%202020-11-24%20a%20la%28s%29%2020.58.48.png)

There are eight scenarios since each door gives us two possibilities each. We obtain the probability of each arrangement by multiplying the three independent probabilities to get these numbers. 

Notice that the events with high probability have low cross entropy and the events with low probability have high cross entropy.

Taking this example we can make a formula taking some variables as the probabilities for each door, the presence of a gift behind the door, and we have that:

![Captura%20de%20Pantalla%202020-11-24%20a%20la%28s%29%2021.04.38.png](attachment:Captura%20de%20Pantalla%202020-11-24%20a%20la%28s%29%2021.04.38.png)

The cross entropy really tells us when two vectors are similar or different. 


Let's code the formula for cross entropy in python. Where Y is the category and P the probability:

In [1]:
import numpy as np

# Write a function that takes as input two lists Y, P,
# and returns the float corresponding to their cross-entropy.
def cross_entropy(Y, P):
    Y = np.float_(Y)
    P = np.float_(P)
    return -np.sum(Y* np.log(P) + (1-Y)*(np.log(1-P)))

If we test it:

In [3]:
Y = [1, 0, 1, 1]
P = [0.4, 0.6, 0.1, 0.5]
print(cross_entropy(Y, P))

4.828313737302301


We show an example where we have two classes (gift or not), we continue with our three doors, but now behind each door there can be an animal, and the animal can be of three types: duck, beaver or walrus. And here is the probability table for our example:

![Captura%20de%20Pantalla%202020-11-24%20a%20la%28s%29%2021.15.15.png](attachment:Captura%20de%20Pantalla%202020-11-24%20a%20la%28s%29%2021.15.15.png)

We take our doors table and define a new Cross entropy formula where m is a number of classes, Yij the probability of the event of an specific animal behind the door:
![Captura%20de%20Pantalla%202020-11-24%20a%20la%28s%29%2021.19.36.png](attachment:Captura%20de%20Pantalla%202020-11-24%20a%20la%28s%29%2021.19.36.png)

![Captura%20de%20Pantalla%202020-11-24%20a%20la%28s%29%2021.20.49.png](attachment:Captura%20de%20Pantalla%202020-11-24%20a%20la%28s%29%2021.20.49.png)

### Logistic regression

One of the most popular and useful algorithms in ML, and the building block of all that constitutes DL is the logistic regression, which do something like this:
- Take your data
- Pick a random model
- Calculate the error
- Minimize the error, and obtain a better model
- Enjoy!

Okay, how do we calculate the error? Coming back to our example of points in blue and red area, We can define that if Y =1 (blue area), the probability of a point to be blue is $\hat{y}$ and the error is equal to the negative of the natural logarithm of $\hat{y}$.
And if Y = 0 (red area), the probability of a point to be red is 1 minus probaility of a point to be blue, it means 1 - $\hat{y}$ and the error is equal to the negative of the natural logarithm of $1 - \hat{y}$..
So we can define a formula for error that goes like this one:
$$
Error = - (1 - y)(ln(1 - \hat{y})) - y(ln(\hat{y}))
$$

Where if the point is red then `y = 0` and the second term of the formula is 0 and the first one is logarithm of 1 - $\hat{y}$, and if the point is blue then `y = 1` and the first term of the ofrmula is 0.

Following that, the formula for the error function is simply the sum over all the error functions of points. 
$$
Error function = - \frac{\mathrm{1} }{\mathrm{m} }\ \sum_{i=1}^{m} (1 - y_i)(ln(1 - \hat{y_i})) - y_i(ln(\hat{y_i})) \
$$

Since $\hat{y}$ is given by the sigmoid of the linear function `Wx + b`, then the total formula for the error is actually in terms of w and b, which are the weights of the model. That is simply the summation we see here:

$$
ErrorFunction(W,b) = - \frac{\mathrm{1} }{\mathrm{m} }\ \sum_{i=1}^{m} (1 - y_i)(ln(1 - \sigma(Wx^{i} + b ))) - y_i(ln(\sigma(Wx^{i} + b ) )) \
$$

In this case $Y_i$ is just the label of the point $x^i$, so now that we've calculated it, our goal is to minimize it. 

This error function applies for binary classification problems, if we have a multiclass classification problem, then the error is now given by the multiclass entropy:

$$
ErrorFunction(W,b) = - \frac{\mathrm{1} }{\mathrm{m} }\ \sum_{i=1}^{m}  \sum_{j=1}^{n} y_{ij}ln(\hat{y_{ij}}) \
$$

This last formula is given here where for every data point we take the product of the label times the logarithm of the prediction and then we average all these values.

### Minimizing the error function

To minimize the error function, we started some random weights, which give us the predictions $\sigma(Wx + b)$ that also gives us a error function given by the formula `Error function(W, b)`, remembering that each point will give us a larger function if it is missclasified and a smaller one if it is correctly classified. 

To reduce that formula we will use gradient descent in order to "get to the bottom of our mountain" which gives us a smaller error function `Error function(W, b)`. This will give rise to new weights, W' and b' which will give us a much better prediction, namely $\sigma(W'x + b')$. 