# Loss Functions and Optimizers

As you may know, loss functions and optimizers are crucial for any neural network application. Loss functions allow you to guide the training of the network and the optimizers act as the catalyst, by determining how the weights of the neural network should change to perform better

## Initialization

First, we do the imports :

### Dataset
As internet might be a problem, lets create our own dataset. A simple line with some noise will be enough.

## Loss Functions (Cost Function)

### Mean Absolute Error

MAE measures the average magnitude of the errors in a set of predictions, without considering their direction. It’s the average over the test sample of the absolute differences between prediction and actual observation where all individual differences have equal weight.

![title](img/mae.png)

First lets see how much the predicted values and the truth values vary by graphing the elementwise differences.


Now lets calculate the actual loss

### Root Mean Squared Error
RMSE is a quadratic scoring rule that also measures the average magnitude of the error. It’s the square root of the average of squared differences between prediction and actual observation

![title](img/rmse.png)

### Binary Cross Entropy
![title](img/bce.png)


## Optimizers

### Gradient Descent

Let’s asssume we have an output variable y which we think depends linearly on the input vector x. We approximate y by
![title](img/img3.png)

The cost function for our linear least squares regression will then be
![title](img/img4.png)



#### Batch Gradient Descent
Assume that we have a vector of paramters θ and a cost function J(θ) which is simply the variable we want to minimize (our objective function). Typically, we will find that the objective function has the form:

![title](img/img1.png)

where Ji is associated with the i-th observation in our data set. The batch gradient descent algorithm, starts with some initial feasible θ (which we can either fix or assign randomly) and then repeatedly performs the update:

![title](img/img2.png)
where η is a constant controlling step-size and is called the learning rate. Note that in order to make a single update, we need to calculate the gradient using the entire dataset. This can be very inefficient for large datasets.


#### Stochastic Gradient Descent
As noted, the gradient descent algorithm makes intuitive sense as it always proceeds in the direction of steepest descent (the gradient of J) and guarantees that we find a local minimum (global under certain assumptions on J). When we have very large data sets, however, the calculation of ∇(J(θ)) can be costly as we must process every data point before making a single step (hence the name “batch”). An alternative approach, the stochastic gradient descent method, is to update θ sequentially with every observation. The updates then take the form:
![title](img/img6.png)

This stochastic gradient approach allows us to start making progress on the minimization problem right away. It is computationally cheaper, but it results in a larger variance of the loss function in comparison with batch gradient descent.

Generally, the stochastic gradient descent method will get close to the optimal θ much faster than the batch method, but will never fully converge to the local (or global) minimum. Thus the stochastic gradient descent method is useful when we want a quick and dirty approximation for the solution to our optimization problem. A full recipe for stochastic gradient descent follows:

Initialize the parameter vector θ and set the learning rate α
Repeat until an acceptable approximation to the minimum is obtained:
Randomly reshuffle the instances in the training data.
For i=1,2,…mi=1,2,…m do: θ:=θ−α∇θJi(θ)



#### MiniBatch Gradient Descent

What if instead of single example from the dataset, we use a batch of data examples witha given size every time we calculate the gradient:
![title](img/img8.png)
This is what mini-batch gradient descent is about. Using mini-batches has the advantage that the variance in the loss function is reduced, while the computational burden is still reasonable, since we do not use the full dataset. The size of the mini-batches becomes another hyper-parameter of the problem. In standard implementations it ranges from 50 to 256.

#### Cubic Equation (Example)