## Activation functions

There are a lot of possibility for activation functions. The most common choice is the *Rectified Linear Unit or ReLU* (*i.e.* max(0, x)) that has the advantages of being fast to compute and fast to derive (the derivative is either 1 or 0, the value is either 0 or the input). There are tons of variations of it (ELU, LeakyReLU...) however I recommend to use them after you have a well working model already that uses ReLU: they help to marginally improve the accuracy, but they won't make the difference between 40% and 70% accuracy.

**Warning**: ReLU is usually used *in-between* layers, but never at the end of the network: in fact, the network needs to output probabilities, while ReLU returns either 0 or x (that is whatever score). So at the end of the network you use the sigmoid (for binary classification problems) or softmax for multiclassification problems.

<img src="assets/activation.png" width=700px>

In [6]:
import torch.nn.functional as F
import torch 

test_input = torch.tensor([-100, -10, -5,-2, 0, 2, 5, 10, 100])
relu_output = F.relu(test_input)
sigmoid_output = F.sigmoid(test_input)
tanh_output = torch.tanh(test_input)

print("Relu:", relu_output)
print("Sigmoid:", sigmoid_output)
print("Tanh:", tanh_output)

Relu: tensor([  0,   0,   0,   0,   0,   2,   5,  10, 100])
Sigmoid: tensor([0.0000e+00, 4.5398e-05, 6.6929e-03, 1.1920e-01, 5.0000e-01, 8.8080e-01,
        9.9331e-01, 9.9995e-01, 1.0000e+00])
Tanh: tensor([-1.0000, -1.0000, -0.9999, -0.9640,  0.0000,  0.9640,  0.9999,  1.0000,
         1.0000])


## Loss Functions

Using the correct loss function is crucial. For binary classification problems, a Binary Cross Entropy loss is often used (`nn.BCELoss()`), while for multiclassificatiown tasks like the MNIST one the `nn.CrossEntropyLoss()`. In case of regression problems, the most popular is the `nn.MSELoss`. 

In place of the `nn.CrossEntopyLoss()`, it is often used the Negative Log Likelihood Loss `nn.NLLLoss()`, that returns the log probabilities instead of the probabilities. 
However, it's very important to remember that:

- You can use `nn.NLLLoss()` only if at the end of the Neural Network there is NO Softmax, but the `nn.LogSoftmax(x, dim=1)`

- This means that if you want to get the probabilities for each class, you need to use the exponential function (that is the 'inverse' of the logarithm) by doing `torch.exp(output)` where output is what you get after applying the model to your input (`probabilities = torch.exp(model(X))`)




## Optimizers

Updating the weights is done by the optimizers. You can find a bunch of them here: https://pytorch.org/docs/stable/optim.html

Each of them have some mathematical reasoning behind them. However, I'll leave it to your own curiosity. What you should know instead is:

- The most used in the literature is the Adam optimizer

`optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9)`

- A good alternative is often the Stochastic Gradient Descent

`optimizer = optim.Adam(model.parameters(), lr=0.0001)`





## Batch size

Setting the right batch size is an hyperparameter that you will tune yourself. Often the limitations are imposed by the machine you are using, meaning you have to use a small batch size to make it to fit in your memory. Usually, power of two are used as batch size.



## Exercises

Complete the previous notebooks, trying to understand each line of code. The text in between the cells should be descriptive enough for you to understand. If there are still pending things, we'll have a live coding session today. 

### Practice

Intermediate: can you apply a Neural Network to tabular data? Take an old dataset and try to do it.
Advanced: Can you use an MLP for your rock-paper-scissors problem? 
Next week you will be doing custom dataloaders, but I will attach the notebook already here. 