# Activation Functions

Based on **Patric Loeber** video: https://www.youtube.com/watch?v=c36lUUr864M&t=9756s

Activation functions are an extremly important feature of neural networks.

![image.png](attachment:b14c78c4-da7a-4172-b002-9ec0a7d29762.png)

Activatin Functions apply a linear transformation to the layer output and basically decide whether a neuron should be activted or not.

Why we use them? why is only a linear transformation not good enough?

![image.png](attachment:d15e649c-2cec-4cf5-b5a9-b5a2e3aabb1a.png)

Typically we would have a linear layer in our network that applies a linear transformation.

![image.png](attachment:90db5d73-cd4a-4611-b1a0-2714606cfa79.png)

Let's suppose we don't have activation functions in between, then we would have only linear transformation after each other. So our whole network from input to output is essentially just a linear regression model. This linear model is not suited for more complex tasks. The conclusion is that with non-linear transformation in between our network can learn better and preform more complex tasks.

![image.png](attachment:e5b70ff2-1a0b-4a81-9b5b-ad2331018e39.png)

After each layer we typically want to apply this activation functions. Here first we have our normal linear layer and then we also apply activation function. With this our network can learn better.

## Most popular activation functions

+ Step function
+ Sigmoid
+ TanH
+ ReLU
+ Leaky ReLU
+ Softmax

### Step Function

![image.png](attachment:f5529e78-fc19-4c0f-8aa0-9eedef16ad95.png)

This will just output one if our input is greater than a threshold. Here the threshold is zero. And zero otherwise. THis activation function is not used in practice but it should demonstrate te example if the neuron should be activated or not.

### Sigmoid Function

![image.png](attachment:6c517be6-8d8f-4f4e-944d-848466c89210.png)

More popular choice is the sigmoid function. This will output a probability between 0 and 1. It is typically used in the last layer of binary classification problem.

### TanH Function

![image.png](attachment:adbcde89-1837-4b3f-b524-60720ab2f989.png)

The hyperbolic tangent fucntcion or TanH, it is basically a scaled sigmoid function and also a little bit shifted by subtraction of 1. This function will output a value between -1 and 1. It is actually a good choice in hidden layers.

### ReLU Function

![image.png](attachment:3f3603a0-06f9-4131-88ef-8636fcbf8652.png)

ReLU function is the most popular choice in most of the networks. The ReLU function will output 0 for negative values and it will simply output the input as output for positive values. It is actually a linear function for values greater than 0. It doesn't look that much different from just a linear transformation but in fact it is non-linear. In addition to being the most popular choice in the networks it is typically also a very good choice for an activation function.

The rule thumb is if you don't know which function you should use then just use a ReLU for hidden layers.

### Leaky ReLU Function

![image.png](attachment:5d232e6b-0093-4145-af09-cd55bd5ae83f.png)

The Leaky ReLU Function is a slightly modified and slightly improved version of the ReLu. This will still just output the input for x > 0 ut this will multiply our input with a very small value for negative numbers. **a** is typically very small so it is for example 0.001. This version of ReLU tries to solve the so-called vanishing gradient problem. Because wit hnormal ReLU our values which are negative equals to 0 in output and this means that also the gradient later in the backpropagation is zero. When the gradient is zero then this means that these weights will never be updated. These neurons won't learn anything and we also say that these neurons are dead. This is why sometimes we want to use the Leaky ReLU Function.

So whenever we notice that our weights won't update during the training then we should try to use the Leaky ReLU Function insted of the normal ReLU Function.

### Softmax

![image.png](attachment:af6c5107-9216-4e5a-b14e-2e81dd2bf29d.png)

This will just basically squash the inputs to be outputs between 0 and 1. So that we have a probability as an output. This is typically a good choice in the lat layer of a multi-class classification problem.

## Code

In [1]:
import torch
import torch.nn as nn
import torch.nn.functional as F

# option 1 (create nn modules)
class NeuralNet(nn.Module):
    def __init__(self, input_size, hidden_size):
        super(Neuralet, self).__init__()
        self.linear1 = nn.Linear(input_size, hidden_size)
        self.relu = nn.ReLU()
        self.linear2 = nn.Linear(hidden_size, 1)
        self.sigmoid = nn.Sigmoid()
        
    def forward(self, x):
        out = self.linear1(x)
        out = self.relu(out)
        out = self.linear2(out)
        out = self.sigmoid(out)
        return out
    
# option 2 (use activation functions directly in forward pass)
class NeuralNet2(nn.Module):
    def __init__(self, input_size, hidden_size):
        super(NueralNet, self).__init__()
        self.linear1 = nn.Linear(input_size, hidden_size)
        self.linear2 = nn.Linear(hidden_size, 1)
        
    def forward(self, x):
        out = torch.relu(self.linear1(x))
        out = torhc.sigmoid(self.linear2(out))
        return out