## Activation Functions
(based on a tutorial by Python Engineer in Youtube)

**Activation Function**
 
 Without activation functions the whole network would be equal to a linear function which is not able to solve more complicated tasks.
 
 So after each layer, we add an activation function so that our network can learn better.
 


**Different Types of Activation Functions**


    1- Step function
                will have 1 as output if the input is greater than threshold. (not used in practice)
    2- Sigmoid
                formula,
                output between 0 and 1,
                in the last layer of binary classification problem.
    3- TanH
                formula
                scaled sigmoid and also a bit shifted
                output between -1 and 1
                good choice in the hidden layers
    4- ReLU
                most popular choice in most of the networks
                output 0 for negative values and output the same input for positive inputs =>
                a linear function to values greater than 0 and 0 for negative values. => non linear
                ***if you don't know what function you should use, use the ReLU for hidden layers.
    5- Leaky ReLU
                modified/improved version of ReLU
                for negative inputs there would be a multiplication of input with a very small number
                (a = 0.001)
                
                Tries to solve vanishing gradient problem. (with the normal ReLU the values for negative part is 
                zero that means the derivation/gradient in backpropagation would be zero, too. This means that 
                those weights will never be updated => Those neurons won't learn anything.
                (the neurons are dead)
                
                So whenever you notice that the weights won't get updated during the training, use the 
                LeakyReLU.
                
    6- Softmax
                Squash the input to be outputs between zero and 1 and sum to one, so the probabilities.
                
                => a good choice for the last layer of a multiclass classification problem.

In [4]:
import torch
import torch.nn as nn
import numpy as np

## Two ways to use the activation functions

## 1

In [5]:
class NeuralNet2(nn.Module):
    def __init__(self, input_size, hidden_size, num_classes):
        super(NeuralNet2, self).__init__()
        self.linear1 = nn.Linear(input_size, hidden_size)  #one linear layer
        self.relu = nn.ReLU()                              #activation function
        self.linear2 = nn.Linear(hidden_size,num_classes)  #last layer
        
    def forward(self, x):
        out = self.linear1(x)
        out = self.relu(out)
        out = self.linear2(out)
        #as we are using nn.CrossEntropy no softmax will be here.
        return out

model = NeuralNet2(input_size=28*28, hidden_size=5, num_classes=3)
criterion = nn.CrossEntropyLoss() #applies softmax too

## 2

In [6]:
class NeuralNet3(nn.Module):
    def __init__(self, input_size, hidden_size, num_classes):
        super(NeuralNet3, self).__init__()
        self.linear1 = nn.Linear(input_size, hidden_size)  #one linear layer
        self.linear2 = nn.Linear(hidden_size, num_classes) #last layer
        
    def forward(self, x):
        out = torch.relu(self.linear1(x))
        out = torch.sigmoid(self.linear2(out))
        #sigmoid at the end
        return y_pred

model = NeuralNet3(input_size=28*28, hidden_size=5, num_classes=3)
criterion = nn.CrossEntropyLoss() #applies softmax too