# Activation Function and types

Activation Function is a function where in which used in the Hidden layer neurons which will basically says how much the neuron should get activated or deactivated. Lets explore the various type of activation function which are most used: 

1. Sigmoid 

2. Tanhx 
3. ReLU (Rectified Linear Unit)
4. Leaky ReLU
5. Pre ReLU
6. Softmax  

## 1. Sigmoid function

It is defined as $\displaystyle z = \sigma(x) = \frac {1}{1 + e^{-x}} $. 

Value of z lies between 0 and 1. The derivative of Sigmoid function lies between 0 and 0.25, due to which Vanishing Gradient problem exists. 

**Advantages:**
- Smooth Gradient

- Output values bound between 0 and 1, normalizing the values.

**Disadvantages:**

- Prone to Vanishing gradient/Gradient Vanishing

- Function output is Zero centered
- Calculation of exponential values increases the time complexity.

## 2. Tanhx

It is defined as $\displaystyle z = \frac {e^x - e^{-x}}{e^x + e^{-x}} $

Value of tanhx lies between -1 to 1. The derivative of tanhx lies between 0 to 1. 

Unlike Sigmoid function, tanhx is a Zero Centered Curve (it means the curve passes through 0 when plotted in the graph and lies between -1 to 1). 

**Advantages:**

- Zero-centered curve

- Output values bound between -1 to 1. 

**Disadvantages:**

- Still prone to Vanishing gradient 

- More time complexity as exponential operations are included.

## 3. ReLU (Rectified Linear Unit) 

Most popular activation function which is used by the researchers. Denoted by 

$\ ReLU = max(0, x) $ where $\ x $ is the Input

The simple definition of ReLU states two main things: 
- If x < 0 or x = 0, the final value of ReLU will be 0. 
- If x > 0, then final value of ReLU will be x. 

The derivative of ReLU activation function is either 0 or 1.

**Advantages:**

- ReLU is much more quicker than other activation functions since its just a $\ max() $ function. 

- Solves the Vanishing Gradient problem.

**Disadvantage:**

- Since the value is either 0 or 1 (derivative), consider if the value is 0 then the value of derivative in weights updation formula will be 0 making old weight again equal to the new weight. 

- **Once a Negative number is entered, ReLU will die. This is termed as Dead ReLU**
- ReLU function is not a zero-centric function.

## 4. Leaky ReLU 

Everything is similar to ReLU. But this Leaky ReLU Activation function solves the Dead ReLU. 

There will be a small change in the ReLU function. Instead of the 0 in the first half, we will replace it by $\ 0.01 * x $.

The Final function used is $\ Leaky ReLU = max(0.01x, x) $

**Note:** Its not fully proved that Leaky ReLU is better than ReLU.

## 5. Pre ReLU

Pre ReLU is a combination of ReLU and Leaky ReLU. Combining both the activation functions, there is a equation which is depicted as follows:

If $\ y_i > 0 $ then $\ f(y_i) = y_i $

If $ y_i \leq 0 $ then $\ f(y_i) = a_i * y_i $ 

Observations:

- If $\ a_i = 0 $, then it becomes ReLU

- If $\ a_i > 0 $, then it becomes Leaky ReLU
- If $\ a_i $ is learnable parameter, then it becomes Pre ReLU



## 6. Softmax

For all the multi-classification problems in the output layer, Softmax activation function is used. 

${\displaystyle \sigma (\mathbf {z} )_{i}={\frac {e^{z_{i}}}{\sum _{j=1}^{K}e^{z_{j}}}}}$

where: 

- $ \sigma $	=	softmax

- $\vec{z} $	=	input vector
- $\ {e^{z_{i}}} $	=	standard exponential function for input vector
- $\ K $	=	number of classes in the multi-class classifier
- $\ e^{z_{j}} $	=	standard exponential function for output vector
- $\ e^{z_{j}} $	=	standard exponential function for output vector $


**The number of nodes in the output layer = The number of output classes N**

The speciality of this activation function, once we receive the raw output from the neural network, then Softmax activation function converts them into a vector of probabilities (array representation). 

### Example

For example, our output has three classes - [Good, Bad, Neutral]

If the Neural network gives an output as "Good", then the representation of this will be [1, 0, 0]
Similarly, if the output is "Bad", then the representation will be in the form of [0, 1, 0] and if the output is "Neutral" the final output will be [0, 0, 1]

Likewise, this activation function converts the output in the encoding form as mentioned which will be later used to calculate the probabilties.

# Which Activation function should be used when?

There were many types of activation functions, but its crucial to know which Activation function should be used when to achieve desired results. Lets understand in terms of Regression and Classification (Binary and Multi-class).

## Regression

If our problem is related to Regression, then the following activation function should be used at Hidden layer and Output layers:

**Hidden Layer:** ReLU Activation function

**Output Layer:** Linear Activation function (output should be in continous variable)

## Classification

If our problem is related to Classification, we will discuss about the activation function which should be used for both Binary and Multi-class classification. 


### Binary Classification

**Hidden Layer:** ReLU Activation function

**Output Layer:** Sigmoid Activation Function (output lies between 0 to 1)

### Multi-class Classification

**Hidden Layer:** ReLU Activation function

**Output Layer:** Softmax Activation Function

> #### In all the types, ReLU activation function will be applied on the hidden layer whereas the activation function applied on the Output layer varies.