# Activation function

An activation function is a very important feature of an artificial neural network , they basically decide whether the neuron should be activated or not.

In artificial neural networks, the activation function defines the output of that node given an input or set of inputs.

**Important use of any activation function is to introduce non-linear properties to our Network**.

In simple term , it calculates a “weighted sum(Wi)” of its input(xi), adds a bias and then decides whether it should be “fired” or not.

> Activation function also helps to normalize the output of each neuron to a range between 1 and 0 or between -1 and 1.

![1.jpg](attachment:1.jpg)

In a neural network, inputs are fed into the neurons in the input layer. Each neuron has a weight, and multiplying the input number with the weight gives the output of the neuron, which is transferred to the next layer.

The activation function is a mathematical “gate” in between the input feeding the current neuron and its output going to the next layer. It can be as simple as a step function that turns the neuron output on and off, depending on a rule or threshold.


> Neural networks use non-linear activation functions, which can help the network learn complex data, compute and learn almost any function representing a question, and provide accurate predictions.

#### The Activation Functions can be basically divided into 2 types-

1. Linear Activation Function
2. Non-linear Activation Functions

##### Linear or Identity Activation Function
As you can see the function is a line or linear. Therefore, the output of the functions will not be confined between any range.

![2.webp](attachment:2.webp)

**Equation** : f(x) = x

**Range** : (-infinity to infinity)

It doesn’t help with the complexity or various parameters of usual data that is fed to the neural networks.

##### Non-linear Activation Function
The Nonlinear Activation Functions are the most used activation functions. Nonlinearity helps to makes the graph look something like this

![3.webp](attachment:3.webp)

It makes it easy for the model to generalize or adapt with variety of data and to differentiate between the output.

The main terminologies needed to understand for nonlinear functions are:

**Derivative or Differential**: Change in y-axis w.r.t. change in x-axis.It is also known as slope.

**Monotonic function**: A function which is either entirely non-increasing or non-decreasing.

The Nonlinear Activation Functions are mainly divided on the basis of their range or curves.

## Commonly used activation functions

# 1.  Sigmod function

The sigmoid activation function is used mostly as it does its task with great efficiency, it basically is a probabilistic approach towards decision making and ranges in between 0 to 1, so when we have to make a decision or to predict an output we use this activation function because of the range is the minimum, therefore, prediction would be more accurate.

The function formula and chart are as follows

![4.png](attachment:4.png)

In the sigmoid function, we can see that its output is in the open interval (0,1). We can think of probability, but in the strict sense, don't treat it as probability. The sigmoid function was once more popular. It can be thought of as the firing rate of a neuron. In the middle where the slope is relatively large, it is the sensitive area of the neuron. On the sides where the slope is very gentle, it is the neuron's inhibitory area.

The function itself has certain defects.

1) When the input is slightly away from the coordinate origin, the gradient of the function becomes very small, almost zero. In the process of neural network backpropagation, we all use the chain rule of differential to calculate the differential of each weight w. When the backpropagation passes through the sigmod function, the differential on this chain is very small. Moreover, it may pass through many sigmod functions, which will eventually cause the weight w to have little effect on the loss function, which is not conducive to the optimization of the weight. This The problem is called gradient saturation or gradient dispersion.

2) The function output is not centered on 0, which will reduce the efficiency of weight update.

3) The sigmod function performs exponential operations, which is slower for computers.


**Advantages of Sigmoid Function** : -

1. Smooth gradient, preventing “jumps” in output values.
2. Output values bound between 0 and 1, normalizing the output of each neuron.
3. Clear predictions, i.e very close to 1 or 0.


**Sigmoid has three major disadvantages**:
* Prone to gradient vanishing
* Function output is not zero-centered
* Power operations are relatively time consuming


## 2. Tanh or hyperbolic tangent Activation Function
tanh is also like logistic sigmoid but better. The range of the tanh function is from (-1 to 1). tanh is also sigmoidal (s - shaped).

The tanh function formula and curve are as follows

![6.svg](attachment:6.svg)

![5.png](attachment:5.png)

Tanh is a hyperbolic tangent function. The curves of tanh function and sigmod function are relatively similar. Let ’s compare them. First of all, when the input is large or small, the output is almost smooth and the gradient is small, which is not conducive to weight update. The difference is the output interval. 

In general binary classification problems, the tanh function is used for the hidden layer and the sigmod function is used for the output layer. However, these are not static, and the specific activation function to be used must be analyzed according to the specific problem, or it depends on debugging.

**Comparison**

![7.webp](attachment:7.webp)

The advantage is that the negative inputs will be mapped strongly negative and the zero inputs will be mapped near zero in the tanh graph.

- The function is differentiable.

- The function is monotonic while its derivative is not monotonic.

- The tanh function is mainly used classification between two classes.

> Both tanh and logistic sigmoid activation functions are used in feed-forward nets.



## 3. ReLU (Rectified Linear Unit) Activation Function
The ReLU is the most used activation function in the world right now.Since, it is used in almost all the convolutional neural networks or deep learning.

ReLU function formula and curve are as follows

![8.svg](attachment:8.svg)

![9.png](attachment:9.png)

As you can see, the ReLU is half rectified (from bottom). f(z) is zero when z is less than zero and f(z) is equal to z when z is above or equal to zero.

Range: [ 0 to infinity)

The function and its derivative both are monotonic.
        
The ReLU (Rectified Linear Unit) function is an activation function that is currently more popular. Compared with the sigmod function and the tanh function, it has the following advantages:

1) When the input is positive, there is no gradient saturation problem.

2) The calculation speed is much faster. The ReLU function has only a linear relationship. Whether it is forward or backward, it is much faster than sigmod and tanh. (Sigmod and tanh need to calculate the exponent, which will be slower.)

Ofcourse, there are disadvantages:

1) When the input is negative, ReLU is completely inactive, which means that once a negative number is entered, ReLU will die. In this way, in the forward propagation process, it is not a problem. Some areas are sensitive and some are insensitive. But in the backpropagation process, if you enter a negative number, the gradient will be completely zero, which has the same problem as the sigmod function and tanh function.

2) We find that the output of the ReLU function is either 0 or a positive number, which means that the ReLU function is not a 0-centric function.

**Important Note**:

The issue is that all the negative values become zero immediately which decreases the ability of the model to fit or train from the data properly. That means any negative input given to the ReLU activation function turns the value into zero immediately in the graph, which in turns affects the resulting graph by not mapping the negative values appropriately.



# 4. Leaky ReLU
It is an attempt to solve the dying ReLU problem

![10.svg](attachment:10.svg)



![11.webp](attachment:11.webp)

Can you see the Leak?

The leak helps to increase the range of the ReLU function. Usually, the value of a is 0.01 or so.

When a is not 0.01 then it is called Randomized ReLU.

Therefore the range of the Leaky ReLU is (-infinity to infinity).

Both Leaky and Randomized ReLU functions are monotonic in nature. Also, their derivatives also monotonic in nature.

**Note**

In order to solve the Dead ReLU Problem, people proposed to set the first half of ReLU 0.01x instead of 0. Another intuitive idea is a parameter-based method, Parametric ReLU : f(x)= max(alpha x,x), which alpha can be learned from back propagation. In theory, Leaky ReLU has all the advantages of ReLU, plus there will be no problems with Dead ReLU, but in actual operation, it has not been fully proved that Leaky ReLU is always better than ReLU.

# 5. ELU (Exponential Linear Units) function

![12.svg](attachment:12.svg)

![13.jpg](attachment:13.jpg)
ELU is also proposed to solve the problems of ReLU. Obviously, ELU has all the advantages of ReLU, and:

- No Dead ReLU issues
- The mean of the output is close to 0, zero-centered

One small problem is that it is slightly more computationally intensive. Similar to Leaky ReLU, although theoretically better than ReLU, there is currently no good evidence in practice that ELU is always better than ReLU.

# 8. Scaled Exponential Linear Unit (SELU) activation function

![16.webp](attachment:16.webp)

![17.webp](attachment:17.webp)

SELU is scaled variant of ELU activation function. It uses two fixed parameters α and λ, and the value of these is derived from the inputs. However, for standardized inputs (mean of 0 and standard deviation of 1) the suggested values are α=1.6733, λ=1.0507.

The major advantage of using SELU is that it provides self-normalization (that is output from SELU activation will preserve the mean of 0 and standard deviation of 1) and this solves the vanishing or exploding gradients problem. SELU will provide the self-normalization if: (a) the neural network consists only a stack of dense layers, (b) all the hidden layers use SELU activation function, (c) the input features must be standardized (having mean of 0 and standard deviation of 1), (d) the hidden layers weight must be initialized with LeCun normal initialization, and lastly, (e) the network must be sequential.

# 7. Softmax Activation Function
 
Softmax is used mainly at the last layer i.e output layer for decision making the same as sigmoid activation works, the softmax basically gives value to the input variable according to their weight and the sum of these weights is eventually one.

For **Binary classification**, both sigmoid, as well as softmax, are equally approachable but in case of multi-class classification problem we generally use **softmax** and **cross-entropy** along with it.

![15.png](attachment:15.png)

**Note**

#### Which activation function you should use in hidden layers?

There is no one answer to this question and will depend on the problem at hand. However, one can start with SELU > ELU > Leaky ReLU > ReLU > tanh > sigmoid. Moreover, if one cares about the run-time then you may use Leaky ReLU. Also if your neural network architecture prevents you to meet the SELU’s self-normalizing conditions then using ELU might give better results.

There are several other activation functions one can experiment with, such as Parametric leaky ReLU, in which α (degree of how much function leaks) is tuned as parameter and can be modified while doing backpropagation like any other model parameter (example - weights and biases). Another variant of ReLU is Randomized leaky ReLU (RReLU), in which α is chosen randomly during training process and fixed to an average value during testing.

# Happy Learning