**Shoutout** to **DataCamp** https://www.datacamp.com/tutorial/introduction-to-activation-functions-in-neural-networks

# Activation Functions

An activation function is a transformation applied on the input to a neuron to give an output. In the absence of an activation function the whole neural network is a series of linear transformations on the input which aligns with linear regression. It is fine to capture only linear relationships but the real world data is hardly linear-only and that is the point of adding hidden layers to a neural network as well. The idea is to capture the non-linearities using these transformations that are applied at the neuron level and called activation function. This greatly increases the flexibility and power of neural networks to model complex data.

The history of this nomenclature goes to the anatomic relation of the subject, since our brain activates certain neurons only when a certain strength of signal is received from the receptors the whole process is called as activation and neural networks borrowed that from neurology to call the non-linearity capturing transformations as activation functions as well.


![image.png](attachment:image.png)


Source: https://www.baeldung.com/cs/sigmoid-vs-tanh-functions

## Types of Activation Functions

There are various types of activation functions and each of them have their own benefits based on different use-cases.

### 1. **Linear Activation Function**
It is the most straightforward activation function, it simply returns the input x as the output. Graphically, it looks like a straight line with a slope of 1.

![image.png](attachment:image.png)

![image-2.png](attachment:image-2.png)

The main use case of the linear activation function is in the output layer of a neural network used for regression. For regression problems where we want to predict a numerical value, using a linear activation function in the output layer ensures the neural network outputs a numerical value. The linear activation function does not squash or transform the output, so the actual predicted value is returned.

However, the linear activation function is rarely used in hidden layers of neural networks. This is because it does not provide any non-linearity. The whole point of hidden layers is to learn non-linear combinations of the input features. Using a linear activation throughout would restrict the model to just learning linear transformations of the input.


</br>

### 2. **Sigmoid Function**
The sigmoid function takes a real number and gives an output between 0 and 1 thus the output can be interpretted as probability. That’s why it is also used in the output neurons of a prediction task. It gives 0 for large negative numbers and 1 for large positive numbers. Sigmoid function was popular in early neural networks since the gradient is strongest when the unit's output is near 0.5, allowing efficient backpropagation training. However, sigmoid function suffer from the "vanishing gradient" problem that hampers learning in deep neural networks.

As the input values become significantly positive or negative, the function saturates at 0 or 1, with an extremely flat slope as we can see in the illustration below. In these regions, the gradient is very close to zero. This results in very small changes in the weights during backpropagation, particularly for neurons in the earlier layers of deep networks, which makes learning painfully slow or even halts it. This is referred to as the **vanishing gradient problem** in neural networks.

The main use case of the sigmoid function is as the activation for the output layer of binary classification models. It squashes the output to a probability value between 0 and 1, which can be interpreted as the probability of the input belonging to a particular class.


![image-4.png](attachment:image-4.png)

![image-3.png](attachment:image-3.png)

</br>

### 3.**Tan-h (Hyperbolic Tangent) Function** 
Tanh and sigmoid may look very similar in illustrations but unlike the sigmoid function, tanh is zero-centered, which means that its output is symmetric around the origin of the coordinate system. This is often considered an advantage because it can help the learning algorithm converge faster. 


![image-2.png](attachment:image-2.png)


![image-3.png](attachment:image-3.png)

Source: https://www.datacamp.com/tutorial/introduction-to-activation-functions-in-neural-networks

If we see the behavior of their gradients, we see that at 0 the gradient of tanh activation is almost 4 times that of sigmoid function. It has stronger gradients than the sigmoid function. Stronger gradients often result in faster learning and convergence during training because they tend to be more resilient against the problem of vanishing gradients when compared to the gradients of the sigmoid function.

![image.png](attachment:image.png)

Source: https://www.baeldung.com/cs/sigmoid-vs-tanh-functions

And also because the output of tanh ranges between -1 and +1, the output of tanh is symmetric around zero, leading to faster convergence.


Despite its advantages in speed of training and 0-centred output, the tanh function being a bounded function still suffers from the vanishing gradient problem.

</br>

### 4. **RELU Function** 

It is known as the Rectified Linear Unit function. It can be similar to a linear function but the non-linearity lies in the max function that is applied on top.

![image.png](attachment:image.png)

The function gives the resulting illustration 

![image-2.png](attachment:image-2.png)


Source: https://www.datacamp.com/tutorial/introduction-to-activation-functions-in-neural-networks

It thresholds the input at zero, returning 0 for negative values and the input itself for positive values.

For inputs greater than 0, ReLU acts as a linear function with a gradient of 1. This means that it does not alter the scale of positive inputs and allows the gradient to pass through unchanged during backpropagation. This property is critical in mitigating the vanishing gradient problem.

Even though ReLU is linear for half of its input space, it is technically a non-linear function because it has a non-differentiable point at x=0, where it abruptly changes from x. This non-linearity allows neural networks to learn complex patterns

Since ReLU outputs zero for all negative inputs, it naturally leads to sparse activations; at any time, only a subset of neurons are activated, leading to more efficient computation.

The ReLU function is computationally inexpensive because it involves simple thresholding at zero. This allows networks to scale to many layers without a significant increase in computational burden, compared to more complex functions like tanh or sigmoid.


Source: https://www.datacamp.com/tutorial/introduction-to-activation-functions-in-neural-networks

</br>

### 5. **Softmax Function**


The softmax activation function, also known as the normalized exponential function, is particularly useful within the context of multi-class classification problems. This function operates on a vector, often referred to as the logits, which represents the raw predictions or scores for each class computed by the previous layers of a neural network.

For input vector x with elements x1, x2, ..., xC, the softmax function is defined as:

![image.png](attachment:image.png)


The output of the softmax function is a probability distribution that sums up to one. Each element of the output represents the probability that the input belongs to a particular class.

The use of the exponential function ensures that all output values are non-negative. This is crucial because probabilities cannot be negative.

Softmax amplifies differences in the input vector. Even small differences in the input values can lead to substantial differences in the output probabilities, with the highest input value(s) tending to dominate in the resulting probability distribution.

Softmax is typically used in the output layer of a neural network when the task involves classifying an input into one of several (more than two) possible categories (multi-class classification).

The probabilities produced by the softmax function can be interpreted as confidence scores for each class, providing insight into the model's certainty about its predictions.

Because softmax amplifies differences, it can be sensitive to outliers or extreme values. For example, if the input vector has a very large value, softmax can "squash" the probabilities of other classes, leading to an overconfident model.

![image-2.png](attachment:image-2.png)


Source: https://www.datacamp.com/tutorial/introduction-to-activation-functions-in-neural-networks