# Activation Functions

## Overview

The `activations` module offers a collection of popular activation functions essential for neural network designs.
Along with the primary function definitions, this module calculates the gradients for each, aiding in understanding and applying the 
backpropagation algorithm.
Notably, the `softmax` function is an exception due to its inherent multi-input, multi-output structure, necessitating a unique 
gradient computation.


In [None]:
from pathlib import Path

ROOT_DIR = Path('..') / '..'
!pip install -q -r {ROOT_DIR / 'requirements.txt'}

import torch  # needed for running the examples
from tqdm import tqdm  # prettier progress bars



## Sigmoid

The sigmoid function confines its input within the range of 0 and 1.
It is commonly used in binary classification tasks, where the output is interpreted as the probability of the input
belonging to the positive class.

Formally, the sigmoid function is defined as:

$$
\mathrm{sigmoid}(x) = \frac{1}{1 + e^{-x}}
$$

And its gradient is:

$$
\mathrm{sigmoid}'(x) = \mathrm{sigmoid}(x)(1 - \mathrm{sigmoid}(x))
$$

Notably, this gradient can induce the vanishing gradient issue for extremely high or low values.



### Examples

#### Example 1: Compute the sigmoid of a tensor

In [None]:
from activations import sigmoid

x = torch.Tensor([0, 1, 2])
result = sigmoid(x)
print(result)  # tensor([0.5000, 0.7311, 0.8808])

#### Example 2: Compute the gradient of the sigmoid of a tensor

In [None]:
x = torch.Tensor([0, 1, 2])
result_with_gradient = sigmoid(x, gradient=True)
print(result_with_gradient)  # tensor([0.2500, 0.1966, 0.1050])

#### Example 3: Higher-dimensional tensors

In [None]:
x = torch.Tensor([[0, 1], [-1, 2]])
result = sigmoid(x)
print(result)
# tensor([[0.5000, 0.7311],
#         [0.2689, 0.8808]])

#### Example 4: Testing against PyTorch's implementation

In [None]:
for _ in tqdm(range(100)):
    x = torch.randn((100, 100, 100))
    actual, expected = sigmoid(x), torch.sigmoid(x)
    assert torch.allclose(actual, expected), f"Expected {expected}, got {actual}"
print("All tests passed!")

## Tanh

**Location**: [`tanh.py`](tanh.py)

**Formula**:

![](https://quicklatex.com/cache3/9f/ql_72eebf5038a7f863b236caa209a7b09f_l3.png)

*Description*: The tanh function bounds its input within the range of -1 and 1.

**Gradient**:

![](https://quicklatex.com/cache3/5a/ql_1aec5bd7f300437b4a6a9f9342a6165a_l3.png)

## ReLU

**Location**: [`relu.py`](relu.py)

**Formula**: 

![](https://quicklatex.com/cache3/a0/ql_37999e1feff124baca2413a754b3bfa0_l3.png)

*Description*: ReLU imparts non-linearity in models without perturbing the receptive fields of convolutions.

**Gradient**: 

![](https://quicklatex.com/cache3/41/ql_f90d9e252e775d7a7c0591082d1fd941_l3.png)

Being computationally efficient, ReLU can occasionally cause dead neurons during the training process.

## Swish

**Location**: [`swish.py`](swish.py)

**Formula**:

![](https://quicklatex.com/cache3/8d/ql_70016b335ea865d9640e6af078f4e08d_l3.png)

Where: ![](https://quicklatex.com/cache3/e8/ql_9a315236dfcda864a869107144a3fbe8_l3.png) is a learnable parameter.

*Description*: Swish stands as a self-gated function, synthesizing the merits of both ReLU and sigmoid.

**Gradient**: 

![](https://quicklatex.com/cache3/00/ql_e0b16a0e5c70dc33fae2dd6df8f09400_l3.png)

## CELU

**Location**: [`celu.py`](celu.py)

**Formula**: 

![](https://quicklatex.com/cache3/f3/ql_c5273c8c3683571ded65e128719665f3_l3.png)

*Description*: CELU is an extension of the exponential linear units (ELU), enhanced with a scalable parameter 
![](https://quicklatex.com/cache3/a0/ql_0c3e2deb84c57937afcc3a11a786fea0_l3.png)

**Gradient**:

![](https://quicklatex.com/cache3/a9/ql_f5f8b0d44fbd0efab3c215f8bf8ea6a9_l3.png)

## Softmax

**Location**: [`softmax.py`](softmax.py)

**Formula**:

![](https://quicklatex.com/cache3/59/ql_9798671b5f273c3282d12bd273d73b59_l3.png)

*Description*: The softmax function is essential for multi-class categorization tasks, transmuting inputs into a
probability distribution spread across multiple categories.

**Gradient**: Evaluating the gradient of softmax demands intricate attention due to the inherent normalization.
This computation typically necessitates the use of the _Jacobian matrix_.
