# Activation Functions

## Overview

The `activations` module offers a collection of popular activation functions essential for neural network designs.
Along with the primary function definitions, this module calculates the gradients for each, aiding in understanding and applying the 
backpropagation algorithm.
Notably, the `softmax` function is an exception due to its inherent multi-input, multi-output structure, necessitating a unique 
gradient computation.


In [22]:
from pathlib import Path

ROOT_DIR = Path('..') / '..'
!pip install -q -r {ROOT_DIR / 'requirements.txt'}

import torch  # needed for running the examples
from tqdm import tqdm  # prettier progress bars


Here's an improved and restructured version:

## Sigmoid Activation Function

The sigmoid function is a type of activation function that is primarily used in binary 
classification tasks.
It maps any input to a value between 0 and 1, which can often be used to represent the 
probability that a given input point belongs to the positive class.

Mathematically, the sigmoid function is given by:

$$
\mathrm{sigmoid}(x) = \frac{1}{1 + e^{-x}}
$$

Its derivative, crucial for the backpropagation algorithm, is:

$$
\mathrm{sigmoid}'(x) = \mathrm{sigmoid}(x)(1 - \mathrm{sigmoid}(x))
$$

However, it's worth noting that the sigmoid function can lead to vanishing gradients when its 
input is very high or very low.

### Examples

#### 1. Computing the sigmoid of a tensor:

In [23]:
from activations import sigmoid

x = torch.Tensor([0, 1, 2])
result = sigmoid(x)
print(result)  # Outputs: tensor([0.5000, 0.7311, 0.8808])


tensor([0.5000, 0.7311, 0.8808])


#### 2. Determining the gradient of the sigmoid for a tensor:

In [24]:
x = torch.Tensor([0, 1, 2])
gradient_result = sigmoid(x, gradient=True)
print(gradient_result)  # Outputs: tensor([0.2500, 0.1966, 0.1050])


tensor([0.2500, 0.1966, 0.1050])


#### 3. Handling higher-dimensional tensors:


In [25]:
x = torch.Tensor([[0, 1], [-1, 2]])
result = sigmoid(x)
print(result)
# Outputs: 
# tensor([[0.5000, 0.7311],
#         [0.2689, 0.8808]])


tensor([[0.5000, 0.7311],
        [0.2689, 0.8808]])


#### 4. Verifying against PyTorch's built-in implementation:


In [26]:
for _ in tqdm(range(100)):
    x = torch.randn((100, 100, 100))
    our_implementation = sigmoid(x)
    pytorch_implementation = torch.sigmoid(x)
    assert torch.allclose(our_implementation, pytorch_implementation), \
        f"Expected {pytorch_implementation}, but got {our_implementation}"
print("All tests passed!")


100%|██████████| 100/100 [00:02<00:00, 40.51it/s]

All tests passed!





## Tanh


## Tanh Activation Function

The hyperbolic tangent, or simply $\text{tanh}$, is another prevalent activation function used
in neural networks.
Its outputs range between -1 and 1, making it zero-centered, which can help mitigate some of
the issues observed with non-zero-centered activation functions like the sigmoid.

Mathematically, the $\text{tanh}$ function is expressed as:

$$
\mathrm{tanh}(x) = \frac{e^{x} - e^{-x}}{e^{x} + e^{-x}}
$$

Or, equivalently, as:

$$
\mathrm{tanh}(x) = 2 \times \mathrm{sigmoid}(2x) - 1
$$

The derivative of $\text{tanh}$, useful for backpropagation, is:

$$
\mathrm{tanh}'(x) = 1 - \mathrm{tanh}^2(x)
$$

Compared to the sigmoid function, $\text{tanh}$ tends to be preferred for hidden layers due to
its zero-centered nature.
Still, it shares the vanishing gradient problem for extremely high or low inputs.



### Examples



#### 1. Computing the $\text{tanh}$ of a tensor:

In [27]:
from activations import tanh

x = torch.Tensor([0, 1, 2])
result = tanh(x)
print(result)  # Expected: tensor([0.0000, 0.7616, 0.9640])


tensor([0.0000, 0.7616, 0.9640])


#### 2. Determining the gradient of $\text{tanh}$ for a tensor:


In [28]:
x = torch.Tensor([0, 1, 2])
gradient_result = tanh(x, gradient=True)
print(gradient_result)  # Expected: tensor([1.0000, 0.4200, 0.0707])


tensor([1.0000, 0.4200, 0.0707])


#### 3. Handling higher-dimensional tensors:


In [29]:
x = torch.Tensor([[0, 1], [-1, 2]])
result = tanh(x)
print(result)
# Expected: 
# tensor([[ 0.0000,  0.7616],
#         [-0.7616,  0.9640]])


tensor([[ 0.0000,  0.7616],
        [-0.7616,  0.9640]])


#### 4. Verifying against PyTorch's built-in implementation:


In [30]:
for _ in range(100):
    x = torch.randn((100, 100, 100))
    actual, expected = tanh(x), torch.tanh(x)
    assert torch.allclose(actual, expected, atol=1e-7), f"Expected {expected}, but got {actual}"
print("All tests passed!")

All tests passed!


## ReLU

**Location**: [`relu.py`](relu.py)

**Formula**: 

![](https://quicklatex.com/cache3/a0/ql_37999e1feff124baca2413a754b3bfa0_l3.png)

*Description*: ReLU imparts non-linearity in models without perturbing the receptive fields of convolutions.

**Gradient**: 

![](https://quicklatex.com/cache3/41/ql_f90d9e252e775d7a7c0591082d1fd941_l3.png)

Being computationally efficient, ReLU can occasionally cause dead neurons during the training process.

## Swish

**Location**: [`swish.py`](swish.py)

**Formula**:

![](https://quicklatex.com/cache3/8d/ql_70016b335ea865d9640e6af078f4e08d_l3.png)

Where: ![](https://quicklatex.com/cache3/e8/ql_9a315236dfcda864a869107144a3fbe8_l3.png) is a learnable parameter.

*Description*: Swish stands as a self-gated function, synthesizing the merits of both ReLU and sigmoid.

**Gradient**: 

![](https://quicklatex.com/cache3/00/ql_e0b16a0e5c70dc33fae2dd6df8f09400_l3.png)

## CELU

**Location**: [`celu.py`](celu.py)

**Formula**: 

![](https://quicklatex.com/cache3/f3/ql_c5273c8c3683571ded65e128719665f3_l3.png)

*Description*: CELU is an extension of the exponential linear units (ELU), enhanced with a scalable parameter 
![](https://quicklatex.com/cache3/a0/ql_0c3e2deb84c57937afcc3a11a786fea0_l3.png)

**Gradient**:

![](https://quicklatex.com/cache3/a9/ql_f5f8b0d44fbd0efab3c215f8bf8ea6a9_l3.png)

## Softmax

**Location**: [`softmax.py`](softmax.py)

**Formula**:

![](https://quicklatex.com/cache3/59/ql_9798671b5f273c3282d12bd273d73b59_l3.png)

*Description*: The softmax function is essential for multi-class categorization tasks, transmuting inputs into a
probability distribution spread across multiple categories.

**Gradient**: Evaluating the gradient of softmax demands intricate attention due to the inherent normalization.
This computation typically necessitates the use of the _Jacobian matrix_.
