# Activation Functions

## Overview

The `activations` module offers a collection of popular activation functions 
essential for neural network designs.
Along with the primary function definitions, this module calculates the 
gradients for each, aiding in understanding and applying the back-propagation 
algorithm.
Notably, the `softmax` function is an exception due to its inherent multi-input,
multi-output structure, necessitating a unique gradient computation.


In [63]:
from pathlib import Path

ROOT_DIR = Path('..') / '..'
!pip install -q -r {ROOT_DIR / 'requirements.txt'}

import torch  # needed for running the examples
from tqdm import tqdm  # prettier progress bars


Here's an improved and restructured version:

## Sigmoid Activation Function

The sigmoid function is a type of activation function that is primarily used in binary 
classification tasks.
It maps any input to a value between 0 and 1, which can often be used to represent the 
probability that a given input point belongs to the positive class.

Mathematically, the sigmoid function is given by:

$$
\mathrm{sigmoid}(x) = \frac{1}{1 + e^{-x}}
$$

Its derivative, crucial for the backpropagation algorithm, is:

$$
\mathrm{sigmoid}'(x) = \mathrm{sigmoid}(x)(1 - \mathrm{sigmoid}(x))
$$

However, it's worth noting that the sigmoid function can lead to vanishing gradients when its 
input is very high or very low.

### Examples

#### 1. Computing the sigmoid of a tensor:

In [64]:
from activations import sigmoid

x = torch.Tensor([0, 1, 2])
result = sigmoid(x)
print(result)  # Outputs: tensor([0.5000, 0.7311, 0.8808])


tensor([0.5000, 0.7311, 0.8808])


#### 2. Determining the gradient of the sigmoid for a tensor:

In [65]:
x = torch.Tensor([0, 1, 2])
gradient_result = sigmoid(x, gradient=True)
print(gradient_result)  # Outputs: tensor([0.2500, 0.1966, 0.1050])


tensor([0.2500, 0.1966, 0.1050])


#### 3. Handling higher-dimensional tensors:


In [66]:
x = torch.Tensor([[0, 1], [-1, 2]])
result = sigmoid(x)
print(result)
# Outputs: 
# tensor([[0.5000, 0.7311],
#         [0.2689, 0.8808]])


tensor([[0.5000, 0.7311],
        [0.2689, 0.8808]])


#### 4. Verifying against PyTorch's built-in implementation:


In [67]:
for _ in tqdm(range(100)):
    x = torch.randn((100, 100, 100))
    our_implementation = sigmoid(x)
    pytorch_implementation = torch.sigmoid(x)
    assert torch.allclose(our_implementation, pytorch_implementation), \
        f"Expected {pytorch_implementation}, but got {our_implementation}"
print("All tests passed!")


100%|██████████| 100/100 [00:02<00:00, 42.69it/s]

All tests passed!





## Tanh

## Tanh Activation Function

The hyperbolic tangent, or simply $\text{tanh}$, is another prevalent activation function used
in neural networks.
Its outputs range between -1 and 1, making it zero-centered, which can help mitigate some of
the issues observed with non-zero-centered activation functions like the sigmoid.

Mathematically, the $\text{tanh}$ function is expressed as:

$$
\mathrm{tanh}(x) = \frac{e^{x} - e^{-x}}{e^{x} + e^{-x}}
$$

Or, equivalently, as:

$$
\mathrm{tanh}(x) = 2 \times \mathrm{sigmoid}(2x) - 1
$$

The derivative of $\text{tanh}$, useful for backpropagation, is:

$$
\mathrm{tanh}'(x) = 1 - \mathrm{tanh}^2(x)
$$

Compared to the sigmoid function, $\text{tanh}$ tends to be preferred for hidden layers due to
its zero-centered nature.
Still, it shares the vanishing gradient problem for extremely high or low inputs.



### Examples



#### 1. Computing the $\text{tanh}$ of a tensor:

In [68]:
from activations import tanh

x = torch.Tensor([0, 1, 2])
result = tanh(x)
print(result)  # Expected: tensor([0.0000, 0.7616, 0.9640])


tensor([0.0000, 0.7616, 0.9640])


#### 2. Determining the gradient of $\text{tanh}$ for a tensor:


In [69]:
x = torch.Tensor([0, 1, 2])
gradient_result = tanh(x, gradient=True)
print(gradient_result)  # Expected: tensor([1.0000, 0.4200, 0.0707])


tensor([1.0000, 0.4200, 0.0707])


#### 3. Handling higher-dimensional tensors:


In [70]:
x = torch.Tensor([[0, 1], [-1, 2]])
result = tanh(x)
print(result)
# Expected: 
# tensor([[ 0.0000,  0.7616],
#         [-0.7616,  0.9640]])


tensor([[ 0.0000,  0.7616],
        [-0.7616,  0.9640]])


#### 4. Verifying against PyTorch's built-in implementation:


In [71]:
for _ in range(100):
    x = torch.randn((100, 100, 100))
    actual, expected = tanh(x), torch.tanh(x)
    assert torch.allclose(actual, expected, atol=1e-7), f"Expected {expected}, but got {actual}"
print("All tests passed!")

All tests passed!


## ReLU (Rectified Linear Unit)

ReLU, or Rectified Linear Unit, is one of the most widely used activation functions in deep learning
models.
It is especially popular in convolutional neural networks and deep feed-forward networks, mainly
because of its simplicity and efficiency.

The ReLU function is mathematically represented as:

$$\mathrm{ReLU}(x) = \max(0,\, x)$$

This means that if the input is positive, it returns the input itself, and if the input is negative
or zero, it returns zero.

The gradient of the ReLU function is quite simple.
It's either 0 (for $x \leq 0$) or 1 (for $x > 0$).
This is given by:

$$
    \mathrm{ReLU}'(x) = 
        \begin{cases} 
            0 & \text{if } x \leq 0 \\
            1 & \text{if } x > 0 
        \end{cases}
$$


### Advantages

1. **Computational Efficiency**: The ReLU function is simple and can be implemented easily without
   requiring any complex operations like exponentials.
   This makes it computationally efficient.
2. **Sparsity**: ReLU activation leads to sparsity.
   When the output is zero, it's said to be "inactive", and when many neurons are inactive in a
   layer, the resulting representations are sparse.
   Sparse representations seem to be more beneficial than dense ones in deep learning models.
3. **Mitigating the Vanishing Gradient Problem**: Traditional activation functions like sigmoid or
   tanh squish their input into a small range between 0 and 1 or -1 and 1 respectively.
   For deep networks, this could lead to gradients that are too small for the network to learn
   effectively.
   ReLU helps mitigate this problem, allowing models to learn faster and require less data.

### Drawbacks

1. **Dying ReLU Problem**: Since the gradient for negative values is zero, during training, some
   neurons might never activate, effectively getting knocked off during the training and not
   contributing to the model.
   This is called the "dying ReLU" problem.
2. **Not Zero-Centered**: Unlike the tanh function, ReLU outputs are not zero-centered.

### Examples


#### Example 1: Computing the ReLU of a tensor:

In [72]:
from activations import relu

x = torch.Tensor([-1.5, 0, 0.5, 2])
result = relu(x)
print(result)  # Expected: tensor([0., 0., 0.5, 2.])

tensor([0.0000, 0.0000, 0.5000, 2.0000])


#### Example 2: Computing the gradient of ReLU for a tensor

In [73]:
x = torch.Tensor([-1.5, 0, 0.5, 2])
gradient_result = relu(x, gradient=True)
print(gradient_result)  # Expected: tensor([0., 1., 1., 1.])

tensor([0., 1., 1., 1.])


#### Example 3: Using ReLU on higher-dimensional tensors

In [74]:
x = torch.Tensor([[-1, 1], [0, -2]])
result = relu(x)
print(result)
# Expected:
# tensor([[0., 1.],
#         [0., 0.]])

tensor([[0., 1.],
        [0., 0.]])


#### Example 4: Testing against PyTorch's built-in ReLU

In [75]:
for _ in range(100):
    x = torch.randn((100, 100, 100))
    actual = relu(x)
    expected = torch.relu(x)
    assert torch.allclose(actual, expected), f"Expected {expected}, got {actual}"
print("All tests passed!")

All tests passed!


## CELU Activation Function

The `CELU` (Continuously Differentiable Exponential Linear Units) activation function is a 
modified version of the traditional ReLU and ELU activation functions.
It aims to resolve the dying ReLU problem by enabling negative values for input below zero,
while still preserving smooth gradients for optimization.

Formally:
Given an input $x$ and a parameter $\alpha > 0$, the CELU function is defined as:

$$
    \mathrm{celu}(x, \alpha) = \begin{cases}
            x                                   & \text{if } x \geq 0 \\
            \alpha (\exp(\frac{x}{\alpha}) - 1) & \text{otherwise}
        \end{cases}
$$

Where:
- $x$ is the input.
- $\alpha$ is a parameter that determines the saturation rate for negative inputs.

The gradient of the CELU function w.r.t. its input $x$ is:

$$
    \frac{\partial\ \text{celu}(x, \alpha)}{\partial x} = \begin{cases}
            1 & \text{if } x \geq 0 \\
            \frac{\text{celu}(x, \alpha) - x e^{\frac{x}{\alpha}}}{\alpha}
                & \text{if } x < 0
        \end{cases}
$$

### Advantages:

1. **Avoiding the Dying ReLU Problem**: Unlike ReLU, which can "kill" neurons leading them to
   output only zeros (especially during the training phase), CELU allows negative values for
   inputs below zero.
2. **Smooth Gradient**: Ensures smoother gradients compared to the original ReLU, which can help
   improve optimization and convergence during training.
3. **Configurable Saturation Rate**: The $\alpha$ parameter allows for configuring how fast the 
   activation saturates for negative inputs.

### Disadvantages:

1. **Computational Overhead**: Due to the exponential function, CELU can be more computationally expensive than simpler activation functions like ReLU.
2. **Parameter Tuning**: Introducing the $\alpha$ parameter can sometimes require additional tuning to get optimal performance, adding to the complexity of the model.

### Usage:

While CELU can be used in a variety of deep learning architectures, it's especially beneficial in scenarios where you observe the dying ReLU problem or when you want a smoother gradient for better optimization.

### Examples

#### Example 1: Computing the CELU of a tensor:

In [76]:
from activations import celu

print(celu(torch.tensor([-1, 0, 1])))  # Output: tensor([-0.6321,  0.0000,  1.0000])

tensor([-0.6321,  0.0000,  1.0000])


### Example 2: Varying the Alpha Parameter

In [77]:
result_with_alpha = celu(torch.tensor([-1, 0, 1]), alpha=0.5)
print(result_with_alpha)  # Output: tensor([-0.4323,  0.0000,  1.0000])

tensor([-0.4323,  0.0000,  1.0000])


### Example 3: Computing the Gradient

In [78]:
x = torch.Tensor([-1, 0, 1])
gradient_result = celu(x, gradient=True)
print(gradient_result)  # Output: tensor([0.2642, 1.0000, 1.0000])

tensor([-0.2642,  1.0000,  1.0000])


### Example 4: Higher-dimensional Tensors

In [79]:
x = torch.Tensor([[1, -1], [0, 2]])
result = celu(x)
print(result)
# Output: 
# tensor([[ 1.0000, -0.6321],
#         [ 0.0000,  2.0000]])

tensor([[ 1.0000, -0.6321],
        [ 0.0000,  2.0000]])


### Example 5: Testing against PyTorch's Implementation

In [80]:
for _ in range(100):
    x = torch.randn((10, 10))
    actual = celu(x)
    expected = torch.celu(x)
    assert torch.allclose(actual, expected, atol=1e-4), f"Expected {expected}, but got {actual}"
print("All tests passed!")

All tests passed!
