# Activation Functions

## Overview

The `activations` module offers a collection of popular activation functions 
essential for neural network designs.
Along with the primary function definitions, this module calculates the 
gradients for each, aiding in understanding and applying the back-propagation 
algorithm.
Notably, the `softmax` function is an exception due to its inherent multi-input,
multi-output structure, necessitating a unique gradient computation.


In [None]:
from pathlib import Path

ROOT_DIR = Path('..') / '..'
!pip install -q -r {ROOT_DIR / 'requirements.txt'}

import torch  # needed for running the examples
from tqdm import tqdm  # prettier progress bars


Here's an improved and restructured version:

## Sigmoid Activation Function

The sigmoid function is a type of activation function that is primarily used in binary 
classification tasks.
It maps any input to a value between 0 and 1, which can often be used to represent the 
probability that a given input point belongs to the positive class.

Mathematically, the sigmoid function is given by:

$$
\mathrm{sigmoid}(x) = \frac{1}{1 + e^{-x}}
$$

Its derivative, crucial for the backpropagation algorithm, is:

$$
\mathrm{sigmoid}'(x) = \mathrm{sigmoid}(x)(1 - \mathrm{sigmoid}(x))
$$

However, it's worth noting that the sigmoid function can lead to vanishing gradients when its 
input is very high or very low.

### Examples

#### 1. Computing the sigmoid of a tensor:

In [None]:
from activations import sigmoid

x = torch.Tensor([0, 1, 2])
result = sigmoid(x)
print(result)  # Outputs: tensor([0.5000, 0.7311, 0.8808])


#### 2. Determining the gradient of the sigmoid for a tensor:

In [None]:
x = torch.Tensor([0, 1, 2])
gradient_result = sigmoid(x, gradient=True)
print(gradient_result)  # Outputs: tensor([0.2500, 0.1966, 0.1050])


#### 3. Handling higher-dimensional tensors:


In [None]:
x = torch.Tensor([[0, 1], [-1, 2]])
result = sigmoid(x)
print(result)
# Outputs: 
# tensor([[0.5000, 0.7311],
#         [0.2689, 0.8808]])


#### 4. Verifying against PyTorch's built-in implementation:


In [None]:
for _ in tqdm(range(100)):
    x = torch.randn((100, 100, 100))
    our_implementation = sigmoid(x)
    pytorch_implementation = torch.sigmoid(x)
    assert torch.allclose(our_implementation, pytorch_implementation), \
        f"Expected {pytorch_implementation}, but got {our_implementation}"
print("All tests passed!")


## Tanh

## Tanh Activation Function

The hyperbolic tangent, or simply $\text{tanh}$, is another prevalent activation function used
in neural networks.
Its outputs range between -1 and 1, making it zero-centered, which can help mitigate some of
the issues observed with non-zero-centered activation functions like the sigmoid.

Mathematically, the $\text{tanh}$ function is expressed as:

$$
\mathrm{tanh}(x) = \frac{e^{x} - e^{-x}}{e^{x} + e^{-x}}
$$

Or, equivalently, as:

$$
\mathrm{tanh}(x) = 2 \times \mathrm{sigmoid}(2x) - 1
$$

The derivative of $\text{tanh}$, useful for backpropagation, is:

$$
\mathrm{tanh}'(x) = 1 - \mathrm{tanh}^2(x)
$$

Compared to the sigmoid function, $\text{tanh}$ tends to be preferred for hidden layers due to
its zero-centered nature.
Still, it shares the vanishing gradient problem for extremely high or low inputs.



### Examples



#### 1. Computing the $\text{tanh}$ of a tensor:

In [None]:
from activations import tanh

x = torch.Tensor([0, 1, 2])
result = tanh(x)
print(result)  # Expected: tensor([0.0000, 0.7616, 0.9640])


#### 2. Determining the gradient of $\text{tanh}$ for a tensor:


In [None]:
x = torch.Tensor([0, 1, 2])
gradient_result = tanh(x, gradient=True)
print(gradient_result)  # Expected: tensor([1.0000, 0.4200, 0.0707])


#### 3. Handling higher-dimensional tensors:


In [None]:
x = torch.Tensor([[0, 1], [-1, 2]])
result = tanh(x)
print(result)
# Expected: 
# tensor([[ 0.0000,  0.7616],
#         [-0.7616,  0.9640]])


#### 4. Verifying against PyTorch's built-in implementation:


In [None]:
for _ in range(100):
    x = torch.randn((100, 100, 100))
    actual, expected = tanh(x), torch.tanh(x)
    assert torch.allclose(actual, expected, atol=1e-7), f"Expected {expected}, but got {actual}"
print("All tests passed!")

## ReLU (Rectified Linear Unit)

ReLU, or Rectified Linear Unit, is one of the most widely used activation functions in deep learning
models.
It is especially popular in convolutional neural networks and deep feed-forward networks, mainly
because of its simplicity and efficiency.

The ReLU function is mathematically represented as:

$$\mathrm{ReLU}(x) = \max(0,\, x)$$

This means that if the input is positive, it returns the input itself, and if the input is negative
or zero, it returns zero.

The gradient of the ReLU function is quite simple.
It's either 0 (for $x \leq 0$) or 1 (for $x > 0$).
This is given by:

$$
    \mathrm{ReLU}'(x) = 
        \begin{cases} 
            0 & \text{if } x \leq 0 \\
            1 & \text{if } x > 0 
        \end{cases}
$$


### Advantages

1. **Computational Efficiency**: The ReLU function is simple and can be implemented easily without
   requiring any complex operations like exponentials.
   This makes it computationally efficient.
2. **Sparsity**: ReLU activation leads to sparsity.
   When the output is zero, it's said to be "inactive", and when many neurons are inactive in a
   layer, the resulting representations are sparse.
   Sparse representations seem to be more beneficial than dense ones in deep learning models.
3. **Mitigating the Vanishing Gradient Problem**: Traditional activation functions like sigmoid or
   tanh squish their input into a small range between 0 and 1 or -1 and 1 respectively.
   For deep networks, this could lead to gradients that are too small for the network to learn
   effectively.
   ReLU helps mitigate this problem, allowing models to learn faster and require less data.

### Drawbacks

1. **Dying ReLU Problem**: Since the gradient for negative values is zero, during training, some
   neurons might never activate, effectively getting knocked off during the training and not
   contributing to the model.
   This is called the "dying ReLU" problem.
2. **Not Zero-Centered**: Unlike the tanh function, ReLU outputs are not zero-centered.

### Examples


#### Example 1: Computing the ReLU of a tensor:

In [None]:
from activations import relu

x = torch.Tensor([-1.5, 0, 0.5, 2])
result = relu(x)
print(result)  # Expected: tensor([0., 0., 0.5, 2.])

#### Example 2: Computing the gradient of ReLU for a tensor

In [None]:
x = torch.Tensor([-1.5, 0, 0.5, 2])
gradient_result = relu(x, gradient=True)
print(gradient_result)  # Expected: tensor([0., 1., 1., 1.])

#### Example 3: Using ReLU on higher-dimensional tensors

In [None]:
x = torch.Tensor([[-1, 1], [0, -2]])
result = relu(x)
print(result)
# Expected:
# tensor([[0., 1.],
#         [0., 0.]])

#### Example 4: Testing against PyTorch's built-in ReLU

In [None]:
for _ in range(100):
    x = torch.randn((100, 100, 100))
    actual = relu(x)
    expected = torch.relu(x)
    assert torch.allclose(actual, expected), f"Expected {expected}, got {actual}"
print("All tests passed!")

## CELU Activation Function

The `CELU` (Continuously Differentiable Exponential Linear Units) activation function emerges as an enhancement over traditional ReLU and ELU activation functions. Its purpose is twofold:

1. **Overcoming the Dying ReLU Problem**: By permitting negative values for inputs below zero, CELU mitigates the issue where neurons can sometimes become inactive and no longer update their weights—a phenomenon known as the "dying ReLU" problem.

2. **Maintaining Smooth Gradients**: The function is designed to offer continuous differentiability, ensuring smooth gradients that aid in the optimization process.

### Mathematical Definition:

For an input \( x \) and a parameter \( \alpha > 0 \), CELU is mathematically represented as:

$$
    \mathrm{celu}(x, \alpha) = 
    \begin{cases}
            x & \text{if } x \geq 0 \\
            \alpha (\exp(\frac{x}{\alpha}) - 1) & \text{otherwise}
    \end{cases}
$$

Where:
- \( x \) denotes the input.
- \( \alpha \) is a tunable parameter governing the saturation rate for negative inputs, influencing how steeply the function saturates for values below zero.

The gradient of the CELU function with respect to its input \( x \) is:

$$
    \frac{\partial\ \text{celu}(x, \alpha)}{\partial x} = 
    \begin{cases}
            1 & \text{if } x \geq 0 \\
            \frac{\text{celu}(x, \alpha) - x e^{\frac{x}{\alpha}}}{\alpha} & \text{if } x < 0
    \end{cases}
$$

### Advantages:

1. **Avoiding the Dying ReLU Problem**: Unlike ReLU, which can "kill" neurons leading them to
   output only zeros (especially during the training phase), CELU allows negative values for
   inputs below zero.
2. **Smooth Gradient**: Ensures smoother gradients compared to the original ReLU, which can help
   improve optimization and convergence during training.
3. **Configurable Saturation Rate**: The $\alpha$ parameter allows for configuring how fast the 
   activation saturates for negative inputs.

### Disadvantages:

1. **Computational Overhead**: Due to the exponential function, CELU can be more computationally expensive than simpler activation functions like ReLU.
2. **Parameter Tuning**: Introducing the $\alpha$ parameter can sometimes require additional tuning to get optimal performance, adding to the complexity of the model.

### Usage:

While CELU can be used in a variety of deep learning architectures, it's especially beneficial in scenarios where you observe the dying ReLU problem or when you want a smoother gradient for better optimization.

### Examples

#### Example 1: Computing the CELU of a tensor:

In [None]:
from activations import celu

print(celu(torch.tensor([-1, 0, 1])))  # Output: tensor([-0.6321,  0.0000,  1.0000])

### Example 2: Varying the Alpha Parameter

In [None]:
result_with_alpha = celu(torch.tensor([-1, 0, 1]), alpha=0.5)
print(result_with_alpha)  # Output: tensor([-0.4323,  0.0000,  1.0000])

### Example 3: Computing the Gradient

In [None]:
x = torch.Tensor([-1, 0, 1])
gradient_result = celu(x, gradient=True)
print(gradient_result)  # Output: tensor([0.2642, 1.0000, 1.0000])

### Example 4: Higher-dimensional Tensors

In [None]:
x = torch.Tensor([[1, -1], [0, 2]])
result = celu(x)
print(result)
# Output: 
# tensor([[ 1.0000, -0.6321],
#         [ 0.0000,  2.0000]])

### Example 5: Testing against PyTorch's Implementation

In [None]:
for _ in range(100):
    x = torch.randn((10, 10))
    actual = celu(x)
    expected = torch.celu(x)
    assert torch.allclose(actual, expected, atol=1e-4), f"Expected {expected}, but got {actual}"
print("All tests passed!")

## Swish Activation Function

The `Swish` activation function, introduced by researchers at Google, is a smooth, non-monotonic function that has gained traction due to its superior performance in deep networks, especially when compared to the traditional ReLU function. Swish's self-gated property helps it provide more dynamic adaptability across various tasks, making it particularly effective in deeper architectures.

### Formula:

The Swish function is given by the formula:

$$
\mathrm{swish}(x) = x \times \mathrm{sigmoid}(\beta x)
$$

### Properties:

1. **Smoothness**: Swish is continuously differentiable, which ensures smooth gradients and assists in the optimization process.
2. **Non-monotonicity**: Unlike ReLU and its variants, Swish is non-monotonic, introducing a form of regulation and adaptability in the network.
3. **Self-Gated**: The function's adaptability arises from its self-gated nature, allowing each neuron to regulate its own activation based on its input.

### Benefits:

- **Superior Performance in Deep Networks**: Empirical results have demonstrated that Swish often outperforms other activation functions, especially in deeper networks.
- **Computational Efficiency**: Despite being slightly more complex than ReLU, Swish retains a high level of computational efficiency.
  
### Gradient:

The gradient of the Swish function with respect to its input $x$ is given by:

$$
    \mathrm{swish}'(x) =
            \mathrm{sigmoid}(\beta x)
            + \beta x \times \mathrm{sigmoid}(\beta x)
                \times (1 - \mathrm{sigmoid}(\beta x))
$$

This gradient ensures the backpropagation process is smooth and efficient.

In essence, the Swish activation function offers a blend of linearity and non-linearity, making it a compelling choice for many deep learning tasks.

### Examples

#### 1. Basic Computation of Swish Function:

In [None]:
import torch
from swish import swish

x = torch.Tensor([-1, 0, 1])
output = swish(x)
print(output)  # Expected output: tensor([-0.2689,  0.0000,  0.7311])

#### 2. Computing Gradient of Swish Function:

In [None]:
x = torch.Tensor([-1, 0, 1])
gradient = swish(x, gradient=True)
print(gradient)  # Expected output: tensor([0.0723, 0.5000, 0.9277])

#### 3. Using a Different Beta Value:

In [None]:
x = torch.Tensor([-1, 0, 1])
output_with_beta = swish(x, beta=1.5)
print(output_with_beta)  # Expected output: tensor([-0.1824,  0.0000,  0.8176])

#### 4. Handling Higher-Dimensional Tensors:

In [None]:
x = torch.Tensor([[0, 1], [-1, 2]])
result = swish(x)
print(result)
# Expected output:
# tensor([[0.0000, 0.7311],
#         [-0.2689, 1.7616]])

#### 5. Benchmarking Against PyTorch's Implementation:

In [None]:
for _ in tqdm(range(100)):
    x = torch.randn((100, 100, 100))
    actual, expected = swish(x), x * torch.sigmoid(x)  # Using PyTorch's built-in sigmoid for verification
    assert torch.allclose(actual, expected, atol=1e-6), f"Expected {expected}, but got {actual}"

print("All tests passed!")

## Softmax Activation Function

The softmax function is a crucial activation function predominantly used in the output layers of classification neural networks. It transforms a vector of arbitrary real values into a probability distribution over multiple classes.

### Definition

Given an input vector $\mathbf{x} = [x_1, x_2, ..., x_k]$, the softmax function $\mathrm{softmax}(\mathbf{x})$ for a particular component $z_i$ is defined as:

$$\mathrm{softmax}(x_i) = \frac{e^{x_i}}{\sum_j e^{x_j}}$$

### Key Characteristics

1. **Output Range**: Each component of the output vector lies in the range (0, 1), making it interpretable as a probability.
2. **Normalization**: The sum of all the components of the output vector is 1, ensuring it's a valid probability distribution.
3. **Monotonicity**: If one component of the input vector increases while others remain constant, the corresponding component of the softmax output will also increase.
4. **Sensitivity**: It amplifies the differences between the largest component and other components in the input vector.

### Applications

1. **Multiclass Classification**: Softmax is extensively used in neural networks for multiclass classification tasks. When the network needs to decide among multiple classes, the softmax function is typically used in the final layer, coupled with the categorical cross-entropy loss during training.
2. **Reinforcement Learning**: In policy gradient methods, the softmax function helps in producing a probability distribution over actions.

### Considerations

1. **Numerical Stability**: Direct computation can lead to numerical instability due to the exponentiation of large numbers. This can be mitigated by subtracting the maximum value in the input vector from all components of the vector before applying the softmax.
2. **Choice of Loss Function**: When using softmax in neural networks, it's crucial to pair it with an appropriate loss function. The categorical cross-entropy loss is the most common choice.

### Differences from Other Activation Functions

Unlike sigmoid or tanh which operate element-wise and squish their input into a bounded range, softmax operates on vectors and ensures their output sums to 1. This makes it suitable for producing probability distributions over multiple categories.

### Examples

### 1. Basic 1D Vector

In [None]:
from activations import softmax

softmax_values = softmax(torch.tensor([2.0, 1.0, 0.1]), dim=0)
print(softmax_values)
# Output: tensor([0.6590, 0.2424, 0.0986])

### 2. 2D Tensor (Matrix)

In [None]:
# Define a 2D tensor
tensor_2d = torch.Tensor([[1.0, 2.0, 3.0], [0.1, -0.5, 0.2]])

# Compute softmax values along dim=1
softmax_matrix = softmax(tensor_2d, dim=1)

print(softmax_matrix)
# Expected output:
# tensor([[0.0900, 0.2447, 0.6652],
#         [0.3768, 0.2068, 0.4164]])

### 3. Using the Stable Softmax

In [None]:
tensor = torch.Tensor([50.0, 60.0, 70.0])

# Without stability
values_unstable = softmax(tensor, dim=0, stable=False)

# With stability
values_stable = softmax(tensor, dim=0, stable=True)

print(values_unstable)  # Might produce unexpected results due to numerical issues

print(values_stable)    # Should produce valid probabilities

### 4. High-dimensional Tensors

In [None]:
tensor_3d = torch.rand(2, 3, 4)  # 3D tensor

# Apply softmax along the second dimension
softmax_3d = softmax(tensor_3d, dim=1)
print(softmax_3d)

# This will convert each 3x4 matrix slice of the tensor into a probability distribution along its rows.

### 5. Testing Against PyTorch's Implementation

In [None]:
for _ in range(100):
    x = torch.randn((100, 100, 100))
    actual = softmax(x, dim=1)
    expected = torch.softmax(x, dim=1)
    assert torch.allclose(actual, expected, atol=1e-4), f"Expected {expected}, but got {actual}"
print("All tests passed!")