1️⃣ Exponential Linear Unit (ELU)

✅ Why is it useful?
Unlike ReLU, ELU allows small negative values for 

x≤0, avoiding the "dying ReLU" problem.

Smooth gradient helps in better training compared to Leaky ReLU.

✅ Downsides?

Computationally expensive because of the exponential term.
More complex than ReLU.

2️⃣ Swish (Self-Gated Activation)
✅ Formula:

Swish(x)=x⋅sigmoid(x)

Swish(x)=x⋅ 
1+e 
−x
 
1
 
✅ Why is it useful?

Unlike ReLU, it is smooth and non-monotonic, meaning it can let small negative values pass instead of zeroing them out.

Google found that Swish outperforms ReLU in deep networks.

Helps prevent dead neurons like ELU.

✅ Downsides?

Slightly more computationally expensive than ReLU.

Works best in deeper networks rather than simple ones.

3️⃣ Softmax (for Multi-Class Classification)

It converts logits (raw output scores) into probabilities that sum to 1, making it useful for multi-class classification.

✅ Why is it useful?

Converts arbitrary real values into probability distributions.
Ensures the sum of all outputs is 1, making it easy to interpret as probabilities.
Used in classification tasks like image recognition.

✅ Downsides?

Susceptible to large numerical values (softmax can cause exploding gradients).

Can amplify small differences in inputs.

In [1]:
import torch

import torch.nn.functional as F

In [2]:
# Input tensor

x = torch.tensor([-1.5])

In [3]:
# Activation functions

elu = F.elu(x, alpha=1.0)

swish = x * torch.sigmoid(x)

softmax = F.softmax(torch.tensor([2.0, 1.0, 0.1]), dim=0) # Example input for multi-class

In [4]:
# Print results

print(f'ELU : {elu.item()}')

print(f'SWISH : {swish.item()}')

print(f'Softmax ( Example inputs ([2.0,1.0, 0.1]) : {softmax.tolist()}')

ELU : -0.7768698334693909
SWISH : -0.2736383080482483
Softmax ( Example inputs ([2.0,1.0, 0.1]) : [0.6590011715888977, 0.24243298172950745, 0.09856589138507843]


👉 Key Takeaways

Use ReLU for most hidden layers (fast & simple).

Use Leaky ReLU/ELU if you see dead neurons with ReLU.

Use Swish if you want better optimization at a small computational cost.

Use Sigmoid/Tanh for specific cases like binary classification or zero-centered activations.

Use Softmax in the last layer of multi-class classification tasks.