# Activation Functions in Neural Networks

Activation functions are mathematical formulas that determine the output of a neuron in a neural network. They introduce non-linearity, enabling networks to learn complex patterns. Here’s a structured summary of the most important activation functions, their formulas, detailed descriptions, advantages, and limitations.

---

## 1. Binary Step
- **Formula:**
  $$
  f(x) = \begin{cases}
    0, & x < 0 \\
    1, & x \geq 0
  \end{cases}
  $$
- **Description:** The binary step function outputs either 0 or 1 depending on whether the input is below or above a threshold (usually 0). It is used for simple decision-making tasks, such as binary classification in perceptrons.
- **Why Use:** Simple, useful for binary classification (yes/no).
- **Limitations:** Not differentiable, can’t train deep networks, too rigid for complex tasks.

---

## 2. Linear
- **Formula:**
  $$
  f(x) = x
  $$
- **Description:** The linear activation function outputs the input directly. It is used in regression tasks where the output can be any real value. However, stacking linear layers results in a linear function, so it cannot model complex patterns.
- **Why Use:** Easy, keeps values unchanged. Used in regression output layers.
- **Limitations:** No non-linearity, can’t model complex patterns, not suitable for deep networks.

---

## 3. Sigmoid
- **Formula:**
  $$
  f(x) = \frac{1}{1 + e^{-x}}
  $$
- **Description:** The sigmoid function squashes input values into the range (0, 1). It is often used in the output layer for binary classification, as it can represent probabilities. However, it suffers from vanishing gradients for large positive or negative inputs.
- **Why Use:** Outputs between (0,1). Good for probabilities. Smooth and differentiable.
- **Limitations:** Vanishing gradient for large |x|, slow to train, not zero-centered, can cause slow convergence.

---

## 4. Tanh (Hyperbolic Tangent)
- **Formula:**
  $$
  f(x) = \tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}
  $$
- **Description:** The tanh function squashes input values into the range (–1, 1). It is zero-centered, making optimization easier than sigmoid. Commonly used in hidden layers of neural networks.
- **Why Use:** Outputs between (–1,1). Zero-centered. Better for hidden layers than sigmoid.
- **Limitations:** Still suffers vanishing gradient for large |x|, can slow down learning.

---

## 5. ReLU (Rectified Linear Unit)
- **Formula:**
  $$
  f(x) = \max(0, x)
  $$
- **Description:** ReLU outputs the input if it is positive, otherwise 0. It is the most widely used activation in deep learning because it is simple and helps mitigate the vanishing gradient problem.
- **Why Use:** Simple, fast, avoids vanishing gradient (mostly). Most widely used in hidden layers.
- **Limitations:** “Dying ReLU” problem (neurons stuck at 0 for negative inputs), not zero-centered.

---

## 6. Leaky ReLU
- **Formula:**
  $$
  f(x) = \begin{cases}
    x, & x \geq 0 \\
    \alpha x, & x < 0
  \end{cases}
  $$
  (α ≈ 0.01)
- **Description:** Leaky ReLU allows a small, non-zero gradient when the input is negative, fixing the dying ReLU problem. The slope α is a small constant.
- **Why Use:** Fixes dying ReLU by allowing small negative slope.
- **Limitations:** α is fixed, may not be optimal for all tasks.

---

## 7. Parametric ReLU (PReLU)
- **Formula:**
  $$
  f(x) = \begin{cases}
    x, & x \geq 0 \\
    a x, & x < 0
  \end{cases}
  $$
  (a is learned during training)
- **Description:** PReLU generalizes Leaky ReLU by making the negative slope a learnable parameter, allowing the network to adapt the slope during training.
- **Why Use:** Learns the negative slope automatically, more flexible.
- **Limitations:** More parameters, risk of overfitting if data is limited.

---

## 8. Exponential Linear Unit (ELU)
- **Formula:**
  $$
  f(x) = \begin{cases}
    x, & x > 0 \\
    \alpha (e^x - 1), & x \leq 0
  \end{cases}
  $$
- **Description:** ELU uses an exponential function for negative inputs, which helps keep mean activations close to zero and improves learning speed.
- **Why Use:** Negative outputs help keep mean near 0. Smooth gradients, helps convergence.
- **Limitations:** More expensive to compute (uses exponentials), α must be set.

---

## 9. Softmax
- **Formula:**
  $$
  f(x_i) = \frac{e^{x_i}}{\sum_j e^{x_j}}
  $$
- **Description:** Softmax converts a vector of raw scores (logits) into probabilities that sum to 1. Used in the output layer for multi-class classification.
- **Why Use:** Turns outputs into probabilities (sums to 1). Great for multi-class classification.
- **Limitations:** Sensitive to large inputs (can overflow), not used in hidden layers.

---

## 10. Softplus
- **Formula:**
  $$
  f(x) = \ln(1 + e^x)
  $$
- **Description:** Softplus is a smooth, differentiable approximation to ReLU. It never outputs exactly zero, so it avoids the dying ReLU problem.
- **Why Use:** Smooth, differentiable version of ReLU. No dying ReLU problem.
- **Limitations:** Slower to compute, doesn’t fully solve vanishing gradient.

---

## 11. Swish
- **Formula:**
  $$
  f(x) = x \cdot \sigma(x) = x \cdot \frac{1}{1 + e^{-x}}
  $$
- **Description:** Swish is a smooth, non-monotonic function that often outperforms ReLU in deep networks. It allows small negative values and is differentiable everywhere.
- **Why Use:** Smooth, avoids dying ReLU. Better performance in some deep nets.
- **Limitations:** More expensive than ReLU, not always better for all tasks.

---

## 12. GELU (Gaussian Error Linear Unit)
- **Formula:**
  $$
  f(x) = x \cdot \Phi(x)
  $$
  (Φ(x) = CDF of standard normal distribution)
- **Description:** GELU weights inputs by their value and the probability that a standard normal variable is less than that value. Used in modern NLP models (e.g., BERT, GPT).
- **Why Use:** Used in Transformers (BERT, GPT). Smooth, probabilistic gating, helps with deep architectures.
- **Limitations:** Computationally heavy, more complex to implement.

---

## Summary Table

| Activation Function | Formula | Why Use (Advantages) | Limitations |
|--------------------|---------|---------------------|-------------|
| Binary Step | $f(x)=\begin{cases}0, x<0 \\ 1, x\geq0\end{cases}$ | Simple, binary classification | Not differentiable, too rigid |
| Linear | $f(x)=x$ | Easy, regression output | No non-linearity |
| Sigmoid | $f(x)=\frac{1}{1+e^{-x}}$ | Probabilities, smooth | Vanishing gradient, not zero-centered |
| Tanh | $f(x)=\tanh(x)$ | Zero-centered, [-1,1] | Vanishing gradient |
| ReLU | $f(x)=\max(0,x)$ | Fast, avoids vanishing gradient | Dying ReLU |
| Leaky ReLU | $f(x)=\max(\alpha x, x)$ | Fixes dying ReLU | α fixed, not optimal |
| PReLU | $f(x)=\max(ax, x)$ | Learns slope | More params, overfitting |
| ELU | $f(x)=\begin{cases}x, x>0 \\ \alpha(e^x-1), x\leq0\end{cases}$ | Smooth, mean near 0 | Expensive |
| Softmax | $f(x_i)=\frac{e^{x_i}}{\sum_j e^{x_j}}$ | Probabilities, multi-class | Overflow risk |
| Softplus | $f(x)=\ln(1+e^x)$ | Smooth ReLU | Slower |
| Swish | $f(x)=x\cdot\sigma(x)$ | Smooth, flexible | Expensive |
| GELU | $f(x)=x\cdot\Phi(x)$ | Used in NLP, smooth | Heavy |

---

> **Tip:** Use ReLU or its variants for hidden layers, Softmax for multi-class output, Sigmoid for binary output, and Linear for regression output.
