# Summary of Activation Functions

## **Step Function**
- **Motivation**: Inspired by the human brain, where neurons fire based on a threshold.
- **Definition**:
  $$
  f(x) =
  \begin{cases} 
  1, & x \geq 0 \\
  0, & x < 0
  \end{cases}
  $$
- **Graph**: A binary step shape.
- **Derivative**: Zero everywhere except at \( x = 0 \).
- **Properties**:
  - Not continuous.
  - Bounded between 0 and 1.
  - Not zero-centered.
  - **Performance**: Poor compared to modern activations.

---

## **Sigmoid Function**
- **Motivation**: A smooth alternative to step function that indicates probability.
- **Definition**:
  $$
  f(x) = \frac{1}{1 + e^{-x}}
  $$
- **Graph**: S-shaped curve ranging from 0 to 1.
- **Derivative**:
  $$
  f'(x) = f(x) (1 - f(x))
  $$
- **Properties**:
  - Continuous and differentiable.
  - Non-linear and monotonic.
  - Not zero-centered.
  - **Performance**: Causes vanishing gradient problem.

---

## **Tanh Function**
- **Motivation**: Zero-centered version of sigmoid.
- **Definition**:
  $$
  f(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}
  $$
- **Graph**: Similar to sigmoid but ranges from -1 to 1.
- **Derivative**:
  $$
  f'(x) = 1 - f^2(x)
  $$
- **Properties**:
  - Continuous, non-linear, monotonic.
  - Zero-centered.
  - **Performance**: Better than sigmoid, but still suffers from vanishing gradients.

---

## **ReLU (Rectified Linear Unit)**
- **Motivation**: Solve vanishing gradient issue.
- **Definition**:
  $$
  f(x) =
  \begin{cases} 
  x, & x \geq 0 \\
  0, & x < 0
  \end{cases}
  $$
- **Graph**: Linear for positive values, zero for negatives.
- **Derivative**:
  $$
  f'(x) =
  \begin{cases} 
  1, & x > 0 \\
  0, & x \leq 0
  \end{cases}
  $$
- **Properties**:
  - Non-linear and monotonic.
  - Not zero-centered.
  - **Performance**: Highly efficient but suffers from "dying ReLU" problem.

---

## **Softplus**
- **Motivation**: Smooth version of ReLU.
- **Definition**:
  $$
  f(x) = \ln(1 + e^x)
  $$
- **Graph**: Similar to ReLU but smooth at \( x = 0 \).
- **Derivative**:
  $$
  f'(x) = \frac{1}{1 + e^{-x}}
  $$
  (which is just the **Sigmoid** function)
- **Properties**:
  - Continuous and differentiable.
  - **Performance**: Similar to ReLU but less used in practice.

---

## **Maxout Activation**
- **Motivation**: Solve ReLUâ€™s "dying neurons" issue.
- **Definition**:
  $$
  f(x) = \max(w_1x + b_1, w_2x + b_2)
  $$
- **Properties**:
  - Zero-centered.
  - **Performance**: Better than ReLU but doubles the number of parameters.

---

## **GELU (Gaussian Error Linear Unit)**
- **Motivation**: Combines properties of ReLU and Dropout.
- **Definition**:
  $$
  f(x) = x \cdot \frac{1}{2} \left( 1 + \text{erf} \left(\frac{x}{\sqrt{2}}\right) \right)
  $$
  or an approximation using Sigmoid:
  $$
  f(x) \approx x \cdot \sigma(1.702x)
  $$
- **Graph**: Slightly non-monotonic.
- **Performance**: Faster convergence than ReLU.

---

## **Swish (Searched Activation Function)**
- **Motivation**: Discovered using Neural Architecture Search (NAS).
- **Definition**:
  $$
  f(x) = x \cdot \sigma(x)
  $$
  where \( \sigma(x) \) is the **Sigmoid function**.
- **Properties**:
  - Continuous and differentiable.
  - **Performance**: Works better than ReLU in some deep networks.
