# Activation Functions — Comprehensive Guide (one command per code cell)
This notebook-style script lists common activation functions used in deep learning. For each activation, you get:
- Formula (plain text), short explanation, domain/range, and derivative note
- Minimal examples in pure Python/NumPy, TensorFlow, and PyTorch

Conventions in this project:
- Markdown cells start with `#%% md`
- Code cells start with `#%%`
- "One command per code cell": every code cell contains exactly one executable command (plus optional comment).


# Optional installations (commented)
# Uncomment in your environment if needed.

In [None]:
# !pip install numpy tensorflow torch  # install core libs if missing


# Imports — one per cell to respect the rule

In [None]:
import math  # math utilities for scalar formulas


In [None]:
import numpy as np  # NumPy for vectorized Python examples


In [None]:
import tensorflow as tf  # TensorFlow examples (requires tf installed)


In [None]:
import torch  # PyTorch examples (requires torch installed)


# Sample inputs (shared across examples)

In [None]:
x_np = np.linspace(-5, 5, 11)  # NumPy vector spanning negative to positive


In [None]:
x_tf = tf.constant(np.linspace(-5, 5, 11), dtype=tf.float32)  # TensorFlow vector input


In [None]:
x_torch = torch.linspace(-5, 5, steps=11)  # PyTorch vector input


# 1) Identity / Linear
Formula: f(x) = x
Explanation: Pass-through; used for outputs in regression.
Domain: (-∞, ∞) → Range: (-∞, ∞)
Derivative: f'(x) = 1

In [None]:
x_np  # Identity in NumPy: returns the same input


In [None]:
x_tf  # Identity in TensorFlow: returns the same tensor


In [None]:
x_torch  # Identity in PyTorch: returns the same tensor


# 2) Binary Step (Heaviside)
Formula: f(x) = 1 if x >= 0 else 0
Explanation: Discontinuous step; not used for backprop (zero gradient almost everywhere).
Domain: (-∞, ∞) → Range: {0, 1}
Derivative: 0 almost everywhere (undefined at 0)

In [None]:
(x_np >= 0).astype(np.float32)  # NumPy binary step using boolean mask


In [None]:
tf.cast(x_tf >= 0.0, tf.float32)  # TensorFlow binary step via cast


In [None]:
(x_torch >= 0).to(dtype=torch.float32)  # PyTorch binary step via boolean to float


# 3) Sigmoid (Logistic)
Formula: f(x) = 1 / (1 + exp(-x))
Explanation: Squashes to (0,1); good for probabilities/binary outputs; can saturate.
Domain: (-∞, ∞) → Range: (0, 1)
Derivative: f'(x) = f(x)·(1 - f(x))

In [None]:
1.0 / (1.0 + np.exp(-x_np))  # NumPy sigmoid


In [None]:
tf.nn.sigmoid(x_tf)  # TensorFlow sigmoid


In [None]:
torch.sigmoid(x_torch)  # PyTorch sigmoid


# 4) Tanh
Formula: f(x) = (e^x - e^{-x}) / (e^x + e^{-x})
Explanation: Zero-centered version of sigmoid; range (-1,1); can still saturate.
Domain: (-∞, ∞) → Range: (-1, 1)
Derivative: f'(x) = 1 - tanh(x)^2

In [None]:
np.tanh(x_np)  # NumPy tanh


In [None]:
tf.nn.tanh(x_tf)  # TensorFlow tanh


In [None]:
torch.tanh(x_torch)  # PyTorch tanh


# 5) ReLU (Rectified Linear Unit)
Formula: f(x) = max(0, x)
Explanation: Sparse activations, mitigates vanishing gradients; can "die" for negative inputs.
Domain: (-∞, ∞) → Range: [0, ∞)
Derivative: 1 for x>0, 0 for x<0 (undefined at 0 → pick 0 or 1)

In [None]:
np.maximum(x_np, 0.0)  # NumPy ReLU


In [None]:
tf.nn.relu(x_tf)  # TensorFlow ReLU


In [None]:
torch.relu(x_torch)  # PyTorch ReLU


# 6) Leaky ReLU
Formula: f(x) = x if x>0 else α·x, with small α (e.g., 0.01)
Explanation: Fixes dying ReLU by allowing small negative slope.
Domain: (-∞, ∞) → Range: (-∞, ∞)
Derivative: 1 for x>0, α for x<0

In [None]:
np.where(x_np > 0, x_np, 0.01 * x_np)  # NumPy Leaky ReLU with alpha=0.01


In [None]:
tf.nn.leaky_relu(x_tf, alpha=0.01)  # TensorFlow Leaky ReLU


In [None]:
torch.nn.functional.leaky_relu(x_torch, negative_slope=0.01)  # PyTorch Leaky ReLU


# 7) PReLU (Parametric ReLU)
Formula: f(x) = x if x>0 else a·x, where a is learned per-channel or per-parameter.
Explanation: Learnable negative slope generalizing Leaky ReLU.
Domain: (-∞, ∞) → Range: (-∞, ∞)
Derivative: 1 for x>0, a for x<0 (a is learnable)

In [None]:
tf.keras.layers.PReLU()(x_tf)  # TensorFlow PReLU layer applied to x (creates learnable alpha)


In [None]:
torch.nn.PReLU()(x_torch)  # PyTorch PReLU module applied to x (learnable parameter)


# 8) ELU (Exponential Linear Unit)
Formula: f(x) = x if x>0 else α·(exp(x) - 1)
Explanation: Negative values push mean activations toward zero; smooth negative part.
Domain: (-∞, ∞) → Range: (-α, ∞)
Derivative: 1 for x>0; f(x)+α for x<=0 (scaled by α)

In [None]:
np.where(x_np > 0, x_np, 1.0 * (np.exp(x_np) - 1.0))  # NumPy ELU with alpha=1.0


In [None]:
tf.nn.elu(x_tf)  # TensorFlow ELU (alpha=1)


In [None]:
torch.nn.functional.elu(x_torch, alpha=1.0)  # PyTorch ELU


# 9) SELU (Scaled ELU)
Formula: f(x) = λ·(x) for x>0; f(x) = λ·(α·(exp(x) - 1)) for x<=0
Explanation: Self-normalizing activations; use with LeCun normal init and AlphaDropout.
Domain: (-∞, ∞) → Range: (-∞, ∞) (scaled)
Derivative: piecewise as ELU, scaled by λ

In [None]:
tf.nn.selu(x_tf)  # TensorFlow SELU (uses fixed λ and α constants)


In [None]:
torch.nn.functional.selu(x_torch)  # PyTorch SELU


# 10) GELU (Gaussian Error Linear Unit)
Formula (exact): f(x) = x·Φ(x), where Φ is Gaussian CDF. Approx: 0.5·x·(1 + tanh(√(2/π)·(x + 0.044715·x^3)))
Explanation: Smooth, stochastic regularization interpretation; performs well in Transformers.
Domain: (-∞, ∞) → Range: (-∞, ∞)
Derivative: involves Gaussian pdf/cdf; smooth

In [None]:
0.5 * x_np * (1.0 + np.tanh(np.sqrt(2.0 / np.pi) * (x_np + 0.044715 * (x_np ** 3))))  # NumPy GELU (approx)


In [None]:
tf.nn.gelu(x_tf, approximate=True)  # TensorFlow GELU (approximate)


In [None]:
torch.nn.functional.gelu(x_torch)  # PyTorch GELU


# 11) Softplus
Formula: f(x) = ln(1 + exp(x))
Explanation: Smooth approximation to ReLU; always positive.
Domain: (-∞, ∞) → Range: (0, ∞)
Derivative: f'(x) = sigmoid(x)

In [None]:
np.log1p(np.exp(x_np))  # NumPy Softplus (uses log1p for stability)


In [None]:
tf.nn.softplus(x_tf)  # TensorFlow Softplus


In [None]:
torch.nn.functional.softplus(x_torch)  # PyTorch Softplus


# 12) Softsign
Formula: f(x) = x / (1 + |x|)
Explanation: Smooth squashing similar to tanh but with polynomial tails.
Domain: (-∞, ∞) → Range: (-1, 1)
Derivative: f'(x) = 1 / (1 + |x|)^2

In [None]:
x_np / (1.0 + np.abs(x_np))  # NumPy Softsign


In [None]:
tf.nn.softsign(x_tf)  # TensorFlow Softsign


In [None]:
torch.nn.functional.softsign(x_torch)  # PyTorch Softsign


# 13) Swish / SiLU
Formula: f(x) = x · sigmoid(x)
Explanation: Self-gated smooth nonlinearity; aka SiLU in TF/PyTorch.
Domain: (-∞, ∞) → Range: (-∞, ∞)
Derivative: f'(x) = sigmoid(x) + x·sigmoid(x)·(1 - sigmoid(x))

In [None]:
x_np / (1.0 + np.exp(-x_np))  # NumPy Swish (x * sigmoid(x))


In [None]:
tf.nn.silu(x_tf)  # TensorFlow SiLU (Swish)


In [None]:
torch.nn.functional.silu(x_torch)  # PyTorch SiLU (Swish)


# 14) Mish
Formula: f(x) = x · tanh(softplus(x))
Explanation: Smooth nonmonotonic; sometimes outperforms Swish.
Domain: (-∞, ∞) → Range: (-∞, ∞)
Derivative: complex; uses tanh and sigmoid components

In [None]:
x_np * np.tanh(np.log1p(np.exp(x_np)))  # NumPy Mish (softplus then tanh)


In [None]:
x_tf * tf.math.tanh(tf.nn.softplus(x_tf))  # TensorFlow Mish (composed ops)


In [None]:
torch.nn.functional.mish(x_torch)  # PyTorch Mish


# 15) Hard Sigmoid
Formula: f(x) = clip((x + 3) / 6, 0, 1)
Explanation: Piecewise-linear approx of sigmoid; cheap.
Domain: (-∞, ∞) → Range: [0, 1]
Derivative: 1/6 in linear region; 0 in saturated regions

In [None]:
np.clip((x_np + 3.0) / 6.0, 0.0, 1.0)  # NumPy Hard Sigmoid


In [None]:
tf.keras.activations.hard_sigmoid(x_tf)  # TensorFlow hard sigmoid


In [None]:
torch.nn.functional.hardsigmoid(x_torch)  # PyTorch hard sigmoid


# 16) Hard Swish
Formula: f(x) = x · hard_sigmoid(x) = x · clip((x + 3)/6, 0, 1)
Explanation: Efficient Swish approximation used in MobileNetV3.
Domain: (-∞, ∞) → Range: (-∞, ∞)
Derivative: piecewise; linear region scaled by x

In [None]:
x_np * np.clip((x_np + 3.0) / 6.0, 0.0, 1.0)  # NumPy Hard Swish


In [None]:
tf.nn.hard_swish(x_tf)  # TensorFlow hard swish


In [None]:
torch.nn.functional.hardswish(x_torch)  # PyTorch hard swish


# 17) ReLU6
Formula: f(x) = min(max(0, x), 6)
Explanation: Clipped ReLU used in mobile networks to bound activations.
Domain: (-∞, ∞) → Range: [0, 6]
Derivative: 1 for 0<x<6; 0 otherwise (undefined at 0 and 6)

In [None]:
np.minimum(np.maximum(x_np, 0.0), 6.0)  # NumPy ReLU6


In [None]:
tf.nn.relu6(x_tf)  # TensorFlow ReLU6


In [None]:
torch.nn.functional.relu6(x_torch)  # PyTorch ReLU6


# 18) HardTanh
Formula: f(x) = -1 if x<-1; f(x) = x if -1<=x<=1; f(x) = 1 if x>1
Explanation: Clipped tanh approximation, piecewise linear.
Domain: (-∞, ∞) → Range: [-1, 1]
Derivative: 1 in linear region; 0 in saturated regions

In [None]:
np.clip(x_np, -1.0, 1.0)  # NumPy HardTanh (clamp)


In [None]:
tf.clip_by_value(x_tf, -1.0, 1.0)  # TensorFlow clamp to [-1, 1]


In [None]:
torch.nn.functional.hardtanh(x_torch, min_val=-1.0, max_val=1.0)  # PyTorch HardTanh


# 19) LogSigmoid
Formula: f(x) = log(1 / (1 + exp(-x))) = -softplus(-x)
Explanation: Numerically stable log of sigmoid; useful in NCE and binary log-likelihoods.
Domain: (-∞, ∞) → Range: (-∞, 0)
Derivative: 1 - sigmoid(-x) = sigmoid(x) - 1 for some forms

In [None]:
-(np.log1p(np.exp(-x_np)))  # NumPy LogSigmoid (stable via log1p)


In [None]:
tf.math.log_sigmoid(x_tf)  # TensorFlow LogSigmoid


In [None]:
torch.nn.functional.logsigmoid(x_torch)  # PyTorch LogSigmoid


# 20) Softmax (multi-class output)
Formula: f_i(x) = exp(x_i) / Σ_j exp(x_j)
Explanation: Converts logits to probability distribution over classes.
Domain: ℝ^K → Range: simplex (nonnegative, sums to 1)
Derivative: Jacobian involves p_i(δ_{ij} - p_j)

In [None]:
np.exp(x_np - np.max(x_np)) / np.sum(np.exp(x_np - np.max(x_np)))  # NumPy softmax over 1D vector (stable shift)


In [None]:
tf.nn.softmax(x_tf)  # TensorFlow Softmax over last dim (1D vector here)


In [None]:
torch.nn.functional.softmax(x_torch, dim=-1)  # PyTorch Softmax over last dim


# Notes and Tips
1. Numerical stability: prefer `log1p`, max-shift in softmax, and library functions (`tf.nn.*`, `torch.nn.functional.*`).
2. Initialization & normalization: SELU requires LeCun normal init and AlphaDropout; ReLU-family often pairs with He init.
3. Output activations: Use sigmoid for binary outputs, softmax for multi-class logits, identity for regression.
4. Derivatives listed are for reference; autodiff frameworks compute gradients automatically in TF/PyTorch.
