```{contents}
```

## Activation choice

**1. Hidden layers**
Never use sigmoid or tanh in hidden layers of deep networks. They compress values into a small range and cause **vanishing gradients** during backpropagation. Gradients shrink as they move backward, training stalls.

Use **ReLU or its variants** instead:

* ReLU
* Leaky ReLU
* PReLU
* ELU

These avoid vanishing gradients because their derivatives do not collapse to zero for positive inputs.

**2. Output layer**
Depends on the task type:

* **Binary classification (1 output neuron)** → use **sigmoid**
  Output range 0–1, gives probability of class 1.

* **Multi-class classification (more than 2 classes, one neuron per class)** → use **softmax**
  Converts outputs to probabilities that sum to 1.

**3. Workflow inside each neuron**

1. Weighted sum:
     `z = w·x + b`
2. Apply activation:
     `a = activation(z)`

Weights and biases are initialized before training. Weight initialization methods will be discussed separately.

**4. Depth and gradient impact**
More layers = deeper network. If sigmoid/tanh is used repeatedly in many layers, derivative chains multiply values in the range (0, 1), leading to almost zero gradient.

**5. Standard rule**

* Hidden layers → ReLU or variants
* Output layer → Sigmoid (binary) or Softmax (multi-class)


| Activation Function | Output Range      | Derivative Issue          | Pros                                     | Cons / Problems                    | Typical Use Case                    |
| ------------------- | ----------------- | ------------------------- | ---------------------------------------- | ---------------------------------- | ----------------------------------- |
| **Sigmoid**         | (0, 1)            | Vanishing gradient        | Smooth output, probability mapping       | Kills gradients in deep nets       | Output (binary classification)      |
| **Tanh**            | (-1, 1)           | Vanishing gradient        | Zero-centered, stronger gradients        | Still vanishes in deep networks    | Rare in hidden layers today         |
| **ReLU**            | [0, ∞)            | Dead neurons (zeros out)  | Fast, simple, no vanishing for positives | Can output only 0 for many neurons | Hidden layers (default)             |
| **Leaky ReLU**      | (-∞, ∞)           | Reduced dead neurons      | Allows small negative slope              | Slope is fixed, not trainable      | Hidden layers                       |
| **PReLU**           | (-∞, ∞)           | Reduced dead neurons      | Learnable negative slope                 | Slightly more compute cost         | Deep hidden layers                  |
| **ELU**             | (-1, ∞)           | Less vanishing            | Negative outputs help mean shift         | Exp computation is slower          | Hidden layers                       |
| **Softmax**         | (0, 1), sums to 1 | N/A (used only at output) | Outputs class probabilities              | Only for multi-class output        | Output (multi-class classification) |
