```{contents}
```


# Sigmoid Activation Function

### ✔ Formula

$$
\sigma(x) = \frac{1}{1 + e^{-x}}
$$

### ✔ Output Range

Transforms any input into a value between **0 and 1**.

This made sigmoid initially useful for:

* Small neural networks
* Binary classification
* Probabilistic interpretation (to some extent)

!$alt text$(../images/sigmoid.png)

---

## ✅ Derivative & Vanishing Gradient Problem

* The derivative of sigmoid ranges between **0 and 0.25**
* During **backpropagation**, many layers multiply small gradients repeatedly
* With deep networks, gradients shrink drastically → **vanishing gradient problem**
* Result: **weights stop updating**, slowing or halting learning

---

## ✅ Drawbacks of Sigmoid

### ❌ 1. Vanishing Gradients

* Due to tiny derivatives (0 to 0.25), weight updates get smaller layer by layer.

### ❌ 2. Not Zero-Centered

* The output is between 0 and 1 (mean ≈ 0.5, not 0)
* This causes **inefficient weight updates** during optimization

Zero-centered activations help:

* Faster convergence
* Better gradient flow

### ❌ 3. Computationally Slow

* Uses exponential operation: ( e^{-x} ), which is **computationally expensive**

---

## ✅ Advantages of Sigmoid

✔ Smooth and continuous
✔ Output strictly between **0 and 1**
✔ Useful when output interpretation is binary
✔ Prevents sharp jumps in output

---

## ✅ Key Disadvantages Recap

1. ❌ Prone to **vanishing gradients**
2. ❌ **Not zero-centered**
3. ❌ Uses costly **exponential operations**

Because of these issues, sigmoid is rarely preferred in **deep networks** today.

---

## Transition to Better Alternatives

Due to sigmoid’s drawbacks, the next activation functions like **tanh** (and later **ReLU**) were introduced to overcome these problems.

---

**Conclusion**

The sigmoid function was widely used in early deep learning models, but due to:

* Vanishing gradients
* Non-zero-centered output
* Computation inefficiency

…it has mostly been replaced by more effective activations like **tanh** and **ReLU**.

The next step in the lesson is understanding **tanh activation** — its benefits and limitations.


Here’s a **clear and structured summary** of the explanation you provided on **Sigmoid Activation Function**:

---

## What is the Sigmoid Activation Function?

The **sigmoid function** is defined as:

$
\sigma(x) = \frac{1}{1 + e^{-x}}
$

It is applied during **forward propagation** after:

1. Inputs × Weights
2. Adding Bias
3. Applying Activation Function

It **squashes values to the range (0, 1)** and is commonly used in early neural network models.

---

## Derivative of Sigmoid & Its Impact

* The derivative of the sigmoid function always lies between **0 and 0.25**
* This is important during **backpropagation**, when gradients are used to update weights
* Due to chain rule, multiplying many small derivatives causes the **vanishing gradient problem**

Example:

* Multiple small gradients get multiplied (e.g., 0.2 × 0.01 × 0.1…)
* Final gradient becomes extremely small
* Weight updates become negligible → learning stops

$$
w_{\text{new}} \approx w_{\text{old}}
$$

This prevents proper training in deep networks.

---

## Main Disadvantages of Sigmoid

### ❌ 1. Vanishing Gradient Problem

* Small derivatives (0 to 0.25) lead to tiny weight updates during backpropagation.

### ❌ 2. Not Zero-Centered

* Output lies between **0 and 1**, not symmetric around zero
* Makes gradient descent inefficient
* Nullifies efficient weight updates

✅ Zero-centered functions help optimize weights faster
(e.g., tanh outputs from −1 to 1)

### ❌ 3. Computationally Expensive

* Sigmoid uses **exponential operations (e⁻ˣ)**
* Slower on processors compared to simpler functions like ReLU

---

## Advantages of Sigmoid

✔ Works well for **binary classification**
✔ Outputs values close to **0 or 1** (good for interpretation)
✔ Smooth and differentiable function

However, it's only practical in:

* Output layers of binary classification models
* Small or shallow neural networks

---

## Why Move Beyond Sigmoid?

Because of its:

* Vanishing gradient issues
* Non-zero-centered output
* Computational inefficiency

Researchers introduced better activation functions like:

* **tanh** (fixes zero-centering issue)
* **ReLU** and its variants (avoids vanishing gradients)

---

**Conclusion**

Sigmoid was historically important but is **not ideal for deep networks** because:

* It slows or stops learning
* It doesn’t update weights efficiently
* It adds unnecessary computation

It is now mostly used:
✅ In **output layers** for binary classification
❌ Not in **hidden layers** of deep neural networks

Next, we move to **tanh activation function** as an improvement.
