# **Sigmoid Activation Function - Study Notes**

## **Introduction**
- The **Sigmoid activation function** is commonly used in neural networks, especially for **binary classification**.
- It maps input values to a range between **0 and 1**, providing a probability-like output.

---
### Graph of the Sigmoid Function

The graph of the sigmoid function looks like this:

![Sigmoid Function Graph](https://upload.wikimedia.org/wikipedia/commons/8/88/Logistic-curve.svg)

---
## **Definition & Formula**
The **Sigmoid function** is defined as:

$$
\sigma(x) = \frac{1}{1 + e^{-x}}
$$

- The function transforms input values from negative infinity to positive infinity into an output between 0 and 1.
- As x approaches negative infinity (x → -∞), the output approaches **0**.
- As x approaches positive infinity (x → +∞), the output approaches **1**.


---

## **Comparison with Step Activation Function**
- The **step function** outputs only **0 or 1**, while **Sigmoid outputs values between 0 and 1**.
- The Sigmoid function helps represent **probabilities**, making it useful in classification tasks.

---

## **Example Calculation**
Given weighted sum values: **1, 2, 3**  

Applying the **sigmoid function**:

$$
\sigma(1) = \frac{1}{1 + e^{-1}} \approx 0.73
$$

$$
\sigma(2) = \frac{1}{1 + e^{-2}} \approx 0.88
$$

$$
\sigma(3) = \frac{1}{1 + e^{-3}} \approx 0.95
$$

Computing weighted sum:

$$
(1 \times 0.73) + (2 \times 0.88) + (3 \times 0.95) = 5.34
$$

Applying **sigmoid(5.34)**:

$$
\sigma(5.34) = \frac{1}{1 + e^{-5.34}} \approx 0.99
$$

- The output is **close to 1** but **not exactly 1**.

---

## **Advantages of Sigmoid**
1. **Smooth and differentiable** – Allows gradient-based optimization (backpropagation).
2. **Probability-like output** – Useful for classification tasks.
3. **Monotonic function** – As input increases, the output increases.

---

## **Drawbacks of Sigmoid**
### 1. **Vanishing Gradient Problem**
   - For very large or small inputs (**|x| > 5**), the output **saturates** at 0 or 1.
   - The **gradient (derivative) becomes very small**, making training slow.
   - In **backpropagation**, small gradients lead to **minimal weight updates**, causing training to **stagnate**.
   
### 2. **Not Zero-Centered**
   - Sigmoid outputs are always **positive** (between 0 and 1).
   - This **restricts** the weight updates during training, making learning inefficient.

---

## **Key Takeaways**
- **Sigmoid is useful** for binary classification but has **limitations in deep networks** due to vanishing gradients.
- **Better alternatives** include **ReLU** (Rectified Linear Unit), which avoids the vanishing gradient problem.
- Understanding sigmoid helps in grasping **how neural networks make probabilistic decisions**.
