# Logistic Regression - Mathematical Derivations

This notebook contains **complete mathematical derivations** for logistic regression.

By the end, you'll understand:
1. Properties of the sigmoid function
2. Why we use log loss (cross-entropy)
3. How to derive the gradient
4. Why the gradient has the same form as linear regression

---

## 1. Sigmoid Function Derivation

### Definition:

$$\sigma(z) = \frac{1}{1 + e^{-z}}$$

---

### Property 1: Range is (0, 1)

**When $z \to +\infty$**:
$$e^{-z} \to 0$$
$$\sigma(z) = \frac{1}{1 + 0} = 1$$

**When $z \to -\infty$**:
$$e^{-z} \to +\infty$$
$$\sigma(z) = \frac{1}{1 + \infty} = 0$$

**When $z = 0$**:
$$\sigma(0) = \frac{1}{1 + e^0} = \frac{1}{2}$$

---

### Property 2: Symmetry

**Claim**: $\sigma(-z) = 1 - \sigma(z)$

**Proof**:

$$\sigma(-z) = \frac{1}{1 + e^{-(-z)}} = \frac{1}{1 + e^{z}}$$

Multiply numerator and denominator by $e^{-z}$:

$$= \frac{e^{-z}}{e^{-z}(1 + e^{z})} = \frac{e^{-z}}{e^{-z} + 1}$$

Now compute $1 - \sigma(z)$:

$$1 - \sigma(z) = 1 - \frac{1}{1 + e^{-z}} = \frac{1 + e^{-z} - 1}{1 + e^{-z}} = \frac{e^{-z}}{1 + e^{-z}}$$

Therefore: $\sigma(-z) = 1 - \sigma(z)$ ✓

---

### Property 3: Derivative

**Claim**: $\frac{d\sigma}{dz} = \sigma(z)(1 - \sigma(z))$

**Proof**:

$$\frac{d\sigma}{dz} = \frac{d}{dz} \left( \frac{1}{1 + e^{-z}} \right)$$

Rewrite as: $(1 + e^{-z})^{-1}$

Use chain rule:

$$= -(1 + e^{-z})^{-2} \cdot \frac{d}{dz}(1 + e^{-z})$$

$$= -(1 + e^{-z})^{-2} \cdot (-e^{-z})$$

$$= \frac{e^{-z}}{(1 + e^{-z})^2}$$

**Simplify**:

$$= \frac{1}{1 + e^{-z}} \cdot \frac{e^{-z}}{1 + e^{-z}}$$

$$= \frac{1}{1 + e^{-z}} \cdot \left( \frac{1 + e^{-z} - 1}{1 + e^{-z}} \right)$$

$$= \sigma(z) \cdot (1 - \sigma(z))$$

**Why this matters**: Makes backpropagation efficient!

---

## 2. Log Loss (Binary Cross-Entropy) Derivation

### Why Log Loss?

We want a cost function that:
1. Penalizes wrong predictions heavily
2. Is convex (for guaranteed convergence)
3. Has nice derivative properties

---

### Maximum Likelihood Derivation

Assume examples are i.i.d. (independent and identically distributed).

For a single example, the probability is:

$$P(y | x; \theta) = \begin{cases} 
h_\theta(x) & \text{if } y = 1 \\
1 - h_\theta(x) & \text{if } y = 0
\end{cases}$$

**Compact form**:

$$P(y | x; \theta) = h_\theta(x)^y (1 - h_\theta(x))^{1-y}$$

Check: When $y=1$: $(1-h)^{1-1} = 1$, so we get $h$. When $y=0$: $h^0 = 1$, so we get $1-h$. ✓

**Likelihood** (probability of all data):

$$L(\theta) = \prod_{i=1}^{m} P(y^{(i)} | x^{(i)}; \theta)$$

$$= \prod_{i=1}^{m} h_\theta(x^{(i)})^{y^{(i)}} (1 - h_\theta(x^{(i)}))^{1-y^{(i)}}$$

**Log-Likelihood** (easier to work with):

$$\ell(\theta) = \log L(\theta) = \sum_{i=1}^{m} \left[ y^{(i)} \log h_\theta(x^{(i)}) + (1-y^{(i)}) \log(1-h_\theta(x^{(i)})) \right]$$

We want to **maximize** likelihood, which is the same as **minimizing** negative log-likelihood:

$$J(\theta) = -\frac{1}{m} \ell(\theta)$$

$$\boxed{J(\theta) = -\frac{1}{m} \sum_{i=1}^{m} \left[ y^{(i)} \log h_\theta(x^{(i)}) + (1-y^{(i)}) \log(1-h_\theta(x^{(i)})) \right]}$$

This is the **log loss** (binary cross-entropy)!

---

## 3. Gradient Derivation

### Goal: Compute $\frac{\partial J}{\partial \theta_j}$

Recall:
- $h_\theta(x) = \sigma(\theta^T x)$
- $J(\theta) = -\frac{1}{m} \sum_{i=1}^{m} \left[ y^{(i)} \log h^{(i)} + (1-y^{(i)}) \log(1-h^{(i)}) \right]$

Where $h^{(i)} = h_\theta(x^{(i)})$ for brevity.

---

### Step 1: Gradient of a single example

$$\frac{\partial}{\partial \theta_j} \left[ -y \log h - (1-y) \log(1-h) \right]$$

**Use chain rule**:

$$= -y \frac{1}{h} \frac{\partial h}{\partial \theta_j} - (1-y) \frac{1}{1-h} \cdot \frac{\partial (1-h)}{\partial \theta_j}$$

$$= -y \frac{1}{h} \frac{\partial h}{\partial \theta_j} - (1-y) \frac{1}{1-h} \cdot \left( -\frac{\partial h}{\partial \theta_j} \right)$$

$$= -y \frac{1}{h} \frac{\partial h}{\partial \theta_j} + (1-y) \frac{1}{1-h} \frac{\partial h}{\partial \theta_j}$$

**Factor out** $\frac{\partial h}{\partial \theta_j}$:

$$= \left( \frac{1-y}{1-h} - \frac{y}{h} \right) \frac{\partial h}{\partial \theta_j}$$

**Combine fractions**:

$$= \frac{h(1-y) - y(1-h)}{h(1-h)} \frac{\partial h}{\partial \theta_j}$$

$$= \frac{h - hy - y + yh}{h(1-h)} \frac{\partial h}{\partial \theta_j}$$

$$= \frac{h - y}{h(1-h)} \frac{\partial h}{\partial \theta_j}$$

---

### Step 2: Compute $\frac{\partial h}{\partial \theta_j}$

Recall: $h = \sigma(\theta^T x) = \sigma(z)$ where $z = \theta^T x$

**Use chain rule**:

$$\frac{\partial h}{\partial \theta_j} = \frac{\partial \sigma(z)}{\partial z} \cdot \frac{\partial z}{\partial \theta_j}$$

We proved earlier: $\frac{d\sigma}{dz} = \sigma(z)(1 - \sigma(z))$

And: $\frac{\partial z}{\partial \theta_j} = \frac{\partial (\theta^T x)}{\partial \theta_j} = x_j$

Therefore:

$$\frac{\partial h}{\partial \theta_j} = \sigma(z)(1-\sigma(z)) \cdot x_j = h(1-h) x_j$$

---

### Step 3: Combine

$$\frac{\partial}{\partial \theta_j} \left[ -y \log h - (1-y) \log(1-h) \right] = \frac{h - y}{h(1-h)} \cdot h(1-h) x_j$$

**The $h(1-h)$ terms cancel!**

$$= (h - y) x_j$$

---

### Step 4: Full gradient

$$\boxed{\frac{\partial J}{\partial \theta_j} = \frac{1}{m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) x_j^{(i)}}$$

**In vectorized form**:

$$\boxed{\nabla J(\theta) = \frac{1}{m} X^T (\sigma(X\theta) - y)}$$

---

### Observation

This has the **exact same form** as linear regression!

**Linear Regression**: $\nabla J = \frac{1}{m} X^T (X\theta - y)$

**Logistic Regression**: $\nabla J = \frac{1}{m} X^T (\sigma(X\theta) - y)$

The only difference: we apply sigmoid to $X\theta$.

**Why?** The sigmoid derivative's $h(1-h)$ term **cancels** with the log loss derivative, leaving the same form!

---

## 4. Gradient Descent Update Rule

$$\theta_j := \theta_j - \alpha \frac{\partial J}{\partial \theta_j}$$

$$\boxed{\theta_j := \theta_j - \alpha \frac{1}{m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) x_j^{(i)}}$$

**Vectorized**:

$$\boxed{\theta := \theta - \alpha \frac{1}{m} X^T (\sigma(X\theta) - y)}$$

---

## 5. Why Log Loss is Convex

**Claim**: $J(\theta)$ is convex when using sigmoid + log loss.

**Proof sketch**:
1. Log loss (negative log-likelihood) is convex
2. Sigmoid is a concave function
3. Composition of convex + concave (in the right way) yields convex

**Why it matters**: Gradient descent is guaranteed to find the global minimum!

---

## Summary

### Key Formulas:

1. **Sigmoid**: $\sigma(z) = \frac{1}{1 + e^{-z}}$

2. **Hypothesis**: $h_\theta(x) = \sigma(\theta^T x)$

3. **Cost (Log Loss)**: $J(\theta) = -\frac{1}{m} \sum_{i=1}^{m} \left[ y^{(i)} \log h_\theta(x^{(i)}) + (1-y^{(i)}) \log(1-h_\theta(x^{(i)})) \right]$

4. **Gradient**: $\nabla J(\theta) = \frac{1}{m} X^T (\sigma(X\theta) - y)$

5. **Update**: $\theta := \theta - \alpha \nabla J(\theta)$

### Beautiful Properties:

- Sigmoid derivative: $\sigma'(z) = \sigma(z)(1-\sigma(z))$ (elegant!)
- Gradient has same form as linear regression (sigmoid derivative cancels!)
- Cost function is convex (global optimum guaranteed)
- Derived from maximum likelihood (probabilistically sound)

---

**Next**: Implement these equations in `logistic_regression_from_scratch.ipynb`!