# Logistic Regression: Theory and Math

This notebook consolidates the mathematical foundation of Logistic Regression, inspired by the *Machine Learning Specialization* by Andrew Ng.  
It covers the intuition, formulas, derivations, gradient descent updates, and links to the corresponding NumPy implementation.

## 1️⃣ Introduction & Intuition

- Logistic Regression is a supervised learning algorithm for **classification** problems.  
- Instead of predicting a continuous value (like Linear Regression), it predicts a **probability** that an input belongs to a class.  
- The output is always between 0 and 1 using the **sigmoid (logistic) function** for binary classification, or the **softmax function** for multiclass classification.  

**Types of Logistic Regression:**

| Type | Features | Classes | Activation Function | Notes |
|------|----------|--------|-------------------|-------|
| **Binary Logistic Regression** | 1 or more | 2 | Sigmoid | Predicts probability of one class; output interpreted as $(P(y=1)$ |
| **Multiclass (Multinomial) Logistic Regression** | 1 or more | K ≥ 3 | Softmax | Predicts probability distribution over K classes; outputs sum to 1 |

**Examples (Binary):**
- Predicting if a tumor is malignant (1) or benign (0)  
- Predicting if an email is spam (1) or not spam (0)

**Examples (Multiclass):**
- Predicting handwritten digits (0–9)  
- Classifying types of flowers (e.g., iris species)

---

Logistic regression models the probability of class membership by fitting the **sigmoid curve** for binary outcomes or the **softmax function** for multiclass outcomes. The goal is to predict probabilities that align with actual labels while minimizing the **cross-entropy (log-loss) cost**.


![Binary Logistic Regression Plot](images/logistic_regression/binary_logistic_regression_plot.png)

![Multiclass Logistic Regression Plot](images/logistic_regression/multiclass_logistic_regression_plot.png)

## 2️⃣ Notations

A reference table for symbols used in **binary and multiclass logistic regression** derivations, grouped by concept.

---

### 🟢 Inputs & Outputs
| Symbol | Type | Meaning | Example / Notes |
|--------|------|---------|----------------|
| $x^{(i)}$ | Scalar | Feature value of sample $i$ | Single feature input for example $i$ |
| $\mathbf{x}^{(i)} \in \mathbb{R}^n$ | Vector | Feature vector of sample $i$ | $[x_1^{(i)}, x_2^{(i)}, ..., x_n^{(i)}]^T$ |
| $X \in \mathbb{R}^{m \times n}$ | Matrix | Feature matrix for all $m$ samples | Rows = samples, cols = features |
| $y^{(i)}$ | Scalar | Target label of sample $i$ | $y^{(i)} \in \{0,1\}$ (binary) or $\{1,2,...,K\}$ (multiclass) |
| $Y \in \mathbb{R}^{m \times K}$ | Matrix | One-hot encoded label matrix | Row $i$ corresponds to sample $i$ |


### 🟡 Linear Combination (Logits)
| Symbol | Type | Meaning | Example / Notes |
|--------|------|---------|----------------|
| $z^{(i)}$ | Scalar | Logit (binary) | $z^{(i)} = \mathbf{w}^\top \mathbf{x}^{(i)} + b$ |
| $z_c$ | Scalar | Logit (score) for class $c$ | $z_c = \mathbf{w}_c^\top \mathbf{x}^{(i)} + b_c$ |
| $\mathbf{z}^{(i)} \in \mathbb{R}^K$ | Vector | Logits for all classes (sample $i$) | $\mathbf{z}^{(i)} = W^\top \mathbf{x}^{(i)} + \mathbf{b}$ |

### 🔵 Predictions
| Symbol | Type | Meaning | Example / Notes |
|--------|------|---------|----------------|
| $\sigma(z)$ | Function | Sigmoid (binary) | $\sigma(z) = \frac{1}{1+e^{-z}}$ |
| $\hat{y}^{(i)}$ | Scalar | Predicted probability for class 1 (binary) | $\hat{y}^{(i)} = \sigma(z^{(i)})$ |
| $\hat{y}_c^{(i)}$ | Scalar | Probability that $\mathbf{x}^{(i)}$ belongs to class $c$ | $\hat{y}_c^{(i)} = \frac{e^{z_c^{(i)}}}{\sum_{j=1}^K e^{z_j^{(i)}}}$ |
| $\hat{\mathbf{y}}^{(i)} \in \mathbb{R}^K$ | Vector | Predicted probability distribution over $K$ classes | $\hat{\mathbf{y}}^{(i)} = \text{softmax}(\mathbf{z}^{(i)})$ |
| $\hat{Y} \in \mathbb{R}^{m \times K}$ | Matrix | Predicted probability matrix for $m$ samples | Each row = $\hat{\mathbf{y}}^{(i)}$ |

### 🔴 Parameters
| Symbol | Type | Meaning | Example / Notes |
|--------|------|---------|----------------|
| $w$ | Scalar | Weight / coefficient (binary) | Single-feature logistic regression |
| $\mathbf{w} \in \mathbb{R}^n$ | Vector | Weight vector (binary) | $[w_1, w_2, ..., w_n]^T$ |
| $W \in \mathbb{R}^{n \times K}$ | Matrix | Weight matrix (multiclass) | Each column = weights for class $c$ |
| $b$ | Scalar | Bias term (binary) | Scalar offset |
| $\mathbf{b} \in \mathbb{R}^K$ | Vector | Bias terms for $K$ classes | $[b_1, b_2, ..., b_K]^T$ |

### 🟣 Cost Functions
| Symbol | Type | Meaning | Example / Notes |
|--------|------|---------|----------------|
| $J(w,b)$ | Scalar | Cost function (binary cross-entropy) | $-\frac{1}{m}\sum_{i=1}^m \big[y^{(i)}\log\hat{y}^{(i)} + (1-y^{(i)})\log(1-\hat{y}^{(i)})\big]$ |
| $J(W,\mathbf{b})$ | Scalar | Cost function (multiclass cross-entropy) | $-\frac{1}{m}\sum_{i=1}^m \sum_{c=1}^K y_c^{(i)} \log \hat{y}_c^{(i)}$ |

### 🟤 Training Hyperparameters
| Symbol | Type | Meaning | Example / Notes |
|--------|------|---------|----------------|
| $m$ | Scalar | Number of training examples | Dataset size |
| $n$ | Scalar | Number of features | Dimension of $\mathbf{x}^{(i)}$ |
| $\alpha$ | Scalar | Learning rate | Step size in gradient descent |

### ⚫ Gradients
| Symbol | Type | Meaning | Example / Notes |
|--------|------|---------|----------------|
| $\frac{\partial J}{\partial w_j}$ | Scalar | Gradient w.r.t $j$-th weight (binary) | Used in updates |
| $\frac{\partial J}{\partial W} \in \mathbb{R}^{n \times K}$ | Matrix | Gradient of cost w.r.t weight matrix | Used in multiclass updates |
| $\frac{\partial J}{\partial b}$ | Scalar | Gradient w.r.t bias (binary) |  |
| $\frac{\partial J}{\partial \mathbf{b}} \in \mathbb{R}^K$ | Vector | Gradient of cost w.r.t bias vector | Used in multiclass updates |

## 3️⃣ Model Formula

### Single Logistic Regression (binary, 1 feature)

$$
z^{(i)} = w x^{(i)} + b
$$

$$
\hat{y}^{(i)} = f_{w,b}(x^{(i)}) = \sigma(z^{(i)}) = \frac{1}{1 + e^{-z^{(i)}}}
$$

### Multivariable Logistic Regression (binary, multiple features)

$$
z^{(i)} = w_1 x_1^{(i)} + w_2 x_2^{(i)} + \dots + w_n x_n^{(i)} + b = \mathbf{w}^\top \mathbf{x}^{(i)} + b
$$

$$
\hat{y}^{(i)} = \sigma(z^{(i)})
$$

### Multiclass Logistic Regression (Softmax)

For multi-class classification with $k$ classes, logistic regression is generalized using the **softmax function**.

The softmax function converts raw scores (logits) for each class into **probabilities that sum to 1**:

$$
z_c = \mathbf{w}_c^\top \mathbf{x} + b_c
$$

$$
\hat{y}_c^{(i)} = \frac{e^{z_c}}{\sum_{j=1}^{k} e^{z_j}}
$$

## 4️⃣ Cost Function (Log Loss)

### Binary Logistic Regression

$$
L(\hat{y}^{(i)}, y^{(i)}) = - \Big[ y^{(i)} \log(\hat{y}^{(i)}) + (1 - y^{(i)}) \log(1 - \hat{y}^{(i)}) \Big]
$$

$$
L(\hat{y}^{(i)}, y^{(i)}) =
\begin{cases}
    - \log \hat{y}^{(i)} & \text{if $y^{(i)}=1$} \\
    - \log \left( 1 - \hat{y}^{(i)} \right) & \text{if $y^{(i)}=0$}
\end{cases}
$$

$$
J(\mathbf{w}, b) = \frac{1}{m} \sum_{i=1}^{m} L(\hat{y}^{(i)}, y^{(i)})
$$
 
---

### Multiclass Logistic Regression (Softmax)

$$
\hat{y}_c^{(i)} = \frac{e^{z_c}}{\sum_{j=1}^{k} e^{z_j}}
$$

$$
L(\hat{\mathbf{y}}^{(i)}, \mathbf{y}^{(i)}) = - \sum_{c=1}^{k} y^{(i)}_c \, \log \hat{y}^{(i)}_c
$$
  
$$
J(W, \mathbf{b}) = -\frac{1}{m} \sum_{i=1}^m \sum_{c=1}^k y_c^{(i)} \log \hat{y}_c^{(i)} = -\frac{1}{m} \sum_{i=1}^m L(\hat{\mathbf{y}}^{(i)}, \mathbf{y}^{(i)})
$$

## 5️⃣ Gradient Descent Derivation

Minimize cost $J$ using gradient descent

$$
w_j := w_j - \alpha \frac{\partial J}{\partial w_j}, \quad
b := b - \alpha \frac{\partial J}{\partial b}
$$
 
---

### 1. Binary Logistic Regression

Gradients:

$$
\frac{\partial J}{\partial w_j} = \frac{1}{m} \sum_{i=1}^m \big( \hat{y}^{(i)} - y^{(i)} \big) x_j^{(i)}
$$

$$
\frac{\partial J}{\partial b} = \frac{1}{m} \sum_{i=1}^m \big( \hat{y}^{(i)} - y^{(i)} \big)
$$

---

### 2. Multiclass Logistic Regression (Softmax)

$$
\hat{y}^{(i)}_c = \frac{e^{z_c^{(i)}}}{\sum_{j=1}^k e^{z_j^{(i)}}}, 
\quad z^{(i)} = W^\top \mathbf{x}^{(i)} + \mathbf{b}
$$

Gradients:

$$
\frac{\partial J}{\partial W} = \frac{1}{m} X^\top (\hat{Y} - Y)
$$

$$
\frac{\partial J}{\partial \mathbf{b}} = \frac{1}{m} 1^\top (\hat{Y} - Y)
$$ 

![Gradient Descent Plot](images/logistic_regression/grad_desc_logistic_lr.png)

## 6️⃣ Vectorized Form

---

### 1️. Binary Logistic Regression (Multiple Features)

#### Predictions

$$
\hat{\mathbf{y}} = \sigma(X \mathbf{w} + b)
$$

#### Cost Function

$$
J(\mathbf{w}, b) = -\frac{1}{m} \Big[ \mathbf{y}^T \log(\hat{\mathbf{y}}) + (1-\mathbf{y})^T \log(1-\hat{\mathbf{y}}) \Big]
$$

#### Gradient Descent Updates

$$
\mathbf{w} := \mathbf{w} - \frac{\alpha}{m} X^T (\hat{\mathbf{y}} - \mathbf{y})
$$

$$
b := b - \frac{\alpha}{m} \sum_{i=1}^m \big(\hat{y}^{(i)} - y^{(i)}\big)
$$

---

### 2️. Multiclass Logistic Regression (Softmax)

#### Predictions (Softmax)

$$
\hat{Y} = \text{softmax}(X W + \mathbf{b})
$$

where

$$
\hat{y}^{(i)}_c = \frac{e^{z_c^{(i)}}}{\sum_{j=1}^k e^{z_j^{(i)}}}, 
\quad z^{(i)} = W^\top \mathbf{x}^{(i)} + \mathbf{b}
$$

#### Cost Function (Cross-Entropy)

$$
J(W, \mathbf{b}) = -\frac{1}{m} \sum_{i=1}^m \sum_{c=1}^k y_c^{(i)} \log(\hat{y}_c^{(i)})
$$

#### Gradient Descent Updates

$$
W := W - \frac{\alpha}{m} X^T (\hat{Y} - Y)
$$

$$
\mathbf{b} := \mathbf{b} - \frac{\alpha}{m} \sum_{i=1}^m (\hat{\mathbf{y}}^{(i)} - \mathbf{y}^{(i)})
$$


## 7️⃣ Additional Concepts

### Decision Boundary

1. **Binary Logistic Regression**
- Classification is made using a threshold (commonly 0.5):
    - If $\hat{y}^{(i)} \geq 0.5$, predict class 1.
    - If $\hat{y}^{(i)} < 0.5$, predict class 0. 


2. **Multiclass Logistic Regression**
- Classification is based on the highest predicted probability:
    $$
    \hat{y}^{(i)} = \arg\max_c \hat{y}_c^{(i)}
    $$

---

### Regularization

To prevent overfitting, add penalty terms:

- **L2 (Ridge)**  
$$
J_{ridge}(\mathbf{w}, b) = J(\mathbf{w}, b) + \frac{\lambda}{2m} \sum_{j=1}^n w_j^2
$$

Gradient w.r.t $w_j$:  
$$
\frac{\partial J_\text{ridge}}{\partial w_j} = \frac{\partial J(\mathbf{w}, b)}{\partial w_j} + \frac{\lambda}{m} w_j
$$

*The L2 penalty term $\frac{\lambda}{m} w_j$ is proportional to the coefficient itself, so larger coefficients shrink faster, smaller ones shrink slower. Coefficients rarely become exactly zero.*

- **L1 (Lasso)**  
$$
J_{lasso}(\mathbf{w}, b) = J(\mathbf{w}, b) + \frac{\lambda}{2m} \sum_{j=1}^n |w_j|
$$

Gradient w.r.t $w_j$:  
$$
\frac{\partial J_\text{lasso}}{\partial w_j} = \frac{\partial J(\mathbf{w}, b)}{\partial w_j} + \frac{\lambda}{2m} \, \text{sign}(w_j)
$$

*The L1 penalty term $\frac{\lambda}{2m} \text{sign}(w_j)$ pushes weights toward zero at a constant rate. Some coefficients can become exactly zero, making Lasso useful for feature selection.*


## 8️⃣ Implementation

See the corresponding **NumPy-based implementation** here: [logistic_regression_numpy.ipynb](../implementation/logistic_regression_numpy.ipynb)