```{contents}
```

## Cost Function

Unlike linear regression, we **cannot use Mean Squared Error (MSE)** directly because of the non-linear sigmoid output—it would lead to a **non-convex cost function**, which is hard to optimize.

Here’s a breakdown:

---

### Sigmoid Function

First, logistic regression outputs probabilities using the sigmoid function:

$$
\hat{y} = h_\theta(x) = \frac{1}{1 + e^{-\theta^T x}}
$$

Where:

* $\hat{y}$ = predicted probability that $y = 1$
* $\theta$ = model parameters
* $x$ = input features

---

### Likelihood Function

Logistic regression is based on **Maximum Likelihood Estimation (MLE)**.

The **likelihood** is the probability of observing the given data with parameters $\theta$:

$$
L(\theta) = \prod_{i=1}^{m} P(y^{(i)} | x^{(i)}; \theta)
$$

For binary labels $y \in \{0,1\}$:

$$
P(y|x; \theta) = (\hat{y})^y (1-\hat{y})^{1-y}
$$

So the likelihood becomes:

$$
L(\theta) = \prod_{i=1}^{m} (\hat{y}^{(i)})^{y^{(i)}} (1-\hat{y}^{(i)})^{1-y^{(i)}}
$$

---

### Log-Likelihood

We usually take the **log** of the likelihood to simplify calculations:

$$
\ell(\theta) = \log L(\theta) = \sum_{i=1}^{m} \Big[ y^{(i)} \log(\hat{y}^{(i)}) + (1 - y^{(i)}) \log(1 - \hat{y}^{(i)}) \Big]
$$

---

### Cost Function (Negative Log-Likelihood / Log Loss)

To **minimize** a function, we take **negative log-likelihood**:

$$
J(\theta) = - \frac{1}{m} \sum_{i=1}^{m} \Big[ y^{(i)} \log(\hat{y}^{(i)}) + (1 - y^{(i)}) \log(1 - \hat{y}^{(i)}) \Big]
$$

* This is the **primary cost function** used in logistic regression.
* Intuition:

  * If the model predicts correctly, $\hat{y}$ is close to $y$, so log loss is small.
  * If the model is confident but wrong, log loss is **very large**.

---

### Variants / Regularized Cost Functions

To prevent **overfitting**, we add **regularization**:

1. **L2 Regularization (Ridge)**

$$
J(\theta) = - \frac{1}{m} \sum_{i=1}^{m} \Big[ y^{(i)} \log(\hat{y}^{(i)}) + (1 - y^{(i)}) \log(1 - \hat{y}^{(i)}) \Big] + \frac{\lambda}{2m} \sum_{j=1}^{n} \theta_j^2
$$

2. **L1 Regularization (Lasso)**

$$
J(\theta) = - \frac{1}{m} \sum_{i=1}^{m} \Big[ y^{(i)} \log(\hat{y}^{(i)}) + (1 - y^{(i)}) \log(1 - \hat{y}^{(i)}) \Big] + \frac{\lambda}{m} \sum_{j=1}^{n} |\theta_j|
$$

* $\lambda$ = regularization parameter
* L2 penalizes large weights
* L1 encourages sparsity (many weights become 0)

---

### Alternative (Less Common) Cost Functions

* **Mean Squared Error (MSE):** Sometimes used, but not preferred because it makes the cost function non-convex for logistic regression.
* **Hinge Loss:** Used in SVMs, not typical for logistic regression.

---

✅ **Summary Table**

| Cost Function         | Formula                                                       | Notes                                           |   |                          |
| --------------------- | ------------------------------------------------------------- | ----------------------------------------------- | - | ------------------------ |
| Log Loss (Binary)     | $-\frac{1}{m} \sum [y \log \hat{y} + (1-y) \log (1-\hat{y})]$ | Standard cost for logistic regression           |   |                          |
| L2 Regularized        | Log Loss + $\frac{\lambda}{2m} \sum \theta_j^2$               | Penalizes large weights, prevents overfitting   |   |                          |
| L1 Regularized        | Log Loss + ( \frac{\lambda}{m} \sum                           | \theta\_j                                       | ) | Encourages sparse models |
| MSE (Not Recommended) | $\frac{1}{2m} \sum (\hat{y}-y)^2$                             | Non-convex for logistic regression, rarely used |   |                          |

