```{contents}
```

# Cost Functions 

## 1. **Maximum Likelihood Estimation (MLE) – Training Objective**

Naïve Bayes learns probabilities $P(y)$ and $P(x_i|y)$ from data by maximizing the **likelihood** of the training set:

$$
L(\theta) = \prod_{j=1}^m P(y^{(j)} \mid x^{(j)}; \theta)
$$

where:

* $m$ = number of training samples
* $y^{(j)}$ = class label of sample $j$
* $x^{(j)}$ = features of sample $j$
* $\theta$ = parameters (priors + likelihoods).

⚡ In practice, we maximize the **log-likelihood** (to avoid underflow and simplify multiplication into summation):

$$
\ell(\theta) = \sum_{j=1}^m \log P(y^{(j)} \mid x^{(j)}; \theta)
$$

👉 So the implicit **cost function** is:

$$
J(\theta) = -\ell(\theta) = - \sum_{j=1}^m \log P(y^{(j)} \mid x^{(j)}; \theta)
$$

This is essentially **negative log-likelihood (NLL)**, also called **cross-entropy loss**.

---

## 2. **Cross-Entropy / Log Loss – Evaluation**

When evaluating probabilistic classifiers like Naïve Bayes, we often use **log loss**:

$$
\text{LogLoss} = -\frac{1}{m} \sum_{j=1}^m \sum_{c=1}^k \mathbf{1}\{y^{(j)} = c\} \log P(y=c \mid x^{(j)})
$$

where:

* $k$ = number of classes
* $\mathbf{1}$ = indicator function (1 if true class = $c$, else 0).

👉 This penalizes wrong predictions more when the model is confident but incorrect.

---

## 3. **Zero-One Loss – Simpler alternative**

Sometimes for classification, we also look at **0-1 loss** (not probabilistic, just accuracy-based):

$$
\text{0-1 Loss} = \frac{1}{m} \sum_{j=1}^m \mathbf{1}\{\hat{y}^{(j)} \neq y^{(j)}\}
$$

This is basically the **misclassification rate**.

---

**Summary**

* **Training** → Naïve Bayes parameters are estimated via **maximum likelihood**, which implicitly minimizes **negative log-likelihood (NLL)**.
* **Evaluation** → Common cost functions:

  * **Log Loss (cross-entropy)** → best for probabilistic performance.
  * **0-1 Loss (error rate)** → best for accuracy comparison.

