## Overview of Gaussian Naive Bayes

Naive Bayes is a family of probabilistic classification methods built on generative modeling, where the goal is to characterize how the features and labels are jointly distributed. Instead of directly learning the conditional probability $P(y \mid X)$, the method models the joint structure by estimating the class prior distribution together with the class-conditional distributions of the features. The posterior probability used for classification is then obtained via Bayes' rule.

At the core of Naive Bayes is the naive conditional independence assumption: given the class label, all features are treated as independent. While this assumption is often violated in real-world data, it greatly simplifies the modeling process, allowing all parameters to be estimated efficiently using simple empirical statistics computed within each class.

Different variants of Naive Bayes arise from different assumptions about the class-conditional feature distributions. For example, Multinomial and Bernoulli Naive Bayes are commonly used for text data, while Categorical Naive Bayes applies to discrete, non-numeric features. 

In this project, we focus on Gaussian Naive Bayes, which assumes that each continuous feature follows a class-conditional normal distribution. This makes GaussianNB particularly suitable for datasets with real-valued features and provides a simple yet effective generative model for continuous data.

| Variant | Feature Type | Distribution Assumption | Typical Use Case |
|--------|--------------|-------------------------|------------------|
| **Gaussian Naive Bayes** | Continuous numerical features | Gaussian (normal) distribution | Sensor data, measurements, general continuous features |
| **Multinomial Naive Bayes** | Count-based features | Multinomial distribution | Text classification, word counts (BoW, TF) |
| **Bernoulli Naive Bayes** | Binary (0/1) features | Bernoulli distribution | Word presence/absence, binary indicators |
| **Categorical Naive Bayes** | General categorical features | Categorical distribution | Discrete categorical variables (color, city, etc.) |


## Advantages and Disadvantages of Gaussian Naive Bayes

### **Advantages**

**1. Extremely fast training**  
GaussianNB uses closed-form estimates for class means and variances, requiring no iterative optimization. 

**2. Performs well on small datasets**  
Estimating Gaussian parameters requires relatively few samples, allowing GNB to work effectively even when data is limited.

**3. Simple and interpretable**  
The contribution of each feature to the final prediction can be clearly understood through class-wise means, variances, and log-likelihoods.

**4. Efficient in high-dimensional spaces**  
Due to the independence assumption, the computational cost scales linearly with the number of features, making GNB suitable for high-dimensional data.


### **Disadvantages**

**1. Independence assumption rarely holds**  
The naive assumption that features are conditionally independent is often violated in real-world data, which can degrade model performance.

**2. Sensitive to distribution mismatch**  
GNB assumes each feature follows a Gaussian distribution within each class. Strongly skewed, multimodal, or heavy-tailed distributions may lead to poor results.

**3. Limited model flexibility**  
The decision boundaries are quadratic, restricting the model’s ability to capture more complex nonlinear structures.

**4. Assumes each feature dimension is an independent univariate Gaussian**  
GNB does not capture correlations between features (i.e., it does not model full multivariate Gaussian distributions), which can be limiting when feature interactions are important.

## Representation

Gaussian Naive Bayes is a **generative classifier**.  
It models how the pair $(x, y)$ is generated, and then uses Bayes’ rule to
predict the most likely class for a new feature vector $x$.

---

### Bayes rule

We start from Bayes’ rule for a single class $c$:

$$
P(y = c \mid x)
= \frac{P(x \mid y = c)\, P(y = c)}{P(x)} .
$$

Here

- $P(y = c \mid x)$: posterior — probability that the label is class $c$ given $x$  
- $P(x \mid y = c)$: likelihood of seeing features $x$ under class $c$  
- $P(y = c)$: class prior  
- $P(x)$: evidence (normalization term, same for all classes when we compare them)

---

### Class prior $P(y = c)$

We estimate the prior by counting how often each class appears in the training set.
Let $N_c$ be the number of samples with label $c$, and $N$ the total number of
samples:

$$
P(y = c) \approx \hat P(y = c)
= \frac{N_c}{N} .
$$

So the prior just measures how common class $c$ is in the data.

---

### Likelihood of features $P(x \mid y = c)$

Let the feature vector be $x = (x_1, \dots, x_d)$.

1. **Naive conditional independence**

   Given the class $y = c$, features are assumed conditionally independent:

   $$
   P(x \mid y = c)
   = \prod_{j=1}^d P(x_j \mid y = c) .
   $$

2. **Class-conditional Gaussian distributions**

   For each class $c$ and feature $j$, we assume a univariate Gaussian:

   $$
   x_j \mid y = c \sim \mathcal{N}(\mu_{c,j}, \sigma_{c,j}^2),
   $$

   so the per-feature likelihood is

   $$
   P(x_j \mid y = c)
   = \mathcal{N}(x_j \mid \mu_{c,j}, \sigma_{c,j}^2)
   = \frac{1}{\sqrt{2\pi\sigma_{c,j}^2}}
     \exp\!\left(
       -\frac{(x_j - \mu_{c,j})^2}{2\sigma_{c,j}^2}
     \right).
   $$

   Combining these, the total likelihood is

   $$
   P(x \mid y = c)
   = \prod_{j=1}^d
     \mathcal{N}(x_j \mid \mu_{c,j}, \sigma_{c,j}^2).
   $$

---

### Final representation (prediction rule)

For prediction we choose the class with the largest posterior probability:

$$
\hat{y}
= \arg\max_{c} P(y = c \mid x).
$$

Using Bayes’ rule and dropping $P(x)$, which is the same for all classes,
this becomes

$$
\hat{y}
= \arg\max_{c} P(x \mid y = c)\, P(y = c)
= \arg\max_{c}
\left[
  \prod_{j=1}^d
  \mathcal{N}(x_j \mid \mu_{c,j}, \sigma_{c,j}^2)
\right]
\frac{N_c}{N}.
$$

In code, we usually work in **log space**, replacing products by sums:

$$
\log P(y = c \mid x)
= \log P(y = c)
+ \sum_{j=1}^d \log \mathcal{N}(x_j \mid \mu_{c,j}, \sigma_{c,j}^2)
\quad \text{(up to an additive constant)} ,
$$

and still predict

$$
\hat{y} = \arg\max_{c} \log P(y = c \mid x).
$$

## Loss

We train Gaussian Naive Bayes by **maximum likelihood**,  
which is equivalent to **minimizing the negative log-likelihood (NLL)**.

---

### General form

Given a parametric model with parameters $\Theta$ and training data
$\{(x_i, y_i)\}_{i=1}^n$, the likelihood of the data is

$$
p_\Theta(\{x_i, y_i\}_{i=1}^n)
= \prod_{i=1}^n p_\Theta(y_i, x_i).
$$

The **negative log-likelihood loss** is

$$
\mathcal{L}_{\text{NLL}}(\Theta)
= - \sum_{i=1}^n \log p_\Theta(y_i, x_i).
$$

Minimizing $\mathcal{L}_{\text{NLL}}$ is the same as maximizing the likelihood.

---

### NLL for Gaussian Naive Bayes

For Gaussian NB, the joint model for a single sample is

$$
p_\Theta(x_i, y_i)
= \pi_{y_i}
  \prod_{j=1}^d
  \mathcal{N}(x_{i,j} \mid \mu_{y_i,j}, \sigma_{y_i,j}^2),
$$

where

- $\pi_c = P(y = c)$ is the class prior,  
- $\mu_{c,j}$ and $\sigma_{c,j}^2$ are the mean and variance of feature $j$ in class $c$.

Plugging this into the general NLL and expanding the Gaussian log-density gives

$$
\mathcal{L}_{\text{NLL}}(\Theta)
=
\sum_{i=1}^n
\left[
-\log \pi_{y_i}
+
\sum_{j=1}^d
\left(
\frac{1}{2}\log\!\big(2\pi \sigma_{y_i,j}^2\big)
+
\frac{(x_{i,j} - \mu_{y_i,j})^2}{2\sigma_{y_i,j}^2}
\right)
\right].
$$

- The term $-\log \pi_{y_i}$ comes from the **class prior**.  
- The inner sum over $j$ comes from the **Gaussian likelihood** of each feature.

Because Gaussian NB has **closed-form MLE solutions** for $\pi_c$, $\mu_{c,j}$, and
$\sigma_{c,j}^2$, we usually do **not** run gradient descent on this loss in practice;
instead we compute the empirical counts, means, and variances that minimize it.

## Optimizer(Not Real)

Unlike discriminative models such as logistic regression or neural networks,  
Gaussian Naive Bayes does not require an iterative optimizer.  
Although the model minimizes the negative log-likelihood (NLL), the optimal parameters
have closed-form maximum likelihood solutions.

For each class $c$ and feature $j$, the MLE updates are:

$$
\hat{\pi}_c = \frac{N_c}{N},
$$
where $N_c$ is the number of samples in class $c$.

$$
\hat{\mu}_{c,j}
= \frac{1}{N_c} \sum_{i : y_i = c} x_{i,j}.
$$

$$
\hat{\sigma}_{c,j}^2
= \frac{1}{N_c} \sum_{i : y_i = c} (x_{i,j} - \hat{\mu}_{c,j})^2.
$$

These formulas directly minimize the NLL loss, so training requires only computing
class counts, sample means, and sample variances—no gradient descent, no iterative
optimization, and no numerical solver.


## Pseudo-code

Training Gaussian Naive Bayes model  
**Require:** Training dataset $(X, y)$, smoothing parameter $\text{var\_smoothing}$  
**Ensure:** Class priors $P(c)$, means $\mu_{c,j}$, variances $\sigma_{c,j}^2$  

1:  $N \leftarrow$ number of samples in $X$  
2:  **for** $j = 1 \dots d$ **do**  
3:  $\hat{\sigma}_j^2 \leftarrow \frac{1}{N} \sum_{i=1}^N (X_{i,j} - \bar{X}_j)^2$  
4:  **end for**  
5:  $v_{\max} \leftarrow \max_{1 \le j \le d} \hat{\sigma}_j^2$  
6:  $\epsilon \leftarrow \text{var\_smoothing} \cdot v_{\max}$  

7:  $C \leftarrow \text{unique}(y)$  
8:  **for each** class $c \in C$ **do**  
9:   $S_c \leftarrow \{\, i : y_i = c \,\}$  
10:   $N_c \leftarrow |S_c|$  
11:   $P(c) \leftarrow \dfrac{N_c}{N}$  
12:   **for** $j = 1 \dots d$ **do**  
13:    $X_{c,j} \leftarrow \{\, X_{i,j} : i \in S_c \,\}$  
14:    $\mu_{c,j} \leftarrow \dfrac{1}{N_c} \sum_{i \in S_c} X_{i,j}$  
15:    $\sigma_{c,j}^2 \leftarrow \dfrac{1}{N_c} \sum_{i \in S_c} (X_{i,j} - \mu_{c,j})^2 + \epsilon$  
16:   **end for**  
17: **end for**  

---

Predicting with Gaussian Naive Bayes  
**Require:** Model parameters $P(c)$, $\mu_{c,j}$, $\sigma_{c,j}^2$  
**Ensure:** Predicted label $\hat{y}$  

1:  **for each** class $c \in C$ **do**  
2:  $\text{score}(c) \leftarrow \log P(c)$  
3:  **for** $j = 1 \dots d$ **do**  
4:   $\ell \leftarrow -\frac{1}{2}\log(2\pi\sigma_{c,j}^2) 
      \;-\; \frac{(x_j - \mu_{c,j})^2}{2\sigma_{c,j}^2}$  
5:   $\text{score}(c) \leftarrow \text{score}(c) + \ell$  
6:  **end for**  
7: **end for**  
8:  $\hat{y} \leftarrow \arg\max_{c} \text{score}(c)$