## Overview of Gaussian Naive Bayes

Naive Bayes is a family of probabilistic classification methods built on generative modeling, where the goal is to characterize how the features and labels are jointly distributed. Instead of directly learning the conditional probability $P(y \mid X)$, the method models the joint structure by estimating the class prior distribution together with the class-conditional distributions of the features. The posterior probability used for classification is then obtained via Bayes' rule.

At the core of Naive Bayes is the naive conditional independence assumption: given the class label, all features are treated as independent. While this assumption is often violated in real-world data, it greatly simplifies the modeling process, allowing all parameters to be estimated efficiently using simple empirical statistics computed within each class.

Different variants of Naive Bayes arise from different assumptions about the class-conditional feature distributions. For example, Multinomial and Bernoulli Naive Bayes are commonly used for text data, while Categorical Naive Bayes applies to discrete, non-numeric features. 

In this project, we focus on Gaussian Naive Bayes, which assumes that each continuous feature follows a class-conditional normal distribution. This makes GaussianNB particularly suitable for datasets with real-valued features and provides a simple yet effective generative model for continuous data.

| Variant | Feature Type | Distribution Assumption | Typical Use Case |
|--------|--------------|-------------------------|------------------|
| **Gaussian Naive Bayes** | Continuous numerical features | Gaussian (normal) distribution | Sensor data, measurements, general continuous features |
| **Multinomial Naive Bayes** | Count-based features | Multinomial distribution | Text classification, word counts (BoW, TF) |
| **Bernoulli Naive Bayes** | Binary (0/1) features | Bernoulli distribution | Word presence/absence, binary indicators |
| **Categorical Naive Bayes** | General categorical features | Categorical distribution | Discrete categorical variables (color, city, etc.) |


## Advantages and Disadvantages of Gaussian Naive Bayes

### **Advantages**

**1. Extremely fast training**  
GaussianNB uses closed-form estimates for class means and variances, requiring no iterative optimization. 

**2. Performs well on small datasets**  
Estimating Gaussian parameters requires relatively few samples, allowing GNB to work effectively even when data is limited.

**3. Simple and interpretable**  
The contribution of each feature to the final prediction can be clearly understood through class-wise means, variances, and log-likelihoods.

**4. Efficient in high-dimensional spaces**  
Due to the independence assumption, the computational cost scales linearly with the number of features, making GNB suitable for high-dimensional data.


### **Disadvantages**

**1. Independence assumption rarely holds**  
The naive assumption that features are conditionally independent is often violated in real-world data, which can degrade model performance.

**2. Sensitive to distribution mismatch**  
GNB assumes each feature follows a Gaussian distribution within each class. Strongly skewed, multimodal, or heavy-tailed distributions may lead to poor results.

**3. Limited model flexibility**  
The decision boundaries are quadratic, restricting the model’s ability to capture more complex nonlinear structures.

**4. Assumes each feature dimension is an independent univariate Gaussian**  
GNB does not capture correlations between features (i.e., it does not model full multivariate Gaussian distributions), which can be limiting when feature interactions are important.

## Representation

Gaussian Naive Bayes is a generative model that assumes the data are generated from class-conditional Gaussian distributions with a simple independence structure.

**Class prior:**

For each class c, the model includes a class prior

$$
\pi_c = P(y = c),
$$

with

$$
\sum_c \pi_c = 1.
$$

These priors describe how frequent each class is in the population.

**Class-conditional feature distributions:**

Conditioned on the class label y = c, each feature $x_j$ is assumed to be generated independently from a univariate Gaussian:

$$
x_j \mid y = c \sim \mathcal{N}(\mu_{c,j}, \sigma_{c,j}^2),
$$

where $\mu_{c,j}$ and $sigma_{c,j}^2$ are unknown population parameters: the true mean and variance of feature j within class c.

Under the Naive Bayes conditional independence assumption, the class-conditional density factorizes as

$$
P(x \mid y = c)
= \prod_{j=1}^d
\mathcal{N}\!\big(x_j \mid \mu_{c,j}, \sigma_{c,j}^2\big).
$$

**Joint model and parameter set:**

The joint distribution over features and labels is

$$
P(x, y = c)
= \pi_c \prod_{j=1}^d
\mathcal{N}\!\big(x_j \mid \mu_{c,j}, \sigma_{c,j}^2\big).
$$

The full parameter set that defines the model is

$$
\Theta
= \{\pi_c, \mu_{c,j}, \sigma_{c,j}^2\}_{c=1,\dots,K;\; j=1,\dots,d},
$$

i.e., one prior $\pi_c$ per class and one Gaussian mean/variance pair $\mu_{c,j}$, $\sigma_{c,j}^2$ for each feature within each class.

## Loss

Gaussian Naive Bayes parameters are estimated by maximizing the likelihood of the training data, or equivalently, minimizing the negative log-likelihood (NLL).  
Given training samples $\{(x_i, y_i)\}_{i=1}^n$ with $x_i = (x_{i,1}, \dots, x_{i,d})$, the joint model is

$$
P(x_i, y_i)
= \pi_{y_i} \prod_{j=1}^d 
\mathcal{N}(x_{i,j} \mid \mu_{y_i,j}, \sigma_{y_i,j}^2).
$$

Taking the negative log of the likelihood over all samples yields the loss function:

$$
\mathcal{L}_{\text{NLL}}(\Theta)
=
\sum_{i=1}^n 
\left[
-\log \pi_{y_i}
+
\sum_{j=1}^d
\left(
\frac{1}{2}\log(2\pi \sigma_{y_i,j}^2)
+
\frac{(x_{i,j} - \mu_{y_i,j})^2}{2\sigma_{y_i,j}^2}
\right)
\right].
$$

This NLL loss arises directly from the Gaussian pdf and the Naive Bayes independence assumption.  
Although this is the formal loss minimized by the model, its optimal parameters have closed-form solutions, so no iterative optimizer is required.

## Optimizer(Not Real)

Unlike discriminative models such as logistic regression or neural networks,  
Gaussian Naive Bayes does not require an iterative optimizer.  
Although the model minimizes the negative log-likelihood (NLL), the optimal parameters
have closed-form maximum likelihood solutions.

For each class $c$ and feature $j$, the MLE updates are:

$$
\hat{\pi}_c = \frac{N_c}{N},
$$
where $N_c$ is the number of samples in class $c$.

$$
\hat{\mu}_{c,j}
= \frac{1}{N_c} \sum_{i : y_i = c} x_{i,j}.
$$

$$
\hat{\sigma}_{c,j}^2
= \frac{1}{N_c} \sum_{i : y_i = c} (x_{i,j} - \hat{\mu}_{c,j})^2.
$$

These formulas directly minimize the NLL loss, so training requires only computing
class counts, sample means, and sample variances—no gradient descent, no iterative
optimization, and no numerical solver.


## Pseudo-code

Training Gaussian Naive Bayes model  
**Require:** Training dataset $(X, y)$  
**Ensure:** Class priors $P(c)$, means $\mu_{c,j}$, variances $\sigma_{c,j}^2$  

1:  $C \leftarrow \text{unique}(y)$  
2:  **for each** class $c \in C$ **do**  
3:  $S_c \leftarrow \{\, i : y_i = c \,\}$  
4:  $N_c \leftarrow |S_c|$  
5:  $P(c) \leftarrow \frac{N_c}{N}$  
6:  **for** $j = 1 \dots d$ **do**  
7:   $X_{c,j} \leftarrow \{\, X_{i,j} : i \in S_c \,\}$  
8:   $\mu_{c,j} \leftarrow \frac{1}{N_c} \sum_{i \in S_c} X_{i,j}$  
9:   $\sigma_{c,j}^2 \leftarrow \frac{1}{N_c} \sum_{i \in S_c} (X_{i,j} - \mu_{c,j})^2 + \epsilon$  
10:  **end for**  
11: **end for**  

---

Predicting with Gaussian Naive Bayes  
**Require:** Model parameters $P(c)$, $\mu_{c,j}$, $\sigma_{c,j}^2$  
**Ensure:** Predicted label $\hat{y}$  

1:  **for each** class $c \in C$ **do**  
2:  $\text{score}(c) \leftarrow \log P(c)$  
3:  **for** $j = 1 \dots d$ **do**  
4:   $\ell \leftarrow -\frac{1}{2}\log(2\pi\sigma_{c,j}^2) 
      \;-\; \frac{(x_j - \mu_{c,j})^2}{2\sigma_{c,j}^2}$  
5:   $\text{score}(c) \leftarrow \text{score}(c) + \ell$  
6:  **end for**  
7: **end for**  
8:  $\hat{y} \leftarrow \arg\max_{c} \text{score}(c)$