**DATA 2060 Final Project**
-

Markdown
-

# Gaussian Naive Bayes Classification (GNB)

Gaussian Naive Bayes (GNB) is a generative classification model. It assumes that the data for each feature is conditionally independent given the class label and that these features follow Gaussian (normal) distributions. Unlike standard Naive Bayes, which may handle discrete features, GNB specifically models continuous features using the normal distribution. The model makes predictions by combining probabilities from all features for each class. For a given input, the probabilities are calculated for each class, and the final classification is assigned to the class with the highest posterior probability.
Formal equation:

$$
P_{\theta}(\mathbf{x}, y) = P_{\theta}(y) \prod_{i=1}^{d} P_{\theta}(x_i \mid y)
$$


---

## Key Assumptions and Applications

Due to our assumption that our features come from normal distributions, Gaussian Naive Bayes Classification is best suited for data with exclusively continuous variables, and where the features are not strongly correlated. This model also has some shortcomings, namely, it is vulnerable to datasets with outlier values that may greatly affect the mean and variance of the data, and large complex datasets where more complex models will typically perform better.

---

## Model Parameter Estimation

Gaussian Naive Bayes, being a **generative model**, does not use an optimizer function. Instead, it capitalizes on the naive assumption that our data comes from conditional independent normal distributions and uses **closed-form Maximum Likelihood Estimation (MLE)** to estimate parameters. MLE determines the parameters $ \mu_y, \sigma^2_y, P(y) $ that maximize the likelihood of the observed data. This is equivalent to minimizing the log loss. 
Formally: 
$$
\arg\min_{\theta} \sum_{i=1}^{m} -\log \left[ P_{\theta}(x_i, y_i) \right]
$$


### Parameters Estimated:
- **Class Priors:** $ P(y) $, the proportion of observations in each class.
- **Feature Means:** $ \mu_y $, the mean of each feature $ x_i $ given class $ y $.
- **Feature Variances:** $ \sigma^2_y $, the variance of each feature $ x_i $ given class $ y $.

> **Note:** Unlike other Naive Bayes classifiers, GNB does not use Laplace smoothing because it works with continuous features. Instead, **variance smoothing** is applied by adding a very small constant (e.g., $ 10^{-6} $) to the variance to avoid instability when variance approaches zero.

---

## Prediction Process

Now to make our predictions, we utilize our assumption that our data comes from normal distributions to calculate our predicted probabilities for each class. We calculate the predicted probabilities for each class $ y $ using the conditional probabilities for each feature $ x_i $, assuming normal distributions:

$$
P(x_i \mid y) = \frac{1}{\sqrt{2\pi\sigma^2_y}} \exp\left(-\frac{(x_i - \mu_y)^2}{2\sigma^2_y}\right)
$$

### Steps:
1. **Compute Conditional Probabilities:** For each feature $ x_i $ and class $ y $, calculate $ P(x_i \mid y) $ using the Gaussian probability density function.
2. **Get Joint Probabilities:** For each of the $n$ classes $ y $, compute the joint probability: $\prod_{i=1}^{d} P(x_i \mid y)$
3. **Calculate Postiers:** Multiply our joint probabilities by the priors, then use logarithms for ease of computation:

   $$
   P(y \mid x) \propto P(y) \prod_{i=1}^{d} P(x_i \mid y)\\
   \Rightarrow \log(P y\mid x) \propto \log P(y) + \log \sum_{i=1}^n P(x \mid y)
   $$

4. **Normalize Probabilities:** Convert the joint probabilities into valid probabilities by normalizing them to sum to 1.
5. **Assign Class:** Select the class $ y $ with the highest posterior probability as the prediction.

---

## Evaluation

The performance of Gaussian Naive Bayes can be evaluated using standard classification metrics, such as:
- **Accuracy**
- **Precision/Recall**
- **Log Loss**

For this explanation, **accuracy** is used as the primary evaluation metric.



In [None]:
import pandas as pd
import numpy as np

class GaussianNaiveBayes(object):
    """Gaussian Naive Bayes model
    
    @attributes:
        n_classes: number of classes
            for our dataset, we will have 2 classes (yes heart disease and no heart disease)
        attr_dist: n_classes x n_attributes NumPy array of attributes (2-D array)
        label_priors: NumPy array of the priors distribution (1-D array)
    """

    def __init__(self, n_classes):
        ''' 
        Notes here
        '''
        self.n_classes = n_classes
        self.attr_dist = None
        self.label_priors = None

    def train(self, X_train, y_train):
        '''
        Trains the model. Calculates label priors. Calculates mean, and variance for each class.
        @params:
            X_train: a 2D (n_examples x n_attributes) numpy array
            y_train: a 1D (n_examples) numpy array
        @return:
            None
        '''
        # Number of features
        self.n_attributes = X_train.shape[1] 

        # Calculating the prior probability
        self.label_priors = np.bincount(y_train, minlength=self.n_classes)/len(y_train)


        
    

    def predict(self, inputs):
        pass

    def accuracy(self, X_test, y_test):
        '''
        Calculate 0-1 loss over predictions.
        @params:
            X_test: 2D array (n_examples x n_attributes) where each row is an example and each column is a feature/attribute
            _test: 1D array where each entry corresponds to the label for a row in X
        @return:
            0-1 loss for the input data and associated labels
        '''
        acc = 0
        predictions = self.predict(X_test)
        for i in range(len(y_test)):
            if predictions[i] == y_test[i]:
                acc += 1
        return acc/len(y_test)




Check Model
-

Main
-