<a href="https://www.kaggle.com/code/samithsachidanandan/gaussian-naive-bayes-from-scratch-in-python?scriptVersionId=266659747" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

### Import the necessay Libraries 

In [1]:
import numpy as np 



### Defining the GaussianNB Class

In [2]:
class GaussianNB:
    
    def fit(self, X, y):
        
        X, y = np.asarray(X), np.asarray(y)
        self.classes_ = np.unique(y)
        n_classes, n_features = len(self.classes_), X.shape[1]

        self.means_ = np.zeros((n_classes, n_features))
        self.variances_ = np.zeros((n_classes, n_features))
        self.priors_ = np.zeros(n_classes) 

        for idx, k in enumerate(self.classes_):
            Xk = X[y == k]

            self.means_[idx] = Xk.mean(axis=0)
            self.variances_[idx] = Xk.var(axis=0)
            self.priors_[idx] = Xk.shape[0] / X.shape[0]
        return self

    def _log_gaussian(self, X):
        num = -0.5 * (X[:, None, : ] - self.means_)**2 /self.variances_
        log_prob = num - 0.5 * np.log( 2 * np.pi * self.variances_)

        return log_prob.sum(axis =2)

    def predict(self, X):
        X= np.asarray(X)

        log_likelihood = self._log_gaussian(X)
        log_prior = np.log(self.priors_)

        return self.classes_[np.argmax(log_likelihood + log_prior, axis= 1 )]


        
        

#### What is happening in GaussianNB Class

In the Fit funtion, First step the data is converted into numpy array the we are identifying all unique class labels then next step we are finding how many unique classes and features are in the data then we are preparing space to store means, variances, and priors for each class/feature once that is done then we are looping over each class and for every class label k  we are selecting all samples withe label k and then stores the mean vector for features of class k, stores the variance vector for features of class k and computes the prior probability of class . Self is returned in the end. 

In the _log_gaussian function, For every sample, calculate the log-probability (log-likelihood) of belonging to each class, using the Gaussian formula.
First we are reshaping the input so that each sample can be compared to each class mean (broadcasting) and then computes the square differences between each input sample and class mean, divided by class variance then in the next step applies the rest of the Gaussian log-probability formula for each feature and class. Finally will return the sums the log-probabilities across all features for each sample/class.

In the predict function, For a new input, pick the class (from self.classes_) with highest log-probability for each sample. first step is ensuring that the input is a NumPy array then get log-likelihoods (feature-wise probabilities summed for each class/sample) and convert class priors to log-format for easy addition. In the final step, for each sample, add the log-likelihoods for all classes to their respective log-priors and pick the class with the highest sum and returns actual class labels, not just indices.




### Laoding the Dataset and Libraries from scikit learn

In [3]:
from sklearn.datasets import load_breast_cancer 
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score 

### Train and Test your Model

In [4]:
X, y = load_breast_cancer(return_X_y = True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

clf = GaussianNB().fit(X_train, y_train)

y_pred = clf.predict(X_test)

print(accuracy_score(y_pred, y_test))

0.956140350877193
