# Gaussian Naive Bayes
Gaussian Naive Bayes is a variant of the Naive Bayes algorithm which assumes that the features follow a Gaussian (normal) distribution. It is used for classification tasks and is particularly effective for problems where the feature distribution is approximately normal.

## Advantages
- Simplicity: Easy to implement and understand.
- Efficiency: Fast to train and make predictions, especially on large datasets.
- Scalability: Works well with a large number of features.
- Good Performance with Few Data: Performs well even with small training datasets.

## Disadvantages
- Assumption of Normality: Assumes that the features follow a Gaussian distribution, which may not always be true.
- Independence Assumption: Assumes that the features are independent of each other, which is often not the case in real-world data.
- Sensitivity to Irrelevant Features: Performance can degrade with irrelevant or highly correlated features.

## Use Cases
- Text Classification: Spam detection, sentiment analysis.
- Medical Diagnosis: Disease prediction based on symptoms.
- Document Categorization: Classifying documents into categories like sports, politics, etc.
- Real-time Prediction: Applications requiring real-time predictions due to its computational efficiency.

## Scaling(not need)
Gaussian Naive Bayes does not require feature scaling because it assumes a Gaussian distribution of the data. Scaling features does not affect the Gaussian assumption.

## Encoding(necessary) 
Categorical features need to be encoded into numerical values because Gaussian Naive Bayes works with numerical data.

# Import library

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.naive_bayes import GaussianNB
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.datasets import load_iris
from scipy.stats import uniform

# Read data

In [2]:
df = pd.read_csv('Breast_Cancer.csv')
x = df.drop('diagnosis',axis=1)
y = df['diagnosis']

In [3]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

# Train

## Grid Search

In [4]:
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import GridSearchCV

gnb = GaussianNB()

params = {
    'var_smoothing': np.logspace(-9, 0, 10)
}

param_grid = {}


grid_search = GridSearchCV(gnb, params, scoring='accuracy', cv=5, n_jobs=-1)

# Train the grid search
grid_search.fit(x_train, y_train)  

In [5]:
print("Best Hyperparameter Index:", grid_search.best_index_)
print("Best Hyperparameters:", grid_search.best_params_)
print("Best Cross-Validated Score:", grid_search.best_score_)

Best Hyperparameter Index: 0
Best Hyperparameters: {'var_smoothing': 1e-09}
Best Cross-Validated Score: 0.9340659340659341


In [6]:
# Get the model with best hyperparameters
model = grid_search.best_estimator_
y_pred = model.predict(x_test)

## Randomized Search

In [7]:
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import RandomizedSearchCV

gnb = GaussianNB()

params = {
    'var_smoothing': uniform(1e-9, 1e-1)
}

param_dist = {}


random_search = RandomizedSearchCV(gnb, params, scoring='accuracy', n_iter=40, cv=5, n_jobs=-1, random_state=42)

# Train the random search
random_search.fit(x_train, y_train)

In [8]:
print("Best Hyperparameter Index:", random_search.best_index_)
print("Best Hyperparameters:", random_search.best_params_)
print("Best Cross-Validated Score:", random_search.best_score_)

Best Hyperparameter Index: 6
Best Hyperparameters: {'var_smoothing': 0.005808362216819947}
Best Cross-Validated Score: 0.8967032967032967


In [9]:
model = random_search.best_estimator_
y_pred = model.predict(x_test)

## Train GaussianNB without search

In [11]:
from sklearn.naive_bayes import GaussianNB
model=GaussianNB(var_smoothing= 1e-09)
model.fit(x_train, y_train)