Q1. A company conducted a survey of its employees and found that 70% of the employees use the company's health insurance plan, while 40% of the employees who use the plan are smokers. What is the probability that an employee is a smoker given that he/she uses the health insurance plan?

Q2. What is the difference between Bernoulli Naive Bayes and Multinomial Naive Bayes?

Q3. How does Bernoulli Naive Bayes handle missing values?

Q4. Can Gaussian Naive Bayes be used for multi-class classification?

Q5. Assignment:

Data preparation:

Download the "Spambase Data Set" from the UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/ datasets/Spambase). This dataset contains email messages, where the goal is to predict whether a message is spam or not based on several input features.

Implementation:

Implement Bernoulli Naive Bayes, Multinomial Naive Bayes, and Gaussian Naive Bayes classifiers using the
scikit-learn library in Python. Use 10-fold cross-validation to evaluate the performance of each classifier on the
dataset. You should use the default hyperparameters for each classifier.

Results:

Report the following performance metrics for each classifier:
- Accuracy
- Precision
- Recall
- F1 score

Discussion:

Discuss the results you obtained. Which variant of Naive Bayes performed the best? Why do you think that is
the case? Are there any limitations of Naive Bayes that you observed?

Conclusion:

Summarise your findings and provide some suggestions for future work.

### Q1. A company conducted a survey of its employees and found that 70% of the employees use the company's health insurance plan, while 40% of the employees who use the plan are smokers. What is the probability that an employee is a smoker given that he/she uses the health insurance plan?


Defining the events:

- A: Employee uses the health insurance plan.
- B: Employee is a smoker.

We are given the following probabilities:
- P(A) = 0.70 (probability that an employee uses the health insurance plan)
- P(B|A) = 0.40 (probability that an employee is a smoker given that he/she uses the health insurance plan)

Using Bayes' theorem, we have:

P(B|A) = (P(A|B) * P(B)) / P(A)

Assuming 100 employees are there in the company and also assume that the probability of using the health insurance plan is independent of being a smoker or not.

With this assumption 

P(B|A) = (0.70 * 0.40) / 0.70

P(B|A) = 0.40

### Q2. What is the difference between Bernoulli Naive Bayes and Multinomial Naive Bayes?


1. Feature representation:
- Bernoulli Naive Bayes: It assumes that features are binary variables, meaning they take on values of either 0 or 1. 
- Multinomial Naive Bayes: It assumes that features are discrete variables that represent the frequencies or counts of occurrences. 

2. Data type:
- Bernoulli Naive Bayes: It is commonly used when dealing with binary or Boolean features. 
- Multinomial Naive Bayes: It is commonly used for text classification tasks, where the features represent word frequencies or discrete counts. It is suitable for datasets where features have multiple possible values and are represented as discrete frequencies.

3. Modeling approach:
- Bernoulli Naive Bayes: It models the likelihood of features as binary variables using the Bernoulli distribution. It assumes that features are conditionally independent given the class label.
- Multinomial Naive Bayes: It models the likelihood of features as discrete variables using the Multinomial distribution. It assumes that features are conditionally independent given the class label.


### Q3. How does Bernoulli Naive Bayes handle missing values?


Bernoulli Naive Bayes does not handle missing values directly. It assumes that features are binary variables, taking values of either 0 or 1. Therefore, if a feature has missing values, it cannot be represented as a binary variable.

In practice, there are a few approaches to handling missing values in the context of Bernoulli Naive Bayes:

1. Dropping samples: One option is to remove samples that have missing values for any feature. This approach, however, may lead to a significant loss of data and potential bias in the remaining dataset.

2. Imputation: Another approach is to impute missing values with a specific value, such as the mode or most frequent value of the feature. This allows for retaining the samples with missing values while still providing a binary representation for the feature.

3. Treating missing values as a separate category: Instead of imputing missing values, you can treat them as a separate category or state for the feature. This approach introduces a new value (e.g., 2) to represent the missing values, while 0 and 1 continue to represent the other two states of the binary feature.

### Q4. Can Gaussian Naive Bayes be used for multi-class classification?


Yes, Gaussian Naive Bayes can be used for multi-class classification tasks. While it is commonly used for binary classification, it can be extended to handle multiple classes by using the "one-vs-rest" or "one-vs-all" approach.

In the "one-vs-rest" approach, separate binary classifiers are trained for each class, considering that class as the positive class and the remaining classes as the negative class. The class with the highest probability among all the binary classifiers is then assigned as the predicted class.

The Gaussian Naive Bayes algorithm assumes that the features follow a Gaussian (normal) distribution. When applied to multi-class classification, it assumes that the features are conditionally independent given the class label and that each feature follows a Gaussian distribution for each class.

While Gaussian Naive Bayes has some simplifying assumptions and may not capture complex dependencies between features, it can still perform reasonably well in practice, especially when the class distributions and feature distributions are approximately Gaussian.

### Q5. Assignment:

Data preparation:

Download the "Spambase Data Set" from the UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/ datasets/Spambase). This dataset contains email messages, where the goal is to predict whether a message is spam or not based on several input features.

Implementation:

Implement Bernoulli Naive Bayes, Multinomial Naive Bayes, and Gaussian Naive Bayes classifiers using the
scikit-learn library in Python. Use 10-fold cross-validation to evaluate the performance of each classifier on the
dataset. You should use the default hyperparameters for each classifier.

Results:

Report the following performance metrics for each classifier:
- Accuracy
- Precision
- Recall
- F1 score

Discussion:

Discuss the results you obtained. Which variant of Naive Bayes performed the best? Why do you think that is
the case? Are there any limitations of Naive Bayes that you observed?

Conclusion:

Summarise your findings and provide some suggestions for future work.

In [1]:
## importing neccesary libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

data = pd.read_csv('/Users/aakanksha/My_Codes/data-science-master-course/data/spambase.data', header=None)

In [2]:
data

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,48,49,50,51,52,53,54,55,56,57
0,0.00,0.64,0.64,0.0,0.32,0.00,0.00,0.00,0.00,0.00,...,0.000,0.000,0.0,0.778,0.000,0.000,3.756,61,278,1
1,0.21,0.28,0.50,0.0,0.14,0.28,0.21,0.07,0.00,0.94,...,0.000,0.132,0.0,0.372,0.180,0.048,5.114,101,1028,1
2,0.06,0.00,0.71,0.0,1.23,0.19,0.19,0.12,0.64,0.25,...,0.010,0.143,0.0,0.276,0.184,0.010,9.821,485,2259,1
3,0.00,0.00,0.00,0.0,0.63,0.00,0.31,0.63,0.31,0.63,...,0.000,0.137,0.0,0.137,0.000,0.000,3.537,40,191,1
4,0.00,0.00,0.00,0.0,0.63,0.00,0.31,0.63,0.31,0.63,...,0.000,0.135,0.0,0.135,0.000,0.000,3.537,40,191,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4596,0.31,0.00,0.62,0.0,0.00,0.31,0.00,0.00,0.00,0.00,...,0.000,0.232,0.0,0.000,0.000,0.000,1.142,3,88,0
4597,0.00,0.00,0.00,0.0,0.00,0.00,0.00,0.00,0.00,0.00,...,0.000,0.000,0.0,0.353,0.000,0.000,1.555,4,14,0
4598,0.30,0.00,0.30,0.0,0.00,0.00,0.00,0.00,0.00,0.00,...,0.102,0.718,0.0,0.000,0.000,0.000,1.404,6,118,0
4599,0.96,0.00,0.00,0.0,0.32,0.00,0.00,0.00,0.00,0.00,...,0.000,0.057,0.0,0.000,0.000,0.000,1.147,5,78,0


In [3]:
# Split the data into features (X) and labels (y)
X = data.iloc[:, :-1]
y = data.iloc[:, -1]

## preprocoessing the data
from numpy import cross
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit_transform(X)

## importing all 3 types of  Naive bayes classifier from scikit learn library and  Instantiate the Naive Bayes classifiers
from sklearn.naive_bayes import BernoulliNB, GaussianNB, MultinomialNB
bernoulli_nb = BernoulliNB()
gaussian_nb = GaussianNB()
multinomial_nb = MultinomialNB()

# Perform cross-validation and evaluate performance metrics
from sklearn.model_selection import cross_validate
scoring = ['accuracy', 'precision', 'recall', 'f1']
cv = 10

# performing CV for all 3 Naive Bayes classifier
bernoulli_score = cross_validate(bernoulli_nb, X, y, cv=cv, scoring=scoring)
gaussian_score = cross_validate(gaussian_nb, X, y, cv=cv, scoring=scoring)
multinomial_score = cross_validate(multinomial_nb, X, y, cv=cv, scoring=scoring)

# Calculate mean scores for each performance metric
mean_bernoulli_scores = {metric: round(bernoulli_score[f"test_{metric}"].mean(), 4) for metric in scoring}
mean_multinomial_scores = {metric: round(multinomial_score[f"test_{metric}"].mean(), 4) for metric in scoring}
mean_gaussian_scores = {metric: round(gaussian_score[f"test_{metric}"].mean(), 4) for metric in scoring}

# Display the performance metrics for each classifier
print("Bernoulli Naive Bayes Performance:")
print(mean_bernoulli_scores)
print("\nMultinomial Naive Bayes Performance:")
print(mean_multinomial_scores)
print("\nGaussian Naive Bayes Performance:")
print(mean_gaussian_scores)

Bernoulli Naive Bayes Performance:
{'accuracy': 0.8839, 'precision': 0.887, 'recall': 0.8152, 'f1': 0.8481}

Multinomial Naive Bayes Performance:
{'accuracy': 0.7863, 'precision': 0.7393, 'recall': 0.7215, 'f1': 0.7283}

Gaussian Naive Bayes Performance:
{'accuracy': 0.8218, 'precision': 0.7104, 'recall': 0.957, 'f1': 0.8131}
