# Assignment no 64 (Naive Bayes Implementation) (10.4.23)

#### Q1. A company conducted a survey of its employees and found that 70% of the employees use the company's health insurance plan, while 40% of the employees who use the plan are smokers. What is the probability that an employee is a smoker given that he/she uses the health insurance plan?


**Ans -** Let's denote the event that an employee uses the company's health insurance plan as `H` and the event that an employee is a smoker as `S`. From the information given, we know that:

P(H) = 0.7 (the probability that an employee uses the company's health insurance plan)

P(S|H) = 0.4 (the probability that an employee is a smoker given that he/she uses the company's health insurance plan)

The probability that an employee is a smoker given that he/she uses the company's health insurance plan is P(S|H), which is equal to 0.4 or 40%.

#### Q2. What is the difference between Bernoulli Naive Bayes and Multinomial Naive Bayes?

**Ans -** Bernoulli Naive Bayes and Multinomial Naive Bayes are two types of Naive Bayes classifiers that differ in the way they model the distribution of the data.

Bernoulli Naive Bayes is used when the features are binary (e.g., present or absent) and are assumed to follow a Bernoulli distribution. In this type of classifier, the likelihood of the data given the class is modeled using a Bernoulli distribution, where the probability of a feature being present is estimated from the data.

Multinomial Naive Bayes, on the other hand, is used when the features represent the frequencies with which certain events have been generated by a multinomial distribution. In this type of classifier, the likelihood of the data given the class is modeled using a Multinomial distribution, where the probabilities of each possible value of a feature are estimated from the data.

In summary, Bernoulli Naive Bayes is used for binary data, while Multinomial Naive Bayes is used for count data.

#### Q3. How does Bernoulli Naive Bayes handle missing values?

**Ans -** In general, when handling missing values in training a Naive Bayes classifier, you have a choice. You can choose to either omit records with any missing values or omit only the missing attributes.

In the case of Bernoulli Naive Bayes, where each feature is binary, the absence of a feature can be treated as a "0" value. This means that if a feature is missing, it can be treated as if it is not present.

It's important to note that the scikit-learn implementation of Naive Bayes does not support predictions with missing values.

In this case, you may need to implement your own version of Naive Bayes that can handle missing values or use one of the approaches mentioned above to handle missing values before training the model.

### Q4. Can Gaussian Naive Bayes be used for multi-class classification?

**Ans -** Yes, Gaussian Naive Bayes can be used for multi-class classification. In a multi-class classification problem, the goal is to assign an instance to one of several possible classes. Gaussian Naive Bayes can be used to model the likelihood of the data given each class, and then use Bayes' theorem to calculate the posterior probabilities of each class given the data. The class with the highest posterior probability is chosen as the predicted class for the instance.

#### Q5. Assignment:
    
**Data preparation:**
Download the "Spambase Data Set" from the UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets/Spambase). This dataset contains email messages, where the goal is to predict whether a message
is spam or not based on several input features.

**Implementation:**
Implement Bernoulli Naive Bayes, Multinomial Naive Bayes, and Gaussian Naive Bayes classifiers using the
scikit-learn library in Python. Use 10-fold cross-validation to evaluate the performance of each classifier on the
dataset. You should use the default hyperparameters for each classifier.

**Results:**
Report the following performance metrics for each classifier:
Accuracy
Precision
Recall
F1 score

**Discussion:**
Discuss the results you obtained. Which variant of Naive Bayes performed the best? Why do you think that is
the case? Are there any limitations of Naive Bayes that you observed?

**Conclusion:**
Summarise your findings and provide some suggestions for future work.

In [1]:
import numpy as np
import pandas as pd
from sklearn.naive_bayes import BernoulliNB, MultinomialNB, GaussianNB
from sklearn.model_selection import cross_validate

In [4]:
# Load the data
data = pd.read_csv('spambase.data', header=None)
data.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,48,49,50,51,52,53,54,55,56,57
0,0.0,0.64,0.64,0.0,0.32,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.778,0.0,0.0,3.756,61,278,1
1,0.21,0.28,0.5,0.0,0.14,0.28,0.21,0.07,0.0,0.94,...,0.0,0.132,0.0,0.372,0.18,0.048,5.114,101,1028,1
2,0.06,0.0,0.71,0.0,1.23,0.19,0.19,0.12,0.64,0.25,...,0.01,0.143,0.0,0.276,0.184,0.01,9.821,485,2259,1
3,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.137,0.0,0.137,0.0,0.0,3.537,40,191,1
4,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.135,0.0,0.135,0.0,0.0,3.537,40,191,1


In [5]:
X = data.iloc[:, :-1].values
y = data.iloc[:, -1].values

In [6]:
# Define the classifiers
classifiers = {
    'BernoulliNB': BernoulliNB(),
    'MultinomialNB': MultinomialNB(),
    'GaussianNB': GaussianNB()
}

In [11]:
# Evaluate the classifiers using 10-fold cross-validation
for name, clf in classifiers.items():
    cv_results = cross_validate(clf, X, y, cv=10, scoring=['accuracy', 'precision', 'recall', 'f1'])
    print(f'{name}:')
    print(f"  Accuracy: {np.mean(cv_results['test_accuracy'])}")
    print(f"  Precision: {np.mean(cv_results['test_precision'])}")
    print(f"  Recall: {np.mean(cv_results['test_recall'])}")
    print(f"  F1 score: {np.mean(cv_results['test_f1'])}")
    print("")

BernoulliNB:
  Accuracy: 0.8839380364047911
  Precision: 0.8869617393737383
  Recall: 0.8152389047416673
  F1 score: 0.8481249015095276

MultinomialNB:
  Accuracy: 0.7863496180326323
  Precision: 0.7393175533565436
  Recall: 0.7214983911116508
  F1 score: 0.7282909724016348

GaussianNB:
  Accuracy: 0.8217730830896915
  Precision: 0.7103733928118492
  Recall: 0.9569516119239877
  F1 score: 0.8130660909542995

