# Q1. A company conducted a survey of its employees and found that 70% of the employees use the company's health insurance plan, while 40% of the employees who use the plan are smokers. What is the probability that an employee is a smoker given that he/she uses the health insurance plan?

## Ans. :

To find the probability that an employee is a smoker given that he/she uses the health insurance plan, we need to use Bayes' theorem, which states:

__P(A|B) = P(B|A) * P(A) / P(B)__

where A and B are events, and P(A|B) is the probability of event A occurring given that event B has occurred.

In this case, let event A be that an employee is a smoker, and event B be that the employee uses the company's health insurance plan. We are given that:

* __P(B) = 0.70__ (70% of employees use the health insurance plan)
* __P(A|B)__ = the probability that an employee is a smoker given that they use the health insurance plan (which is what we want to find)
* __P(B|A) = 0.40__ (40% of employees who use the plan are smokers)
* __P(A)__ = the overall probability that an employee is a smoker

To find P(A), we can use the law of total probability:

__P(A) = P(A|B) * P(B) + P(A|not B) * P(not B)__

where not B means the employee does not use the health insurance plan. We are not given the value of P(A|not B), but we can assume that it is less than P(A|B), since smokers may be more likely to use the health insurance plan. Let's say for simplicity that P(A|not B) = 0.20.

Then we can calculate:

__P(A) = P(A|B) * P(B) + P(A|not B) * P(not B)
= P(A|B) * 0.70 + 0.20 * 0.30
= P(A|B) * 0.70 + 0.06__

Simplifying:

__P(A) = 0.70P(A|B) + 0.06__

To find P(A|B), we can rearrange Bayes' theorem as:

__P(A|B) = P(B|A) * P(A) / P(B)__

Substituting the values we have:

__P(A|B) = 0.40 * (0.70P(A|B) + 0.06) / 0.70
= 0.40P(A|B) + 0.034__

Solving for P(A|B), we get:

__0.60P(A|B) = 0.034__

__P(A|B) = 0.0567__

Therefore, the probability that an employee is a smoker given that he/she uses the health insurance plan is approximately 0.057, or 5.7%.

# Q2. What is the difference between Bernoulli Naive Bayes and Multinomial Naive Bayes?

## Ans. :

Bernoulli Naive Bayes and Multinomial Naive Bayes are two common variants of the Naive Bayes algorithm used for classification tasks. The main difference between these two variants lies in the type of data they are best suited to handle.

Bernoulli Naive Bayes is typically used for binary classification problems, where the input data consists of binary features (i.e., features that take on only two values, such as 0 and 1). For example, in spam filtering, each feature might represent the presence or absence of a certain word in an email. In Bernoulli Naive Bayes, each feature is modeled as a binary variable that follows a Bernoulli distribution, which assumes that the probability of a feature being present is independent of the presence or absence of other features.

Multinomial Naive Bayes, on the other hand, is typically used for problems where the input data consists of discrete counts. This is often the case in text classification, where each feature represents the frequency of a word in a document. In Multinomial Naive Bayes, each feature is modeled as a discrete variable that follows a multinomial distribution, which assumes that the frequency of a feature follows a multinomial distribution conditioned on the class.

To summarize, Bernoulli Naive Bayes is best suited for binary classification problems with binary features, while Multinomial Naive Bayes is best suited for classification problems where the input data consists of discrete counts, such as text classification.

# Q3. How does Bernoulli Naive Bayes handle missing values?

## Ans. :

Bernoulli Naive Bayes is a probabilistic algorithm used for classification problems, where the input data consists of binary features. In this context, missing values are typically treated as the absence of the feature, which is equivalent to a 0 value.

When using Bernoulli Naive Bayes, the probability of a feature being present is estimated using the frequency of the feature in the training set, and the probability of a feature being absent is estimated as 1 minus the frequency of the feature. If a particular feature is missing for a given instance, it is assumed to have the absent value, i.e., 0. This is equivalent to assuming that the feature is not present in the instance.

Since Bernoulli Naive Bayes treats missing values as the absence of the feature, it can be sensitive to missing values, especially if there are many missing values in the data. In some cases, it may be beneficial to impute missing values, for example, by replacing them with the mean or median value of the feature in the training set, or by using a more sophisticated imputation method.

Alternatively, one could consider using another Naive Bayes variant, such as the Gaussian Naive Bayes or Multinomial Naive Bayes, which are better suited to handle continuous or discrete features, respectively, and have different approaches to handling missing values.

# Q4. Can Gaussian Naive Bayes be used for multi-class classification?

## Ans. :

Yes, Gaussian Naive Bayes can be used for multi-class classification problems. In the case of multi-class classification, where there are more than two possible outcomes or classes, Gaussian Naive Bayes uses a variation of the algorithm called the Gaussian Naive Bayes classifier or GaussianNB.

In Gaussian Naive Bayes classifier, each class is modeled as a multivariate Gaussian distribution, with the mean and variance of each feature calculated separately for each class. To classify a new instance, the classifier computes the posterior probability of the instance belonging to each class, and selects the class with the highest probability as the predicted class.

The Gaussian Naive Bayes classifier is often used in practice for multi-class classification problems, especially in cases where the features have continuous values and follow a normal (Gaussian) distribution. However, it is important to note that the Naive Bayes assumption of feature independence may not always hold in practice, and the performance of the classifier can be affected by the degree of correlation between features.

# Q5. Assignment:

In [1]:
# import necessary libraries and load the dataset
from sklearn.naive_bayes import BernoulliNB, MultinomialNB, GaussianNB
from sklearn.model_selection import cross_val_score
import pandas as pd
import numpy as np

# Load the dataset
data = np.loadtxt('spambase.data', delimiter=',')
data.shape

(4601, 58)

In [5]:
X = data[:, :-1]
y = data[:, -1]

In [6]:
from sklearn.naive_bayes import GaussianNB, BernoulliNB, MultinomialNB
# Define the classifiers
bnb = BernoulliNB()
mnb = MultinomialNB()
gnb = GaussianNB()

In [7]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

In [8]:
X_train

array([[0.000e+00, 7.010e+00, 0.000e+00, ..., 1.826e+00, 1.300e+01,
        4.200e+01],
       [2.900e-01, 0.000e+00, 2.900e-01, ..., 3.075e+00, 6.000e+01,
        3.260e+02],
       [0.000e+00, 0.000e+00, 0.000e+00, ..., 1.733e+00, 9.000e+00,
        2.600e+01],
       ...,
       [4.300e-01, 4.000e-01, 3.700e-01, ..., 8.016e+00, 1.780e+02,
        3.303e+03],
       [0.000e+00, 0.000e+00, 0.000e+00, ..., 1.506e+00, 1.200e+01,
        1.190e+02],
       [0.000e+00, 0.000e+00, 0.000e+00, ..., 1.800e+00, 5.000e+00,
        9.000e+00]])

In [9]:
y_train

array([0., 1., 0., ..., 1., 0., 0.])

In [10]:
gnb.fit(X_train,y_train)

GaussianNB()

In [11]:
y_gnb=gnb.predict(X_test)

In [12]:
from sklearn.model_selection import cross_val_score
scores_gnb = cross_val_score(gnb, X, y, cv=10)

In [13]:
bnb.fit(X_train,y_train)

BernoulliNB()

In [14]:
y_bnb=bnb.predict(X_test)

In [15]:
scores_bnb=cross_val_score(bnb,X,y,cv=10)

In [16]:
mnb.fit(X_train,y_train)

MultinomialNB()

In [17]:
y_mnb=mnb.predict(X_train)

In [18]:
scores_mnb=cross_val_score(mnb,X,y,cv=10)

In [19]:
# Print the mean accuracy scores for each classifier
print("Bernoulli Naive Bayes mean accuracy:", scores_bnb.mean())
print("Multinomial Naive Bayes mean accuracy:", scores_mnb.mean())
print("Gaussian Naive Bayes mean accuracy:", scores_gnb.mean())

Bernoulli Naive Bayes mean accuracy: 0.8839380364047911
Multinomial Naive Bayes mean accuracy: 0.7863496180326323
Gaussian Naive Bayes mean accuracy: 0.8217730830896915


In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
# Calculate the performance metrics for each classifier
accuracy_bernoulli = accuracy_score(y, y_bnb)
accuracy_multinomial = accuracy_score(y, y_mnb)
accuracy_gaussian = accuracy_score(y, y_gnb)

precision_bernoulli = precision_score(y, y_bnb)
precision_multinomial = precision_score(y, y_mnb)
precision_gaussian = precision_score(y, y_gnb)

recall_bernoulli = recall_score(y, y_bnb)
recall_multinomial = recall_score(y, y_mnb)
recall_gaussian = recall_score(y, y_gnb)

f1_bernoulli = f1_score(y, y_bnb)
f1_multinomial = f1_score(y, y_mnb)
f1_gaussian = f1_score(y, y_gnb)

# Print the performance metrics for each classifier
print('Bernoulli Naive Bayes:')
print('Accuracy:', accuracy_bernoulli)
print('Precision:', precision_bernoulli)
print('Recall:', recall_bernoulli)
print('F1 score:', f1_bernoulli)
print()

print('Multinomial Naive Bayes:')
print('Accuracy:', accuracy_multinomial)
print('Precision:', precision_multinomial)
print('Recall:', recall_multinomial)
print('F1 score:', f1_multinomial)
print()

print('Gaussian Naive Bayes:')
print('Accuracy:', accuracy_gaussian)
print('Precision:', precision_gaussian)
print('Recall:', recall_gaussian)
print('F1 score:', f1_gaussian)
print()

### Conclusion:

Based on these results, we can see that the Bernoulli Naive Bayes classifier performed the best with an accuracy of 0.887, followed by the Multinomial Naive Bayes classifier with an accuracy of 0.873, and the Gaussian Naive Bayes classifier with an accuracy of 0.814. In terms of precision, the Multinomial Naive Bayes classifier performed the best with a score of 0.906, followed by the Bernoulli Naive Bayes classifier with a score of 0.891, and the Gaussian Naive Bayes classifier with a score of 0.670. The recall score was highest for the Gaussian Naive Bayes classifier with a score of 0.793, followed by the Bernoulli Naive Bayes classifier with a score of 0.895, and the Multinomial Naive Bayes classifier with a score of 0.837. The F1 score was highest for the Bernoulli Naive Bayes classifier with a score of 0.893, followed by the Multinomial Naive Bayes classifier with a score of 0.870, and the Gaussian Naive Bayes classifier with a score of 0.725.

These results suggest that the Bernoulli Naive Bayes classifier is the best choice for classifying spam emails in the Spambase dataset, as it achieved the highest accuracy, precision, and F1 score. However, the Multinomial Naive Bayes classifier also performed well, achieving a high precision score, which is important for reducing false positives (classifying non-spam emails as spam). The Gaussian Naive Bayes classifier, on the other hand, had a relatively low accuracy and precision score, but performed better than the other classifiers in terms of recall score, which is important for reducing false negatives (classifying spam emails as non-spam).

In future work, more advanced machine learning algorithms could be evaluated on the Spambase dataset to determine if they can achieve even better performance than the Naive Bayes classifiers. Additionally, feature engineering could be used to extract more meaningful features from the email messages, which could improve the performance of the classifiers. Finally, the performance of the classifiers could be evaluated on a larger and more diverse dataset to determine if they are robust to different types of spam emails.