Q1. A company conducted a survey of its employees and found that 70% of the employees use the company's health insurance plan, while 40% of the employees who use the plan are smokers. What is the probability that an employee is a smoker given that he/she uses the health insurance plan?


In this case, let A be the event that an employee is a smoker, and let B be the event that an employee uses the health insurance plan. We are asked to find the probability of A given that B has occurred, which is P(A|B).

We are given that P(B) = 0.7, the probability that an employee uses the health insurance plan. We are also given that P(A|B) = 0.4, the conditional probability that an employee is a smoker given that he/she uses the health insurance plan. However, we need to calculate P(A), the overall probability that an employee is a smoker.

To do this, we can use the law of total probability, which states that:

P(A) = P(A|B) * P(B) + P(A|~ B) * P(~B)

P(A) = 0.4 * 0.7 + 0.2 * 0.3

P(A) = 0.28 + 0.06

P(A) = 0.34

Q2. What is the difference between Bernoulli Naive Bayes and Multinomial Naive Bayes?


Bernoulli Naive Bayes is typically used for binary classification problems where the features are binary variables (i.e., they take on one of two possible values). For example, in a spam detection problem, the features might be the presence or absence of certain keywords in an email. In Bernoulli Naive Bayes, the probability of a document belonging to a class is calculated as the product of the probabilities of each feature in the document, given the class. Each feature is modeled as a Bernoulli distribution (i.e., a binary variable with a probability of success p), hence the name Bernoulli Naive Bayes.

Multinomial Naive Bayes, on the other hand, is used for classification problems where the features are discrete variables that can take on one of a finite set of values. For example, in a text classification problem, the features might be the frequencies of different words in a document. In Multinomial Naive Bayes, the probability of a document belonging to a class is calculated as the product of the probabilities of each feature in the document, given the class. Each feature is modeled as a multinomial distribution (i.e., a discrete variable that can take on one of k possible values with a probability vector of length k), hence the name Multinomial Naive Bayes.



Q3. How does Bernoulli Naive Bayes handle missing values?


In Bernoulli Naive Bayes, missing values are typically handled by assuming that the missing values are equivalent to the negative state of the corresponding binary feature.

For example, suppose we have a dataset of emails for a spam classification problem, where the features are binary variables indicating the presence or absence of certain words. If a certain word is missing in an email, we can assume that it is equivalent to the absence of that word in the email (i.e., a negative value for that feature).

This assumption is known as the "missing-at-random" assumption, which means that the probability of a feature being missing is independent of the actual value of that feature, given the class label. This assumption is reasonable if the missing values occur randomly and do not depend on the class label.

Once the missing values have been imputed as negative values for the corresponding features, Bernoulli Naive Bayes can be applied as usual to calculate the probability of each class given the observed features.

It is important to note that the "missing-at-random" assumption may not hold in all cases, and in some cases, it may be necessary to use more sophisticated techniques to handle missing values, such as imputation based on the distribution of the observed values or using other machine learning algorithms that can handle missing values more directly.

Q4. Can Gaussian Naive Bayes be used for multi-class classification?


Yes, Gaussian Naive Bayes can be used for multi-class classification. In this case, the algorithm is known as Gaussian Naive Bayes for multi-class classification or Gaussian Naive Bayes for multiple classes.

The approach for multi-class classification with Gaussian Naive Bayes is to fit a separate Gaussian distribution for each class using the training data, where each class corresponds to a unique value of the target variable. Then, for a new observation with a set of feature values, the probability of each class is calculated using Bayes' theorem, where the likelihood of the observed feature values given the class is modeled as a multivariate Gaussian distribution with the mean and covariance matrix estimated from the training data. The class with the highest probability is then assigned as the predicted class for the new observation.

Gaussian Naive Bayes for multi-class classification is a simple and computationally efficient algorithm that can work well in practice, especially when the number of classes is small and the features are continuous and normally distributed. However, it may not perform as well when the assumptions of independence and normality do not hold or when the class distributions are highly imbalanced. In these cases, other classification algorithms may be more appropriate.

Q5. Assignment:
Data preparation:
Download the "Spambase Data Set" from the UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/
datasets/Spambase). This dataset contains email messages, where the goal is to predict whether a message
is spam or not based on several input features.
Implementation:
Implement Bernoulli Naive Bayes, Multinomial Naive Bayes, and Gaussian Naive Bayes classifiers using the
scikit-learn library in Python. Use 10-fold cross-validation to evaluate the performance of each classifier on the
dataset. You should use the default hyperparameters for each classifier.
Results:
Report the following performance metrics for each classifier:
Accuracy
Precision
Recall
F1 score
Discussion:
Discuss the results you obtained. Which variant of Naive Bayes performed the best? Why do you think that is
the case? Are there any limitations of Naive Bayes that you observed?
Conclusion:
Summarise your findings and provide some suggestions for future work.

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import cross_val_score
from sklearn.naive_bayes import BernoulliNB, MultinomialNB, GaussianNB

# Load the dataset
data = pd.read_csv(r"C:\Users\mohit bhade\Downloads\spambase\spambase.data",header=None)
data

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,48,49,50,51,52,53,54,55,56,57
0,0.00,0.64,0.64,0.0,0.32,0.00,0.00,0.00,0.00,0.00,...,0.000,0.000,0.0,0.778,0.000,0.000,3.756,61,278,1
1,0.21,0.28,0.50,0.0,0.14,0.28,0.21,0.07,0.00,0.94,...,0.000,0.132,0.0,0.372,0.180,0.048,5.114,101,1028,1
2,0.06,0.00,0.71,0.0,1.23,0.19,0.19,0.12,0.64,0.25,...,0.010,0.143,0.0,0.276,0.184,0.010,9.821,485,2259,1
3,0.00,0.00,0.00,0.0,0.63,0.00,0.31,0.63,0.31,0.63,...,0.000,0.137,0.0,0.137,0.000,0.000,3.537,40,191,1
4,0.00,0.00,0.00,0.0,0.63,0.00,0.31,0.63,0.31,0.63,...,0.000,0.135,0.0,0.135,0.000,0.000,3.537,40,191,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4596,0.31,0.00,0.62,0.0,0.00,0.31,0.00,0.00,0.00,0.00,...,0.000,0.232,0.0,0.000,0.000,0.000,1.142,3,88,0
4597,0.00,0.00,0.00,0.0,0.00,0.00,0.00,0.00,0.00,0.00,...,0.000,0.000,0.0,0.353,0.000,0.000,1.555,4,14,0
4598,0.30,0.00,0.30,0.0,0.00,0.00,0.00,0.00,0.00,0.00,...,0.102,0.718,0.0,0.000,0.000,0.000,1.404,6,118,0
4599,0.96,0.00,0.00,0.0,0.32,0.00,0.00,0.00,0.00,0.00,...,0.000,0.057,0.0,0.000,0.000,0.000,1.147,5,78,0


In [2]:
# Split the data into features (X) and labels (y)
X = data.iloc[:, :-1].values
y = data.iloc[:, -1].values

In [3]:
X

array([[0.000e+00, 6.400e-01, 6.400e-01, ..., 3.756e+00, 6.100e+01,
        2.780e+02],
       [2.100e-01, 2.800e-01, 5.000e-01, ..., 5.114e+00, 1.010e+02,
        1.028e+03],
       [6.000e-02, 0.000e+00, 7.100e-01, ..., 9.821e+00, 4.850e+02,
        2.259e+03],
       ...,
       [3.000e-01, 0.000e+00, 3.000e-01, ..., 1.404e+00, 6.000e+00,
        1.180e+02],
       [9.600e-01, 0.000e+00, 0.000e+00, ..., 1.147e+00, 5.000e+00,
        7.800e+01],
       [0.000e+00, 0.000e+00, 6.500e-01, ..., 1.250e+00, 5.000e+00,
        4.000e+01]])

In [4]:
y

array([1, 1, 1, ..., 0, 0, 0], dtype=int64)

In [5]:
# Instantiate the classifiers
bernoulli_nb = BernoulliNB()
multinomial_nb = MultinomialNB()
gaussian_nb = GaussianNB()

In [6]:
# Perform 10-fold cross-validation and evaluate performance metrics
accuracy_scores = cross_val_score(bernoulli_nb, X, y, cv=10, scoring='accuracy')
precision_scores = cross_val_score(bernoulli_nb, X, y, cv=10, scoring='precision')
recall_scores = cross_val_score(bernoulli_nb, X, y, cv=10, scoring='recall')
f1_scores = cross_val_score(bernoulli_nb, X, y, cv=10, scoring='f1')

# Print the average performance metrics for Bernoulli Naive Bayes
print("Bernoulli Naive Bayes:")
print("Accuracy:", np.mean(accuracy_scores))
print("Precision:", np.mean(precision_scores))
print("Recall:", np.mean(recall_scores))
print("F1 score:", np.mean(f1_scores))

Bernoulli Naive Bayes:
Accuracy: 0.8839380364047911
Precision: 0.8869617393737383
Recall: 0.8152389047416673
F1 score: 0.8481249015095276


In [7]:
# Perform 10-fold cross-validation and evaluate performance metrics
accuracy_scores = cross_val_score(multinomial_nb, X, y, cv=10, scoring='accuracy')
precision_scores = cross_val_score(multinomial_nb, X, y, cv=10, scoring='precision')
recall_scores = cross_val_score(multinomial_nb, X, y, cv=10, scoring='recall')
f1_scores = cross_val_score(multinomial_nb, X, y, cv=10, scoring='f1')

# Print the average performance metrics for Multinomial Naive Bayes
print("\nMultinomial Naive Bayes:")
print("Accuracy:", np.mean(accuracy_scores))
print("Precision:", np.mean(precision_scores))
print("Recall:", np.mean(recall_scores))
print("F1 score:", np.mean(f1_scores))


Multinomial Naive Bayes:
Accuracy: 0.7863496180326323
Precision: 0.7393175533565436
Recall: 0.7214983911116508
F1 score: 0.7282909724016348


In [8]:
# Perform 10-fold cross-validation and evaluate performance metrics
accuracy_scores = cross_val_score(gaussian_nb, X, y, cv=10, scoring='accuracy')
precision_scores = cross_val_score(gaussian_nb, X, y, cv=10, scoring='precision')
recall_scores = cross_val_score(gaussian_nb, X, y, cv=10, scoring='recall')
f1_scores = cross_val_score(gaussian_nb, X, y, cv=10, scoring='f1')

# Print the average performance metrics for Gaussian Naive Bayes
print("\nGaussian Naive Bayes:")
print("Accuracy:", np.mean(accuracy_scores))
print("Precision:", np.mean(precision_scores))
print("Recall:", np.mean(recall_scores))
print("F1 score:", np.mean(f1_scores))


Gaussian Naive Bayes:
Accuracy: 0.8217730830896915
Precision: 0.7103733928118492
Recall: 0.9569516119239877
F1 score: 0.8130660909542995


Discuss the results you obtained. Which variant of Naive Bayes performed the best? Why do you think that is the case? Are there any limitations of Naive Bayes that you observed?
From the provided results, it appears that Bernoulli Naive Bayes performs the best among the three variants in terms of accuracy, precision, recall, and F1 score. It achieves the highest accuracy of 0.8839 and the highest F1 score of 0.8481.

One reason why Bernoulli Naive Bayes might be performing well on this dataset is that it assumes binary features, which can be a good fit for representing email spam detection. The presence or absence of certain words or patterns in an email can be indicative of spam. Therefore, the binary assumption of Bernoulli Naive Bayes aligns well with this problem.

On the other hand, Multinomial Naive Bayes assumes discrete features and may not be as suitable for this dataset, as the features might not represent simple counts or frequencies. Gaussian Naive Bayes assumes a Gaussian distribution for continuous features, which might not be the case for all the features in the dataset, leading to suboptimal performance.

It is also worth noting that while Gaussian Naive Bayes achieves a high recall of 0.9570, indicating a low false negative rate (i.e., correctly identifying spam emails), its precision and F1 score are relatively lower compared to Bernoulli Naive Bayes. This suggests that it might be classifying some non-spam emails as spam, resulting in a higher false positive rate.

Summarise your findings and provide some suggestions for future work.
Findings:
In conclusion, the Naive Bayes classifiers were evaluated on the "Spambase" dataset, and the performance metrics were analyzed. Based on the provided results, Bernoulli Naive Bayes performed the best among the three variants, achieving the highest accuracy, precision, recall, and F1 score. This indicates that the binary assumption of Bernoulli Naive Bayes aligns well with the problem of spam detection in email messages.

However, it is important to note that the performance of the classifiers may vary depending on the specific characteristics of the dataset. It is recommended to consider multiple evaluation metrics and conduct further analysis to gain deeper insights into the strengths and weaknesses of each classifier.

Suggestions
For future work, the following suggestions can be considered:

Feature Engineering: Explore additional feature engineering techniques to enhance the performance of the classifiers. This may include text preprocessing, feature selection, or creating new features based on domain knowledge.

Hyperparameter Tuning: Experiment with different hyperparameter settings for each variant of Naive Bayes to identify optimal configurations. Grid search or random search techniques can be employed to search the hyperparameter space efficiently.

Ensemble Methods: Investigate ensemble methods, such as bagging or boosting, to combine the predictions of multiple Naive Bayes classifiers or incorporate other classifiers to potentially improve performance.

Evaluate Other Algorithms: Compare the performance of Naive Bayes with other machine learning algorithms, such as decision trees, random forests, support vector machines, or neural networks, to determine if alternative models can provide better results.

Cross-Dataset Evaluation: Evaluate the performance of the classifiers on different datasets to assess their generalizability and robustness. It is important to test the classifiers on diverse datasets to ensure their effectiveness across different domains.

Error Analysis: Perform an in-depth analysis of the misclassified instances to gain insights into the specific patterns or characteristics that are challenging for the classifiers. This analysis can guide further improvements and provide valuable information for future research.

 