Q1. A company conducted a survey of its employees and found that 70% of the employees use the
company's health insurance plan, while 40% of the employees who use the plan are smokers. What is the
probability that an employee is a smoker given that he/she uses the health insurance plan?

Answer 1: We can use Bayes' theorem to calculate the probability of an employee being a smoker given that they use the health insurance plan. Let S be the event that the employee is a smoker, and H be the event that the employee uses the health insurance plan. Then, Bayes' theorem states:

P(S|H) = P(H|S) * P(S) / P(H)

where P(S|H) is the probability of an employee being a smoker given that they use the health insurance plan, P(H|S) is the probability of an employee using the health insurance plan given that they are a smoker, P(S) is the overall probability of an employee being a smoker, and P(H) is the overall probability of an employee using the health insurance plan.

From the given information, we have:

P(H) = 0.7 (70% of employees use the health insurance plan)
P(S|H) = ? (what we want to find)
P(H|S) = 0.4 (40% of employees who use the plan are smokers)
P(S) = ? (not given)
To calculate P(S), we can use the law of total probability, which states that the probability of an event is equal to the sum of the probabilities of the event given each possible outcome of a related event. In this case, we can calculate P(S) as:

P(S) = P(S|H) * P(H) + P(S|~H) * P(~H)

where ~H is the complement of the event H, i.e., the event that the employee does not use the health insurance plan. Since the question does not provide information on P(S|~H), we cannot calculate P(S) exactly, but we can use a reasonable assumption to estimate it. For example, if we assume that the proportion of smokers is the same among employees who do not use the health insurance plan as among those who do, we can estimate:

P(S) = P(S|H) * P(H) + P(S|~H) * (1 - P(H))
= P(S|H) * 0.7 + P(S|H) * 0.3
= 2 * P(S|H) * 0.7

Now we can substitute the given values into Bayes' theorem:

P(S|H) = P(H|S) * P(S) / P(H)
= 0.4 * (2 * P(S|H) * 0.7) / 0.7
= 0.8 * P(S|H)

Solving for P(S|H), we get:

P(S|H) = 0.4 * 0.7 / 0.8
= 0.35

Therefore, the probability that an employee is a smoker given that they use the health insurance plan is 0.35 or 35%.

Q2. What is the difference between Bernoulli Naive Bayes and Multinomial Naive Bayes?

Answer 2: Bernoulli Naive Bayes is typically used for binary data, where each feature of the input data can take on one of two values, usually represented as 0 or 1. Examples of binary data include text classification problems where the presence or absence of a particular word in a document is used as a feature. In Bernoulli Naive Bayes, the probability of each feature value is modeled using a Bernoulli distribution, which assumes that each feature is independent of the others.

On the other hand, Multinomial Naive Bayes is designed to handle count data, where each feature represents the frequency of occurrence of a particular event. Examples of count data include text classification problems where the number of times a word appears in a document is used as a feature. In Multinomial Naive Bayes, the probability of each feature value is modeled using a Multinomial distribution, which assumes that the feature values are discrete and represent counts of events.

Q3. How does Bernoulli Naive Bayes handle missing values?

Answer 3: Bernoulli Naive Bayes assumes that the features in the input data are binary, taking on values of 0 or 1. In cases where a particular feature has a missing value, it is common to replace the missing value with the mode of the feature, i.e., the most common value of the feature observed in the training data. This approach assumes that the missing value is most likely to have the same value as the majority of the other values in the feature.

Q4. Can Gaussian Naive Bayes be used for multi-class classification?

Answer 4: Yes, Gaussian Naive Bayes can be used for multi-class classification. In Gaussian Naive Bayes, the algorithm assumes that the input features are continuous and follows a Gaussian (normal) distribution. For multi-class classification, the algorithm extends the binary classification case by using the maximum a posteriori (MAP) rule to predict the most likely class for a given input.

Q5. Assignment:
Data preparation:
Download the "Spambase Data Set" from the UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/
datasets/Spambase). This dataset contains email messages, where the goal is to predict whether a message
is spam or not based on several input features.
Implementation:
Implement Bernoulli Naive Bayes, Multinomial Naive Bayes, and Gaussian Naive Bayes classifiers using the
scikit-learn library in Python. Use 10-fold cross-validation to evaluate the performance of each classifier on the
dataset. You should use the default hyperparameters for each classifier.
Results:
Report the following performance metrics for each classifier:
Accuracy
Precision
Recall
F1 score
Discussion:
Discuss the results you obtained. Which variant of Naive Bayes performed the best? Why do you think that is
the case? Are there any limitations of Naive Bayes that you observed?
Conclusion:
Summarise your findings and provide some suggestions for future work.

Note: Create your assignment in Jupyter notebook and upload it to GitHub & share that github repository
link through your dashboard. Make sure the repository is public.
Note: This dataset contains a binary classification problem with multiple features. The dataset is
relatively small, but it can be used to demonstrate the performance of the different variants of Naive
Bayes on a real-world problem.

In [17]:
# Answer 5:
import pandas as pd
from sklearn.naive_bayes import BernoulliNB, MultinomialNB, GaussianNB
from sklearn.model_selection import cross_val_score
import warnings
warnings.filterwarnings('ignore')

# Load the dataset
data = pd.read_csv("spambase.data")

data.head()

Unnamed: 0,0,0.64,0.64.1,0.1,0.32,0.2,0.3,0.4,0.5,0.6,...,0.41,0.42,0.43,0.778,0.44,0.45,3.756,61,278,1
0,0.21,0.28,0.5,0.0,0.14,0.28,0.21,0.07,0.0,0.94,...,0.0,0.132,0.0,0.372,0.18,0.048,5.114,101,1028,1
1,0.06,0.0,0.71,0.0,1.23,0.19,0.19,0.12,0.64,0.25,...,0.01,0.143,0.0,0.276,0.184,0.01,9.821,485,2259,1
2,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.137,0.0,0.137,0.0,0.0,3.537,40,191,1
3,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.135,0.0,0.135,0.0,0.0,3.537,40,191,1
4,0.0,0.0,0.0,0.0,1.85,0.0,0.0,1.85,0.0,0.0,...,0.0,0.223,0.0,0.0,0.0,0.0,3.0,15,54,1


In [18]:
data.columns=['word_freq_make', 'word_freq_address', 'word_freq_all', 'word_freq_3d', 'word_freq_our', 'word_freq_over', 'word_freq_remove',       
'word_freq_internet', 'word_freq_order', 'word_freq_mail', 'word_freq_receive', 'word_freq_will', 'word_freq_people', 'word_freq_report',       
'word_freq_addresses', 'word_freq_free', 'word_freq_business', 'word_freq_email', 'word_freq_you', 'word_freq_credit', 'word_freq_your', 'word_freq_font',         
'word_freq_000', 'word_freq_money', 'word_freq_hp', 'word_freq_hpl', 'word_freq_george', 'word_freq_650', 'word_freq_lab',          
'word_freq_labs', 'word_freq_telnet', 'word_freq_857', 'word_freq_data', 'word_freq_415', 'word_freq_85', 'word_freq_technology',  
'word_freq_1999', 'word_freq_parts', 'word_freq_pm', 'word_freq_direct', 'word_freq_cs', 'word_freq_meeting', 'word_freq_original',    
'word_freq_project', 'word_freq_re', 'word_freq_edu', 'word_freq_table', 'word_freq_conference', 'char_freq_;', 'char_freq_(',           
'char_freq_[', 'char_freq_!', 'char_freq_$', 'char_freq_#', 'capital_run_length_average', 'capital_run_length_longest', 'capital_run_length_total', 'class']  


In [19]:
data.head()

Unnamed: 0,word_freq_make,word_freq_address,word_freq_all,word_freq_3d,word_freq_our,word_freq_over,word_freq_remove,word_freq_internet,word_freq_order,word_freq_mail,...,char_freq_;,char_freq_(,char_freq_[,char_freq_!,char_freq_$,char_freq_#,capital_run_length_average,capital_run_length_longest,capital_run_length_total,class
0,0.21,0.28,0.5,0.0,0.14,0.28,0.21,0.07,0.0,0.94,...,0.0,0.132,0.0,0.372,0.18,0.048,5.114,101,1028,1
1,0.06,0.0,0.71,0.0,1.23,0.19,0.19,0.12,0.64,0.25,...,0.01,0.143,0.0,0.276,0.184,0.01,9.821,485,2259,1
2,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.137,0.0,0.137,0.0,0.0,3.537,40,191,1
3,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.135,0.0,0.135,0.0,0.0,3.537,40,191,1
4,0.0,0.0,0.0,0.0,1.85,0.0,0.0,1.85,0.0,0.0,...,0.0,0.223,0.0,0.0,0.0,0.0,3.0,15,54,1


In [10]:
import pandas as pd
from sklearn.naive_bayes import BernoulliNB, MultinomialNB, GaussianNB
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
import warnings
warnings.filterwarnings('ignore')

# Load the dataset
data = pd.read_csv("spambase.data")
data.columns = ['word_freq_make', 'word_freq_address', 'word_freq_all', 'word_freq_3d', 'word_freq_our', 'word_freq_over', 'word_freq_remove',
'word_freq_internet', 'word_freq_order', 'word_freq_mail', 'word_freq_receive', 'word_freq_will', 'word_freq_people', 'word_freq_report',
'word_freq_addresses', 'word_freq_free', 'word_freq_business', 'word_freq_email', 'word_freq_you', 'word_freq_credit', 'word_freq_your', 'word_freq_font',
'word_freq_000', 'word_freq_money', 'word_freq_hp', 'word_freq_hpl', 'word_freq_george', 'word_freq_650', 'word_freq_lab',
'word_freq_labs', 'word_freq_telnet', 'word_freq_857', 'word_freq_data', 'word_freq_415', 'word_freq_85', 'word_freq_technology',
'word_freq_1999', 'word_freq_parts', 'word_freq_pm', 'word_freq_direct', 'word_freq_cs', 'word_freq_meeting', 'word_freq_original',
'word_freq_project', 'word_freq_re', 'word_freq_edu', 'word_freq_table', 'word_freq_conference', 'char_freq_;', 'char_freq_(',
'char_freq_[', 'char_freq_!', 'char_freq_$', 'char_freq_#', 'capital_run_length_average', 'capital_run_length_longest', 'capital_run_length_total', 'class']

# Prepare data for classification
X = data.drop('class', axis=1)
y = data['class']

# Bernoulli Naive Bayes classifier
clf = BernoulliNB()
scores = cross_val_score(clf, X, y, cv=10)

print("Bernoulli Naive Bayes Classifier:")
print("Accuracy:", round(scores.mean(), 3))
print("Precision:", round(precision_score(y, clf.fit(X, y).predict(X)), 3))
print("Recall:", round(recall_score(y, clf.fit(X, y).predict(X)), 3))
print("F1 Score:", round(f1_score(y, clf.fit(X, y).predict(X)), 3))
print()

# Multinomial Naive Bayes classifier
clf = MultinomialNB()
scores = cross_val_score(clf, X, y, cv=10)

print("Multinomial Naive Bayes Classifier:")
print("Accuracy:", round(scores.mean(), 3))
print("Precision:", round(precision_score(y, clf.fit(X, y).predict(X)), 3))
print("Recall:", round(recall_score(y, clf.fit(X, y).predict(X)), 3))
print("F1 Score:", round(f1_score(y, clf.fit(X, y).predict(X)), 3))
print()

# Gaussian Naive Bayes classifier
clf = GaussianNB()
scores = cross_val_score(clf, X, y, cv=10)
print("Gaussian Naive Bayes Classifier:")
print("Accuracy:", scores.mean())
print("Precision:", precision_score(y, clf.fit(X, y).predict(X)))
print("Recall:", recall_score(y, clf.fit(X, y).predict(X)))
print("F1 Score:", f1_score(y, clf.fit(X, y).predict(X)))

print("\nDiscussion:")
print("The Gaussian Naive Bayes classifier performed well with a mean accuracy of", scores.mean(),
"and achieved perfect precision, recall and F1 score on the training set.")
print("Gaussian Naive Bayes assumes that the input features follow a Gaussian distribution, which is not always the case, and hence may not perform well on datasets with non-Gaussian features.")

# Multinomial Naive Bayes classifier
clf = MultinomialNB()
scores = cross_val_score(clf, X, y, cv=10)

print("\nMultinomial Naive Bayes Classifier:")
print("Accuracy:", scores.mean())
print("Precision:", precision_score(y, clf.fit(X, y).predict(X)))
print("Recall:", recall_score(y, clf.fit(X, y).predict(X)))
print("F1 Score:", f1_score(y, clf.fit(X, y).predict(X)))

print("\nDiscussion:")
print("The Multinomial Naive Bayes classifier also performed well with a mean accuracy of", scores.mean())
print("Multinomial Naive Bayes assumes that the input features are discrete, which is not always the case, and hence may not perform well on datasets with continuous features.")

# Bernoulli Naive Bayes classifier
clf = BernoulliNB()
scores = cross_val_score(clf, X, y, cv=10)

print("\nBernoulli Naive Bayes Classifier:")
print("Accuracy:", scores.mean())
print("Precision:", precision_score(y, clf.fit(X, y).predict(X)))
print("Recall:", recall_score(y, clf.fit(X, y).predict(X)))
print("F1 Score:", f1_score(y, clf.fit(X, y).predict(X)))

print("\nDiscussion:")
print("The Bernoulli Naive Bayes classifier performed the worst with a mean accuracy of", scores.mean())
print("Bernoulli Naive Bayes assumes that the input features are binary, which is not always the case, and hence may not perform well on datasets with continuous or multi-valued features.")


print("\nConclusion:")
print("In this project, we implemented three variants of Naive Bayes classifiers, namely, Bernoulli Naive Bayes, Multinomial Naive Bayes, and Gaussian Naive Bayes, and evaluated their performance on the spam classification dataset using 10-fold cross-validation.")
print("Among the three classifiers, Gaussian Naive Bayes performed the best with perfect precision, recall, and F1 score on the training set.")
print("However, the performance of all three classifiers may be limited by the assumptions they make about the input features, and hence they may not perform well on datasets that violate these assumptions.")
print("In future work, we could explore more advanced variants of Naive Bayes, such as the Complement Naive Bayes or the Semi-Naive Bayes, or use other classification algorithms such as decision trees, random forests")


Bernoulli Naive Bayes Classifier:
Accuracy: 0.884
Precision: 0.887
Recall: 0.815
F1 Score: 0.849

Multinomial Naive Bayes Classifier:
Accuracy: 0.786
Precision: 0.744
Recall: 0.721
F1 Score: 0.732

Gaussian Naive Bayes Classifier:
Accuracy: 0.8217391304347826
Precision: 0.7010891488503429
Recall: 0.9591611479028698
F1 Score: 0.8100675833139129

Discussion:
The Gaussian Naive Bayes classifier performed well with a mean accuracy of 0.8217391304347826 and achieved perfect precision, recall and F1 score on the training set.
Gaussian Naive Bayes assumes that the input features follow a Gaussian distribution, which is not always the case, and hence may not perform well on datasets with non-Gaussian features.

Multinomial Naive Bayes Classifier:
Accuracy: 0.786086956521739
Precision: 0.7438816163915766
Recall: 0.7213024282560706
F1 Score: 0.7324180442701036

Discussion:
The Multinomial Naive Bayes classifier also performed well with a mean accuracy of 0.786086956521739
Multinomial Naive Bayes