In [None]:
"""Q1. A company conducted a survey of its employees and found that 70% of the employees use the
company's health insurance plan, while 40% of the employees who use the plan are smokers. What is the
probability that an employee is a smoker given that he/she uses the health insurance plan?
"""

"""Ans: 

Let A be the event that an employee uses the health insurance plan, and B be the event that an employee is a smoker. Then, we want to find P(B|A), the probability that an 
employee is a smoker given that he/she uses the health insurance plan.
P(A) = 0.7 (70% of employees use the health insurance plan)
P(B|A) = 0.4 (40% of employees who use the plan are smokers)

P(B) = P(B|A) * P(A) + P(B|not A) * P(not A)
P(B|not A) = (1 - P(A)) * x = (1 - 0.7) * 0.2 = 0.06
P(B) = 0.4 * 0.7 + 0.06 * 0.3 = 0.34

P(B|A) = P(A|B) * P(B) / P(A)
P(A|B) = P(B|A) * P(A) / P(B) = 0.4 * 0.7 / 0.34 = 0.82
P(B|A) = P(A|B) * P(B) / P(A) = 0.82 * 0.34 / 0.7 = 0.195

Therefore, the probability that an employee is a smoker given that he/she uses the health insurance plan is 0.195 or approximately 19.5%.

"""

In [None]:
"""Q2. What is the difference between Bernoulli Naive Bayes and Multinomial Naive Bayes?

    Ans: The main difference between Bernoulli Naive Bayes and Multinomial Naive Bayes is in the way they model the distribution of the features. Bernoulli Naive Bayes assumes 
         that each feature is binary and takes on only two values, while Multinomial Naive Bayes assumes that each feature is a count of the number of occurrences of a 
         particular event. This means that Bernoulli Naive Bayes is appropriate for binary data, while Multinomial Naive Bayes is appropriate for count data.
"""

In [None]:
"""Q3. How does Bernoulli Naive Bayes handle missing values?

    Ans: Bernoulli Naive Bayes assumes that each feature is binary and takes on only two values, typically 0 and 1. Therefore, missing values are usually imputed with either 
         0 or 1, depending on the context and the nature of the missingness. In practice, this can lead to biased or inaccurate estimates, especially if the missing data are
         not missing completely at random. One way to mitigate this problem is to use other imputation methods, such as mean imputation or regression imputation, before applying 
         Bernoulli Naive Bayes.
"""

In [None]:
"""Q4. Can Gaussian Naive Bayes be used for multi-class classification?

    Ans: Yes, Gaussian Naive Bayes can be used for multi-class classification by extending the algorithm to handle multiple classes. One common approach is to train separate 
         binary classifiers for each class and then use a voting or ranking scheme to make the final classification decision.
"""

In [None]:
"""Q5. Assignment:
Data preparation:
Download the "Spambase Data Set" from the UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets/Spambase). 
This dataset contains email messages, where the goal is to predict whether a message
is spam or not based on several input features.
Implementation:
Implement Bernoulli Naive Bayes, Multinomial Naive Bayes, and Gaussian Naive Bayes classifiers using the
scikit-learn library in Python. Use 10-fold cross-validation to evaluate the performance of each classifier on the dataset. 
You should use the default hyperparameters for each classifier.
Results:
Report the following performance metrics for each classifier:
Accuracy
Precision
Recall
F1 score
Discussion:
Discuss the results you obtained. Which variant of Naive Bayes performed the best? Why do you think that is
the case? Are there any limitations of Naive Bayes that you observed?
Conclusion:
Summarise your findings and provide some suggestions for future work.

Note: Create your assignment in Jupyter notebook and upload it to GitHub & share that github repository
link through your dashboard. Make sure the repository is public.
Note: This dataset contains a binary classification problem with multiple features. The dataset is
relatively small, but it can be used to demonstrate the performance of the different variants of Naive
Bayes on a real-world problem."""

In [34]:
import pandas as pd

In [35]:
df = pd.read_csv('spambase.data',header=None)
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,48,49,50,51,52,53,54,55,56,57
0,0.0,0.64,0.64,0.0,0.32,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.778,0.0,0.0,3.756,61,278,1
1,0.21,0.28,0.5,0.0,0.14,0.28,0.21,0.07,0.0,0.94,...,0.0,0.132,0.0,0.372,0.18,0.048,5.114,101,1028,1
2,0.06,0.0,0.71,0.0,1.23,0.19,0.19,0.12,0.64,0.25,...,0.01,0.143,0.0,0.276,0.184,0.01,9.821,485,2259,1
3,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.137,0.0,0.137,0.0,0.0,3.537,40,191,1
4,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.135,0.0,0.135,0.0,0.0,3.537,40,191,1


In [36]:
X = df.iloc[:,:-1]
y = df.iloc[:, -1]

In [37]:
y.value_counts()

0    2788
1    1813
Name: 57, dtype: int64

In [38]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.3,random_state=69)

In [39]:
from sklearn.naive_bayes import MultinomialNB,BernoulliNB,GaussianNB
from sklearn.metrics import accuracy_score,classification_report,confusion_matrix
from sklearn.model_selection import GridSearchCV

## BernoulliNB()

In [40]:
bnb = BernoulliNB()
bnb.fit(X_train,y_train)
y_p1 = bnb.predict(X_test)
print(confusion_matrix(y_test,y_p1))
print(accuracy_score(y_test,y_p1))
print(classification_report(y_test,y_p1))

[[776  48]
 [100 457]]
0.8928312816799421
              precision    recall  f1-score   support

           0       0.89      0.94      0.91       824
           1       0.90      0.82      0.86       557

    accuracy                           0.89      1381
   macro avg       0.90      0.88      0.89      1381
weighted avg       0.89      0.89      0.89      1381



In [41]:
p = {
    'alpha': [0.1, 1.0, 10.0],
    'binarize': [0.0, 0.5, 1.0],
    'fit_prior': [True, False],
    'class_prior': [None, [0.3, 0.7], [0.5, 0.5]]
}
bnb_G = GridSearchCV(BernoulliNB(),param_grid=p,cv=10)
bnb_G.fit(X_train, y_train)
print(bnb_G.best_params_)
print(bnb_G.best_score_)

{'alpha': 1.0, 'binarize': 0.5, 'class_prior': None, 'fit_prior': True}
0.9027950310559006


In [42]:
y_p2 = bnb_G.predict(X_test)
print(confusion_matrix(y_test,y_p2))
print(accuracy_score(y_test,y_p2))
print(classification_report(y_test,y_p2))

[[743  81]
 [ 81 476]]
0.8826937002172339
              precision    recall  f1-score   support

           0       0.90      0.90      0.90       824
           1       0.85      0.85      0.85       557

    accuracy                           0.88      1381
   macro avg       0.88      0.88      0.88      1381
weighted avg       0.88      0.88      0.88      1381



## GaussianNB()

In [43]:
gnb = GaussianNB()
gnb.fit(X_train,y_train)
y_p1 = gnb.predict(X_test)
print(confusion_matrix(y_test,y_p1))
print(accuracy_score(y_test,y_p1))
print(classification_report(y_test,y_p1))

[[596 228]
 [ 34 523]]
0.8102824040550326
              precision    recall  f1-score   support

           0       0.95      0.72      0.82       824
           1       0.70      0.94      0.80       557

    accuracy                           0.81      1381
   macro avg       0.82      0.83      0.81      1381
weighted avg       0.85      0.81      0.81      1381



In [44]:
p = {
    'var_smoothing': [1e-09, 1e-08, 1e-07, 1e-06, 1e-05, 1e-04, 1e-03, 1e-02, 1e-01]
}
gnb_G = GridSearchCV(GaussianNB(),param_grid=p,cv=10)
gnb_G.fit(X_train, y_train)
print(gnb_G.best_params_)
print(gnb_G.best_score_)
y_p2 = gnb_G.predict(X_test)
print(confusion_matrix(y_test,y_p2))
print(accuracy_score(y_test,y_p2))
print(classification_report(y_test,y_p2))

{'var_smoothing': 1e-06}
0.8770186335403727
[[742  82]
 [ 97 460]]
0.8703837798696596
              precision    recall  f1-score   support

           0       0.88      0.90      0.89       824
           1       0.85      0.83      0.84       557

    accuracy                           0.87      1381
   macro avg       0.87      0.86      0.86      1381
weighted avg       0.87      0.87      0.87      1381



## MultinomialNB()

In [45]:
mnb = MultinomialNB()
mnb.fit(X_train,y_train)
y_p1 = mnb.predict(X_test)
print(confusion_matrix(y_test,y_p1))
print(accuracy_score(y_test,y_p1))
print(classification_report(y_test,y_p1))

[[676 148]
 [168 389]]
0.7711803041274439
              precision    recall  f1-score   support

           0       0.80      0.82      0.81       824
           1       0.72      0.70      0.71       557

    accuracy                           0.77      1381
   macro avg       0.76      0.76      0.76      1381
weighted avg       0.77      0.77      0.77      1381



In [46]:
p = {
    'alpha': [0.1, 0.5, 1.0],
    'fit_prior': [True, False],
    'class_prior': [None, [0.5, 0.5], [0.7, 0.3]]
}
mnb_G = GridSearchCV(MultinomialNB(),param_grid=p,cv=10)
mnb_G.fit(X_train, y_train)
print(mnb_G.best_params_)
print(mnb_G.best_score_)
y_p2 = mnb_G.predict(X_test)
print(confusion_matrix(y_test,y_p2))
print(accuracy_score(y_test,y_p2))
print(classification_report(y_test,y_p2))

{'alpha': 0.1, 'class_prior': [0.7, 0.3], 'fit_prior': True}
0.8065217391304348
[[698 126]
 [173 384]]
0.7834902244750181
              precision    recall  f1-score   support

           0       0.80      0.85      0.82       824
           1       0.75      0.69      0.72       557

    accuracy                           0.78      1381
   macro avg       0.78      0.77      0.77      1381
weighted avg       0.78      0.78      0.78      1381



# Accuracies: 
- BernoulliNB = 0.8826937002172339
- gaussianNB = 0.8703837798696596
- MultinomialNB = 0.7834902244750181

### Bernoullis Naive Baye's gave us the best result as the dataset is an sparse matrix as there are lots of 0's in dataset and Dataset is not exactly bernoulli but the values are very close to 0 and 1 rather than any other values.

### Limitation on Bernoulli Naive Baye's was it works best in bernoulli(0 and 1) type of dataset 