

# <center> Naive Baye's Theorem Assignment 

----------

# Answer 1


We can use Bayes' theorem to calculate the probability that an employee is a smoker given that he/she uses the health insurance plan.

Let S be the event that the employee is a smoker, and H be the event that the employee uses the health insurance plan. Then, using Bayes' theorem, we have:

- P(S | H) = P(H | S) * P(S) / P(H)

- where P(S) is the prior probability of being a smoker, P(H | S) is the conditional probability of using the health insurance plan given that the employee is a smoker, and P(H) is the marginal probability of using the health insurance plan.

- From the problem statement, we are given that P(H) = 0.70 (70% of the employees use the health insurance plan), and P(S | H) is what we want to calculate.

- We are also given that P(S | H) = P(H | S) * P(S) / P(H) = 0.40 (40% of the employees who use the plan are smokers). Therefore:

- P(S) = prior probability of being a smoker = unknown
P(H | S) = conditional probability of using the health insurance plan given that the employee is a smoker = 0.40
P(H) = marginal probability of using the health insurance plan = 0.70

- To find P(S), we can use the law of total probability:

P(S) = P(S | H) * P(H) + P(S | not H) * P(not H)

- where P(S | not H) is the conditional probability of being a smoker given that the employee does not use the health insurance plan, and P(not H) is the probability of not using the health insurance plan.

Since we don't have any information about P(S | not H), we'll assume that it's the same as P(S | H), which is 0.40. We are also given that P(not H) = 0.30 (30% of the employees do not use the health insurance plan).

Therefore:

P(S) = P(S | H) * P(H) + P(S | not H) * P(not H)
= 0.40 * 0.70 + 0.40 * 0.30
= 0.40

Now we can calculate P(S | H) using Bayes' theorem:

P(S | H) = P(H | S) * P(S) / P(H)
= 0.40 * 0.40 / 0.70
= 0.2286 or 22.86%

#### Therefore, the probability that an employee is a smoker given that he/she uses the health insurance plan is 0.2286 or 22.86%.


## Answer 2 
### <center>  Difference between Bernoulli Naive Bayes and Multinomial Naive Bayes.

## Bernoulli Naive Bayes:
- Bernoulli Naive Bayes is typically used when the input features are binary or Boolean variables.
-  For example, in text classification, the input features can be binary indicators of whether a certain word appears in a document or not. 
- In Bernoulli Naive Bayes, each input feature is treated as a binary variable, and the model estimates the probability of each feature being present in a document given the class label. It assumes that the presence or absence of a feature is independent of the presence or absence of any other feature.

## Multinomial Naive Bayes:
- Multinomial Naive Bayes is typically used when the input features are discrete counts, such as the frequency of occurrence of each word in a document.
-  In Multinomial Naive Bayes, the input features are treated as discrete variables, and the model estimates the probability distribution of the count of each feature given the class label. 
- It assumes that the counts of each feature are independent of the counts of any other feature, but not necessarily independent of their presence or absence.

-------

##  Answer 3
###  <center> Bernoulli Naive Bayes handle missing values .

- In Bernoulli Naive Bayes, missing values are typically handled by assuming that the missing values are equivalent to the feature being absent. This means that if a feature value is missing for a particular instance, the classifier will assume that the feature is not present for that instance.


### For example.
 let's say we have a binary feature "has_experience" with values "yes" and "no", and some instances are missing this feature value. The Bernoulli Naive Bayes classifier will treat the missing values as equivalent to "no", i.e., it will assume that the instances without a value for "has_experience" do not have experience.

- This approach to handling missing values can lead to some loss of information, as the classifier assumes that missing values are equivalent to the feature being absent. However, it is a simple and effective approach that is widely used in practice. If the missing values are considered to be informative, it may be more appropriate to impute the missing values based on other features or using more advanced techniques, such as matrix factorization or k-nearest neighbor imputation

-------

# Answer 4 
## <center> Can Gaussian Naive Bayes be used for multi-class classification.

 - Yes, Gaussian Naive Bayes can be used for multi-class classification. In fact, Gaussian Naive Bayes is one of the most commonly used algorithms for multi-class classification problems, where the goal is to predict the class label of an instance from among three or more possible classes.

- To use Gaussian Naive Bayes for multi-class classification, the algorithm needs to be extended to handle more than two classes. 
- One way to do this is to use the "one-vs-all" approach, where a separate binary classifier is trained for each class, and the class with the highest probability is chosen as the final prediction. 
- In this approach, the Gaussian Naive Bayes algorithm is trained to model the distribution of each feature for each class separately, and the class probabilities are estimated using Bayes' theorem.


## Another approach 
Another approach for multi-class classification using Gaussian Naive Bayes is to use the "one-vs-one" approach, where a separate binary classifier is trained for each pair of classes, and the class with the most votes is chosen as the final prediction. 
- In this approach, the Gaussian Naive Bayes algorithm is trained to model the distribution of each feature for each pair of classes separately, and the class probabilities are estimated using Bayes' theorem.

-----------

###  <center>  Answer 5 
# <center> Assignment Spambase Dataset

## Objective:
#### The goal is to predict whether a message is spam or not based on several input features.

In [62]:
# import all essential library

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline


In [63]:
# laod the dataset
data = pd.read_csv("spambase.data")


In [64]:
# convert it into dataframe

df = pd.DataFrame(data)
df.head()

Unnamed: 0,0,0.64,0.64.1,0.1,0.32,0.2,0.3,0.4,0.5,0.6,...,0.41,0.42,0.43,0.778,0.44,0.45,3.756,61,278,1
0,0.21,0.28,0.5,0.0,0.14,0.28,0.21,0.07,0.0,0.94,...,0.0,0.132,0.0,0.372,0.18,0.048,5.114,101,1028,1
1,0.06,0.0,0.71,0.0,1.23,0.19,0.19,0.12,0.64,0.25,...,0.01,0.143,0.0,0.276,0.184,0.01,9.821,485,2259,1
2,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.137,0.0,0.137,0.0,0.0,3.537,40,191,1
3,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.135,0.0,0.135,0.0,0.0,3.537,40,191,1
4,0.0,0.0,0.0,0.0,1.85,0.0,0.0,1.85,0.0,0.0,...,0.0,0.223,0.0,0.0,0.0,0.0,3.0,15,54,1


In [65]:
df.describe

<bound method NDFrame.describe of          0  0.64  0.64.1  0.1  0.32   0.2   0.3   0.4   0.5   0.6  ...   0.41  \
0     0.21  0.28    0.50  0.0  0.14  0.28  0.21  0.07  0.00  0.94  ...  0.000   
1     0.06  0.00    0.71  0.0  1.23  0.19  0.19  0.12  0.64  0.25  ...  0.010   
2     0.00  0.00    0.00  0.0  0.63  0.00  0.31  0.63  0.31  0.63  ...  0.000   
3     0.00  0.00    0.00  0.0  0.63  0.00  0.31  0.63  0.31  0.63  ...  0.000   
4     0.00  0.00    0.00  0.0  1.85  0.00  0.00  1.85  0.00  0.00  ...  0.000   
...    ...   ...     ...  ...   ...   ...   ...   ...   ...   ...  ...    ...   
4595  0.31  0.00    0.62  0.0  0.00  0.31  0.00  0.00  0.00  0.00  ...  0.000   
4596  0.00  0.00    0.00  0.0  0.00  0.00  0.00  0.00  0.00  0.00  ...  0.000   
4597  0.30  0.00    0.30  0.0  0.00  0.00  0.00  0.00  0.00  0.00  ...  0.102   
4598  0.96  0.00    0.00  0.0  0.32  0.00  0.00  0.00  0.00  0.00  ...  0.000   
4599  0.00  0.00    0.65  0.0  0.00  0.00  0.00  0.00  0.00  0.00  ...  0.0

In [66]:
df.shape

(4600, 58)

In [67]:
# check the null
df.isnull().sum()

0         0
0.64      0
0.64.1    0
0.1       0
0.32      0
0.2       0
0.3       0
0.4       0
0.5       0
0.6       0
0.7       0
0.64.2    0
0.8       0
0.9       0
0.10      0
0.32.1    0
0.11      0
1.29      0
1.93      0
0.12      0
0.96      0
0.13      0
0.14      0
0.15      0
0.16      0
0.17      0
0.18      0
0.19      0
0.20      0
0.21      0
0.22      0
0.23      0
0.24      0
0.25      0
0.26      0
0.27      0
0.28      0
0.29      0
0.30      0
0.31      0
0.33      0
0.34      0
0.35      0
0.36      0
0.37      0
0.38      0
0.39      0
0.40      0
0.41      0
0.42      0
0.43      0
0.778     0
0.44      0
0.45      0
3.756     0
61        0
278       0
1         0
dtype: int64

In [78]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4600 entries, 0 to 4599
Data columns (total 58 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   0       4600 non-null   float64
 1   0.64    4600 non-null   float64
 2   0.64.1  4600 non-null   float64
 3   0.1     4600 non-null   float64
 4   0.32    4600 non-null   float64
 5   0.2     4600 non-null   float64
 6   0.3     4600 non-null   float64
 7   0.4     4600 non-null   float64
 8   0.5     4600 non-null   float64
 9   0.6     4600 non-null   float64
 10  0.7     4600 non-null   float64
 11  0.64.2  4600 non-null   float64
 12  0.8     4600 non-null   float64
 13  0.9     4600 non-null   float64
 14  0.10    4600 non-null   float64
 15  0.32.1  4600 non-null   float64
 16  0.11    4600 non-null   float64
 17  1.29    4600 non-null   float64
 18  1.93    4600 non-null   float64
 19  0.12    4600 non-null   float64
 20  0.96    4600 non-null   float64
 21  0.13    4600 non-null   float64
 22  

In [69]:
df.describe()

Unnamed: 0,0,0.64,0.64.1,0.1,0.32,0.2,0.3,0.4,0.5,0.6,...,0.41,0.42,0.43,0.778,0.44,0.45,3.756,61,278,1
count,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,...,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0
mean,0.104576,0.212922,0.280578,0.065439,0.312222,0.095922,0.114233,0.105317,0.090087,0.239465,...,0.038583,0.139061,0.01698,0.26896,0.075827,0.044248,5.191827,52.17087,283.290435,0.393913
std,0.305387,1.2907,0.50417,1.395303,0.672586,0.27385,0.39148,0.401112,0.278643,0.644816,...,0.243497,0.270377,0.109406,0.815726,0.245906,0.429388,31.732891,194.912453,606.413764,0.488669
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.588,6.0,35.0,0.0
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.065,0.0,0.0,0.0,0.0,2.2755,15.0,95.0,0.0
75%,0.0,0.0,0.42,0.0,0.3825,0.0,0.0,0.0,0.0,0.16,...,0.0,0.188,0.0,0.31425,0.052,0.0,3.70525,43.0,265.25,1.0
max,4.54,14.28,5.1,42.81,10.0,5.88,7.27,11.11,5.26,18.18,...,4.385,9.752,4.081,32.478,6.003,19.829,1102.5,9989.0,15841.0,1.0



 - Implement Bernoulli Naive Bayes, Multinomial Naive Bayes, and Gaussian Naive Bayes classifiers using the
scikit-learn library in Python. Use 10-fold cross-validation to evaluate the performance of each classifier on the
dataset. You should use the default hyperparameters for each classifier.

In [70]:
# import all essential libraries from sklearn

from sklearn.naive_bayes import BernoulliNB, MultinomialNB, GaussianNB
from sklearn.model_selection import cross_validate
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, make_scorer

In [71]:
# Split X and y
X = df.iloc[:,:-1]
X.head(3)


Unnamed: 0,0,0.64,0.64.1,0.1,0.32,0.2,0.3,0.4,0.5,0.6,...,0.40,0.41,0.42,0.43,0.778,0.44,0.45,3.756,61,278
0,0.21,0.28,0.5,0.0,0.14,0.28,0.21,0.07,0.0,0.94,...,0.0,0.0,0.132,0.0,0.372,0.18,0.048,5.114,101,1028
1,0.06,0.0,0.71,0.0,1.23,0.19,0.19,0.12,0.64,0.25,...,0.0,0.01,0.143,0.0,0.276,0.184,0.01,9.821,485,2259
2,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.0,0.137,0.0,0.137,0.0,0.0,3.537,40,191


In [72]:
# y 
y = df.iloc[:, -1]
y

0       1
1       1
2       1
3       1
4       1
       ..
4595    0
4596    0
4597    0
4598    0
4599    0
Name: 1, Length: 4600, dtype: int64

## Define the classifier


In [73]:
# Define the classifiers
bnb = BernoulliNB()
mnb = MultinomialNB()
gnb = GaussianNB()

In [79]:
# Define the performance metrics

scoring = {'accuracy': make_scorer(accuracy_score),
           'precision': make_scorer(precision_score),
           'recall': make_scorer(recall_score),
           'f1_score': make_scorer(f1_score)
           }


# Perform 10-fold cross-validation for each classifier

bnb_scores = cross_validate(bnb, X, y, cv=10, scoring=scoring)
mnb_scores = cross_validate(mnb, X, y, cv=10, scoring=scoring)
gnb_scores = cross_validate(gnb, X, y, cv=10, scoring=scoring)



# Print the results
print("Bernoulli Naive Bayes:")
print("Accuracy:", bnb_scores['test_accuracy'].mean())
print("Precision:", bnb_scores['test_precision'].mean())
print("Recall:", bnb_scores['test_recall'].mean())
print("F1 score:", bnb_scores['test_f1_score'].mean())

print("\nMultinomial Naive Bayes:")
print("Accuracy:", mnb_scores['test_accuracy'].mean())
print("Precision:", mnb_scores['test_precision'].mean())
print("Recall:", mnb_scores['test_recall'].mean())
print("F1 score:", mnb_scores['test_f1_score'].mean())

print("\nGaussian Naive Bayes:")
print("Accuracy:", gnb_scores['test_accuracy'].mean())
print("Precision:", gnb_scores['test_precision'].mean())
print("Recall:", gnb_scores['test_recall'].mean())
print("F1 score:", gnb_scores['test_f1_score'].mean())

Bernoulli Naive Bayes:
Accuracy: 0.8839130434782609
Precision: 0.886914139754535
Recall: 0.8151235504826666
F1 score: 0.8480714616697421

Multinomial Naive Bayes:
Accuracy: 0.786086956521739
Precision: 0.7390291264847734
Recall: 0.7207971586424625
F1 score: 0.7277511309974372

Gaussian Naive Bayes:
Accuracy: 0.8217391304347826
Precision: 0.7102746648832371
Recall: 0.9569394693704085
F1 score: 0.8129997873786424


- From the results, we observed that Bernoulli Naive Bayes performed the best among the three classifiers, with an accuracy of 88.39%, precision of 88.69%, recall of 81.51%, and F1 score of 84.81%. Gaussian Naive Bayes had the highest recall score of 95.69%, but its precision and F1 score were lower compared to the other classifiers. Multinomial Naive Bayes had the lowest performance among the three classifiers, with an accuracy of 78.61%, precision of 73.90%, recall of 72.08%, and F1 score of 72.78%.

## Insights:
- Based on the performance metrics, Bernoulli Naive Bayes performed the best with the highest accuracy, precision, recall, and F1 score

- Multinomial Naive Bayes had the lowest performance among the three classifiers, likely due to the fact that it assumes a discrete count of features whereas the other two classifiers assume continuous values

-  Gaussian Naive Bayes had a high recall score, but its precision score was lower than that of Bernoulli Naive Bayes, indicating that it may have classified some non-spam messages as spam.

#### Overall Conclusion 
Naive Bayes is a fast and efficient algorithm that can work well for text classification tasks, especially when the features are discrete or binary. However, one limitation of Naive Bayes is that it assumes that all features are independent of each other, which may not always be the case in real-world problems

### Future Work 
For future work, one possible direction is to explore other classification algorithms and compare their performance with Naive Bayes classifiers on the same dataset. It may also be beneficial to perform feature engineering to select relevant features and improve the performance of the classifiers. Finally, it may be useful to explore techniques to handle missing data and imbalanced data, which are common issues in real-world datasets.

-----------

## 