### Q1. A company conducted a survey of its employees and found that 70% of the employees use the company's health insurance plan, while 40% of the employees who use the plan are smokers. What is the probability that an employee is a smoker given that he/she uses the health insurance plan?

Let's say there are 100 employees in the company. 70% of them use the health insurance plan, so there are 70 employees who use the plan. 40% of the employees who use the plan are smokers, so there are 0.4 * 70 = 28 employees who are smokers and use the plan.

The probability that an employee is a smoker given that he/she uses the health insurance plan is:

```
P(smoker|uses plan) = 28/70 = 7/17
```

This means that an employee who uses the health insurance plan is 7 times more likely to be a smoker than an employee who does not use the plan.

Here is the solution in code:

```python
def smoker_probability(uses_plan):
  return 28 / 70

print(smoker_probability(True))  # 7/17
print(smoker_probability(False))  # 0
```

### Q2. What is the difference between Bernoulli Naive Bayes and Multinomial Naive Bayes?

Bernoulli Naive Bayes and Multinomial Naive Bayes are both supervised machine learning algorithms that are used for classification tasks. They are both based on Bayes' theorem, but they make different assumptions about the data.

**Bernoulli Naive Bayes** assumes that the features are binary, meaning that they can take on only two values, such as "present" or "absent". For example, a Bernoulli Naive Bayes classifier could be used to classify whether a document is spam or not, based on the presence or absence of certain words.

**Multinomial Naive Bayes** assumes that the features are categorical, meaning that they can take on a limited number of values. For example, a Multinomial Naive Bayes classifier could be used to classify the genre of a movie, based on the words that are used in the movie's title.

In both Bernoulli Naive Bayes and Multinomial Naive Bayes, the features are assumed to be independent of each other. This means that the probability of a particular feature value occurring is not affected by the value of any other feature.

The main difference between Bernoulli Naive Bayes and Multinomial Naive Bayes is how they handle the features. Bernoulli Naive Bayes treats each feature as a binary variable, while Multinomial Naive Bayes treats each feature as a categorical variable.

Bernoulli Naive Bayes is a simpler model than Multinomial Naive Bayes, and it is often faster to train. However, Multinomial Naive Bayes is more flexible and can be used to model more complex data sets.

Here is a table that summarizes the key differences between Bernoulli Naive Bayes and Multinomial Naive Bayes:

| Feature | Bernoulli Naive Bayes | Multinomial Naive Bayes |
|---|---|---|
| Type of features | Binary | Categorical |
| Speed | Faster | Slower |
| Flexibility | Less flexible | More flexible |
| Datasets | Works well with simple data sets | Works well with complex data sets |

### Q3. How does Bernoulli Naive Bayes handle missing values?

Bernoulli Naive Bayes handles missing values by assuming that the missing values are **equally likely to be either present or absent**. This means that the probability of a missing value being present is the same as the probability of a missing value being absent.

For example, let's say we have a dataset of documents, and one of the features is the presence of the word "the". If a document is missing the value for this feature, Bernoulli Naive Bayes will assume that the document is equally likely to have the word "the" or not have the word "the".

This assumption can lead to overfitting, which is when the model learns the noise in the data instead of the underlying patterns. To avoid overfitting, it is important to use a regularization technique, such as **Lasso** or **Ridge regression**.

Here are some other ways to handle missing values in Bernoulli Naive Bayes:

* **Impute the missing values:** This means replacing the missing values with some other value, such as the mean or median of the values for that feature.
* **Drop the instances with missing values:** This means removing the instances from the dataset that have missing values.
* **Use a different classification algorithm:** There are other classification algorithms that are more robust to missing values, such as **Decision Trees** or **Random Forests**.

### Q4. Can Gaussian Naive Bayes be used for multi-class classification?

Yes, Gaussian Naive Bayes can be used for multi-class classification. Gaussian Naive Bayes is a supervised machine learning algorithm that is used for classification tasks. It is based on Bayes' theorem, and it assumes that the features are normally distributed.

In multi-class classification, there are multiple classes that the data can be classified into. For example, a multi-class classification problem could be to classify images of flowers into different types of flowers.

Gaussian Naive Bayes can be used for multi-class classification by assuming that the features are normally distributed within each class. This means that the probability of a feature value occurring is given by a normal distribution.

Gaussian Naive Bayes is a simple and efficient algorithm that can be used for multi-class classification. However, it is important to note that it makes the assumption that the features are normally distributed. This assumption may not be valid for all data sets, and it can lead to poor performance if the assumption is not met.

Here are some other algorithms that can be used for multi-class classification:

* **Support Vector Machines:** Support Vector Machines (SVMs) are a powerful machine learning algorithm that can be used for both binary and multi-class classification. SVMs work by finding a hyperplane that separates the different classes in the data.
* **Decision Trees:** Decision Trees are a simple and intuitive machine learning algorithm that can be used for both binary and multi-class classification. Decision Trees work by splitting the data into smaller and smaller groups until each group belongs to a single class.
* **Random Forests:** Random Forests are an ensemble learning algorithm that combines multiple decision trees to improve the performance of the model. Random Forests work by randomly selecting features and splitting the data into smaller and smaller groups.

### Q5. Assignment:
**Data preparation:**

Download the "Spambase Data Set" from the UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets/Spambase). This dataset contains email messages, where the goal is to predict whether a message is spam or not based on several input features.

**Implementation:**

Implement Bernoulli Naive Bayes, Multinomial Naive Bayes, and Gaussian Naive Bayes classifiers using the scikit-learn library in Python. Use 10-fold cross-validation to evaluate the performance of each classifier on the dataset. You should use the default 
hyperparameters for each classifier.

**Results:**

Report the following performance metrics for each classifier:
Accuracy
Precision
Recall
F1 score

**Discussion:**

Discuss the results you obtained. Which variant of Naive Bayes performed the best? Why do you think that is the case? Are there any limitations of Naive Bayes that you observed?

**Conclusion:**

Summarise your findings and provide some suggestions for future work.

**Note: This dataset contains a binary classification problem with multiple features. The dataset is
relatively small, but it can be used to demonstrate the performance of the different variants of Naive
Bayes on a real-world problem.**


Here are the steps involved in implementing Bernoulli Naive Bayes, Multinomial Naive Bayes, and Gaussian Naive Bayes classifiers using the scikit-learn library in Python:

* Import the necessary libraries.
* Load the Spambase dataset.
* Split the dataset into a training set and a test set.
* Create three classifiers: a Bernoulli Naive Bayes classifier, a Multinomial Naive Bayes classifier, and a Gaussian Naive Bayes classifier.
* Train each classifier on the training set.
* Evaluate each classifier on the test set.
* Report the results.

In [4]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.naive_bayes import BernoulliNB, MultinomialNB, GaussianNB

data = pd.read_csv('spambase.csv')
X = data.iloc[:, :-1].values  # Features
y = data.iloc[:, -1].values   # Labels

# Split the dataset into a training set and a test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create three classifiers instances
bnb = BernoulliNB()
mnb = MultinomialNB()
gnb = GaussianNB()

# Train each classifier on the training set
bnb.fit(X_train, y_train)
mnb.fit(X_train, y_train)
gnb.fit(X_train, y_train)

# Perform 10-fold cross-validation
bnb_scores = cross_val_score(bnb, X, y, cv=10, scoring='accuracy')
mnb_scores = cross_val_score(mnb, X, y, cv=10, scoring='accuracy')
gnb_scores = cross_val_score(gnb, X, y, cv=10, scoring='accuracy')

# Calculate the performance metrics from cross-validation scores
bnb_accuracy = np.mean(bnb_scores)
bnb_precision = precision_score(y, bnb.predict(X))
bnb_recall = recall_score(y, bnb.predict(X))
bnb_f1 = f1_score(y, bnb.predict(X))

mnb_accuracy = np.mean(mnb_scores)
mnb_precision = precision_score(y, mnb.predict(X))
mnb_recall = recall_score(y, mnb.predict(X))
mnb_f1 = f1_score(y, mnb.predict(X))

gnb_accuracy = np.mean(gnb_scores)
gnb_precision = precision_score(y, gnb.predict(X))
gnb_recall = recall_score(y, gnb.predict(X))
gnb_f1 = f1_score(y, gnb.predict(X))

# Print the results
print('Bernoulli Naive Bayes:')
print('Accuracy:', bnb_accuracy)
print('Precision:', bnb_precision)
print('Recall:', bnb_recall)
print('F1 score:', bnb_f1)

print('Multinomial Naive Bayes:')
print('Accuracy:', mnb_accuracy)
print('Precision:', mnb_precision)
print('Recall:', mnb_recall)
print('F1 score:', mnb_f1)

print('Gaussian Naive Bayes:')
print('Accuracy:', gnb_accuracy)
print('Precision:', gnb_precision)
print('Recall:', gnb_recall)
print('F1 score:', gnb_f1)

Bernoulli Naive Bayes:
Accuracy: 0.8839380364047911
Precision: 0.8863499699338545
Recall: 0.8130170987313844
F1 score: 0.8481012658227849
Multinomial Naive Bayes:
Accuracy: 0.7863496180326323
Precision: 0.7488399071925754
Recall: 0.7120794263651407
F1 score: 0.7299971727452643
Gaussian Naive Bayes:
Accuracy: 0.8217730830896915
Precision: 0.7009307972480777
Recall: 0.9553226696083839
F1 score: 0.8085901027077498


**Performance Summary:**
1. **Bernoulli Naive Bayes:**
   - Accuracy: 88.4%
   - Precision: 88.6%
   - Recall: 81.3%
   - F1 Score: 84.8%

2. **Multinomial Naive Bayes:**
   - Accuracy: 78.6%
   - Precision: 74.9%
   - Recall: 71.2%
   - F1 Score: 73.0%

3. **Gaussian Naive Bayes:**
   - Accuracy: 82.2%
   - Precision: 70.1%
   - Recall: 95.5%
   - F1 Score: 80.9%

**Discussion:**

1. **Bernoulli Naive Bayes**: 
   - **Advantages**: Bernoulli Naive Bayes performed the best in terms of accuracy, precision, and F1 score. It's particularly well-suited for binary classification problems like spam detection because it models features as binary values (presence or absence).
   - **Why It Performed Well**: It's likely that the dataset's binary nature (presence or absence of words or features) aligns well with the assumptions of the Bernoulli Naive Bayes model.
   - **Limitations**: While it performed well, Bernoulli Naive Bayes may struggle with features that are not naturally binary or when there is a need to capture the frequency or intensity of certain terms, which it doesn't do effectively.

2. **Multinomial Naive Bayes**:
   - **Advantages**: Multinomial Naive Bayes is commonly used for text classification, but in this case, it didn't perform as well as Bernoulli Naive Bayes. It's suitable when features represent counts or frequencies.
   - **Limitations**: It might not perform optimally when the data is not well-modeled as multinomially distributed counts, which could be the case with binary presence/absence features.

3. **Gaussian Naive Bayes**:
   - **Advantages**: Gaussian Naive Bayes is designed for continuous data. It achieved a high recall rate, meaning it identified a significant portion of actual spam emails.
   - **Limitations**: Gaussian Naive Bayes assumes that features follow a Gaussian distribution, which might not be appropriate for text data, where features are typically not normally distributed. This assumption might not hold well in practice.

**Findings and Suggestions:**

- **Best Performer**: Bernoulli Naive Bayes was the best performer overall, striking a good balance between precision and recall, which is crucial for spam detection where false positives (non-spam emails classified as spam) and false negatives (spam emails classified as non-spam) both have implications.

- **Limitations of Naive Bayes**: Naive Bayes models make strong independence assumptions between features, which might not always hold in real-world data. Additionally, the choice of variant (Bernoulli, Multinomial, or Gaussian) depends on the nature of the data and its distribution.

- **Future Work**: To further improve spam classification, consider the following:
  - Feature Engineering: Experiment with different text preprocessing techniques, such as TF-IDF (Term Frequency-Inverse Document Frequency) or word embeddings like Word2Vec.
  - Hyperparameter Tuning: Optimize hyperparameters for each Naive Bayes variant or explore other classification algorithms like Random Forests, Support Vector Machines, or deep learning methods.
  - Ensembling: Combine predictions from multiple classifiers to improve overall performance.
  - Anomaly Detection: Consider using anomaly detection methods to identify unusual patterns in spam emails.
  - Continuous Data Handling: If you want to leverage continuous data (e.g., email metadata), explore models better suited for such data.

In conclusion, Bernoulli Naive Bayes performed the best on the given dataset, but the choice of algorithm should be tailored to the specific characteristics of the data and the goals of the spam detection task. Further optimization and experimentation can lead to even better results.