Q1. A company conducted a survey of its employees and found that 70% of the employees use the
company's health insurance plan, while 40% of the employees who use the plan are smokers. What is the
probability that an employee is a smoker given that he/she uses the health insurance plan?



A: the employee uses the health insurance plan

B: the employee is a smoker
We are given the following probabilities:

P(A) = 0.7 (70% of the employees use the company's health insurance plan) P(B|A) = 0.4 (40% of the employees who use the plan are smokers)

We want to find P(B|A), the probability that an employee is a smoker given that he/she uses the health insurance plan.

Using Bayes' theorem, we have:

P(B|A) = P(A|B) * P(B) / P(A)

We can find each of these probabilities as follows:

P(B) = the overall probability of being a smoker, which we are not given directly. However, we can find it using the law of total probability:

P(B) = P(B|A) * P(A) + P(B|not A) * P(not A)

We are given P(B|A) = 0.4 and P(A) = 0.7, and we can assume that P(B|not A) is the proportion of employees who are smokers but do not use the health insurance plan. We are not given this proportion, but we can assume it is smaller than 40%, since the use of health insurance plan is likely to be higher among smokers.

Let's assume P(B|not A) = 0.2, which means that 20% of the employees who do not use the plan are smokers. Then:

P(B) = 0.4 * 0.7 + 0.2 * 0.3 = 0.34

Next, we need to find P(A|B), the probability that an employee uses the health insurance plan given that he/she is a smoker. We can assume that this proportion is higher than the overall proportion of 70%, since smokers are more likely to use health insurance. Let's assume P(A|B) = 0.8, which means that 80% of smokers use the plan. Then:

P(B|A) = 0.4 P(A) = 0.7 P(B) = 0.34 P(A|B) = P(B|A) * P(A) / P(B) = 0.4 * 0.7 / 0.34 = 0.82

Therefore, the probability that an employee is a smoker given that he/she uses the health insurance plan is 0.82.

Q2. What is the difference between Bernoulli Naive Bayes and Multinomial Naive Bayes?

Bernoulli Naive Bayes and Multinomial Naive Bayes are both variations of the Naive Bayes classifier, but they handle features and data distributions differently:

### Bernoulli Naive Bayes
- **Feature Assumption**: Assumes binary/Boolean features. Each feature is either present or absent.
- **Data Distribution**: Models the features as following a Bernoulli distribution (i.e., each feature is either 0 or 1).
- **Use Case**: Best suited for text classification tasks where the presence or absence of a word (or other features) is of interest, rather than the frequency of occurrences.
- **Example**: Given a text document, features could be the presence or absence of certain words.

### Multinomial Naive Bayes
- **Feature Assumption**: Assumes features are counts or frequencies of occurrences.
- **Data Distribution**: Models the features as following a multinomial distribution, which is suitable for count data or frequencies.
- **Use Case**: Ideal for text classification tasks where the frequency of terms or words is important. It works well when you have word counts or term frequencies as features.
- **Example**: Given a text document, features could be the number of times each word appears.

### Summary
- **Bernoulli Naive Bayes**: Assumes binary features (presence/absence).
- **Multinomial Naive Bayes**: Assumes features are counts or frequencies.

In practice, you would choose between them based on whether your features are binary or represent counts/frequencies.

Q3. How does Bernoulli Naive Bayes handle missing values?

When dealing with missing values in a Bernoulli Naive Bayes model, there are a few considerations:

1. **Omitting Missing Values**: The scikit-learn implementation of Bernoulli Naive Bayes omits missing values while constructing probability tables¹. This means that when calculating probabilities, the algorithm ignores categories with missing values. Essentially, it only computes probabilities based on the specific categories provided with non-missing values.

2. **Handling Missing Data**: If your dataset contains NaN values (missing data), you don't need to explicitly replace them with placeholders like -1 or impute average values. Instead, the algorithm automatically excludes these missing values during probability calculations.

3. **Categorical Data**: Since Bernoulli Naive Bayes is designed for binary/boolean features, it's appropriate for your categorical data where each attribute is either seen (1) or not seen (0). If an attribute value is missing (NaN), it won't contribute to the probability calculation for that specific category.

In summary, the best approach is to let the algorithm handle missing values by ignoring them during probability estimation. You've already removed data points where content_1 is NaN, focusing on cases where active decisions were made by users¹. This ensures that missing values do not affect the probability calculations for your specific use case.


Q4. Can Gaussian Naive Bayes be used for multi-class classification?

Certainly! Gaussian Naive Bayes (GaussianNB) is indeed suitable for multi-class classification tasks. Let me explain how it works.

Gaussian Naive Bayes is a probabilistic classification technique that assumes each class follows a normal distribution. It makes the "naive" assumption that each feature is conditionally independent given the class label. Here are the key points:

1. **Assumption of Normal Distribution**: GaussianNB assumes that the features within each class follow a Gaussian (normal) distribution. This means that the likelihood of observing a particular feature value given the class label follows a bell-shaped curve.

2. **Independence Assumption**: The "naive" part of Naive Bayes comes from assuming that the features are independent of each other, given the class label. While this assumption rarely holds in practice, GaussianNB can still perform well if the features are approximately independent.

3. **Multi-Class Classification**: GaussianNB can handle multi-class classification problems. It assigns a class label to an input based on the maximum posterior probability (computed using Bayes' theorem) across all classes.

4. **Online Updates**: GaussianNB can perform online updates to model parameters via `partial_fit`. This means you can update the model incrementally as new data arrives.

Here's a simple example using Python and scikit-learn:

```python
import numpy as np
from sklearn.naive_bayes import GaussianNB

# Example data
X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
Y = np.array([1, 1, 1, 2, 2, 2])

# Create and fit the GaussianNB model
clf = GaussianNB()
clf.fit(X, Y)

# Predict a new data point
new_data_point = np.array([[-0.8, -1]])
predicted_class = clf.predict(new_data_point)
print(f"Predicted class: {predicted_class}")
```

Remember that GaussianNB assumes continuous features following a normal distribution. If your data violates this assumption significantly, consider other Naive Bayes variants like MultinomialNB (for discrete features) or BernoulliNB (for binary features) ¹.



Here's a Python program using the GaussianNB class from the sklearn library to demonstrate multi-class classification. In this example, we'll use the Iris dataset, which is a classic dataset for classification problems with three classes.



In [1]:
import numpy as np
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, classification_report


In [2]:
# Load the Iris dataset
iris = datasets.load_iris()
X = iris.data
y = iris.target


In [3]:
# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)


In [4]:
# Initialize the Gaussian Naive Bayes classifier
gnb = GaussianNB()


In [5]:
# Fit the model on the training data
gnb.fit(X_train, y_train)


In [6]:
# Predict the classes on the test data
y_pred = gnb.predict(X_test)


In [7]:
# Calculate the accuracy and print the classification report
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")


Accuracy: 0.98


In [8]:
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=iris.target_names))



Classification Report:
              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        19
  versicolor       1.00      0.92      0.96        13
   virginica       0.93      1.00      0.96        13

    accuracy                           0.98        45
   macro avg       0.98      0.97      0.97        45
weighted avg       0.98      0.98      0.98        45



## Q5. Assignment:

## Data preparation:
Download the "Spambase Data Set" from the UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/
datasets/Spambase). This dataset contains email messages, where the goal is to predict whether a message
is spam or not based on several input features.

## Implementation:
Implement Bernoulli Naive Bayes, Multinomial Naive Bayes, and Gaussian Naive Bayes classifiers using the
scikit-learn library in Python. Use 10-fold cross-validation to evaluate the performance of each classifier on the
dataset. You should use the default hyperparameters for each classifier.

## Results:
Report the following performance metrics for each classifier:
Accuracy
Precision
Recall
F1 score


## Discussion:
Discuss the results you obtained. Which variant of Naive Bayes performed the best? Why do you think that is
the case? Are there any limitations of Naive Bayes that you observed?

## Conclusion:
Summarise your findings and provide some suggestions for future work.

## 1. Data Preparation:
First, download the Spambase Data Set from the UCI Machine Learning Repository. You can download the dataset file (spambase.data) and load it into a pandas DataFrame.

## 2. Implementing Naive Bayes Classifiers
You need to implement three types of Naive Bayes classifiers using scikit-learn:
* Bernoulli Naive Bayes
* Multinomial Naive Bayes
* Gaussian Naive Bayes

In [12]:
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.naive_bayes import BernoulliNB, MultinomialNB, GaussianNB
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
import numpy as np


In [13]:
# Load the dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/spambase/spambase.data"
columns = [f'feature_{i}' for i in range(1, 58)] + ['label']
data = pd.read_csv(url, header=None, names=columns)


In [14]:
data

Unnamed: 0,feature_1,feature_2,feature_3,feature_4,feature_5,feature_6,feature_7,feature_8,feature_9,feature_10,...,feature_49,feature_50,feature_51,feature_52,feature_53,feature_54,feature_55,feature_56,feature_57,label
0,0.00,0.64,0.64,0.0,0.32,0.00,0.00,0.00,0.00,0.00,...,0.000,0.000,0.0,0.778,0.000,0.000,3.756,61,278,1
1,0.21,0.28,0.50,0.0,0.14,0.28,0.21,0.07,0.00,0.94,...,0.000,0.132,0.0,0.372,0.180,0.048,5.114,101,1028,1
2,0.06,0.00,0.71,0.0,1.23,0.19,0.19,0.12,0.64,0.25,...,0.010,0.143,0.0,0.276,0.184,0.010,9.821,485,2259,1
3,0.00,0.00,0.00,0.0,0.63,0.00,0.31,0.63,0.31,0.63,...,0.000,0.137,0.0,0.137,0.000,0.000,3.537,40,191,1
4,0.00,0.00,0.00,0.0,0.63,0.00,0.31,0.63,0.31,0.63,...,0.000,0.135,0.0,0.135,0.000,0.000,3.537,40,191,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4596,0.31,0.00,0.62,0.0,0.00,0.31,0.00,0.00,0.00,0.00,...,0.000,0.232,0.0,0.000,0.000,0.000,1.142,3,88,0
4597,0.00,0.00,0.00,0.0,0.00,0.00,0.00,0.00,0.00,0.00,...,0.000,0.000,0.0,0.353,0.000,0.000,1.555,4,14,0
4598,0.30,0.00,0.30,0.0,0.00,0.00,0.00,0.00,0.00,0.00,...,0.102,0.718,0.0,0.000,0.000,0.000,1.404,6,118,0
4599,0.96,0.00,0.00,0.0,0.32,0.00,0.00,0.00,0.00,0.00,...,0.000,0.057,0.0,0.000,0.000,0.000,1.147,5,78,0


In [15]:
# Prepare the features and target
X = data.iloc[:, :-1]
y = data['label']


In [16]:
# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)


In [17]:
# Initialize the classifiers
bernoulli_nb = BernoulliNB()
multinomial_nb = MultinomialNB()
gaussian_nb = GaussianNB()



In [18]:
# Function to evaluate classifier
def evaluate_classifier(clf, X_train, X_test, y_train, y_test):
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    return accuracy, precision, recall, f1

In [19]:
# Evaluate Bernoulli Naive Bayes
accuracy, precision, recall, f1 = evaluate_classifier(bernoulli_nb, X_train, X_test, y_train, y_test)
print("Bernoulli Naive Bayes:")
print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1 Score: {f1:.4f}")

Bernoulli Naive Bayes:
Accuracy: 0.8791
Precision: 0.8883
Recall: 0.8128
F1 Score: 0.8489


In [21]:
# Evaluate Multinomial Naive Bayes
accuracy, precision, recall, f1 = evaluate_classifier(multinomial_nb, X_train, X_test, y_train, y_test)
print("\nMultinomial Naive Bayes:")
print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1 Score: {f1:.4f}")


Multinomial Naive Bayes:
Accuracy: 0.7820
Precision: 0.7624
Recall: 0.6950
F1 Score: 0.7271


In [22]:
# Evaluate Gaussian Naive Bayes
accuracy, precision, recall, f1 = evaluate_classifier(gaussian_nb, X_train, X_test, y_train, y_test)
print("\nGaussian Naive Bayes:")
print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1 Score: {f1:.4f}")


Gaussian Naive Bayes:
Accuracy: 0.8248
Precision: 0.7207
Recall: 0.9480
F1 Score: 0.8189


## 3. Discussion
## Results Analysis:
1. Performance Comparison: Compare the accuracy, precision, recall, and F1 score of each classifier. This will show which model performed best overall and in terms of each metric.

2. Why the Best Model Performed Better:
* Bernoulli Naive Bayes: Assumes binary features (0 or 1). It might perform well if the dataset's features are binary or sparse.
* Multinomial Naive Bayes: Works well with count data and is typically used for text classification where feature vectors are word count
* Gaussian Naive Bayes: Assumes features are normally distributed, which might not be suitable if the features don't follow this distribution.

3. Limitations of Naive Bayes:

  * Assumes feature independence, which might not always be true.
  * May not perform well if the features are not distributed according to the assumptions of the model (e.g., Gaussian distribution for Gaussian Naive Bayes).

## 4. Conclusion
Summarize the performance of each classifier and provide insights into why certain models may have performed better. Suggest possible future work, such as trying different feature engineering techniques, tuning hyperparameters, or using more advanced classifiers.