In [None]:
Q1. A company conducted a survey of its employees and found that 70% of the employees use the
company's health insurance plan, while 40% of the employees who use the plan are smokers. What is the
probability that an employee is a smoker given that he/she uses the health insurance plan?

To find the probability that an employee is a smoker given that he/she uses the health insurance plan, we can use Bayes' theorem. Let's denote the following:

- \( P(S) \) = Probability that an employee is a smoker.
- \( P(H) \) = Probability that an employee uses the health insurance plan.
- \( P(S|H) \) = Probability that an employee is a smoker given that he/she uses the health insurance plan.

We are given the following probabilities:
- \( P(H) = 0.70 \) (70% of employees use the health insurance plan).
- \( P(S|H) = 0.40 \) (40% of employees who use the plan are smokers).

We need to calculate \( P(S|H) \) using Bayes' theorem:

\[ P(S|H) = \frac{P(H|S) \times P(S)}{P(H)} \]

Since we are given \( P(H) \) and \( P(S|H) \), we can substitute these values into the formula to find \( P(S) \).

Here's the Python code to calculate \( P(S|H) \):

# Given probabilities
P_H = 0.70  # Probability that an employee uses the health insurance plan
P_S_given_H = 0.40  # Probability that an employee is a smoker given that he/she uses the health insurance plan

# Applying Bayes' theorem to find P(S|H)
P_S_given_H = (P_S_given_H * P_H) / P_H

print("Probability that an employee is a smoker given that he/she uses the health insurance plan:", P_S_given_H)

Output:
Probability that an employee is a smoker given that he/she uses the health insurance plan: 0.4


So, the probability that an employee is a smoker given that he/she uses the health insurance plan is \( 0.40 \) or \( 40\% \).

In [None]:
Q2. What is the difference between Bernoulli Naive Bayes and Multinomial Naive Bayes?

The main difference between Bernoulli Naive Bayes and Multinomial Naive Bayes lies in the type of features they are designed to handle and the underlying distribution assumption of the feature probabilities.

1. **Bernoulli Naive Bayes**:
   - **Features**: Bernoulli Naive Bayes is suitable for binary feature data, where each feature represents the presence or absence of a particular attribute.
   - **Feature Representation**: It assumes that features are binary-valued (0 or 1), indicating the absence or presence of a particular attribute or term.
   - **Example**: It is commonly used for text classification tasks, such as sentiment analysis, where features represent the presence or absence of specific words in a document.
   - **Probability Distribution**: It models the probability of each feature occurring in each class using a Bernoulli distribution.

2. **Multinomial Naive Bayes**:
   - **Features**: Multinomial Naive Bayes is suitable for categorical feature data, where features represent counts or frequencies of occurrences of discrete items.
   - **Feature Representation**: It assumes that features are integer counts or frequencies of occurrences of various categories or terms.
   - **Example**: It is commonly used for text classification tasks, such as document categorization, where features represent the counts of words or terms in a document.
   - **Probability Distribution**: It models the probability of each feature occurring in each class using a Multinomial distribution.

In [None]:
Q3. How does Bernoulli Naive Bayes handle missing values?

Bernoulli Naive Bayes handles missing values by simply ignoring them during the classification process. Since Bernoulli Naive Bayes assumes that features are binary-valued (0 or 1), missing values can be treated as either absent (0) or not considered in the calculation.

When training a Bernoulli Naive Bayes classifier, missing values in the training data are typically treated as the absence of the corresponding feature. During the classification phase, if a feature value is missing for a particular instance, the classifier will ignore that feature for that instance and only consider the available features for classification.

In summary, Bernoulli Naive Bayes does not explicitly handle missing values but implicitly treats them as the absence of the corresponding feature, which is consistent with its assumption of binary feature values. It is important to preprocess the data appropriately to handle missing values before training the classifier to ensure accurate classification results. Common approaches to handling missing values include imputation (replacing missing values with estimated values) or excluding instances with missing values from the analysis.

In [None]:
Q4. Can Gaussian Naive Bayes be used for multi-class classification?

Yes, Gaussian Naive Bayes can be used for multi-class classification. Gaussian Naive Bayes is a variant of the Naive Bayes algorithm that is suitable for continuous-valued features assumed to follow a Gaussian (normal) distribution. It is commonly used for classification tasks where the features are continuous and can take on real-numbered values.

To perform multi-class classification using Gaussian Naive Bayes, the algorithm extends naturally to handle multiple classes. Each class is modeled by its own Gaussian distribution, characterized by the mean and variance of the features for that class. During classification, the algorithm calculates the likelihood of each class given the feature values of the instance using the Gaussian probability density function.

Here's how Gaussian Naive Bayes can be used for multi-class classification:

1. **Model Training**: For each class \( c \), Gaussian Naive Bayes estimates the mean (\( \mu_c \)) and variance (\( \sigma^2_c \)) of each feature based on the training data for that class. This involves calculating the mean and variance of each feature for each class using the available training instances.

2. **Classification**: Given a new instance with feature values \( x_1, x_2, \ldots, x_n \), Gaussian Naive Bayes calculates the likelihood of each class \( c \) as the product of the likelihoods of each feature value given that class, assuming independence between features:

   \[ P(x_1, x_2, \ldots, x_n | c) = P(x_1 | c) \times P(x_2 | c) \times \ldots \times P(x_n | c) \]

   The likelihood of each feature value \( x_i \) given class \( c \) is calculated using the Gaussian probability density function with the mean \( \mu_{c_i} \) and variance \( \sigma^2_{c_i} \) estimated during training.

3. **Prediction**: The class with the highest likelihood is predicted as the output class for the new instance.

In [None]:
Q5. Assignment:
Data preparation:
Download the "Spambase Data Set" from the UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/
datasets/Spambase). This dataset contains email messages, where the goal is to predict whether a message
is spam or not based on several input features.
Implementation:
Implement Bernoulli Naive Bayes, Multinomial Naive Bayes, and Gaussian Naive Bayes classifiers using the
scikit-learn library in Python. Use 10-fold cross-validation to evaluate the performance of each classifier on the
dataset. You should use the default hyperparameters for each classifier.
Results:
Report the following performance metrics for each classifier:
Accuracy
Precision
Recall
F1 score
Discussion:
Discuss the results you obtained. Which variant of Naive Bayes performed the best? Why do you think that is
the case? Are there any limitations of Naive Bayes that you observed?
Conclusion:
Summarise your findings and provide some suggestions for future work.

Note: This dataset contains a binary classification problem with multiple features. The dataset is
relatively small, but it can be used to demonstrate the performance of the different variants of Naive
Bayes on a real-world problem.

To complete this assignment, you can follow these steps:

1. **Data Preparation**:
   - Download the "Spambase Data Set" from the provided link.
   - Load the dataset into your Python environment, for example using pandas.
   - Preprocess the data as necessary, such as handling missing values, scaling features, or encoding categorical variables if needed.

2. **Implementation**:
   - Implement Bernoulli Naive Bayes, Multinomial Naive Bayes, and Gaussian Naive Bayes classifiers using scikit-learn.
   - Use 10-fold cross-validation to evaluate the performance of each classifier on the dataset. You can use `cross_val_score` from scikit-learn for this purpose.
   - Train each classifier on the dataset and calculate the accuracy, precision, recall, and F1 score for each fold of the cross-validation.

3. **Results**:
   - Report the performance metrics (accuracy, precision, recall, F1 score) for each classifier.
   - Compare the performance of the three Naive Bayes variants and discuss the results. Analyze which variant performed the best and why you think that is the case.
   - Identify any limitations or challenges you encountered while working with the Naive Bayes classifiers.

4. **Conclusion**:
   - Summarize your findings from the experiment.
   - Provide suggestions for future work, such as exploring different preprocessing techniques, experimenting with different hyperparameters, or trying other machine learning algorithms for comparison.

Here's a sample code outline to get you started:

import pandas as pd
from sklearn.model_selection import cross_val_score
from sklearn.naive_bayes import BernoulliNB, MultinomialNB, GaussianNB

# Load the dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/spambase/spambase.data"
names = [...]  # Specify column names if necessary
data = pd.read_csv(url, names=names)

# Split features and target variable
X = data.drop("target_column", axis=1)
y = data["target_column"]

# Initialize classifiers
bernoulli_nb = BernoulliNB()
multinomial_nb = MultinomialNB()
gaussian_nb = GaussianNB()

# Perform cross-validation and calculate performance metrics
metrics = ['accuracy', 'precision', 'recall', 'f1']
for clf, name in [(bernoulli_nb, 'Bernoulli NB'), (multinomial_nb, 'Multinomial NB'), (gaussian_nb, 'Gaussian NB')]:
    scores = cross_val_score(clf, X, y, cv=10, scoring=metrics)
    print(f"{name}:")
    for metric in metrics:
        print(f"  {metric.capitalize()}: {scores.mean():.2f}")

Make sure to replace `"target_column"` with the appropriate column name containing the target variable in the dataset. Additionally, customize the `names` list if the dataset does not contain column names. Finally, adapt the code to handle any necessary preprocessing steps specific to your dataset.