Q1. A company conducted a survey of its employees and found that 70% of the employees use the
company's health insurance plan, while 40% of the employees who use the plan are smokers. What is the
probability that an employee is a smoker given that he/she uses the health insurance plan?

ANS- This problem can be solved using conditional probability.

Let's denote:
- \(A\) as the event that an employee uses the health insurance plan.
- \(B\) as the event that an employee is a smoker.

The probability that an employee uses the health insurance plan is \(P(A) = 0.70\) (given as 70% or 0.70).

The probability that an employee who uses the health insurance plan is a smoker is \(P(B|A) = 0.40\) (given as 40% or 0.40).

We want to find \(P(B|A)\), which is the probability that an employee is a smoker given that they use the health insurance plan. This can be calculated using Bayes' theorem:

\[ P(B|A) = \frac{P(A \cap B)}{P(A)} \]

We are given \(P(B|A) = 0.40\) and \(P(A) = 0.70\). Rearranging Bayes' theorem:

\[ P(B|A) = \frac{P(A \cap B)}{P(A)} \]
\[ P(A \cap B) = P(B|A) \times P(A) \]
\[ P(A \cap B) = 0.40 \times 0.70 = 0.28 \]

Therefore, the probability that an employee is a smoker given that he/she uses the health insurance plan (\(P(B|A)\)) is \(0.40\) or \(40%\).

Q2. What is the difference between Bernoulli Naive Bayes and Multinomial Naive Bayes?

ANS- The main difference between Bernoulli Naive Bayes and Multinomial Naive Bayes lies in the types of features they are designed to handle and how they model the data.

1. **Bernoulli Naive Bayes**:
   - Designed for features that are binary or Boolean (0 or 1).
   - Commonly used in text classification tasks where the presence or absence of words in a document is represented as binary features (e.g., bag-of-words models).
   - Considers only the presence or absence of a feature, not its frequency or count.

2. **Multinomial Naive Bayes**:
   - Typically used for features that represent counts or frequencies.
   - Often applied in text classification where features are word counts or term frequencies in documents (e.g., TF-IDF vectors).
   - Considers the frequency of each feature within each class.

In Bernoulli Naive Bayes, the presence or absence of each feature is the focus, assuming that features are binary variables. It’s suitable for situations where the occurrence of a feature matters more than its frequency.

On the other hand, Multinomial Naive Bayes deals with feature counts or frequencies, making it appropriate when the frequency or occurrence of features within each class is essential for classification.

Choosing between these models often depends on the nature of the data and the problem at hand. If the features are binary, Bernoulli Naive Bayes might be more appropriate. If the features represent counts or frequencies (like word counts in documents), Multinomial Naive Bayes could be a better choice.

Q3. How does Bernoulli Naive Bayes handle missing values?

ANS- Bernoulli Naive Bayes handles missing values by ignoring the missing features during the calculation of probabilities. When a feature value is missing for a particular instance, it effectively treats that missing value as if the feature were not present or absent.

In practice, when working with missing values in Bernoulli Naive Bayes:

1. **During Training**:
   - Instances with missing values for certain features are excluded from the calculation of probabilities related to those specific features.
   - The absence of a feature value (due to being missing) contributes neither to the presence nor to the absence of that feature for a given class.
   - Essentially, the missing values are ignored, and the probabilities are estimated based on the available data.

2. **During Prediction**:
   - When predicting the class for a new instance with missing feature values, Bernoulli Naive Bayes calculates probabilities based only on the available features.
   - Missing features for the new instance are treated as if those features are not present in the calculation of class probabilities.

Handling missing values in Bernoulli Naive Bayes is somewhat implicit, as the model naturally accommodates instances with missing feature values by excluding those features during probability estimation and prediction. However, it's essential to preprocess the data appropriately by handling missing values before training the model to avoid potential biases or loss of information. Common strategies include imputation (replacing missing values with estimated values) or considering the absence of the feature itself as a separate category if suitable for the context.

Q4. Can Gaussian Naive Bayes be used for multi-class classification?

ANS- Yes, Gaussian Naive Bayes can be used for multi-class classification.

In Gaussian Naive Bayes, the assumption is that the features within each class are distributed according to a Gaussian (normal) distribution. This assumption holds for continuous data where the likelihood of the features given the class is modeled as a Gaussian distribution with a mean and variance specific to each class-feature combination.

For multi-class classification problems, Gaussian Naive Bayes extends its application by using the Gaussian distribution assumption independently for each feature within each class. When new instances need to be classified into one of multiple classes, the model calculates the probabilities of the instance belonging to each class based on the Gaussian distribution of each feature within every class.

During prediction for multi-class scenarios, Gaussian Naive Bayes computes the probability of an instance belonging to each class and assigns the class with the highest probability as the predicted class for that instance.

So, while Gaussian Naive Bayes is often used for binary classification problems, it can indeed be adapted to handle multi-class classification by extending its calculations to accommodate multiple classes.

Q5. Assignment:
Data preparation:
Download the "Spambase Data Set" from the UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/
datasets/Spambase). This dataset contains email messages, where the goal is to predict whether a message
is spam or not based on several input features.
Implementation:
Implement Bernoulli Naive Bayes, Multinomial Naive Bayes, and Gaussian Naive Bayes classifiers using the
scikit-learn library in Python. Use 10-fold cross-validation to evaluate the performance of each classifier on the
dataset. You should use the default hyperparameters for each classifier.
Results:
Report the following performance metrics for each classifier:
Accuracy
Precision
Recall
F1 score
Discussion:
Discuss the results you obtained. Which variant of Naive Bayes performed the best? Why do you think that is
the case? Are there any limitations of Naive Bayes that you observed?
Conclusion:
Summarise your findings and provide some suggestions for future work.

In [2]:
!pip install scikit-learn




In [7]:
import pandas as pd
from sklearn.model_selection import cross_validate
from sklearn.naive_bayes import BernoulliNB, MultinomialNB, GaussianNB

# Load the dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/spambase/spambase.data"
column_names = [
    "word_freq_make", "word_freq_address",  # Add column names from the dataset documentation
    # ... (Include all column names)
    "capital_run_length_average", "capital_run_length_longest", "capital_run_length_total", "is_spam"
]
data = pd.read_csv(url, names=column_names)

# Prepare data and target variable
X = data.drop('is_spam', axis=1)  # Drop the target column to get features
y = data['is_spam']  # Assign the target column as 'is_spam'

# Initialize classifiers
bernoulli_nb = BernoulliNB()
multinomial_nb = MultinomialNB()
gaussian_nb = GaussianNB()

# Define evaluation metrics
scoring = ['accuracy', 'precision', 'recall', 'f1']

# Perform 10-fold cross-validation for each classifier
for clf, name in zip([bernoulli_nb, multinomial_nb, gaussian_nb], ['Bernoulli NB', 'Multinomial NB', 'Gaussian NB']):
    scores = cross_validate(clf, X, y, cv=10, scoring=scoring)

    # Print results for each classifier
    print(f"{name} Classifier:")
    for metric, values in scores.items():
        mean_score = values.mean()  # Calculate mean score for each metric
        print(f"{metric.capitalize()}: {mean_score:.4f}")  # Print mean score rounded to 4 decimal places
    print("\n")


Bernoulli NB Classifier:
Fit_time: 0.0129
Score_time: 0.0114
Test_accuracy: 0.7614
Test_precision: 0.7324
Test_recall: 0.6316
Test_f1: 0.6759


Multinomial NB Classifier:
Fit_time: 0.0093
Score_time: 0.0083
Test_accuracy: 0.5503
Test_precision: 0.4468
Test_recall: 0.5704
Test_f1: 0.5004


Gaussian NB Classifier:
Fit_time: 0.0073
Score_time: 0.0072
Test_accuracy: 0.7164
Test_precision: 0.8669
Test_recall: 0.3299
Test_f1: 0.4760


