
Q1. A company conducted a survey of its employees and found that 70% of the employees use the company's health insurance plan, while 40% of the employees who use the plan are smokers. What is the probability that an employee is a smoker given that he/she uses the health insurance plan? 
Sure, here's a Python code snippet to calculate the probability that an employee is a smoker given that he/she uses the health insurance plan:



In [None]:

# Given probabilities
P_A = 0.70  # Probability that an employee uses the health insurance plan
P_B_given_A = 0.40  # Probability that an employee is a smoker given that he/she uses the health insurance plan

# Calculate the overall probability that an employee is a smoker (P(B))
P_B = P_B_given_A * P_A + (1 - P_A)  # Using the law of total probability

# Calculate the probability that an employee is a smoker given that he/she uses the health insurance plan (P(B|A))
P_B_given_A = (P_B_given_A * P_A) / P_B  # Using Bayes' theorem

print("Probability that an employee is a smoker given that he/she uses the health insurance plan:", P_B_given_A)

## Q2. What is the difference between Bernoulli Naive Bayes and Multinomial Naive Bayes? 

The main difference between Bernoulli Naive Bayes and Multinomial Naive Bayes lies in the type of data they are designed to handle:

1. **Bernoulli Naive Bayes**: This classifier is suitable for features that are binary-valued, meaning they can take on only two values, typically 0 and 1. It assumes that features are generated from a Bernoulli distribution, where each feature represents the presence or absence of a particular attribute. It's commonly used in text classification tasks where features are represented as binary indicators (e.g., word presence/absence).

2. **Multinomial Naive Bayes**: This classifier is designed for features that describe counts or frequencies, typically occurring in text data. It assumes that features follow a multinomial distribution, where each feature represents the frequency of a particular term or word in a document. It's commonly used in text classification tasks where features are represented as counts of word occurrences (e.g., word frequency).

In summary, Bernoulli Naive Bayes is suitable for binary features, while Multinomial Naive Bayes is suitable for count-based features, particularly in text classification tasks.


## Q3. How does Bernoulli Naive Bayes handle missing values? 

In Bernoulli Naive Bayes, missing values are typically handled by treating them as a separate category or by imputing them with a specific value. Here are two common approaches:

1. **Treating missing values as a separate category**: In this approach, missing values are considered as a distinct category or state of the feature. During training, the classifier learns the probability of each feature being present or absent, including the probability of it being missing. When making predictions for instances with missing values, the classifier incorporates the learned probabilities for the missing category.

2. **Imputing missing values**: Another approach is to impute missing values with a specific value before training the classifier. For binary features in Bernoulli Naive Bayes, missing values can be imputed with either 0 or 1, depending on the context. The choice of imputation value may depend on domain knowledge or the characteristics of the dataset. After imputation, the classifier is trained using the imputed dataset as if the missing values were observed.

The choice between these approaches depends on the nature of the data and the problem at hand. It's essential to consider the potential impact of missing values on the classification performance and choose the approach that best suits the specific requirements of the task.

## Q4. Can Gaussian Naive Bayes be used for multi-class classification? 
Yes, Gaussian Naive Bayes can be used for multi-class classification. 

In Gaussian Naive Bayes, the features are assumed to follow a Gaussian (normal) distribution. This assumption holds for continuous features, and the classifier calculates the likelihood of each class given the observed feature values using the Gaussian probability density function.

For multi-class classification, Gaussian Naive Bayes extends naturally by considering each class separately and calculating the likelihood of each class independently. The classifier then assigns the class with the highest likelihood as the predicted class for a given instance.

Therefore, Gaussian Naive Bayes can handle multi-class classification problems by considering the Gaussian distribution of features within each class and making predictions based on the likelihood of each class given the observed feature values.

## Q5. Assignment:
Data preparation: Download the "Spambase Data Set" from the UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets/Spambase). This dataset contains email messages, where the goal is to predict whether a message is spam or not based on several input features.  
Implementation: Implement Bernoulli Naive Bayes, Multinomial Naive Bayes, and Gaussian Naive Bayes classifiers using the scikit-learn library in Python. Use 10-fold cross-validation to evaluate the performance of each classifier on the dataset. You should use the default hyperparameters for each classifier.  
Results: Report the following performance metrics for each classifier: Accuracy Precision Recall F1 score  
Discussion: Discuss the results you obtained. Which variant of Naive Bayes performed the best? Why do you think that is the case? Are there any limitations of Naive Bayes that you observed?  
Conclusion: Summarise your findings and provide some suggestions for future work.

In [None]:
import pandas as pd
from sklearn.model_selection import cross_val_score
from sklearn.naive_bayes import BernoulliNB, MultinomialNB, GaussianNB
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Load the dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/spambase/spambase.data"
column_names = ['word_freq_make', 'word_freq_address', 'word_freq_all', 'word_freq_3d', 'word_freq_our', 'word_freq_over',
                'word_freq_remove', 'word_freq_internet', 'word_freq_order', 'word_freq_mail', 'word_freq_receive', 
                'word_freq_will', 'word_freq_people', 'word_freq_report', 'word_freq_addresses', 'word_freq_free',
                'word_freq_business', 'word_freq_email', 'word_freq_you', 'word_freq_credit', 'word_freq_your', 
                'word_freq_font', 'word_freq_000', 'word_freq_money', 'word_freq_hp', 'word_freq_hpl', 'word_freq_george',
                'word_freq_650', 'word_freq_lab', 'word_freq_labs', 'word_freq_telnet', 'word_freq_857', 'word_freq_data',
                'word_freq_415', 'word_freq_85', 'word_freq_technology', 'word_freq_1999', 'word_freq_parts', 'word_freq_pm',
                'word_freq_direct', 'word_freq_cs', 'word_freq_meeting', 'word_freq_original', 'word_freq_project',
                'word_freq_re', 'word_freq_edu', 'word_freq_table', 'word_freq_conference', 'char_freq_;', 'char_freq_(',
                'char_freq_[', 'char_freq_!', 'char_freq_$', 'char_freq_#', 'capital_run_length_average', 'capital_run_length_longest',
                'capital_run_length_total', 'is_spam']
data = pd.read_csv(url, header=None, names=column_names)

# Split features and target variable
X = data.drop('is_spam', axis=1)
y = data['is_spam']

# Instantiate Naive Bayes classifiers
bernoulli_nb = BernoulliNB()
multinomial_nb = MultinomialNB()
gaussian_nb = GaussianNB()

# Perform 10-fold cross-validation and calculate performance metrics
classifiers = {'Bernoulli Naive Bayes': bernoulli_nb, 
               'Multinomial Naive Bayes': multinomial_nb, 
               'Gaussian Naive Bayes': gaussian_nb}

for name, clf in classifiers.items():
    accuracy = cross_val_score(clf, X, y, cv=10, scoring='accuracy')
    precision = cross_val_score(clf, X, y, cv=10, scoring='precision')
    recall = cross_val_score(clf, X, y, cv=10, scoring='recall')
    f1 = cross_val_score(clf, X, y, cv=10, scoring='f1')
    
    print(f"Classifier: {name}")
    print(f"Accuracy: {accuracy.mean()}")
    print(f"Precision: {precision.mean()}")
    print(f"Recall: {recall.mean()}")
    print(f"F1 Score: {f1.mean()}")
    print("------------------------------------------")
