### A company conducted a survey of its employees and found that 70% of the employees use the company's health insurance plan, while 40% of the employees who use the plan are smokers. What is the probability that an employee is a smoker given that he/she uses the health insurance plan?

To find the probability that an employee is a smoker given that he/she uses the health insurance plan, you can use conditional probability. You can use the formula for conditional probability:

               P(A|B) = P(A ∩ B)/P(B)

Where:
- P(A|B) is the probability of event A occurring given that event B has occurred.
- P(A ∩ B) is the probability of both events A and B occurring.
- P(B) is the probability of event B occurring.

In this case, event A is "being a smoker," and event B is "using the health insurance plan."

You are given:
- P(B), the probability of using the health insurance plan, which is 70% or 0.70.
- P(A ∩ B), the probability of being a smoker and they use health insurance plan, which is 40% or 0.40.

Now, you can calculate P(A|B), the probability of both being a smoker and using the health insurance plan:

         P(A|B) = P(A ∩ B)/P(B) = 0.40/0.70 = 0.58

So, the probability that an employee is a smoker given that he/she uses the health insurance plan is 0.58, or 58%.

###  What is the difference between Bernoulli Naive Bayes and Multinomial Naive Bayes?

The main difference between them lies in the type of data they are designed to handle and the assumptions they make.

1. **Bernoulli Naive Bayes**:
   - **Data Type**: Bernoulli Naive Bayes is suitable for binary data, where each feature is either present (1) or absent (0). It's commonly used in document classification tasks where the presence or absence of specific words or features is important.
   - **Assumption**: It assumes that features are binary and independent, meaning that the presence or absence of one feature does not affect the presence or absence of another feature.
   - **Use Case**: It is often used in spam detection, sentiment analysis, and any task where the focus is on binary presence/absence features.

2. **Multinomial Naive Bayes**:
   - **Data Type**: Multinomial Naive Bayes is designed for discrete data, typically used with features that represent counts or frequencies. It is commonly used in text classification when dealing with text data represented as word counts or term frequencies.
   - **Assumption**: It assumes that features follow a multinomial distribution, which is appropriate for count-based data, such as word counts.
   - **Use Case**: It is widely used in document classification, such as categorizing documents into topics or classifying emails based on the frequency of words.

### How does Bernoulli Naive Bayes handle missing values?

Dealing with missing values in the context of Bernoulli Naive Bayes:

1. **Imputation**: We can impute missing values by assigning them either 0 or 1, depending on our domain knowledge or the characteristics of our data. However, this approach should be taken with caution, as it may introduce bias or incorrect information into our model.

2. **Ignore Missing Values**: Another option is to simply ignore the instances with missing values during model training and prediction. This means that any instance with missing values for certain features will not contribute to the probability calculations for those features. However, this approach may lead to a loss of information and reduced model performance if there is a significant amount of missing data.

3. **Feature Engineering**: We can create an additional binary feature to explicitly represent the presence or absence of missing values. This way, we can treat missing values as just another category or state of the feature. This approach allows us to retain information about missing data while still adhering to the binary nature of Bernoulli Naive Bayes.

4. **Advanced Imputation Techniques**: Depending on our dataset and the nature of the missing data, we can use more advanced imputation techniques, such as mean imputation, mode imputation, or even machine learning-based imputation methods like k-nearest neighbors imputation. These methods can help preserve the distribution of the data and provide better estimates for missing values.

### Can Gaussian Naive Bayes be used for multi-class classification?

Yes, Gaussian Naive Bayes can be used for multi-class classification. Gaussian Naive Bayes is an extension of the Naive Bayes algorithm that is well-suited for handling continuous numerical data that follows a Gaussian (normal) distribution. It is commonly used in machine learning for various classification tasks, including multi-class classification.

Gaussian Naive Bayes can handle multi-class classification by using the following approach:

1. **Class Probability Estimation**: For each class in multi-class problem, Gaussian Naive Bayes calculates the probability that an input instance belongs to that class. It does this by estimating the class prior probability (the probability that a random instance belongs to a specific class) and the class conditional probability (the likelihood of observing the given features given the class).

2. **Maximum a Posteriori (MAP) Estimation**: To make a prediction for a new instance, Gaussian Naive Bayes applies the Maximum a Posteriori (MAP) estimation. It calculates the posterior probability for each class and assigns the input instance to the class with the highest posterior probability. Mathematically, this can be expressed as:

            Prediction = arg(max_c∈classes(P(C=c|X)))

   Where:
   - Prediction is the predicted class label.
   - C is the class.
   - X is the input feature vector.

3. **Gaussian Distribution Assumption**: Gaussian Naive Bayes assumes that the continuous features within each class follow a Gaussian distribution. This means that it models the distribution of feature values using the mean and variance for each class.

Gaussian Naive Bayes is particularly useful when we have continuous data and we can reasonably assume that the feature distributions within each class are approximately Gaussian. However, it's essential to keep in mind that the "naive" assumption of feature independence may not hold in all real-world scenarios, and the performance of the model can be affected by violations of this assumption.

In [33]:
## code to suppotrt the ans 

import numpy as np
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, classification_report

iris = datasets.load_iris()
X = iris.data 
y = iris.target 

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

gnb = GaussianNB()

gnb.fit(X_train, y_train)

y_pred = gnb.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")

report = classification_report(y_test, y_pred, target_names=iris.target_names)
print("Classification Report:\n", report)

Accuracy: 0.98
Classification Report:
               precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        19
  versicolor       1.00      0.92      0.96        13
   virginica       0.93      1.00      0.96        13

    accuracy                           0.98        45
   macro avg       0.98      0.97      0.97        45
weighted avg       0.98      0.98      0.98        45



#### Data preparation: Download the "Spambase Data Set" from the UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets/Spambase). This dataset contains email messages, where the goal is to predict whether a message is spam or not based on several input features.

#### Implementation: Implement Bernoulli Naive Bayes, Multinomial Naive Bayes, and Gaussian Naive Bayes classifiers using the scikit-learn library in Python. Use 10-fold cross-validation to evaluate the performance of each classifier on the dataset. You should use the default hyperparameters for each classifier.

#### Results: Report the following performance metrics for each classifier: Accuracy, Precision, Recall, F1 score

In [34]:
import numpy as np
import pandas as pd
from sklearn.model_selection import cross_val_score
from sklearn.naive_bayes import BernoulliNB, MultinomialNB, GaussianNB
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

In [35]:
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)

In [36]:
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/spambase/spambase.data"
data = pd.read_csv(url, header=None)

In [37]:
X = data.iloc[:, :-1].values
y = data.iloc[:, -1].values

In [38]:
bernoulli_nb = BernoulliNB()
multinomial_nb = MultinomialNB()
gaussian_nb = GaussianNB()

In [39]:
scoring_metrics = ['accuracy', 'precision', 'recall', 'f1']

In [40]:
results_classifier = []
results_metric = []
results_score = []

for metric in scoring_metrics:
    results_bernoulli[metric] = cross_val_score(bernoulli_nb, X, y, cv=10, scoring=metric).mean()
    results_multinomial[metric] = cross_val_score(multinomial_nb, X, y, cv=10, scoring=metric).mean()
    results_gaussian[metric] = cross_val_score(gaussian_nb, X, y, cv=10, scoring=metric).mean()

In [41]:
results = {
    'Classifier': [],
}

for metric in scoring_metrics:
    results[metric] = []

for classifier in [bernoulli_nb, multinomial_nb, gaussian_nb]:
    classifier_name = str(classifier).split('(')[0]
    results['Classifier'].extend([classifier_name])
    for metric in scoring_metrics:
        scores = cross_val_score(classifier, X, y, cv=10, scoring=metric.lower())
        results[metric].append(scores.mean())

results_df = pd.DataFrame(results)

print(results_df)

      Classifier  accuracy  precision    recall        f1
0    BernoulliNB  0.883938   0.886962  0.815239  0.848125
1  MultinomialNB  0.786350   0.739318  0.721498  0.728291
2     GaussianNB  0.821773   0.710373  0.956952  0.813066


### Discuss the results you obtained. Which variant of Naive Bayes performed the best? Why do you think that is the case? Are there any limitations of Naive Bayes that you observed?

Bernoulli Naive Bayes:

Bernoulli Naive Bayes achieved the highest accuracy and F1 score among the three classifiers. It also had good precision and recall scores. This performance suggests that Bernoulli Naive Bayes is well-suited for this binary classification task, where the presence or absence of specific features (e.g., certain words or patterns in emails) is important. The high precision indicates that it correctly classifies a significant portion of the spam emails without many false positives.

Multinomial Naive Bayes:

Multinomial Naive Bayes achieved slightly lower accuracy and F1 score compared to Bernoulli Naive Bayes. It also had lower precision and recall. Multinomial Naive Bayes is typically used for tasks where the data is represented as counts or frequencies (e.g., text classification with term frequencies), and it may not be the best choice for binary data like spam classification.

Gaussian Naive Bayes:

Gaussian Naive Bayes performed well in terms of recall but had lower precision compared to Bernoulli Naive Bayes. This means that it identified a high percentage of spam emails (high recall), but it also had more false positives (lower precision). Gaussian Naive Bayes assumes that features follow a Gaussian distribution, which may not be the best fit for binary data like spam classification.

Limitations:

1. Independence Assumption: Naive Bayes assumes that features are conditionally independent given the class label. In real-world data, this assumption is often violated, which can affect the model's performance.

2. Sensitivity to Feature Distribution: Each variant of Naive Bayes (Bernoulli, Multinomial, Gaussian) makes specific assumptions about the distribution of features. If these assumptions don't align with the data, the model's performance can suffer.

3. Limited Expressiveness: Naive Bayes is a simple and interpretable model, but it may not capture complex relationships in the data as effectively as more advanced models like decision trees or neural networks.

4. Imbalanced Data: Naive Bayes can struggle with imbalanced datasets where one class is much more prevalent than the other. This can lead to biased predictions.

### Summarise your findings and provide some suggestions for future work.

**Summary of Findings**:

**Classifier Performance**:
   - Among the three Naive Bayes classifiers (Bernoulli, Multinomial, Gaussian), Bernoulli Naive Bayes performed the best for the spam classification task. It achieved the highest accuracy, precision, recall, and F1 score.
   - Multinomial Naive Bayes had slightly lower performance compared to Bernoulli Naive Bayes, while Gaussian Naive Bayes showed strong recall but lower precision.
   - Bernoulli Naive Bayes is well-suited for this binary classification task, where the presence or absence of specific features (e.g., words in emails) is crucial.

**Suggestions for Future Work**:

1. **Feature Engineering**:
   - Explore more advanced feature engineering techniques to improve model performance. For example, consider using natural language processing (NLP) techniques to extract and represent text-based features more effectively.

2. **Model Comparison**:
   - Compare the performance of Naive Bayes with other machine learning algorithms, such as decision trees, random forests, support vector machines, or neural networks, to determine if there are better models suited for this task.

3. **Hyperparameter Tuning**:
   - Perform hyperparameter tuning for each Naive Bayes variant to optimize their performance further. This may involve adjusting smoothing parameters or other relevant hyperparameters.

4. **Ensemble Methods**:
   - Explore ensemble methods like bagging or boosting to combine the strengths of multiple classifiers and potentially improve classification accuracy.

5. **Imbalanced Data Handling**:
   - Implement techniques to address class imbalance if it exists in the dataset. This could involve oversampling the minority class, undersampling the majority class, or using specialized algorithms for imbalanced data.

6. **Evaluate Additional Features**:
   - Consider incorporating additional features into the dataset that may enhance the performance of the classifiers. These features could be related to email metadata, sender information, or other contextual information.

7. **Model Interpretability**
9. **Security and Privacy Measures**