# ans 1:

To find the probability that an employee is a smoker given that he/she uses the health insurance plan, you can use the conditional probability formula:

\[ P(\text{Smoker | Uses Health Insurance}) = \frac{P(\text{Smoker and Uses Health Insurance})}{P(\text{Uses Health Insurance})} \]

Let's denote the events as follows:

\( A \): Employee uses the health insurance plan
\( B \): Employee is a smoker

The information given is:

\[ P(A) = 0.70 \] (70% of employees use the health insurance plan)

\[ P(B | A) = 0.40 \] (40% of employees who use the plan are smokers)

Now, plug these values into the formula:

\[ P(B | A) = \frac{P(B \cap A)}{P(A)} \]

\[ P(\text{Smoker | Uses Health Insurance}) = \frac{0.40 \times 0.70}{0.70} \]

\[ P(\text{Smoker | Uses Health Insurance}) = 0.40 \]

Therefore, the probability that an employee is a smoker given that he/she uses the health insurance plan is 0.40 or 40%.

# asn 2:

Bernoulli Naive Bayes and Multinomial Naive Bayes are both variants of the Naive Bayes algorithm used for classification tasks, but they differ in terms of the types of data they are suitable for and their underlying assumptions.

1. **Bernoulli Naive Bayes:**
   - **Data Type:** Bernoulli Naive Bayes is suitable for binary data, where features are binary-valued (0 or 1), indicating the presence or absence of a particular feature.
   - **Assumption:** It assumes that features are binary and that the presence or absence of each feature is equally important for classification.

2. **Multinomial Naive Bayes:**
   - **Data Type:** Multinomial Naive Bayes is appropriate for data that can be represented as counts or frequencies of events, typically in the form of integer counts (e.g., word counts in document classification).
   - **Assumption:** It assumes that the features follow a multinomial distribution, which means they represent the occurrences or frequencies of events.

In summary, the main difference lies in the type of data they are designed to handle. Bernoulli Naive Bayes is suitable for binary data, while Multinomial Naive Bayes is more appropriate for categorical data represented by counts or frequencies. The choice between them depends on the nature of your data and the assumptions that align with your specific classification problem.

# asn 3:
Bernoulli Naive Bayes, like other Naive Bayes variants, does not explicitly handle missing values. However, you can preprocess your data to handle missing values before applying the algorithm. Here are a few common approaches:

1. **Imputation:** Replace missing values with a specific value, such as the mode (most frequent value) or mean of the feature. In the case of binary features used in Bernoulli Naive Bayes, you might replace missing values with either 0 or 1, depending on the distribution of the feature.

2. **Meaningful Imputation:** For binary features, you might consider replacing missing values with a value that makes sense in the context of your data. For example, if the feature represents the presence or absence of a specific attribute, you might choose to impute missing values with 0 (absence) if that's more appropriate.

3. **Create a Separate Category:** You can also treat missing values as a separate category and encode them as such in your dataset. This approach works well when missing values carry some information or represent a meaningful absence of data.

4. **Model-based Imputation:** Utilize other predictive models to estimate missing values based on the observed values of other features. Once missing values are imputed, you can proceed with Bernoulli Naive Bayes as usual.

Regardless of the approach chosen, it's essential to carefully consider the implications of missing values on your dataset and choose a method that aligns with the assumptions of your analysis and the characteristics of your data. Additionally, preprocessing steps should be applied consistently across training and testing datasets to ensure fair and accurate evaluation of your model's performance.

# asn 4:
Yes, Gaussian Naive Bayes (GNB) can be used for multi-class classification tasks. Naive Bayes classifiers, including the Gaussian variant, are inherently designed for both binary and multi-class classification problems.

In the case of Gaussian Naive Bayes, the assumption is that the features within each class follow a Gaussian (normal) distribution. Despite its "naive" assumption of feature independence, it often performs well in practice, especially when the features are approximately normally distributed.

For multi-class classification, GNB extends the basic concept by considering each class independently and assigning the class with the highest posterior probability given the input features. The decision rule is based on maximizing the class conditional probabilities.

In summary, Gaussian Naive Bayes is a valid choice for multi-class classification tasks, and it can be particularly effective when the underlying assumptions align well with the characteristics of the data.

In [3]:
# ans 5:



import pandas as pd
from sklearn.model_selection import cross_val_predict, cross_val_score
from sklearn.naive_bayes import BernoulliNB, MultinomialNB, GaussianNB
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Load the dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/spambase/spambase.data"
names = [f"feature_{i}" for i in range(57)] + ['is_spam']
data = pd.read_csv(url, names=names)

# Split the dataset
X = data.drop('is_spam', axis=1)
y = data['is_spam']

# Implement and evaluate Bernoulli Naive Bayes
bernoulli_nb = BernoulliNB()
y_pred_bernoulli = cross_val_predict(bernoulli_nb, X, y, cv=10)
bernoulli_scores = cross_val_score(bernoulli_nb, X, y, cv=10)

# Implement and evaluate Multinomial Naive Bayes
multinomial_nb = MultinomialNB()
y_pred_multinomial = cross_val_predict(multinomial_nb, X, y, cv=10)
multinomial_scores = cross_val_score(multinomial_nb, X, y, cv=10)

# Implement and evaluate Gaussian Naive Bayes
gaussian_nb = GaussianNB()
y_pred_gaussian = cross_val_predict(gaussian_nb, X, y, cv=10)
gaussian_scores = cross_val_score(gaussian_nb, X, y, cv=10)

# Report performance metrics
def report_metrics(name, y_pred, scores):
    print(f"{name} Naive Bayes:")
    print(f"Accuracy: {scores.mean()}")
    print(f"Precision: {precision_score(y, y_pred)}")
    print(f"Recall: {recall_score(y, y_pred)}")
    print(f"F1 Score: {f1_score(y, y_pred)}")
    print("\n")

report_metrics("Bernoulli", y_pred_bernoulli, bernoulli_scores)
report_metrics("Multinomial", y_pred_multinomial, multinomial_scores)
report_metrics("Gaussian", y_pred_gaussian, gaussian_scores)


Bernoulli Naive Bayes:
Accuracy: 0.8839380364047911
Precision: 0.8813357185450209
Recall: 0.815223386651958
F1 Score: 0.8469914040114614


Multinomial Naive Bayes:
Accuracy: 0.7863496180326323
Precision: 0.7323628219484882
Recall: 0.7214561500275786
F1 Score: 0.7268685746040567


Gaussian Naive Bayes:
Accuracy: 0.8217730830896915
Precision: 0.7004440855874041
Recall: 0.9569773855488142
F1 Score: 0.8088578088578089




## Conclusion

**Discussion (Continued):**

In the discussion, analyze the results obtained from the three Naive Bayes variants:

- **Bernoulli Naive Bayes:**
  - This variant assumes that features are binary, making it suitable for binary data like the presence or absence of words in spam emails.
  - Discuss if the binary assumption aligns well with the dataset and if this influenced the performance metrics.

- **Multinomial Naive Bayes:**
  - This variant is commonly used for discrete data, such as word counts in text classification.
  - Examine whether the assumptions of this model match the characteristics of the Spambase dataset.

- **Gaussian Naive Bayes:**
  - Gaussian Naive Bayes is designed for continuous data and assumes a Gaussian distribution of features.
  - Discuss how well this assumption holds for the dataset and if it impacted the classifier's performance.

Compare the metrics such as accuracy, precision, recall, and F1 score for each variant. Identify any patterns or trends in the results.

**Conclusion:**

Summarize the key findings and insights:

- Highlight which Naive Bayes variant performed the best in terms of the specified metrics.
- Discuss possible reasons for the observed performance differences, considering the dataset characteristics and the assumptions of each variant.
- Mention any limitations observed during the analysis, such as sensitivity to feature assumptions or potential overfitting.

**Suggestions for Future Work:**

Provide recommendations for future work based on your observations:

- Explore feature engineering techniques to enhance the performance of Naive Bayes models.
- Consider hyperparameter tuning to optimize the models further.
- Experiment with other machine learning algorithms to compare their performance with Naive Bayes.
- Investigate ensemble methods or hybrid models for potentially improved classification accuracy.

By concluding with these suggestions, you provide a roadmap for further research and improvement in spam email classification using machine learning techniques.