<a href="https://colab.research.google.com/github/shallynagfase9/Naive-Bayes-Ensemble-Techniques-its-types/blob/main/Na%C3%AFve_bayes_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Q1. A company conducted a survey of its employees and found that 70% of the employees use the
company's health insurance plan, while 40% of the employees who use the plan are smokers. What is the
probability that an employee is a smoker given that he/she uses the health insurance plan?

In [None]:
"""
The problem states directly that 40% of the employees who use the health insurance plan are smokers. This conditional probability is P(S|H) = 0.40.

"""


Q2. What is the difference between Bernoulli Naive Bayes and Multinomial Naive Bayes?

In [None]:
"""
Bernoulli Naive Bayes and Multinomial Naive Bayes are two types of Naive Bayes classifiers that are suitable for different kinds of data and applications. Here are the key differences between them:

Bernoulli Naive Bayes-
Data Type: Designed for binary/boolean features. Each feature is either 0 or 1.

Multinomial Naive Bayes
Data Type: Designed for discrete count features. Each feature represents the count or frequency of an attribute.

"""

Q3. How does Bernoulli Naive Bayes handle missing values?

In [None]:
"""
Bernoulli Naive Bayes, like other Naive Bayes classifiers, relies on the presence or absence of features (binary values, 0 or 1) to compute probabilities. Handling missing values in this context can be challenging since the model expects each feature to have a binary value. Here are some common strategies to handle missing values when using Bernoulli Naive Bayes:
1. Imputation
2. Ignoring Missing Values
3. Missing Indicator
4. Modeling Missingness

"""

Q4. Can Gaussian Naive Bayes be used for multi-class classification?

In [None]:
"""
Gaussian Naive Bayes for Multi-class Classification
Gaussian Naive Bayes assumes that the continuous features in the dataset follow a Gaussian (normal) distribution. It's particularly effective when dealing with numerical data where each class’s features are assumed to be drawn from a Gaussian distribution.

"""

Q5. Assignment:
Data preparation:
Download the "Spambase Data Set" from the UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/
datasets/Spambase). This dataset contains email messages, where the goal is to predict whether a message
is spam or not based on several input features.
Implementation:
Implement Bernoulli Naive Bayes, Multinomial Naive Bayes, and Gaussian Naive Bayes classifiers using the
scikit-learn library in Python. Use 10-fold cross-validation to evaluate the performance of each classifier on the
dataset. You should use the default hyperparameters for each classifier.
Results:
Report the following performance metrics for each classifier:
Accuracy
Precision
Recall
F1 score
Discussion:
Discuss the results you obtained. Which variant of Naive Bayes performed the best? Why do you think that is
the case? Are there any limitations of Naive Bayes that you observed?
Conclusion:
Summarise your findings and provide some suggestions for future work.

In [1]:
import pandas as pd
from sklearn.model_selection import cross_val_score
from sklearn.naive_bayes import BernoulliNB, MultinomialNB, GaussianNB
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import KFold

# Load the Spambase dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/spambase/spambase.data"
columns = [
    "word_freq_make", "word_freq_address", "word_freq_all", "word_freq_3d", "word_freq_our",
    "word_freq_over", "word_freq_remove", "word_freq_internet", "word_freq_order", "word_freq_mail",
    "word_freq_receive", "word_freq_will", "word_freq_people", "word_freq_report", "word_freq_addresses",
    "word_freq_free", "word_freq_business", "word_freq_email", "word_freq_you", "word_freq_credit",
    "word_freq_your", "word_freq_font", "word_freq_000", "word_freq_money", "word_freq_hp", "word_freq_hpl",
    "word_freq_george", "word_freq_650", "word_freq_lab", "word_freq_labs", "word_freq_telnet",
    "word_freq_857", "word_freq_data", "word_freq_415", "word_freq_85", "word_freq_technology",
    "word_freq_1999", "word_freq_parts", "word_freq_pm", "word_freq_direct", "word_freq_cs",
    "word_freq_meeting", "word_freq_original", "word_freq_project", "word_freq_re", "word_freq_edu",
    "word_freq_table", "word_freq_conference", "char_freq_;", "char_freq_(", "char_freq_[", "char_freq_!",
    "char_freq_$", "char_freq_#", "capital_run_length_average", "capital_run_length_longest",
    "capital_run_length_total", "is_spam"
]

df = pd.read_csv(url, header=None, names=columns)

# Prepare the data: Features and Target
X = df.drop(columns=['is_spam'])
y = df['is_spam']

# Initialize Naive Bayes classifiers
models = {
    'Bernoulli Naive Bayes': BernoulliNB(),
    'Multinomial Naive Bayes': MultinomialNB(),
    'Gaussian Naive Bayes': make_pipeline(StandardScaler(), GaussianNB())
}

# Define cross-validation
kf = KFold(n_splits=10, shuffle=True, random_state=42)

# Evaluate each model
results = {}
for name, model in models.items():
    scores_accuracy = cross_val_score(model, X, y, cv=kf, scoring='accuracy')
    scores_precision = cross_val_score(model, X, y, cv=kf, scoring='precision')
    scores_recall = cross_val_score(model, X, y, cv=kf, scoring='recall')
    scores_f1 = cross_val_score(model, X, y, cv=kf, scoring='f1')

    results[name] = {
        'Accuracy': scores_accuracy.mean(),
        'Precision': scores_precision.mean(),
        'Recall': scores_recall.mean(),
        'F1 Score': scores_f1.mean()
    }

# Print results
print("Performance Metrics:")
for name, metrics in results.items():
    print(f"\n{name}:")
    for metric, value in metrics.items():
        print(f"{metric}: {value:.4f}")


Performance Metrics:

Bernoulli Naive Bayes:
Accuracy: 0.8852
Precision: 0.8844
Recall: 0.8159
F1 Score: 0.8480

Multinomial Naive Bayes:
Accuracy: 0.7914
Precision: 0.7407
Recall: 0.7218
F1 Score: 0.7304

Gaussian Naive Bayes:
Accuracy: 0.8166
Precision: 0.6945
Recall: 0.9530
F1 Score: 0.8025
