Q1. What is the difference between Bernoulli Naive Bayes and Multinomial Naive Bayes?

Bernoulli Naive Bayes and Multinomial Naive Bayes are both variants of the Naive Bayes algorithm commonly used in machine learning for text classification tasks. They differ primarily in the assumptions they make about the distribution of feature variables.

 Bernoulli Naive Bayes:

Assumes that the features are binary-valued (e.g., presence or absence of a term in a document).
Typically used for text classification tasks where the presence or absence of words in a document is the primary concern.
It models the presence or absence of each term in the document as a binary variable.
It is suitable when your feature vectors are binary (i.e., occurrence of a word or not in the document).

 Multinomial Naive Bayes:

Assumes that the features represent counts or frequencies (e.g., term frequency in a document).
Often used in text classification where the frequency of terms (words) is considered rather than just their presence.
It models the occurrence counts of each term in the document.
It is suitable when your feature vectors represent counts or frequencies, such as term frequency-inverse document frequency (TF-IDF) vectors commonly used in text classification.


Q2. How does Bernoulli Naive Bayes handle missing values?

Bernoulli Naive Bayes, like other Naive Bayes variants, is generally not explicitly designed to handle missing values. However, there are a few common strategies for dealing with missing values when using Bernoulli Naive Bayes:

1. Imputation: One approach is to impute missing values with a placeholder value before applying the Bernoulli Naive Bayes algorithm. For binary features, missing values could be replaced with either 0 or 1, depending on the context and the nature of the missingness. However, this approach may introduce bias or noise into the data.

2. Ignore Missing Values: Another approach is to simply ignore instances with missing values during training and classification. This means that any instance with missing values would not contribute to the probability estimation for any class. While straightforward, this approach may result in loss of valuable information, particularly if there are many instances with missing values.

3. Advanced Imputation Techniques: More sophisticated imputation techniques could also be employed, such as mean imputation, mode imputation, or using machine learning-based imputation methods like k-nearest neighbors (KNN) or predictive mean matching. However, these methods may introduce additional complexity and computational overhead.



Q3. Can Gaussian Naive Bayes be used for multi-class classification?

Yes, Gaussian Naive Bayes can be used for multi-class classification. Gaussian Naive Bayes is a variant of the Naive Bayes algorithm that assumes that continuous features follow a Gaussian (normal) distribution. It's particularly suitable for features that are real-valued.

For multi-class classification, Gaussian Naive Bayes extends naturally by using the principle of maximum a posteriori (MAP) estimation to classify instances into multiple classes. It calculates the posterior probability of each class given the input features and selects the class with the highest posterior probability as the predicted class.



Q5. Assignment:
    
    Data preparation:

In [8]:

import requests

def download_spambase_dataset(url, save_path):
    response = requests.get(url)
    with open(save_path, 'wb') as f:
        f.write(response.content)
    print("Download completed.")

url = "https://archive.ics.uci.edu/ml/machine-learning-databases/spambase/spambase.data"
save_path = "spambase.data"

download_spambase_dataset(url, save_path)

Download completed.


Implementation:

In [2]:
import numpy as np
from sklearn.model_selection import cross_val_score
from sklearn.naive_bayes import BernoulliNB, MultinomialNB, GaussianNB
from sklearn.preprocessing import KBinsDiscretizer
from sklearn.pipeline import make_pipeline
from sklearn.datasets import fetch_openml

In [3]:
data = fetch_openml(name='spambase', version=1, as_frame=True)

  warn(


In [4]:
X = data.data
y = data.target


In [5]:
bernoulli_nb = BernoulliNB()
bernoulli_scores = cross_val_score(bernoulli_nb, X, y, cv=10)
print("Bernoulli Naive Bayes Mean Accuracy:", np.mean(bernoulli_scores))

Bernoulli Naive Bayes Mean Accuracy: 0.8839380364047911


In [6]:
multinomial_nb = MultinomialNB()
multinomial_scores = cross_val_score(multinomial_nb, X, y, cv=10)
print("Multinomial Naive Bayes Mean Accuracy:", np.mean(multinomial_scores))

Multinomial Naive Bayes Mean Accuracy: 0.7863496180326323


In [7]:
gaussian_nb = make_pipeline(KBinsDiscretizer(n_bins=10, encode='ordinal', strategy='uniform'), GaussianNB())
gaussian_scores = cross_val_score(gaussian_nb, X, y, cv=10)
print("Gaussian Naive Bayes Mean Accuracy:", np.mean(gaussian_scores))

Gaussian Naive Bayes Mean Accuracy: 0.6796321795718192


In [9]:
import numpy as np
from sklearn.model_selection import cross_validate
from sklearn.naive_bayes import BernoulliNB, MultinomialNB, GaussianNB
from sklearn.preprocessing import KBinsDiscretizer
from sklearn.pipeline import make_pipeline
from sklearn.datasets import fetch_openml
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

In [12]:
data = fetch_openml(name='spambase', version=1, as_frame=True)

# Separate features and target variable
X = data.data
y = data.target


  warn(


In [11]:
bernoulli_nb = BernoulliNB()
bernoulli_results = cross_validate(bernoulli_nb, X, y, cv=10, scoring=('accuracy', 'precision', 'recall', 'f1'))
print("Bernoulli Naive Bayes:")
print("Accuracy:", np.mean(bernoulli_results['test_accuracy']))
print("Precision:", np.mean(bernoulli_results['test_precision']))
print("Recall:", np.mean(bernoulli_results['test_recall']))
print("F1 Score:", np.mean(bernoulli_results['test_f1']))

Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/sklearn/model_selection/_validation.py", line 767, in _score
    scores = scorer(estimator, X_test, y_test)
  File "/opt/conda/lib/python3.10/site-packages/sklearn/metrics/_scorer.py", line 107, in __call__
    score = scorer._score(cached_call, estimator, *args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/sklearn/metrics/_scorer.py", line 268, in _score
    return self._sign * self._score_func(y_true, y_pred, **self._kwargs)
  File "/opt/conda/lib/python3.10/site-packages/sklearn/metrics/_classification.py", line 1954, in precision_score
    p, _, _, _ = precision_recall_fscore_support(
  File "/opt/conda/lib/python3.10/site-packages/sklearn/metrics/_classification.py", line 1573, in precision_recall_fscore_support
    labels = _check_set_wise_labels(y_true, y_pred, average, labels, pos_label)
  File "/opt/conda/lib/python3.10/site-packages/sklearn/metrics/_classification.py", line 1382, i

Bernoulli Naive Bayes:
Accuracy: nan
Precision: nan
Recall: nan
F1 Score: nan


Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/sklearn/model_selection/_validation.py", line 767, in _score
    scores = scorer(estimator, X_test, y_test)
  File "/opt/conda/lib/python3.10/site-packages/sklearn/metrics/_scorer.py", line 107, in __call__
    score = scorer._score(cached_call, estimator, *args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/sklearn/metrics/_scorer.py", line 268, in _score
    return self._sign * self._score_func(y_true, y_pred, **self._kwargs)
  File "/opt/conda/lib/python3.10/site-packages/sklearn/metrics/_classification.py", line 1954, in precision_score
    p, _, _, _ = precision_recall_fscore_support(
  File "/opt/conda/lib/python3.10/site-packages/sklearn/metrics/_classification.py", line 1573, in precision_recall_fscore_support
    labels = _check_set_wise_labels(y_true, y_pred, average, labels, pos_label)
  File "/opt/conda/lib/python3.10/site-packages/sklearn/metrics/_classification.py", line 1382, i