# Naïve bayes-2
Assignment Questions

To find the probability that an employee is a smoker given that he/she uses the health insurance plan, we can use Bayes' theorem:

P(Smoker | Health Plan) = P(Health Plan | Smoker) * P(Smoker) / P(Health Plan)

We are given:

P(Health Plan) = 0.7 (70% of employees use the health insurance plan)
P(Smoker | Health Plan) = 0.4 (40% of employees who use the plan are smokers)
P(Smoker) = unknown

To find P(Smoker), we can use the law of total probability:

P(Smoker) = P(Smoker | Health Plan) * P(Health Plan) + P(Smoker | No Health Plan) * P(No Health Plan)

We are not given P(Smoker | No Health Plan), but we can assume that it is lower than P(Smoker | Health Plan), since smokers may be more likely to use the health insurance plan. For simplicity, let's assume that P(Smoker | No Health Plan) = 0.2 (20% of employees who don't use the plan are smokers), and P(No Health Plan) = 0.3 (30% of employees don't use the plan).

Then, we can calculate:

P(Smoker) = 0.4 * 0.7 + 0.2 * 0.3 = 0.32

Now we can substitute these values into Bayes' theorem:

P(Smoker | Health Plan) = 0.4 * P(Smoker) / P(Health Plan)
P(Smoker | Health Plan) = 0.4 * 0.32 / 0.7
P(Smoker | Health Plan) = 0.1829

Therefore, the probability that an employee is a smoker given that he/she uses the health insurance plan is approximately 0.1829 or 18.29%.

The main difference between Bernoulli Naive Bayes and Multinomial Naive Bayes is the type of data that they are designed to handle.

Bernoulli Naive Bayes is typically used for binary data, where each feature can take on only two values (e.g., presence or absence of a certain word in a document). It models the likelihood of each feature given each class as a Bernoulli distribution, which represents the probability of a binary outcome (e.g., success or failure). Bernoulli Naive Bayes assumes that the features are conditionally independent given the class.

Multinomial Naive Bayes, on the other hand, is designed for data that has multiple discrete values (e.g., word counts in a document). It models the likelihood of each feature given each class as a multinomial distribution, which represents the probability of observing a particular count for each possible value of the feature. Multinomial Naive Bayes assumes that the features are conditionally independent given the class.

In summary, Bernoulli Naive Bayes is used for binary data, while Multinomial Naive Bayes is used for count data. However, there may be cases where either approach can be used, depending on the specific problem and the nature of the data.

Bernoulli Naive Bayes handles missing values in a straightforward way: it simply ignores any features with missing values when making predictions. This is because Bernoulli Naive Bayes assumes that the features are conditionally independent given the class, so the absence of a feature is considered to be informative (i.e., it indicates that the feature is not present).

For example, suppose we have a dataset with three binary features: X1, X2, and X3, and a binary class variable Y. If there is a missing value in X2 for a particular instance, we can simply omit X2 for that instance when calculating the likelihoods of each class. That is, we only consider the likelihoods of Y given X1 and X3.

It is important to note that the effectiveness of this approach depends on the reason for the missing values. If the missing values are missing at random (i.e., the probability of missingness is independent of the true value of the feature), then this approach is valid. However, if the missing values are related to the class variable or other features in a systematic way, then ignoring them can lead to biased or inaccurate predictions. In such cases, more sophisticated imputation methods may be needed to handle missing values.

Yes, Gaussian Naive Bayes can be used for multi-class classification. In this case, the model assumes that the distribution of each feature for each class is Gaussian (i.e., normal) and the features are independent given the class.

To perform multi-class classification with Gaussian Naive Bayes, the model estimates the mean and variance of each feature for each class, and uses Bayes' theorem to calculate the posterior probability of each class given the observed values of the features. The class with the highest posterior probability is then selected as the predicted class.

There are different methods for implementing multi-class classification with Gaussian Naive Bayes, such as one-vs-all (also called one-vs-rest) and one-vs-one. In the one-vs-all approach, the model trains a separate binary classifier for each class, where each classifier distinguishes between one class and the rest of the classes. In the one-vs-one approach, the model trains a binary classifier for each pair of classes, where each classifier distinguishes between two specific classes.

In summary, Gaussian Naive Bayes can be used for multi-class classification, but the implementation may differ depending on the specific problem and the chosen method.

In [5]:
import pandas as pd

data = pd.read_csv('spambase.data', header=None)

# The last column contains the target variable (spam or not spam)
X = data.iloc[:, :-1]
y = data.iloc[:, -1]

In [12]:
import numpy as np
import pandas as pd
from sklearn.naive_bayes import BernoulliNB, MultinomialNB, GaussianNB
from sklearn.model_selection import cross_val_score

# Load the dataset
data = pd.read_csv("spambase.data", header=None)

# Split the data into features and target
X = data.iloc[:, :-1].values
y = data.iloc[:, -1].values

# Instantiate the classifiers
bernoulli = BernoulliNB()
multinomial = MultinomialNB()
gaussian = GaussianNB()

# Perform 10-fold cross-validation and compute the performance metrics for each classifier
scoring = ['accuracy', 'precision', 'recall', 'f1']
bernoulli_scores = cross_val_score(bernoulli, X, y, cv=10, scoring=scoring)
multinomial_scores = cross_val_score(multinomial, X, y, cv=10, scoring=scoring)
gaussian_scores = cross_val_score(gaussian, X, y, cv=10, scoring=scoring)

# Print the performance metrics for each classifier
print("Bernoulli Naive Bayes:")
print("Accuracy: ", np.mean(bernoulli_scores[:, 0]))
print("Precision: ", np.mean(bernoulli_scores[:, 1]))
print("Recall: ", np.mean(bernoulli_scores[:, 2]))
print("F1 score: ", np.mean(bernoulli_scores[:, 3]))
print()

print("Multinomial Naive Bayes:")
print("Accuracy: ", np.mean(multinomial_scores[:, 0]))
print("Precision: ", np.mean(multinomial_scores[:, 1]))
print("Recall: ", np.mean(multinomial_scores[:, 2]))
print("F1 score: ", np.mean(multinomial_scores[:, 3]))
print()

print("Gaussian Naive Bayes:")
print("Accuracy: ", np.mean(gaussian_scores[:, 0]))
print("Precision: ", np.mean(gaussian_scores[:, 1]))
print("Recall: ", np.mean(gaussian_scores[:, 2]))
print("F1 score: ", np.mean(gaussian_scores[:, 3]))



ValueError: For evaluating multiple scores, use sklearn.model_selection.cross_validate instead. ['accuracy', 'precision', 'recall', 'f1'] was passed.