Q1. A company conducted a survey of its employees and found that 70% of the employees use the
company's health insurance plan, while 40% of the employees who use the plan are smokers. What is the
probability that an employee is a smoker given that he/she uses the health insurance plan?

To find the probability that an employee is a smoker given that they use the health insurance plan, you can use conditional probability. You want to find the probability of being a smoker (S) given that they use the health insurance plan (H), which can be written as P(S|H).

You are given two pieces of information:

1. The probability that an employee uses the health insurance plan: P(H) = 70% = 0.70.
2. The probability that an employee is a smoker given that they use the health insurance plan: P(S|H) = 40% = 0.40.

To calculate P(S|H), you can use the formula for conditional probability:

P(S|H) = P(S and H) / P(H)

Now, you need to find P(S and H), which represents the probability that an employee is both a smoker and uses the health insurance plan. Since these events are not stated to be dependent or independent, we can use the multiplication rule for independent events:

P(S and H) = P(S) * P(H)

P(S and H) = 0.40 * 0.70 = 0.28

Now, you have both P(S and H) and P(H). Plug these values into the conditional probability formula:

P(S|H) = 0.28 / 0.70 = 0.4

So, the probability that an employee is a smoker given that they use the health insurance plan is 0.4 or 40%.

Q2. What is the difference between Bernoulli Naive Bayes and Multinomial Naive Bayes?

the choice between Bernoulli Naive Bayes and Multinomial Naive Bayes depends on the nature of your text data and the specific requirements of your machine learning task. If you're working with binary data (presence/absence of terms), Bernoulli Naive Bayes may be more suitable. If you're interested in term frequencies, Multinomial Naive Bayes is a better choice.

Q3. How does Bernoulli Naive Bayes handle missing values?

Bernoulli Naive Bayes doesn't have a built-in mechanism for handling missing values, so you'll need to make decisions based on your specific dataset and problem. Imputation, treating missing values as a separate category, feature selection, or considering alternative Naive Bayes variants are potential strategies depending on your data and objectives.

Q4. Can Gaussian Naive Bayes be used for multi-class classification?

Gaussian Naive Bayes can handle multi-class classification by extending the Gaussian distribution assumption to each class. It's a popular choice when dealing with continuous or real-valued data and can work effectively for multi-class problems as long as the Gaussian distribution assumption is reasonable for the data within each class.

Q5. Assignment:

In [4]:
import pandas as pd

In [5]:
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/spambase/spambase.data"


In [12]:
df = pd.read_csv(url,header=None)

In [13]:
X = df.iloc[:,0:57]

In [14]:
X.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,47,48,49,50,51,52,53,54,55,56
0,0.0,0.64,0.64,0.0,0.32,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.778,0.0,0.0,3.756,61,278
1,0.21,0.28,0.5,0.0,0.14,0.28,0.21,0.07,0.0,0.94,...,0.0,0.0,0.132,0.0,0.372,0.18,0.048,5.114,101,1028
2,0.06,0.0,0.71,0.0,1.23,0.19,0.19,0.12,0.64,0.25,...,0.0,0.01,0.143,0.0,0.276,0.184,0.01,9.821,485,2259
3,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.0,0.137,0.0,0.137,0.0,0.0,3.537,40,191
4,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.0,0.135,0.0,0.135,0.0,0.0,3.537,40,191


In [18]:
y = df[57]

In [19]:
y

0       1
1       1
2       1
3       1
4       1
       ..
4596    0
4597    0
4598    0
4599    0
4600    0
Name: 57, Length: 4601, dtype: int64

In [42]:
from sklearn.model_selection import train_test_split,cross_val_score

In [21]:
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.25,random_state=10)

### Gaussian NB

In [22]:
from sklearn.naive_bayes import GaussianNB

In [40]:
gnb = GaussianNB()

In [43]:
cv_gnb = cross_val_score(gnb,X_train,y_train,cv=10)

In [44]:
cv_score_gnb = cv_gnb.mean()

In [45]:
cv_score_gnb

0.8139130434782608

In [25]:
gnb.fit(X_train,y_train)

In [27]:
y_pred = gnb.predict(X_test)

In [28]:
from sklearn.metrics import accuracy_score,classification_report,confusion_matrix

In [29]:
accuracy_score(y_test,y_pred)

0.8253692441355344

In [30]:
confusion_matrix(y_test,y_pred)

array([[510, 184],
       [ 17, 440]])

In [31]:
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.97      0.73      0.84       694
           1       0.71      0.96      0.81       457

    accuracy                           0.83      1151
   macro avg       0.84      0.85      0.82      1151
weighted avg       0.86      0.83      0.83      1151



### Multinomial NB

In [32]:
from sklearn.naive_bayes import MultinomialNB

In [33]:
mb = MultinomialNB()

In [46]:
cv_mb = cross_val_score(mb,X_train,y_train,cv=10)

In [47]:
cv_score_mb = cv_mb.mean()

In [48]:
cv_score_mb

0.7994202898550724

In [34]:
mb.fit(X_train,y_train)

In [35]:
y_pred1 = mb.predict(X_test)

In [37]:
accuracy_score(y_test,y_pred1)

0.8132059079061685

In [38]:
confusion_matrix(y_test,y_pred1)

array([[587, 107],
       [108, 349]])

In [39]:
print(classification_report(y_test,y_pred1))

              precision    recall  f1-score   support

           0       0.84      0.85      0.85       694
           1       0.77      0.76      0.76       457

    accuracy                           0.81      1151
   macro avg       0.80      0.80      0.80      1151
weighted avg       0.81      0.81      0.81      1151



### Bernoulli NB

In [49]:
from sklearn.naive_bayes import BernoulliNB

In [50]:
bnb = BernoulliNB()

In [51]:
cv_bnb = cross_val_score(bnb,X_train,y_train,cv=10)

In [52]:
cv_score_bnb = cv_bnb.mean()

In [53]:
cv_score_bnb

0.8863768115942028

In [54]:
bnb.fit(X_train,y_train)

In [55]:
y_pred2 = bnb.predict(X_test)

In [56]:
accuracy_score(y_test,y_pred2)

0.8905299739357081

In [57]:
confusion_matrix(y_test,y_pred2)

array([[646,  48],
       [ 78, 379]])

In [58]:
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.97      0.73      0.84       694
           1       0.71      0.96      0.81       457

    accuracy                           0.83      1151
   macro avg       0.84      0.85      0.82      1151
weighted avg       0.86      0.83      0.83      1151



### Discussion

Bernoulli Naive Baye's has perform the best beacuse it has hightest accuracy score.

### Conclusion

exploring more advanced machine learning algorithms, feature engineering techniques, or hyperparameter tuning to improve classification performance.