# Naïve bayes-2 Assignment

#### Q1. A company conducted a survey of its employees and found that 70% of the employees use the company's health insurance plan, while 40% of the employees who use the plan are smokers. What is the probability that an employee is a smoker given that he/she uses the health insurance plan?

To find the probability that an employee is a smoker given that he/she uses the health insurance plan, denoted as 
𝑃
(
smoker
∣
uses insurance
)
P(smoker∣uses insurance), we can use conditional probability.

Let:

𝐴
A be the event that an employee uses the health insurance plan.
𝐵
B be the event that an employee is a smoker.
We want to find 
𝑃
(
𝐵
∣
𝐴
)
P(B∣A), which is the probability that an employee is a smoker given that the employee uses the health insurance plan.

We are given:
P(A)=0. (probability that an employee uses the insurance plan)
P(B|A)=0.40
P(B∣A)=0.40 (probability that an employee is a smoker given that they use the insurance plan)

By definition of conditional probability:
P(B∣A)=P(A)P(A∩B)
Where: P(A∩B) is the probability that an employee both uses the insurance plan and is a smoker.
Substituting the given probabilities:
 

Hence, the probability that an employee is a smoker given that he/she uses the health insurance plan is 
4
7
7
4
​
or approximately 
0.5714 (rounded to four decimal places).

#### Q2. What is the difference between Bernoulli Naive Bayes and Multinomial Naive Bayes?

The main difference between Bernoulli Naive Bayes and Multinomial Naive Bayes lies in how they model the data and the type of features they are suitable for.

1. **Bernoulli Naive Bayes**:
   - **Model Type**: Bernoulli Naive Bayes is designed for binary/boolean features, where features are represented as either 0 or 1 (e.g., presence or absence of a term in a document).
   - **Example**: Bernoulli Naive Bayes is commonly used in text classification tasks, where each term's presence (1) or absence (0) in a document is considered.

2. **Multinomial Naive Bayes**:
   - **Model Type**: Multinomial Naive Bayes is suitable for features that describe discrete frequency counts (e.g., word counts in document classification).
   - **Data Representation**: It models the frequency of each feature (word) appearing in an instance (document).
   - **Example**: Multinomial Naive Bayes is widely used in natural language processing tasks, such as text categorization or spam email detection, where the frequency of words matters.

In summary:
- Bernoulli Naive Bayes assumes binary features (presence or absence), making it useful for tasks where only the occurrence of a feature matters.
- Multinomial Naive Bayes handles discrete feature counts (frequencies), making it suitable for tasks involving counts or frequencies of features, such as word occurrences in text.

The choice between Bernoulli and Multinomial Naive Bayes depends on the nature of the data and the specific requirements of the classification task. For text classification, Multinomial Naive Bayes is often preferred due to its ability to capture word frequencies, which are important for understanding the content and context of text documents.

#### Q3. How does Bernoulli Naive Bayes handle missing values?

Bernoulli Naive Bayes, like other Naive Bayes classifiers, typically assumes that features are binary or boolean (taking values of 0 or 1) representing the presence or absence of specific features. In the context of handling missing values, Bernoulli Naive Bayes requires a specific approach:

1. **Imputation of Missing Values**:
   - Before applying Bernoulli Naive Bayes, missing values in the dataset need to be handled appropriately. One common approach is to impute missing values with a default value (e.g., 0 or 1) based on the distribution of the feature.
   - For binary features, missing values can be replaced by the most frequent value (0 or 1) in that feature column.

2. **Consideration of Missingness as Information**:
   - In some cases, missing values themselves might be considered as informative or indicative of certain patterns. For Bernoulli Naive Bayes, treating missing values as a separate category (different from 0 or 1) might not align with the assumptions of the model.
   - Therefore, handling missing values by imputation based on feature distributions or using techniques like mean/mode imputation can help prepare the data for Bernoulli Naive Bayes.

3. **Effect on Model Performance**:
   - The way missing values are handled can impact the performance of Bernoulli Naive Bayes. If imputation is done poorly or if missingness is not appropriately encoded, it can introduce bias or noise into the model.
   - It's important to assess the impact of missing value imputation on the model's predictive accuracy and adjust the approach accordingly.

In summary, Bernoulli Naive Bayes typically requires preprocessing steps to handle missing values, such as imputation with the most frequent value or another suitable strategy based on the nature of binary features. Careful consideration and experimentation with different imputation techniques are necessary to ensure the robustness and reliability of the classifier when dealing with missing data.

#### Q4. Can Gaussian Naive Bayes be used for multi-class classification?

Yes, Gaussian Naive Bayes can indeed be used for multi-class classification tasks. Gaussian Naive Bayes is a variant of the Naive Bayes algorithm that assumes continuous (real-valued) features and follows a Gaussian (normal) distribution for each class.

Gaussian Naive Bayes is a versatile algorithm that can be applied to multi-class classification problems, particularly when dealing with continuous or real-valued features that can be modeled using Gaussian distributions within each class. It's important to assess the suitability of this algorithm based on the nature of the data and its adherence to the underlying assumptions of the Gaussian Naive Bayes model.

#### Q5. Assignment:

In [7]:
pip install ucimlrepo

Collecting ucimlrepo
  Downloading ucimlrepo-0.0.6-py3-none-any.whl.metadata (5.3 kB)
Downloading ucimlrepo-0.0.6-py3-none-any.whl (8.0 kB)
Installing collected packages: ucimlrepo
Successfully installed ucimlrepo-0.0.6
Note: you may need to restart the kernel to use updated packages.


In [3]:
from ucimlrepo import fetch_ucirepo 
  
# fetch dataset 
spambase = fetch_ucirepo(id=94) 
  
# data (as pandas dataframes) 
X = spambase.data.features 
y = spambase.data.targets 
  
# # metadata 
# print(spambase.metadata) 
  
# # variable information 
# print(spambase.variables) 

In [4]:
X.head()

Unnamed: 0,word_freq_make,word_freq_address,word_freq_all,word_freq_3d,word_freq_our,word_freq_over,word_freq_remove,word_freq_internet,word_freq_order,word_freq_mail,...,word_freq_conference,char_freq_;,char_freq_(,char_freq_[,char_freq_!,char_freq_$,char_freq_#,capital_run_length_average,capital_run_length_longest,capital_run_length_total
0,0.0,0.64,0.64,0.0,0.32,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.778,0.0,0.0,3.756,61,278
1,0.21,0.28,0.5,0.0,0.14,0.28,0.21,0.07,0.0,0.94,...,0.0,0.0,0.132,0.0,0.372,0.18,0.048,5.114,101,1028
2,0.06,0.0,0.71,0.0,1.23,0.19,0.19,0.12,0.64,0.25,...,0.0,0.01,0.143,0.0,0.276,0.184,0.01,9.821,485,2259
3,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.0,0.137,0.0,0.137,0.0,0.0,3.537,40,191
4,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.0,0.135,0.0,0.135,0.0,0.0,3.537,40,191


In [5]:
X.columns


Index(['word_freq_make', 'word_freq_address', 'word_freq_all', 'word_freq_3d',
       'word_freq_our', 'word_freq_over', 'word_freq_remove',
       'word_freq_internet', 'word_freq_order', 'word_freq_mail',
       'word_freq_receive', 'word_freq_will', 'word_freq_people',
       'word_freq_report', 'word_freq_addresses', 'word_freq_free',
       'word_freq_business', 'word_freq_email', 'word_freq_you',
       'word_freq_credit', 'word_freq_your', 'word_freq_font', 'word_freq_000',
       'word_freq_money', 'word_freq_hp', 'word_freq_hpl', 'word_freq_george',
       'word_freq_650', 'word_freq_lab', 'word_freq_labs', 'word_freq_telnet',
       'word_freq_857', 'word_freq_data', 'word_freq_415', 'word_freq_85',
       'word_freq_technology', 'word_freq_1999', 'word_freq_parts',
       'word_freq_pm', 'word_freq_direct', 'word_freq_cs', 'word_freq_meeting',
       'word_freq_original', 'word_freq_project', 'word_freq_re',
       'word_freq_edu', 'word_freq_table', 'word_freq_conference',


In [6]:
y.head()

Unnamed: 0,Class
0,1
1,1
2,1
3,1
4,1


In [7]:
X.shape

(4601, 57)

In [29]:
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB,BernoulliNB, MultinomialNB
from sklearn.metrics import accuracy_score,classification_report,confusion_matrix
import warnings
warnings.filterwarnings('ignore')

In [33]:
from sklearn.model_selection import GridSearchCV

In [30]:
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.33,random_state=42)

# Using Gaussian Naive Bayes classifiers

In [31]:
gnb=GaussianNB()
gnb.fit(X_train,y_train)
y_pred=gnb.predict(X_test)


print(accuracy_score(y_test,y_pred))
print(confusion_matrix(y_test,y_pred))
print(classification_report(y_test,y_pred))


conf_mat=confusion_matrix(y_test,y_pred)
TP=conf_mat[0][0]
FN=conf_mat[0][1]
FP=conf_mat[1][0]
TN=conf_mat[1][1]

precision = TP / (TP + FP)
recall = TP / (TP + FN)
f1_score = 2 * (precision * recall) / (precision + recall)

print(f"Precision: {precision:.4f}")
print(f"Recall (Sensitivity): {recall:.4f}")
print(f"F1 Score: {f1_score:.4f}")

accuracy = (TP + TN) / (TP + FP + FN + TN)
print(f"Accuracy: {accuracy:.4f}")

0.8229098090849243
[[649 237]
 [ 32 601]]
              precision    recall  f1-score   support

           0       0.95      0.73      0.83       886
           1       0.72      0.95      0.82       633

    accuracy                           0.82      1519
   macro avg       0.84      0.84      0.82      1519
weighted avg       0.85      0.82      0.82      1519

Precision: 0.9530
Recall (Sensitivity): 0.7325
F1 Score: 0.8283
Accuracy: 0.8229


# Using Bernoulli Naive Bayes classifiers

In [32]:
bnb=BernoulliNB()
bnb.fit(X_train,y_train)
y_pred=bnb.predict(X_test)


print(accuracy_score(y_test,y_pred))
print(confusion_matrix(y_test,y_pred))
print(classification_report(y_test,y_pred))


conf_mat=confusion_matrix(y_test,y_pred)
TP=conf_mat[0][0]
FN=conf_mat[0][1]
FP=conf_mat[1][0]
TN=conf_mat[1][1]

precision = TP / (TP + FP)
recall = TP / (TP + FN)
f1_score = 2 * (precision * recall) / (precision + recall)

print(f"Precision: {precision:.4f}")
print(f"Recall (Sensitivity): {recall:.4f}")
print(f"F1 Score: {f1_score:.4f}")

accuracy = (TP + TN) / (TP + FP + FN + TN)
print(f"Accuracy: {accuracy:.4f}")

0.8821593153390388
[[824  62]
 [117 516]]
              precision    recall  f1-score   support

           0       0.88      0.93      0.90       886
           1       0.89      0.82      0.85       633

    accuracy                           0.88      1519
   macro avg       0.88      0.87      0.88      1519
weighted avg       0.88      0.88      0.88      1519

Precision: 0.8757
Recall (Sensitivity): 0.9300
F1 Score: 0.9020
Accuracy: 0.8822


In [35]:
parameter={
    'binarize': [0.0, 0.5, 1.0],  # Adjust binarization threshold
    'alpha': [0.0, 0.5, 1.0]
}

In [37]:
bnb_cv=GridSearchCV(bnb,param_grid=parameter,scoring='accuracy',refit=True,cv=10,verbose=3)

In [38]:
bnb_cv

In [39]:
bnb_cv.fit(X_train,y_train)

Fitting 10 folds for each of 9 candidates, totalling 90 fits
[CV 1/10] END ..........alpha=0.0, binarize=0.0;, score=0.618 total time=   0.0s
[CV 2/10] END ..........alpha=0.0, binarize=0.0;, score=0.618 total time=   0.0s
[CV 3/10] END ..........alpha=0.0, binarize=0.0;, score=0.617 total time=   0.0s
[CV 4/10] END ..........alpha=0.0, binarize=0.0;, score=0.617 total time=   0.0s
[CV 5/10] END ..........alpha=0.0, binarize=0.0;, score=0.617 total time=   0.0s
[CV 6/10] END ..........alpha=0.0, binarize=0.0;, score=0.617 total time=   0.0s
[CV 7/10] END ..........alpha=0.0, binarize=0.0;, score=0.617 total time=   0.0s
[CV 8/10] END ..........alpha=0.0, binarize=0.0;, score=0.617 total time=   0.0s
[CV 9/10] END ..........alpha=0.0, binarize=0.0;, score=0.617 total time=   0.0s
[CV 10/10] END .........alpha=0.0, binarize=0.0;, score=0.617 total time=   0.0s
[CV 1/10] END ..........alpha=0.0, binarize=0.5;, score=0.618 total time=   0.0s
[CV 2/10] END ..........alpha=0.0, binarize=0.5;

In [40]:
bnb_cv.best_params_

{'alpha': 0.5, 'binarize': 0.5}

In [41]:
y_pred=bnb_cv.predict(X_test)

In [42]:
print(accuracy_score(y_test,y_pred))
print(confusion_matrix(y_test,y_pred))
print(classification_report(y_test,y_pred))


conf_mat=confusion_matrix(y_test,y_pred)
TP=conf_mat[0][0]
FN=conf_mat[0][1]
FP=conf_mat[1][0]
TN=conf_mat[1][1]

precision = TP / (TP + FP)
recall = TP / (TP + FN)
f1_score = 2 * (precision * recall) / (precision + recall)

print(f"Precision: {precision:.4f}")
print(f"Recall (Sensitivity): {recall:.4f}")
print(f"F1 Score: {f1_score:.4f}")

accuracy = (TP + TN) / (TP + FP + FN + TN)
print(f"Accuracy: {accuracy:.4f}")

0.8986175115207373
[[799  87]
 [ 67 566]]
              precision    recall  f1-score   support

           0       0.92      0.90      0.91       886
           1       0.87      0.89      0.88       633

    accuracy                           0.90      1519
   macro avg       0.89      0.90      0.90      1519
weighted avg       0.90      0.90      0.90      1519

Precision: 0.9226
Recall (Sensitivity): 0.9018
F1 Score: 0.9121
Accuracy: 0.8986


# Using Multinomial Naive Bayes

In [44]:
mnb=GaussianNB()
mnb.fit(X_train,y_train)
y_pred=mnb.predict(X_test)


print(accuracy_score(y_test,y_pred))
print(confusion_matrix(y_test,y_pred))
print(classification_report(y_test,y_pred))


conf_mat=confusion_matrix(y_test,y_pred)
TP=conf_mat[0][0]
FN=conf_mat[0][1]
FP=conf_mat[1][0]
TN=conf_mat[1][1]

precision = TP / (TP + FP)
recall = TP / (TP + FN)
f1_score = 2 * (precision * recall) / (precision + recall)

print(f"Precision: {precision:.4f}")
print(f"Recall (Sensitivity): {recall:.4f}")
print(f"F1 Score: {f1_score:.4f}")

accuracy = (TP + TN) / (TP + FP + FN + TN)
print(f"Accuracy: {accuracy:.4f}")

0.8229098090849243
[[649 237]
 [ 32 601]]
              precision    recall  f1-score   support

           0       0.95      0.73      0.83       886
           1       0.72      0.95      0.82       633

    accuracy                           0.82      1519
   macro avg       0.84      0.84      0.82      1519
weighted avg       0.85      0.82      0.82      1519

Precision: 0.9530
Recall (Sensitivity): 0.7325
F1 Score: 0.8283
Accuracy: 0.8229


#### Therefore, unless precision is absolutely the top priority and false positives must be minimized at all costs, Bernoulli Naive Bayes seems to be the best choice for classifying spam emails in this scenario. It strikes a good balance between accuracy, precision, and recall, which are crucial for effective spam detection while minimizing false positives and false negatives.