<h1 style="text-align:center;color:#0F4C81">Naïve Bayes</h1>

- Bayes Theorem
- Hands-on demo
- sklearn implementation:
    - GaussianNB
    - MultinomialNB
    - BernoulliNB

**Bayes' theorem** (alternatively **Bayes' law** or **Bayes' rule**, after Thomas Bayes) gives a mathematical rule for inverting conditional probabilities, allowing one to find the probability of a cause given its effect.[1] For example, if the risk of developing health problems is known to increase with age, Bayes' theorem allows the risk to someone of a known age to be assessed more accurately by conditioning it relative to their age, rather than assuming that the person is typical of the population as a whole.

Bayes' theorem is stated mathematically as the following equation:

$$
P(A|B) = \frac{P(B|A)P(A)}{P(B)}
$$

where $A$ and $B$ are events and $P(B) \ne 0$.
- $P(A|B)$ is a **conditional probability**: the probability of event $A$ occuring given that $B$ is true. It is also called the **posterios probability** of $A$ given $B$.
- $P(B|A)$ is also a conditional probability: the probability of event $B$ occuring given that $A$ is true. It can also be interpreted as the **likelihood** of $A$ given a fixed $B$ because $P(B|A) = L(A|B)$.
- $P(A)$ and $P(B)$ are the probabilities of observing $A$ and $B$ respectively without any givene conditions. They are known as the **prior probability** and **marginal probability**.

[Wikipedia article](https://en.wikipedia.org/wiki/Bayes%27_theorem)

### **Exercise: Applying Bayes' Theorem to Wealth Prediction**  

#### **Problem Statement**  
Given the dataset below, determine the probability that a person is **super wealthy** (Wealth Group = 3) given that they have a **bachelor’s degree** (Academic Qualification Group = 2).  

#### **Dataset**  

| ID  | Name     | Academic Qualification | Group X | Wealth   | Group Y |
|-----|---------|------------------------|---------|----------|---------|
| 1   | Alice   | Bachelor               | 2       | $18.1B   | 3       |
| 2   | Bob     | PhD                    | 3       | $900,000 | 2       |
| 3   | Charlie | Master                 | 3       | $10,000  | 1       |
| 4   | David   | Bachelor               | 2       | $25,000  | 1       |
| 5   | Emma    | Bachelor               | 2       | $40,000  | 2       |
| 6   | Frank   | High School            | 1       | $12,500  | 1       |
| 7   | Grace   | High School            | 1       | $29,000  | 1       |
| 8   | Henry   | High School            | 1       | $1,500   | 1       |
| 9   | Ivy     | Bachelor               | 2       | $125,000 | 2       |
| 10  | Jack    | Bachelor               | 2       | $100,000 | 2       |

#### **Step 1: Define Events**  
- $ A $ = Being **super wealthy** (Wealth Group = 3)  
- $ B $ = Having a **bachelor’s degree** (Academic Qualification Group = 2)  

We need to compute $ P(A | B) $, the probability of being super wealthy given that someone has a bachelor’s degree.  

Using **Bayes’ Theorem**:  

$$
P(A | B) = \frac{P(B | A) P(A)}{P(B)}
$$

#### **Step 2: Compute Required Probabilities from Data**  

- **Total number of individuals** = 10  
- **Individuals in Wealth Group = 3 (Super Wealthy)** = **1** (Alice)  
- **Individuals in Academic Qualification Group = 2 (Bachelor's Degree)** = **5** (Alice, David, Emma, Ivy, Jack)  
- **Individuals who are both in Wealth Group = 3 and Academic Qualification Group = 2** = **1** (Alice)  

$$
P(A) = \frac{\text{Super Wealthy individuals}}{\text{Total individuals}} = \frac{1}{10} = 0.1
$$

$$
P(B) = \frac{\text{Individuals with a Bachelor's degree}}{\text{Total individuals}} = \frac{5}{10} = 0.5
$$

$$
P(B | A) = \frac{\text{Super Wealthy individuals with a Bachelor's degree}}{\text{Total Super Wealthy individuals}} = \frac{1}{1} = 1
$$

#### **Step 3: Apply Bayes’ Theorem**  

$$
P(A | B) = \frac{(1) (0.1)}{0.5}
$$

$$
P(A | B) = \frac{0.1}{0.5} = 0.2
$$

Thus, **if someone has a bachelor’s degree, the probability that they are super wealthy is 20%**.


### **Interpretation**  
Even though the overall probability of being super wealthy is low (**10%** of the population), having a **bachelor’s degree doubles the probability to 20%**. However, since only one person in the dataset is super wealthy, this is a small sample size, and more data would be needed to draw stronger conclusions.

In [11]:
import pandas as pd

data = {
    "ID": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    "Name": ["Alice", "Bob", "Charlie", "David", "Emma", "Frank", "Grace", "Henry", "Ivy", "Jack"],
    "Academic Qualification": ["Bachelor", "PhD", "Master", "Bachelor", "Bachelor", 
                               "High School", "High School", "High School", "Bachelor", "Bachelor"],
    "Group X": [2, 3, 3, 2, 2, 1, 1, 1, 2, 2],
    "Wealth": [18.1e9, 900000, 10000, 25000, 40000, 12500, 29000, 1500, 125000, 100000],
    "Group Y": [3, 2, 1, 1, 2, 1, 1, 1, 2, 2]
}

df = pd.DataFrame(data)

display(df)

Unnamed: 0,ID,Name,Academic Qualification,Group X,Wealth,Group Y
0,1,Alice,Bachelor,2,18100000000.0,3
1,2,Bob,PhD,3,900000.0,2
2,3,Charlie,Master,3,10000.0,1
3,4,David,Bachelor,2,25000.0,1
4,5,Emma,Bachelor,2,40000.0,2
5,6,Frank,High School,1,12500.0,1
6,7,Grace,High School,1,29000.0,1
7,8,Henry,High School,1,1500.0,1
8,9,Ivy,Bachelor,2,125000.0,2
9,10,Jack,Bachelor,2,100000.0,2


In [12]:
# Count total individuals
total_individuals = len(df)

# Count individuals with a Bachelor's degree
bachelor_group = df[df["Group X"] == 2]
count_bachelor = len(bachelor_group)

# Count super wealthy individuals (Wealth Group Y = 3)
super_wealthy = df[df["Group Y"] == 3]
count_super_wealthy = len(super_wealthy)

# Count individuals who are both super wealthy and have a Bachelor's degree
super_wealthy_bachelor = bachelor_group[bachelor_group["Group Y"] == 3]
count_super_wealthy_bachelor = len(super_wealthy_bachelor)

# Calculate P(A | B)
P_A_given_B = count_super_wealthy_bachelor / count_bachelor
print(f"P(Super Wealthy | Bachelor's Degree) = {P_A_given_B:.2f}")

P(Super Wealthy | Bachelor's Degree) = 0.20


### **When Should You Use Bayes' Theorem?**

In this specific case, you can compute $ P(A | B) $ directly from the dataset using simple conditional probability:

$$
P(A | B) = \frac{\text{Individuals in both Group A (super wealthy) and Group B (bachelor's degree)}}{\text{Individuals in Group B (bachelor's degree)}}
$$

which gives us:

$$
P(A | B) = \frac{1}{5} = 0.2
$$


Bayes' Theorem is particularly useful in scenarios where **direct probability computation is not feasible** due to missing or indirect information. Here are some key situations where you **must** use Bayes' Theorem:

#### **1. When You Have Indirect Probabilities**  
- Sometimes, you don't have **direct data on $ P(A | B) $** but have **reverse information** like $ P(B | A) $ and need to compute $ P(A | B) $.
- Example: **Medical Testing**
  - You know the probability of testing positive given that someone has a disease ($ P(B | A) $).
  - You need to compute the probability that a person actually has the disease given a positive test ($ P(A | B) $).

#### **2. When You Need to Incorporate Prior Knowledge**
- Bayes' Theorem allows you to adjust probabilities based on prior beliefs or historical data.
- Example: **Spam Detection**
  - If an email contains the word "free," what is the probability it is spam?
  - We may know:
    - $ P(\text{"free"} | \text{spam}) $ = Probability of "free" appearing in spam.
    - $ P(\text{"free"} | \text{not spam}) $ = Probability of "free" appearing in legitimate emails.
    - $ P(\text{spam}) $ = Overall likelihood of spam.
  - **Bayes' Theorem helps compute the final probability that an email is spam given it contains "free".**

#### **3. When You Have Overlapping or Confusing Events**
- Example: **Diagnostics & Fraud Detection**
  - You might know:
    - Probability of an alert being triggered when fraud is present.
    - Probability of an alert being triggered when no fraud is present (false positives).
    - Overall probability of fraud occurring.
  - If an alert is triggered, what is the chance fraud is actually happening? **Bayes' Theorem helps here**.

### **When You Can Skip Bayes' Theorem**
If you **already have** enough direct data to compute conditional probabilities (like in the wealth dataset), **Bayes' Theorem is not necessary**. It's most useful when you're reasoning **backwards from observed data to hidden causes**.

### **Example Where Bayes' Theorem is Necessary**: Identifying a Defective Product  

#### **Scenario:**  
A factory produces **microchips** from **two different machines**:  
- **Machine A** produces **60%** of the total microchips.  
- **Machine B** produces **40%** of the total microchips.  

The probability of a defective chip is:  
- **Machine A:** $ P(D | A) = 2\% = 0.02 $  
- **Machine B:** $ P(D | B) = 5\% = 0.05 $  

If a randomly selected microchip is found to be **defective**, what is the probability that it came from **Machine B**?


### **Step 1: Define Events**  
- $ A $: The chip comes from **Machine A**  
- $ B $: The chip comes from **Machine B**  
- $ D $: The chip is **defective**  

We need to compute:  

$$
P(B | D) = \frac{P(D | B) P(B)}{P(D)}
$$


### **Step 2: Compute Total Probability of Defective Chips**  

Using the **law of total probability**, the overall probability of getting a defective chip is:

$$
P(D) = P(D | A) P(A) + P(D | B) P(B)
$$

Substituting values:

$$
P(D) = (0.02 \times 0.6) + (0.05 \times 0.4)
$$

$$
P(D) = 0.012 + 0.02 = 0.032
$$


### **Step 3: Apply Bayes’ Theorem**  

$$
P(B | D) = \frac{P(D | B) P(B)}{P(D)}
$$

$$
P(B | D) = \frac{(0.05 \times 0.4)}{0.032}
$$

$$
P(B | D) = \frac{0.02}{0.032} = 0.625
$$

So, **if a microchip is defective, there is a 62.5% probability that it came from Machine B**.

In [14]:
# Given probabilities
P_A = 0.6  # Probability that a chip comes from Machine A
P_B = 0.4  # Probability that a chip comes from Machine B

P_D_given_A = 0.02  # Probability of defect given Machine A
P_D_given_B = 0.05  # Probability of defect given Machine B

# Compute total probability of a defective chip using Law of Total Probability
P_D = (P_D_given_A * P_A) + (P_D_given_B * P_B)

print(f"P(D) = {P_D:.4f}")

# Compute P(B | D) using Bayes' Theorem
P_B_given_D = (P_D_given_B * P_B) / P_D

print(f"P(Machine B | Defective Chip) = {P_B_given_D:.3f}")

P(D) = 0.0320
P(Machine B | Defective Chip) = 0.625


**From Bayes' Theorem to Naive Bayes**  
Now that we've established the foundational understanding of Bayes' Theorem and how it allows us to update our beliefs in the presence of new evidence, we can move on to an important application: **Naive Bayes**. While Bayes' Theorem works for general probabilistic reasoning, **Naive Bayes** is a specific application of this concept in classification problems, particularly when dealing with high-dimensional data. The term "naive" comes from the simplifying assumption that the features are **conditionally independent** given the class label, which significantly reduces the complexity of the computation. This assumption, while often unrealistic in real-world data, still allows Naive Bayes classifiers to perform surprisingly well in many practical situations. In the following sections, we will explore how different types of Naive Bayes models, such as **Gaussian**, **Multinomial**, and **Bernoulli**, extend Bayes' Theorem for classification tasks and implement them using **scikit-learn**.

### **Main Types of Naive Bayes Classifier**

There are three main types of Naive Bayes classifiers. The key difference between these types lies in the assumption they make about the distribution of features:

1. **Bernoulli Naive Bayes**: Suited for binary/boolean features. It assumes each feature is a binary-valued (0/1) variable.
2. **Multinomial Naive Bayes**: Typically used for discrete counts. It’s often used in text classification, where features might be word counts.
3. **Gaussian Naive Bayes**: Assumes that continuous features follow a normal distribution.

<div style="display:flex;justify-content:center;align-items:center;">
<img src="images/naive_bayes_models.png" style="width=400px;object-fit:cover;" />
</div>

## **Bernoulli Naive Bayes (`BernoulliNB`)**

It is a good start to focus on the simplest one which is Bernoulli NB. BernoulliNB implements the naive Bayes training and classification algorithms for data that is distributed according to multivariate Bernoulli distributions; i.e., there may be multiple features but each one is assumed to be a binary-valued (Bernoulli, boolean) variable. Therefore, this class requires samples to be represented as binary-valued feature vectors.

We’ll use artificial golf dataset as an example. This dataset predicts whether a person will play golf based on weather conditions:

|    | Outlook   |   Temperature |   Humidity | Wind   | Play   |
|---:|:----------|--------------:|-----------:|:-------|:-------|
|  0 | sunny     |            85 |         85 | False  | No     |
|  1 | sunny     |            80 |         90 | True   | No     |
|  2 | overcast  |            83 |         78 | False  | Yes    |
|  3 | rain      |            70 |         96 | False  | Yes    |
|  4 | rain      |            68 |         80 | False  | Yes    |
|  5 | rain      |            65 |         70 | True   | No     |
|  6 | overcast  |            64 |         65 | True   | Yes    |
|  7 | sunny     |            72 |         95 | False  | No     |
|  8 | sunny     |            69 |         70 | False  | Yes    |
|  9 | rain      |            75 |         80 | False  | Yes    |
| 10 | sunny     |            75 |         70 | True   | Yes    |
| 11 | overcast  |            72 |         90 | True   | Yes    |
| 12 | overcast  |            81 |         75 | False  | Yes    |
| 13 | rain      |            71 |         80 | True   | No     |
| 14 | sunny     |            81 |         88 | True   | No     |
| 15 | overcast  |            74 |         92 | False  | Yes    |
| 16 | rain      |            76 |         85 | False  | Yes    |
| 17 | sunny     |            78 |         75 | True   | No     |
| 18 | sunny     |            82 |         92 | False  | No     |
| 19 | rain      |            67 |         90 | True   | No     |
| 20 | overcast  |            85 |         85 | True   | Yes    |
| 21 | rain      |            73 |         88 | False  | Yes    |
| 22 | sunny     |            88 |         65 | True   | Yes    |
| 23 | overcast  |            77 |         70 | False  | Yes    |
| 24 | sunny     |            79 |         60 | False  | Yes    |
| 25 | overcast  |            80 |         95 | True   | Yes    |
| 26 | rain      |            66 |         70 | False  | No     |
| 27 | overcast  |            84 |         78 | False  | Yes    |


We’ll adapt it slightly for Bernoulli Naive Bayes by converting our features to binary.

|    |   Temperature_Hot |   Humidity_Humid |   Wind |   overcast |   rain |   sunny |   Play |
|---:|------------------:|-----------------:|-------:|-----------:|-------:|--------:|-------:|
|  0 |                 1 |                1 |      0 |          0 |      0 |       1 |      0 |
|  1 |                 0 |                1 |      1 |          0 |      0 |       1 |      0 |
|  2 |                 1 |                1 |      0 |          1 |      0 |       0 |      1 |
|  3 |                 0 |                1 |      0 |          0 |      1 |       0 |      1 |
|  4 |                 0 |                1 |      0 |          0 |      1 |       0 |      1 |
|  5 |                 0 |                0 |      1 |          0 |      1 |       0 |      0 |
|  6 |                 0 |                0 |      1 |          1 |      0 |       0 |      1 |
|  7 |                 0 |                1 |      0 |          0 |      0 |       1 |      0 |
|  8 |                 0 |                0 |      0 |          0 |      0 |       1 |      1 |
|  9 |                 0 |                1 |      0 |          0 |      1 |       0 |      1 |
| 10 |                 0 |                0 |      1 |          0 |      0 |       1 |      1 |
| 11 |                 0 |                1 |      1 |          1 |      0 |       0 |      1 |
| 12 |                 1 |                0 |      0 |          1 |      0 |       0 |      1 |
| 13 |                 0 |                1 |      1 |          0 |      1 |       0 |      0 |
| 14 |                 1 |                1 |      1 |          0 |      0 |       1 |      0 |
| 15 |                 0 |                1 |      0 |          1 |      0 |       0 |      1 |
| 16 |                 0 |                1 |      0 |          0 |      1 |       0 |      1 |
| 17 |                 0 |                0 |      1 |          0 |      0 |       1 |      0 |
| 18 |                 1 |                1 |      0 |          0 |      0 |       1 |      0 |
| 19 |                 0 |                1 |      1 |          0 |      1 |       0 |      0 |
| 20 |                 1 |                1 |      1 |          1 |      0 |       0 |      1 |
| 21 |                 0 |                1 |      0 |          0 |      1 |       0 |      1 |
| 22 |                 1 |                0 |      1 |          0 |      0 |       1 |      1 |
| 23 |                 0 |                0 |      0 |          1 |      0 |       0 |      1 |
| 24 |                 0 |                0 |      0 |          0 |      0 |       1 |      1 |
| 25 |                 0 |                1 |      1 |          1 |      0 |       0 |      1 |
| 26 |                 0 |                0 |      0 |          0 |      1 |       0 |      0 |
| 27 |                 1 |                1 |      0 |          1 |      0 |       0 |      1 |

#### **Main Mechanism**
Bernoulli Naive Bayes operates on data where each feature is either 0 or 1.

1. Calculate the probability of each class in the training data.
2. For each feature and class, calculate the probability of the feature being 1 and 0 given the class.
3. For a new instance: For each class, multiply its probability by the probability of each feature value (0 or 1) for that class.
4. Predict the class with the highest resulting probability.

## **Step 1: Calculate Prior Probabilities**
The **prior probability** of a class $ C_k $ is calculated as:  

$$
P(C_k) = \frac{\text{Number of instances in class } C_k}{\text{Total instances}}
$$

From the dataset, we count how many times **Play = Yes** and **Play = No**:

- **Yes (Play = 1)** → 17 instances
- **No (Play = 0)** → 10 instances
- **Total instances** → $ 27 $

$$
P(\text{Play=Yes}) = \frac{17}{27} \approx 0.63
$$

$$
P(\text{Play=No}) = \frac{10}{27} \approx 0.37
$$


## **Step 2: Compute Likelihoods**
For each feature (binary column), we compute:

$$
P(X_i = 1 | C_k) = \frac{\text{Number of instances where } X_i = 1 \text{ and } C_k}{\text{Total instances where } C_k}
$$

$$
P(X_i = 0 | C_k) = 1 - P(X_i = 1 | C_k)
$$

We compute these probabilities for **each feature** (e.g., `Temperature_Hot`, `Humidity_Humid`, etc.) **given Play = Yes and Play = No**.

### **Example Calculation for `Sunny` Feature**
- **Sunny = 1 | Play = Yes** → Appears in **5 out of 17** cases.

$$
P(\text{Sunny=1 | Play=Yes}) = \frac{5}{17} \approx 0.29
$$

- **Sunny = 1 | Play = No** → Appears in **5 out of 10** cases.

$$
P(\text{Sunny=1 | Play=No}) = \frac{5}{10} = 0.50
$$

We repeat this for all features.

## **Step 3: Apply Bayes Theorem**
For a **new test instance**:
> **(Temperature_Hot = 1, Humidity_Humid = 1, Wind = 0, Overcast = 0, Rain = 0, Sunny = 1)**  
> (i.e., **Hot, Humid, No Wind, Sunny**)

We use **Bayes’ Rule** to calculate:

$$
P(\text{Play=Yes} | X) = \frac{P(X | \text{Play=Yes}) P(\text{Play=Yes})}{P(X)}
$$

$$
P(\text{Play=No} | X) = \frac{P(X | \text{Play=No}) P(\text{Play=No})}{P(X)}
$$

Since we only care about comparing probabilities, we ignore **P(X)** and compute the numerator.

$$
P(\text{Play=Yes} | X) \propto P(X | \text{Play=Yes}) P(\text{Play=Yes})
$$

$$
P(\text{Play=No} | X) \propto P(X | \text{Play=No}) P(\text{Play=No})
$$

### **Compute Likelihood for Play = Yes**
$$
P(X | \text{Play=Yes}) = P(\text{TempHot=1} | \text{Play=Yes}) \times P(\text{HumidityHumid=1} | \text{Play=Yes}) \times P(\text{Wind=0} | \text{Play=Yes}) \times P(\text{Sunny=1} | \text{Play=Yes})
$$

$$
= (0.35) \times (0.47) \times (0.59) \times (0.29)
$$

$$
= 0.028
$$

Final probability:

$$
P(\text{Play=Yes} | X) = 0.028 \times 0.63 = 0.0176
$$

### **Compute Likelihood for Play = No**
$$
P(X | \text{Play=No}) = P(\text{TempHot=1} | \text{Play=No}) \times P(\text{HumidityHumid=1} | \text{Play=No}) \times P(\text{Wind=0} | \text{Play=No}) \times P(\text{Sunny=1} | \text{Play=No})
$$

$$
= (0.50) \times (0.70) \times (0.40) \times (0.50)
$$

$$
= 0.07
$$

Final probability:

$$
P(\text{Play=No} | X) = 0.07 \times 0.37 = 0.0259
$$


## **Step 4: Make a Prediction**
Since:

$$
P(\text{Play=Yes} | X) = 0.0176
$$

$$
P(\text{Play=No} | X) = 0.0259
$$

Since $ P(\text{Play=No} | X) > P(\text{Play=Yes} | X) $, we predict **Play = No** (won't play golf).

In [16]:
import pandas as pd
from sklearn.naive_bayes import BernoulliNB
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

# Load the dataset
dataset_dict = {
    'Outlook': ['sunny', 'sunny', 'overcast', 'rainy', 'rainy', 'rainy', 'overcast', 'sunny', 'sunny', 'rainy', 'sunny', 'overcast', 'overcast', 'rainy', 'sunny', 'overcast', 'rainy', 'sunny', 'sunny', 'rainy', 'overcast', 'rainy', 'sunny', 'overcast', 'sunny', 'overcast', 'rainy', 'overcast'],
    'Temperature': [85.0, 80.0, 83.0, 70.0, 68.0, 65.0, 64.0, 72.0, 69.0, 75.0, 75.0, 72.0, 81.0, 71.0, 81.0, 74.0, 76.0, 78.0, 82.0, 67.0, 85.0, 73.0, 88.0, 77.0, 79.0, 80.0, 66.0, 84.0],
    'Humidity': [85.0, 90.0, 78.0, 96.0, 80.0, 70.0, 65.0, 95.0, 70.0, 80.0, 70.0, 90.0, 75.0, 80.0, 88.0, 92.0, 85.0, 75.0, 92.0, 90.0, 85.0, 88.0, 65.0, 70.0, 60.0, 95.0, 70.0, 78.0],
    'Wind': [False, True, False, False, False, True, True, False, False, False, True, True, False, True, True, False, False, True, False, True, True, False, True, False, False, True, False, False],
    'Play': ['No', 'No', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'No', 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'No', 'Yes']
}
df = pd.DataFrame(dataset_dict)

# Prepare data for model
df = pd.get_dummies(df, columns=['Outlook'],  prefix='', prefix_sep='', dtype=int)
df['Wind'] = df['Wind'].astype(int)
df['Play'] = (df['Play'] == 'Yes').astype(int)

# Split data into training and testing sets
X, y = df.drop(columns='Play'), df['Play']
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.5, shuffle=False)

# Scale numerical features (for automatic binarization)
scaler = StandardScaler()
float_cols = X_train.select_dtypes(include=['float64']).columns
X_train[float_cols] = scaler.fit_transform(X_train[float_cols])
X_test[float_cols] = scaler.transform(X_test[float_cols])

# Train the model
nb_clf = BernoulliNB()
nb_clf.fit(X_train, y_train)

# Make predictions
y_pred = nb_clf.predict(X_test)

# Check accuracy
print(f"Accuracy: {accuracy_score(y_test, y_pred)}")

Accuracy: 0.8571428571428571


## **Gaussian Naive Bayes (GaussianNB)**

**Overview**

Gaussian Naive Bayes is a classification algorithm based on **Bayes' Theorem** with the assumption that the **features are conditionally independent** given the class label (the "naive" assumption). What makes **Gaussian Naive Bayes (GaussianNB)** unique is that it assumes that the continuous features follow a **Gaussian (normal) distribution**.

### **How Gaussian Naive Bayes Works**

1. **Assumptions**:  
   Gaussian Naive Bayes assumes that:
   - Each feature (variable) is independent of the others within a given class (the Naive assumption).
   - Each feature follows a **Gaussian (normal) distribution** within each class.

2. **Bayes' Theorem**:  
   The algorithm relies on **Bayes' Theorem** to calculate the posterior probability of a class given the features:

   $$
   P(C_k | X) = \frac{P(X | C_k) \cdot P(C_k)}{P(X)}
   $$
   where:
   - $ C_k $ is the class.
   - $ X $ represents the feature vector.
   - $ P(C_k | X) $ is the posterior probability of class $ C_k $ given the feature vector $ X $.
   - $ P(X | C_k) $ is the likelihood, the probability of observing $ X $ given class $ C_k $.
   - $ P(C_k) $ is the prior probability of class $ C_k $.
   - $ P(X) $ is the marginal likelihood of the feature vector $ X $.

   Since we are dealing with continuous features, the likelihood $ P(X | C_k) $ is modeled as the **product of Gaussian distributions** for each feature.

3. **Gaussian Distribution for Each Feature**:  
   Each feature $ x_i $ is assumed to follow a normal distribution, given the class $ C_k $:

   $$
   P(x_i | C_k) = \frac{1}{\sqrt{2 \pi \sigma^2}} \exp \left( -\frac{(x_i - \mu)^2}{2\sigma^2} \right)
   $$
   where:
   - $ \mu $ is the mean of the feature $ x_i $ within class $ C_k $.
   - $ \sigma^2 $ is the variance of the feature $ x_i $ within class $ C_k $.
   
   The likelihood $ P(X | C_k) $ is the product of these Gaussian probabilities for each feature, assuming the features are independent:

   $$
   P(X | C_k) = \prod_{i=1}^{n} P(x_i | C_k)
   $$

4. **Posterior Probability**:  
   After calculating the likelihood $ P(X | C_k) $ and using **Bayes' Theorem**, the algorithm calculates the posterior probability for each class $ C_k $. The class with the highest posterior probability is chosen as the predicted class.

   $$
   P(C_k | X) \propto P(C_k) \prod_{i=1}^{n} P(x_i | C_k)
   $$

   The class with the highest posterior probability is selected as the predicted label for the given input vector $ X $.

### **Key Steps in Gaussian Naive Bayes**

1. **Estimate Parameters**:  
   For each class $ C_k $, calculate:
   - The mean $ \mu_k $ of each feature in class $ C_k $.
   - The variance $ \sigma_k^2 $ of each feature in class $ C_k $.

2. **Compute Likelihood**:  
   For each test point, compute the likelihood of each feature $ x_i $ given each class using the **Gaussian distribution**.

3. **Apply Bayes’ Theorem**:  
   Combine the prior probability of each class $ P(C_k) $, the likelihood of the features $ P(X | C_k) $, and the normalizing constant to compute the posterior probability $ P(C_k | X) $.

4. **Predict**:  
   Select the class $ C_k $ with the highest posterior probability.

### **When to Use Gaussian Naive Bayes**

1. **Continuous Features**:  
   Gaussian Naive Bayes is particularly useful when the features are continuous and you assume that they follow a **normal distribution**. For example, in applications where features such as height, weight, temperature, etc., are important, Gaussian Naive Bayes is a good fit.

2. **Large Datasets**:  
   Naive Bayes models, including Gaussian Naive Bayes, are computationally efficient and scale well with large datasets. This is especially true when the number of features is high.

3. **Simple and Fast Classification**:  
   Gaussian Naive Bayes is a **simple and fast** algorithm, especially when you need a baseline model or need to process data quickly for exploratory analysis. It can perform surprisingly well even when the assumptions of independence or normality are not fully met.

4. **When Assumptions Hold**:  
   It performs best when the continuous features in the dataset **roughly follow a Gaussian distribution**. If the features deviate significantly from normality, the performance might degrade.

### **Advantages of Gaussian Naive Bayes**

- **Fast to Train and Predict**: Since it makes strong assumptions (like feature independence and normality), it is computationally efficient, especially for large datasets.
- **Simple and Easy to Implement**: The algorithm is simple and interpretable, making it easy to implement and understand.
- **Handles Multi-Class Problems**: Naive Bayes can handle **multi-class classification** problems out of the box.

### **Disadvantages of Gaussian Naive Bayes**

- **Assumption of Feature Independence**: The "naive" assumption that all features are independent given the class is often unrealistic in real-world data, which can lead to suboptimal performance when this assumption is violated.
- **Sensitivity to Imbalanced Data**: Like many probabilistic models, Gaussian Naive Bayes can be sensitive to class imbalance, where it may favor the majority class.
- **Assumption of Gaussian Distribution**: If the features do not follow a Gaussian distribution, the model might not perform well, as it is based on that assumption.

### **When Not to Use Gaussian Naive Bayes**
- When the features **do not follow a Gaussian distribution**, or the distribution is highly skewed.
- When the **feature independence assumption** is clearly violated (e.g., when features are strongly correlated with each other).
- When you have **imbalanced data** and are particularly concerned with predicting the minority class.



## **Hands-on demo:**

## **Step 1: Define the Problem**
We will classify a new data point $ X = (5.5, 3.5) $ into one of two classes:  
- **Class 0**  
- **Class 1**  

Given a training dataset:

| Sample | Feature 1 (X1) | Feature 2 (X2) | Class (Y) |
|--------|--------------|--------------|------------|
| 1      | 5.1          | 3.5          | 0          |
| 2      | 4.9          | 3.0          | 0          |
| 3      | 4.7          | 3.2          | 0          |
| 4      | 4.6          | 3.1          | 0          |
| 5      | 5.0          | 3.6          | 1          |
| 6      | 5.4          | 3.9          | 1          |
| 7      | 4.8          | 3.4          | 1          |
| 8      | 5.2          | 3.7          | 1          |

We will classify $ X = (5.5, 3.5) $ using **Gaussian Naive Bayes**.


## **Step 2: Compute Class Priors** $ P(Y) $
We calculate the probability of each class occurring in the dataset:

$$
P(Y=0) = \frac{\text{Number of samples in Class 0}}{\text{Total number of samples}} = \frac{4}{8} = 0.5
$$

$$
P(Y=1) = \frac{\text{Number of samples in Class 1}}{\text{Total number of samples}} = \frac{4}{8} = 0.5
$$


## **Step 3: Compute Mean and Variance for Each Feature Per Class**
We calculate the mean $ \mu $ and variance $ \sigma^2 $ for **each feature per class**.

### **For Class 0**
#### Feature 1 (X1)
$$
\mu_{0,1} = \frac{5.1 + 4.9 + 4.7 + 4.6}{4} = \frac{19.3}{4} = 4.825
$$

$$
\sigma^2_{0,1} = \frac{(5.1 - 4.825)^2 + (4.9 - 4.825)^2 + (4.7 - 4.825)^2 + (4.6 - 4.825)^2}{4}
$$

$$
= \frac{(0.275)^2 + (0.075)^2 + (-0.125)^2 + (-0.225)^2}{4}
$$

$$
= \frac{0.0756 + 0.0056 + 0.0156 + 0.0506}{4} = \frac{0.1474}{4} = 0.03685
$$

#### Feature 2 (X2)
$$
\mu_{0,2} = \frac{3.5 + 3.0 + 3.2 + 3.1}{4} = \frac{12.8}{4} = 3.2
$$

$$
\sigma^2_{0,2} = \frac{(3.5 - 3.2)^2 + (3.0 - 3.2)^2 + (3.2 - 3.2)^2 + (3.1 - 3.2)^2}{4}
$$

$$
= \frac{(0.3)^2 + (-0.2)^2 + (0)^2 + (-0.1)^2}{4}
$$

$$
= \frac{0.09 + 0.04 + 0 + 0.01}{4} = \frac{0.14}{4} = 0.035
$$


### **For Class 1**
#### Feature 1 (X1)
$$
\mu_{1,1} = \frac{5.0 + 5.4 + 4.8 + 5.2}{4} = \frac{20.4}{4} = 5.1
$$

$$
\sigma^2_{1,1} = \frac{(5.0 - 5.1)^2 + (5.4 - 5.1)^2 + (4.8 - 5.1)^2 + (5.2 - 5.1)^2}{4}
$$

$$
= \frac{(-0.1)^2 + (0.3)^2 + (-0.3)^2 + (0.1)^2}{4}
$$

$$
= \frac{0.01 + 0.09 + 0.09 + 0.01}{4} = \frac{0.2}{4} = 0.05
$$

#### Feature 2 (X2)
$$
\mu_{1,2} = \frac{3.6 + 3.9 + 3.4 + 3.7}{4} = \frac{14.6}{4} = 3.65
$$

$$
\sigma^2_{1,2} = \frac{(3.6 - 3.65)^2 + (3.9 - 3.65)^2 + (3.4 - 3.65)^2 + (3.7 - 3.65)^2}{4}
$$

$$
= \frac{(-0.05)^2 + (0.25)^2 + (-0.25)^2 + (0.05)^2}{4}
$$

$$
= \frac{0.0025 + 0.0625 + 0.0625 + 0.0025}{4} = \frac{0.13}{4} = 0.0325
$$

## **Step 4: Compute Likelihoods Using the Gaussian Formula**

The Gaussian probability density function (PDF) is:

$$
P(X_i | Y) = \frac{1}{\sqrt{2 \pi \sigma^2}} e^{-\frac{(X_i - \mu)^2}{2\sigma^2}}
$$

### **For Class 0**
We have:
- **Feature 1 (X1)**: $ \mu_{0,1} = 4.825 $, $ \sigma^2_{0,1} = 0.03685 $
- **Feature 2 (X2)**: $ \mu_{0,2} = 3.2 $, $ \sigma^2_{0,2} = 0.035 $

#### **Compute $ P(5.5 | Y=0) $**
$$
P(5.5 | Y=0) = \frac{1}{\sqrt{2 \pi (0.03685)}} e^{-\frac{(5.5 - 4.825)^2}{2(0.03685)}}
$$
$$
= \frac{1}{\sqrt{0.2318}} e^{-\frac{(0.675)^2}{0.0737}}
$$
$$
= \frac{1}{0.4815} e^{-6.183}
$$
$$
= 2.077 e^{-6.183}
$$
$$
= 2.077 \times 0.00206 = 0.0043
$$

#### **Compute $ P(3.5 | Y=0) $**
$$
P(3.5 | Y=0) = \frac{1}{\sqrt{2 \pi (0.035)}} e^{-\frac{(3.5 - 3.2)^2}{2(0.035)}}
$$
$$
= \frac{1}{\sqrt{0.2199}} e^{-\frac{(0.3)^2}{0.07}}
$$
$$
= \frac{1}{0.469} e^{-1.286}
$$
$$
= 2.131 e^{-1.286}
$$
$$
= 2.131 \times 0.276 = 0.588
$$

### **For Class 1**
We have:
- **Feature 1 (X1)**: $ \mu_{1,1} = 5.1 $, $ \sigma^2_{1,1} = 0.05 $
- **Feature 2 (X2)**: $ \mu_{1,2} = 3.65 $, $ \sigma^2_{1,2} = 0.0325 $

#### **Compute $ P(5.5 | Y=1) $**
$$
P(5.5 | Y=1) = \frac{1}{\sqrt{2 \pi (0.05)}} e^{-\frac{(5.5 - 5.1)^2}{2(0.05)}}
$$
$$
= \frac{1}{\sqrt{0.314}} e^{-\frac{(0.4)^2}{0.1}}
$$
$$
= \frac{1}{0.56} e^{-1.6}
$$
$$
= 1.785 e^{-1.6}
$$
$$
= 1.785 \times 0.201 = 0.359
$$

#### **Compute $ P(3.5 | Y=1) $**
$$
P(3.5 | Y=1) = \frac{1}{\sqrt{2 \pi (0.0325)}} e^{-\frac{(3.5 - 3.65)^2}{2(0.0325)}}
$$
$$
= \frac{1}{\sqrt{0.204}} e^{-\frac{(-0.15)^2}{0.065}}
$$
$$
= \frac{1}{0.451} e^{-0.346}
$$
$$
= 2.216 e^{-0.346}
$$
$$
= 2.216 \times 0.707 = 1.566
$$

## **Step 5: Compute Posterior Probability**
Using **Bayes' Theorem**, we compute:

$$
P(Y | X) \propto P(X | Y) P(Y)
$$

Since $ P(Y=0) = P(Y=1) = 0.5 $, we can ignore the denominator.

#### **Compute Score for Class 0**
$$
P(5.5 | Y=0) \times P(3.5 | Y=0) \times P(Y=0)
$$

$$
= 0.0043 \times 0.588 \times 0.5
$$

$$
= 0.00126
$$

#### **Compute Score for Class 1**
$$
P(5.5 | Y=1) \times P(3.5 | Y=1) \times P(Y=1)
$$

$$
= 0.359 \times 1.566 \times 0.5
$$

$$
= 0.2807
$$


## **Step 6: Choose the Class with the Highest Score**
- **Class 0 Score** = **0.00126**
- **Class 1 Score** = **0.2807**

Since **Class 1 has a much higher probability**, we classify $ X = (5.5, 3.5) $ as **Class 1**.

### sklearn `GaussianNB`

In [20]:
# example 1: Check with sklearn's GaussianNB model
import pandas as pd
from sklearn.naive_bayes import GaussianNB

data = {
    "Sample": [1, 2, 3, 4, 5, 6, 7, 8],
    "Feature1_X1": [5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.8, 5.2],
    "Feature2_X2": [3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.7],
    "Class_Y": [0, 0, 0, 0, 1, 1, 1, 1]
}
df = pd.DataFrame(data)
df.head(10)

Unnamed: 0,Sample,Feature1_X1,Feature2_X2,Class_Y
0,1,5.1,3.5,0
1,2,4.9,3.0,0
2,3,4.7,3.2,0
3,4,4.6,3.1,0
4,5,5.0,3.6,1
5,6,5.4,3.9,1
6,7,4.8,3.4,1
7,8,5.2,3.7,1


In [22]:
X = df[['Feature1_X1', 'Feature2_X2']]
y = df['Class_Y']

gnb = GaussianNB()
gnb.fit(X, y)

x_new = pd.DataFrame({
    "Feature1_X1": [5.5],
    "Feature2_X2": [3.5]
})

gnb.predict(x_new)

array([1])

In [23]:
gnb.class_prior_

array([0.5, 0.5])

In [24]:
gnb.classes_

array([0, 1])

In [25]:
gnb.class_count_

array([4., 4.])

In [27]:
gnb.var_

array([[0.036875, 0.035   ],
       [0.05    , 0.0325  ]])

In [28]:
gnb.theta_

array([[4.825, 3.2  ],
       [5.1  , 3.65 ]])

In [29]:
# example 2: Classify Iris dataset
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
import numpy as np

# Load dataset
iris = datasets.load_iris()
X, y = iris.data, iris.target

# Split into training & testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("Training set shape:", X_train.shape)
print("Test set shape:", X_test.shape)

Training set shape: (120, 4)
Test set shape: (30, 4)


In [31]:
gnb = GaussianNB()
gnb.fit(X_train, y_train)

y_pred = gnb.predict(X_test)

In [32]:
y_pred

array([1, 0, 2, 1, 1, 0, 1, 2, 1, 1, 2, 0, 0, 0, 0, 1, 2, 1, 1, 2, 0, 2,
       0, 2, 2, 2, 2, 2, 0, 0])

In [33]:
y_test

array([1, 0, 2, 1, 1, 0, 1, 2, 1, 1, 2, 0, 0, 0, 0, 1, 2, 1, 1, 2, 0, 2,
       0, 2, 2, 2, 2, 2, 0, 0])

## **Multinomial Naive Bayes (MultinomialNB)**

Multinomial Naive Bayes (MNB) is a variant of the Naive Bayes algorithm that is designed for **discrete count data**. It is widely used for **text classification**, where features represent word frequencies or term counts in documents.


#### **How Multinomial Naive Bayes Works**  
Unlike Gaussian Naive Bayes, which assumes continuous features following a normal distribution, MNB works by modeling the likelihood of features using the **multinomial distribution**.

Given a feature vector $ X = (X_1, X_2, ..., X_n) $, the probability of class $ Y = c $ is given by:

$$
P(Y = c | X) \propto P(Y = c) \prod_{i=1}^{n} P(X_i | Y = c)
$$

where:

$$
P(X_i | Y = c) = \frac{\text{Count}(X_i, c) + \alpha}{\sum_j \text{Count}(X_j, c) + \alpha V}
$$

- **Count(X_i, c)** = number of times feature $ X_i $ appears in class $ c $
- **V** = total number of unique features (vocabulary size in text classification)
- **α (Laplace smoothing)** = a smoothing parameter (usually set to 1) to avoid zero probabilities

🚫 **Not suitable for continuous numerical data** (e.g., height, weight, or sensor data).  

### **Spam Detection Using Multinomial Naïve Bayes (Step-by-Step by Hand, No Code)**  

Multinomial Naïve Bayes is widely used for text classification, such as **spam detection**. It works by counting the frequency of words in emails and using **Bayes’ Theorem** to classify an email as **Spam or Not Spam**.


### **Given Data (Training Emails)**  
We will use a small dataset of **5 emails** for training.  

| Email # | Text                           | Label |
|---------|--------------------------------|-------|
| 1       | "Win money now"               | Spam  |
| 2       | "Earn cash fast"              | Spam  |
| 3       | "Hello, let’s meet today"     | Not Spam  |
| 4       | "Win a free lottery ticket"  | Spam  |
| 5       | "Meeting schedule update"    | Not Spam  |


### **Step 1: Vocabulary Extraction**  
We extract the unique words from all emails:  

**Vocabulary =** {win, money, now, earn, cash, fast, hello, let’s, meet, today, free, lottery, ticket, meeting, schedule, update}

Each word will be counted in spam and non-spam emails.

### **Step 2: Count Word Frequencies**
We count how many times each word appears in **Spam** and **Not Spam** emails.

| Word      | Spam Count | Not Spam Count |
|-----------|------------|---------------|
| win       | 2          | 0             |
| money     | 1          | 0             |
| now       | 1          | 0             |
| earn      | 1          | 0             |
| cash      | 1          | 0             |
| fast      | 1          | 0             |
| hello     | 0          | 1             |
| let’s     | 0          | 1             |
| meet      | 0          | 1             |
| today     | 0          | 1             |
| free      | 1          | 0             |
| lottery   | 1          | 0             |
| ticket    | 1          | 0             |
| meeting   | 0          | 1             |
| schedule  | 0          | 1             |
| update    | 0          | 1             |


### **Step 3: Compute Probabilities**
#### **(1) Prior Probabilities**
We calculate the probability of an email being **Spam** or **Not Spam** based on the training data.

$$
P(\text{Spam}) = \frac{\text{Number of Spam Emails}}{\text{Total Emails}} = \frac{3}{5} = 0.6
$$

$$
P(\text{NotSpam}) = \frac{\text{Number of Not Spam Emails}}{\text{Total Emails}} = \frac{2}{5} = 0.4
$$

#### **(2) Likelihood (Word Probabilities)**
Using **Laplace Smoothing** ($ +1 $ to all counts to avoid zero probabilities):

$$
P(word | Spam) = \frac{\text{Word Count in Spam} + 1}{\text{Total Words in Spam + Vocabulary Size}}
$$

$$
P(word | NotSpam) = \frac{\text{Word Count in Not Spam} + 1}{\text{Total Words in Not Spam + Vocabulary Size}}
$$

Let's assume:  
- Total words in **Spam** emails: **10**  
- Total words in **Not Spam** emails: **6**  
- Vocabulary size: **16**  

For example:

$$
P(win | Spam) = \frac{2 + 1}{10 + 16} = \frac{3}{26} = 0.115
$$

$$
P(win | Not Spam) = \frac{0 + 1}{6 + 16} = \frac{1}{22} = 0.045
$$

We repeat this for all words.


### **Step 4: Classify a New Email**
Let's classify:  
**"Win cash fast"**  

Using **Naïve Bayes**, we compute:

#### **(1) Compute Probability for Spam**
$$
P(Spam | \text{"Win cash fast"}) \propto P(Spam) \times P(win | Spam) \times P(cash | Spam) \times P(fast | Spam)
$$

$$
= 0.6 \times 0.115 \times 0.077 \times 0.077
$$

#### **(2) Compute Probability for Not Spam**
$$
P(NotSpam | \text{"Win cash fast"}) \propto P(NotSpam) \times P(win | NotSpam) \times P(cash | NotSpam) \times P(fast | NotSpam)
$$

$$
= 0.4 \times 0.045 \times 0.045 \times 0.045
$$


### **Step 5: Compare Probabilities**
Since **$ P(Spam | "Win cash fast") > P(NotSpam | "Win cash fast") $**,  
we classify this email as **SPAM**.


### **Summary**
- **Step 1:** Extract vocabulary.  
- **Step 2:** Count word frequencies in Spam & Not Spam emails.  
- **Step 3:** Compute **prior probabilities** & **word likelihoods**.  
- **Step 4:** Compute probabilities for a new email using Bayes’ Theorem.  
- **Step 5:** Compare and classify!

### `MultinomialNB` for Spam Classification

In [1]:
import pandas as pd
import numpy as np
# import nltk
# import string
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report

# Load dataset
df = pd.read_csv("data/spam.csv", encoding="latin-1")

# Select only the necessary columns
df = df[['Category', 'Message']]
df.columns = ['label', 'text']  # Rename columns

# Convert labels to numeric: 'spam' -> 1, 'ham' -> 0
df['label'] = df['label'].map({'ham': 0, 'spam': 1})

# Text preprocessing function
# nltk.download('stopwords')
# from nltk.corpus import stopwords

# def clean_text(text):
#     text = text.lower()  # Convert to lowercase
#     text = "".join([char for char in text if char not in string.punctuation])  # Remove punctuation
#     words = text.split()  # Tokenization
#     words = [word for word in words if word not in stopwords.words('english')]  # Remove stopwords
#     return " ".join(words)

# Apply text cleaning
# df['clean_text'] = df['text'].apply(clean_text)

# Convert text to numerical features using TF-IDF
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(df['text'])  # Feature matrix
y = df['label']  # Target variable

# Split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train the Multinomial Naïve Bayes classifier
model = MultinomialNB()
model.fit(X_train, y_train)

# Predictions
y_pred = model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.4f}")
print("\nClassification Report:\n", classification_report(y_test, y_pred))

# Function to predict new messages
def predict_message(msg):
    # msg_clean = clean_text(msg)  # Preprocess message
    msg_vectorized = vectorizer.transform([msg])  # Convert to TF-IDF vector
    prediction = model.predict(msg_vectorized)[0]
    return "Spam" if prediction == 1 else "Ham"

# Test with a new message
new_message = "Congra wotulations! Youn a free lottery. Click here to claim your prize!"
print(f"Message: {new_message} → Prediction: {predict_message(new_message)}")

Accuracy: 0.9650

Classification Report:
               precision    recall  f1-score   support

           0       0.96      1.00      0.98       966
           1       1.00      0.74      0.85       149

    accuracy                           0.97      1115
   macro avg       0.98      0.87      0.91      1115
weighted avg       0.97      0.97      0.96      1115

Message: Congra wotulations! Youn a free lottery. Click here to claim your prize! → Prediction: Spam


https://medium.com/towards-data-science/bernoulli-naive-bayes-explained-a-visual-guide-with-code-examples-for-beginners-aec39771ddd6

