# Supervised Learning: Naive Bayes Classifiers

In this lesson, we'll explore:
1. **Naive Bayes** fundamentals and mathematical foundation.
2. Different **Naive Bayes variants** in scikit-learn (Gaussian, Multinomial, Bernoulli, Complement, Categorical), and why/when you'd use each.
3. **Mathematical theory** behind each variant.
4. **Pipelines and ColumnTransformers** in a real example.
5. **Python worked-out examples** for each NB.

By the end, you'll understand how NB works, see code demos, and learn how to apply each variant effectively.

## 1. How Naive Bayes Works

### 1.1 Bayes Rule
Recall **Bayes Theorem**:
```
P(y|x) = P(x|y) * P(y) / P(x)
```
We pick the class y that **maximizes** P(y|x). Since P(x) is constant for all classes, we effectively want:
```
argmax_y [P(y) * P(x|y)]
```

### 1.2 "Naive" Conditional Independence
We assume each feature x_i is conditionally independent given y:
```
P(x|y) = product over i of P(x_i | y)
```
This greatly simplifies the computations.

### 1.3 Intuitive Example
Imagine a spam filter with features: occurrence of "lottery" or "free". We assume they're independent given spam.
```
P(spam | lottery, free) ~ P(spam) * P(lottery|spam) * P(free|spam)
```
We do the same for not-spam, see which is larger.

## 2. Different Naive Bayes Variants
Scikit-learn provides:

1. **GaussianNB**: For continuous features, assuming they're Gaussian.
2. **MultinomialNB**: For count features (like word counts in text).
3. **BernoulliNB**: For binary features (0/1 presence).
4. **ComplementNB**: Variation of Multinomial for imbalanced data.
5. **CategoricalNB**: For purely categorical features.

Each variant **assumes** a certain distribution for features:
- Gaussian -> normal distribution.
- Multinomial -> discrete counts.
- Bernoulli -> binary.
- Complement -> reweights complements, good for class imbalance.
- Categorical -> each feature has discrete categories.


| **Naive Bayes Model** | **Use Case** | **Assumption** |
|-----------------------|--------------|----------------|
| **GaussianNB** | Continuous numeric data | Features follow a Gaussian distribution |
| **MultinomialNB** | Discrete counts (e.g. word counts) | Non-negative integer features, typical for text |
| **BernoulliNB** | Binary features (0/1) | Bernoulli distribution, presence/absence of features |
| **ComplementNB** | Variation of multinomial for class imbalance | Weighted by complements for better handling of imbalance |
| **CategoricalNB** | All categorical features | Numeric must be discretized, each feature is a category |

Each model only differs in how it estimates \(P(x_i | y)\) for each feature \(x_i\).

## 3. Key Features & Assumptions
- **Fast Prediction**: NB typically has few parameters and is computationally efficient.
- **High-dimensional** data: often used in text classification.
- **Naive** independence: Ignores feature correlations (rarely true, but works well in practice).
- **Feature distributions**: Must match the variant's assumptions (Gaussian vs. count vs. binary vs. categorical).
- **No missing data**: Usually, NB is shown in examples that assume no missing values.

## 4. Pipelines and ColumnTransformers with a Real Example
We'll use the **Penguins** dataset from seaborn. It has numeric columns (bill_length_mm, etc.) and categorical columns (island, sex). We'll predict `species`.
We'll demonstrate a pipeline with a ColumnTransformer to handle numeric vs. categorical data, then feed into NB classifiers.

In [None]:
!pip install seaborn --quiet
import seaborn as sns
import pandas as pd
import numpy as np

penguins = sns.load_dataset('penguins')
penguins.dropna(inplace=True)
print('Shape after dropping NAs:', penguins.shape)
penguins.head()

Shape after dropping NAs: (333, 7)


Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female
5,Adelie,Torgersen,39.3,20.6,190.0,3650.0,Male


Split data into X, y, then train/test.

In [None]:
from sklearn.model_selection import train_test_split

X = penguins.drop(columns=['species'])
y = penguins['species']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=125, stratify=y)
X_train.shape, X_test.shape

((233, 6), (100, 6))

### ColumnTransformer
We have numeric columns (`bill_length_mm`, etc.) and categorical columns (`island`, `sex`). We'll scale numeric data and one-hot-encode categorical data. Then we feed it into a chosen NB classifier.

In [None]:
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

num_cols = ['bill_length_mm','bill_depth_mm','flipper_length_mm','body_mass_g']
cat_cols = ['island','sex']

numeric_transformer = Pipeline([
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline([
    ('ohe', OneHotEncoder(sparse_output =False, handle_unknown='ignore'))
])

preprocessor = ColumnTransformer([
    ('num', numeric_transformer, num_cols),
    ('cat', categorical_transformer, cat_cols)
])
preprocessor

## 5. Trying Different NB Variants
We'll build pipelines for each NB variant, see which is appropriate, and measure accuracy.

In [None]:
from sklearn.naive_bayes import GaussianNB, MultinomialNB, BernoulliNB, ComplementNB, CategoricalNB
from sklearn.metrics import accuracy_score

def build_pipeline(nb_model):
    return Pipeline([
        ('preprocess', preprocessor),
        ('nb', nb_model)
    ])

classifiers = {
    'GaussianNB': GaussianNB(),
    'MultinomialNB': MultinomialNB(),
    'BernoulliNB': BernoulliNB(),
    'ComplementNB': ComplementNB()
    # We can do CategoricalNB, but we have numeric features.
    #"CategoricalNB" -> we can do if numeric is binned. (ranges)
}

results = {}
for name, nb_clf in classifiers.items():
    pipe = build_pipeline(nb_clf)
    try:
        pipe.fit(X_train, y_train)
        acc = pipe.score(X_test, y_test)
        results[name] = acc
    except Exception as e:
        results[name] = f'Error: {e}'

results

{'GaussianNB': 0.72,
 'MultinomialNB': 'Error: Negative values in data passed to MultinomialNB (input X).',
 'BernoulliNB': 0.97,
 'ComplementNB': 'Error: Negative values in data passed to ComplementNB (input X).'}

Some NB variants might produce an error or suboptimal performance if numeric data doesn't match their assumption (e.g., Multinomial expects non-negative counts). Multinomial and Bernoulli might complain about negative or continuous data. GaussianNB typically works better for numeric.


### 5.1 Trying CategoricalNB (with Discretization)
If we want to try **CategoricalNB**, we must turn our numeric features into categories (e.g., via KBinsDiscretizer).

In [None]:
from sklearn.preprocessing import KBinsDiscretizer

disc = KBinsDiscretizer(n_bins=10, encode='ordinal', strategy='uniform')

cat_pipeline = ColumnTransformer([
    ('num_disc', disc, num_cols),
    ('cat_ohe', OneHotEncoder(sparse_output=False, handle_unknown='ignore'), cat_cols)
])

cat_nb_model = Pipeline([
    ('preprocess', cat_pipeline),
    ('catnb', CategoricalNB())
])

cat_nb_model.fit(X_train, y_train)
acc_catnb = cat_nb_model.score(X_test, y_test)
acc_catnb

0.97

### 5.2 Working Multinomial
MultinomialNB is ideal for count data. In text classification, features typically represent counts or frequencies of words in documents. Below is a small demonstration using scikit-learn’s CountVectorizer and MultinomialNB.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
import numpy as np

# 1) Create a small text dataset
texts = [
    "free offer for you",             # spam
    "limited time discount",          # spam
    "meet me for lunch",             # not spam
    "how about a movie tonight",      # not spam
    "discount sale free free offer",  # spam
    "let me know about the updates",  # not spam
    "free gift only today"            # spam
]
labels = ["spam", "spam", "not_spam", "not_spam", "spam", "not_spam", "spam"]

# 2) Convert text to integer counts with CountVectorizer
vectorizer = CountVectorizer()
X_counts = vectorizer.fit_transform(texts)

# 3) Train/Test split
X_train, X_test, y_train, y_test = train_test_split(
    X_counts, labels, test_size=0.3, random_state=42, stratify=labels
)

# 4) Fit MultinomialNB
mnb = MultinomialNB()
mnb.fit(X_train, y_train)

# 5) Evaluate
acc_multinomial = mnb.score(X_test, y_test)
print("MultinomialNB Accuracy:", acc_multinomial)

# 6) Predict a new text
new_text = ["Im free tonight"]
new_counts = vectorizer.transform(new_text)
prediction = mnb.predict(new_counts)
print("New text prediction (MultinomialNB):", prediction)


MultinomialNB Accuracy: 1.0
New text prediction (MultinomialNB): ['spam']


### 5.3 ComplementNB Example
ComplementNB is a variation of MultinomialNB designed to handle imbalanced data better by focusing on the complement of each class. It is also suited for count-based features. Below is a similar example, using the same data, but substituting ComplementNB for the classifier:

In [None]:
from sklearn.naive_bayes import ComplementNB

# Using the same 'texts' and 'labels' from above,
# and the same vectorizer approach (CountVectorizer).

# 1) Convert text to counts (already done above):
# X_counts = vectorizer.fit_transform(texts)

# 2) Train/Test split (using the same code as above):
# X_train, X_test, y_train, y_test = train_test_split(...)

# 3) Fit ComplementNB
cnb = ComplementNB()
cnb.fit(X_train, y_train)

# 4) Evaluate
acc_complement = cnb.score(X_test, y_test)
print("ComplementNB Accuracy:", acc_complement)

# 5) Predict on the same new text
prediction_cnb = cnb.predict(new_counts)  # 'new_counts' from earlier
print("New text prediction (ComplementNB):", prediction_cnb)

ComplementNB Accuracy: 1.0
New text prediction (ComplementNB): ['spam']


## 6. Summarizing NB Models
We can see which performed best, but also remember the distribution assumptions.


In [None]:
print('Results with naive approach of numeric -> possible negative or non-integer for some NB:')
for k,v in results.items():
    print(k, '->', v)

print('\nCategoricalNB with discretization ->', acc_catnb)

Results with naive approach of numeric -> possible negative or non-integer for some NB:
GaussianNB -> 0.72
MultinomialNB -> Error: Negative values in data passed to MultinomialNB (input X).
BernoulliNB -> 0.97
ComplementNB -> Error: Negative values in data passed to ComplementNB (input X).

CategoricalNB with discretization -> 0.97


# 6.1 Advantages & Disadvantages

**Advantages**:
- **Easy to implement** and fast to predict (few parameters).
- Handles **large feature spaces** well, e.g., text.
- Often effective even if independence assumption is not strictly true.
- Works well with limited training data.

**Disadvantages**:
- Assumes **feature independence**, which is usually false.
- Sensitive to **irrelevant features** if their distribution is significantly different.
- May assign **zero probability** to unseen events (unless using smoothing).


## 6.2 Common Applications
- **Spam Email Filtering**: Classifies emails as spam/not-spam.
- **Text Classification**: Sentiment analysis, topic categorization.
- **Medical Diagnosis**: Predicting disease probability from symptoms.
- **Credit Scoring**: Deciding loan approvals.
- **Weather Prediction**: Determining likely conditions.


## 7. Conclusion
1. Naive Bayes uses **Bayes rule** with an independence assumption.
2. Different NB variants match different data distributions (Gaussian, Multinomial, Bernoulli, etc.).
3. We used **Pipelines** + **ColumnTransformer** to handle numeric/categorical features.
4. We tested multiple NB variants on a numeric+categorical dataset (penguins). **GaussianNB** is more suitable for numeric. Others can require data transformations.


## Summary Table

| **Section**                 | **Key Point**                                                           |
|----------------------------|-------------------------------------------------------------------------|
| Bayes’ Theorem            | Posterior = (Likelihood * Prior) / Evidence                              |
| Naive Assumption          | Features are conditionally independent given the class                   |
| NB Variants               | Gaussian, Multinomial, Bernoulli, Complement, Categorical                |
| Typical Use Cases         | Text classification, spam filtering, simple numeric tasks                |
| Advantages                | Fast, simple, works well in high-dim data, small training data           |
| Disadvantages             | Independence assumption often violated, can be zero-prob for new events  |
| Applications              | Spam filtering, sentiment analysis, medical diagnosis, credit scoring    |


### Key Takeaways
- NB is simple, fast, often used for text classification or small data.
- Always pick the variant that matches your feature distribution.
- Pipelines + ColumnTransformer keep code organized.
