<a href="https://colab.research.google.com/github/razon1494/ML-Practices/blob/main/Module16_Practice.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Module 16: Naive Bayes Classifier
## Section– Practice Notebook with TODOs

This notebook is for **practice**. It is aligned with the teaching notebook for Section 16.5.

You will work with two datasets:
1. A **synthetic numeric dataset** created with `make_classification` (for GaussianNB).
2. A **subset of the 20 Newsgroups text dataset** (for MultinomialNB and BernoulliNB).

Where you see `TODO`, write the required code yourself.
You can always refer back to the teaching notebook if you get stuck.

In [None]:
# ===============================================================
# Imports and basic setup
# ===============================================================

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification, fetch_20newsgroups
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB, MultinomialNB, BernoulliNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import seaborn as sns

plt.rcParams['figure.figsize'] = (7, 4)
sns.set(style='whitegrid')

## Part A – Gaussian Naive Bayes on Synthetic Numeric Data

In this part you will:
- Create a synthetic numeric dataset.
- Split into train and test sets.
- Train a Gaussian Naive Bayes model.
- Evaluate the model with accuracy and a confusion matrix.
- Experiment by changing the dataset difficulty.


In [None]:
# Create a synthetic numeric dataset for binary classification
X, y = make_classification(
    n_samples=600,
    n_features=6,
    n_informative=4,
    n_redundant=0,
    n_clusters_per_class=1,
    class_sep=1.6,
    random_state=42
)

print('Shape of X:', X.shape)
print('Shape of y:', y.shape)
print('Class distribution:', np.bincount(y))

In [None]:
# TODO 1: Split the data into training and test sets
# Use train_test_split with test_size=0.25 and random_state=42
# Save the result in X_train, X_test, y_train, y_test

# from sklearn.model_selection import train_test_split  # already imported above

X_train, X_test, y_train, y_test = ...  # TODO: write the correct function call

print('Training set shape:', X_train.shape)
print('Test set shape:', X_test.shape)

In [None]:
# TODO 2: Create and train a GaussianNB model
# 1. Create an instance of GaussianNB
# 2. Fit it on the training data

# Example structure:
# gnb = GaussianNB()
# gnb.fit(X_train, y_train)

gnb = ...          # TODO: create the model
...
# TODO: fit the model on X_train, y_train
print('Model training completed.')

In [None]:
# TODO 3: Make predictions on the test set and compute accuracy
# 1. Use the trained model to predict on X_test
# 2. Compute accuracy_score using y_test and predictions

y_pred = ...  # TODO: predictions for X_test
acc = ...     # TODO: compute accuracy_score
print('Accuracy of GaussianNB on synthetic data:', acc)

In [None]:
# TODO 4: Compute and plot the confusion matrix
# Steps:
# 1. Compute confusion_matrix using y_test and y_pred
# 2. Print the confusion matrix
# 3. Plot it with sns.heatmap

cm = ...  # TODO: compute confusion matrix
print('Confusion matrix:\n', cm)

sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.title('Gaussian Naive Bayes on Synthetic Numeric Data')
plt.xlabel('Predicted label')
plt.ylabel('True label')
plt.show()

In [None]:
# TODO 5: Print a classification report
# Use classification_report with y_test and y_pred

report = ...  # TODO: call classification_report here
print(report)

### Experiment: Change the Dataset Difficulty

- Recreate the dataset with a smaller `class_sep` value, such as `class_sep=0.8`.
- Repeat training and evaluation.
- Observe how accuracy and the confusion matrix change.

You can copy your previous code cells and adjust only the `make_classification` call.

## Part B – Naive Bayes for Text Classification (20 Newsgroups Subset)

In this part you will:
- Load a subset of the 20 Newsgroups dataset.
- Convert text into numeric features using `CountVectorizer`.
- Train a `MultinomialNB` classifier.
- Train a `BernoulliNB` classifier with binary features.
- Compare their performance.


In [None]:
# Fetch a subset of the 20 Newsgroups dataset
categories = ['comp.graphics', 'rec.sport.baseball', 'sci.med'] ## ADD another two more features
newsgroups = fetch_20newsgroups(
    subset='train',
    categories=categories,
    remove=('headers', 'footers', 'quotes'),
    shuffle=True,
    random_state=42
)

print('Number of documents:', len(newsgroups.data))
print('Target names:', newsgroups.target_names)

In [None]:
# Put into a DataFrame for easier handling
df_text = pd.DataFrame({
    'text': newsgroups.data,
    'label': newsgroups.target
})
df_text.head()

In [None]:
# TODO 6: Split text data into train and test sets
# Use train_test_split on df_text['text'] and df_text['label']
# Suggested: test_size=0.25, random_state=42

X_train_text, X_test_text, y_train_text, y_test_text = ...  # TODO

print('Train size:', X_train_text.shape[0])
print('Test size:', X_test_text.shape[0])

In [None]:
# TODO 7: Convert text to count vectors for MultinomialNB
# Steps:
# 1. Create a CountVectorizer with stop_words='english' and max_features=3000
# 2. Fit on X_train_text and transform both train and test sets
#    to get X_train_counts and X_test_counts

vectorizer = ...  # TODO: create CountVectorizer
X_train_counts = ...  # TODO: fit_transform on X_train_text
X_test_counts = ...   # TODO: transform on X_test_text

print('Shape of X_train_counts:', X_train_counts.shape)
print('Shape of X_test_counts:', X_test_counts.shape)

In [None]:
# TODO 8: Train a MultinomialNB model on the count vectors
# 1. Create a MultinomialNB instance
# 2. Fit it on X_train_counts and y_train_text

mnb = ...  # TODO: create model
...
# TODO: fit the model
print('MultinomialNB model trained on text data.')

In [None]:
# TODO 9: Evaluate MultinomialNB
# 1. Predict on X_test_counts
# 2. Compute accuracy
# 3. Compute and print a confusion matrix
# 4. Print a classification report with target_names=newsgroups.target_names

y_pred_text = ...  # TODO: predictions
acc_text = ...     # TODO: accuracy
print('Accuracy of MultinomialNB on 20 Newsgroups subset:', acc_text)

cm_text = ...      # TODO: confusion matrix
print('Confusion matrix:\n', cm_text)

sns.heatmap(cm_text, annot=True, fmt='d', cmap='Greens',
            xticklabels=newsgroups.target_names,
            yticklabels=newsgroups.target_names)
plt.title('MultinomialNB on 20 Newsgroups Subset')
plt.xlabel('Predicted label')
plt.ylabel('True label')
plt.show()

report_text = ...  # TODO: classification_report
print(report_text)

### 2.1 Practice: Bernoulli Naive Bayes with Binary Features

Now repeat a similar process using `BernoulliNB`:
- Use `CountVectorizer` with `binary=True`.
- Train a `BernoulliNB` model.
- Compare its accuracy and confusion matrix with `MultinomialNB`.


In [None]:
# TODO 10: Create binary bag of words features
# 1. Create a new CountVectorizer with binary=True
# 2. Fit on X_train_text and transform train and test sets

vectorizer_bin = ...  # TODO: CountVectorizer with binary=True
X_train_bin = ...     # TODO: fit_transform
X_test_bin = ...      # TODO: transform

print('Shape of X_train_bin:', X_train_bin.shape)
print('Shape of X_test_bin:', X_test_bin.shape)

In [None]:
# TODO 11: Train and evaluate BernoulliNB
# Steps:
# 1. Create a BernoulliNB model
# 2. Fit it on X_train_bin and y_train_text
# 3. Predict on X_test_bin
# 4. Compute accuracy, confusion matrix, and classification report
# 5. Compare the results with MultinomialNB

bnb = ...  # TODO: create BernoulliNB model
...
# TODO: fit the model
y_pred_bin = ...  # TODO: predictions
acc_bin = ...     # TODO: accuracy
print('Accuracy of BernoulliNB on 20 Newsgroups subset:', acc_bin)

cm_bin = ...      # TODO: confusion matrix
print('Confusion matrix:\n', cm_bin)

sns.heatmap(cm_bin, annot=True, fmt='d', cmap='Purples',
            xticklabels=newsgroups.target_names,
            yticklabels=newsgroups.target_names)
plt.title('BernoulliNB on 20 Newsgroups Subset')
plt.xlabel('Predicted label')
plt.ylabel('True label')
plt.show()

report_bin = ...  # TODO: classification_report
print(report_bin)

## Final Reflection

- Which model worked better on the text data, **MultinomialNB** or **BernoulliNB**?
- How did changing the dataset difficulty in Part A affect the performance of GaussianNB?
- Where do you think Naive Bayes will perform well in real projects (for example, spam detection, topic classification)?

Write a few short notes summarizing your observations.