# Task 5: Naive Bayes

## Overview

In this task, you will train **Naive Bayes classifiers** using both **Bag of Words (BoW)** and **TF-IDF** representations. This will help compare how different text representations impact classification performance.

## Why Naive Bayes?

**Naive Bayes (NB)** is a widely used algorithm for text classification because:
1. It is fast and efficient for large text datasets.
2. It handles high-dimensional text features well.
3. It is based on Bayes’ Theorem, assuming word independence (a useful simplification for text tasks).

In [1]:
# WRITE YOUR CODE HERE
import numpy as np

from joblib import dump
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score

In [2]:
# naive bayes for Bag of Words (BoW)
def train_and_save_nb_bow():
    # get features for train and test dataset
    X_train_bow = np.load('data/bow_train.npy')
    X_test_bow = np.load('data/bow_test.npy')

    # get labels for train and test dataset
    labels_train = np.load('data/train_labels.npy')
    labels_test = np.load('data/test_labels.npy')

    # prepare naive bayes model based on training features and labels
    clf_model = MultinomialNB()
    clf_model.fit(X_train_bow, labels_train)

    # check the accuracy for test dataset
    accuracy = clf_model.score(X_test_bow, labels_test)
    print("BoW Test Accuracy: " + str(round(100*accuracy, 1)) + "%")

    # save the model
    dump(clf_model, 'models/naive_bayes_bow.pkl')

In [3]:
train_and_save_nb_bow()

BoW Test Accuracy: 58.7%


In [4]:
# naive bayes for TF_IDF
def train_and_save_nb_tfidf():
    # get features for train and test dataset
    X_train_tfidf = np.load('data/tfidf_train.npy')
    X_test_tfidf = np.load('data/tfidf_test.npy')

    # get labels for train and test dataset
    labels_train = np.load('data/train_labels.npy')
    labels_test = np.load('data/test_labels.npy')

    # prepare naive bayes model based on training features and labels
    clf_model = MultinomialNB()
    clf_model.fit(X_train_tfidf, labels_train)

    # check the accuracy for test dataset
    accuracy = clf_model.score(X_test_tfidf, labels_test)
    print("TF-IDF Test Accuracy: " + str(round(100*accuracy, 1)) + "%")

    # save the model
    dump(clf_model, 'models/naive_bayes_tfidf.pkl')

In [5]:
train_and_save_nb_tfidf()

TF-IDF Test Accuracy: 65.7%
