## **Exercise for UNIT 4.1 | Naive Bayes**
### **Axel John Nuqui & Joelmar Grecia**

### This notebook presents two implementations of the Naive Bayes classifier for text classification.

The first implementation is a manual approach, where the Bag-of-Words representation is constructed from scratch. It includes the calculation of prior probabilities, likelihoods, and posterior probabilities based on word frequencies in each class.

The second implementation uses the scikit-learn library, specifically the Multinomial Naive Bayes (`MultinomialNB`) classifier combined with `CountVectorizer` for automated text vectorization and model training.

This comparison demonstrates both the underlying mathematical principles of Naive Bayes and its practical application using a standard machine learning library.

In [37]:
# creating dataset for naive bayes

dataset = [
    {"text": "Free money now!!!", "class": "SPAM"},
    {"text": "Hi mom, how are you?", "class": "HAM"},
    {"text": "Lowest price for your meds", "class": "SPAM"},
    {"text": "Are we still on for dinner?", "class": "HAM"},
    {"text": "Let's catch up tomorrow at the office", "class": "HAM"},
    {"text": "Meeting at 3 PM tomorrow", "class": "HAM"},
    {"text": "Get 50% off, limited time!", "class": "SPAM"},
    {"text": "Free money now!!!", "class": "HAM"},
    {"text": "Free money now!!!", "class": "SPAM"},
    {"text": "Free money now!!!", "class": "HAM"},
]

In [38]:
# imports

import math
from collections import Counter
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer

In [39]:
# manual naive bayes classifier class

class MANUAL_NAIVE_BAYES_CLASSIFIER():
    def __init__(self):
        self.vocab = set()
        self.prior = {}
        self.likelihood = {}
        self.classes = []
        
    def train(self, dataset):
        # extracting classes counts in the dataset
        labels = [item['class'] for item in dataset]
        self.classes = list(set(labels))
        total_docs = len(dataset)
        
        # initializing counters for the bag of words
        word_counts = {c: Counter() for c in self.classes}
        total_words_in_class = {c: 0 for c in self.classes}
        
        # populating the bag of words
        for item in dataset: 
            label = item['class']
            tokens = item['text'].lower().split()
            self.prior[label] = self.prior.get(label, 0) + 1
           
            # adding each tokens in vocab  
            for token in tokens:
                self.vocab.add(token)
                word_counts[label][token] += 1
                total_words_in_class[label] += 1
            
        # finalizing priors: P(c)
        for c in self.classes:
            self.prior[c] /= total_docs
            
        # calculation of likelihoods: P(w|c) with laplace smoothing
        vocab_size = len(self.vocab)
        self.likelihood = {c: {} for c in self.classes}
        
        for c in self.classes:
            for word in self.vocab:
                num = word_counts[c][word] + 1
                den = total_words_in_class[c] + vocab_size
                self.likelihood[c][word] = num / den
                
    def predict(self, text):
        tokens = text.lower().split()
        scores = {}
        
        for c in self.classes:
            score = math.log(self.prior[c])
            
            for token in tokens:
                if token in self.vocab:
                    score += math.log(self.likelihood[c][token])
                else:
                    pass
            scores[c] = score
            
        return max(scores, key=scores.get)

In [40]:
# scikit-learn's multinomial NB classifier

# preparation of data
texts = [item['text'] for item in dataset]
labels = [item['class'] for item in dataset]

# vectorization for creation of bag of words
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(texts)

In [41]:
# initialization and training of both manual and multimodal NB classifiers

# manualNBC training
model = MANUAL_NAIVE_BAYES_CLASSIFIER()
model.train(dataset)

# multinomial NB training
sklearn_model = MultinomialNB()
sklearn_model.fit(X, labels)

0,1,2
,alpha,1.0
,force_alpha,True
,fit_prior,True
,class_prior,


In [42]:
# test sentences for class prediction

test_texts = ["Limited offer, click here!", "Meeting at 2 PM with the manager."]

In [None]:
# test using manual naive bayes

print("\n- - - PREDICTION USING MANUAL NB CLASSIFIER MODEL - - -\n")
manualNBC_result1 = model.predict(test_texts[0])
manualNBC_result2 = model.predict(test_texts[1])    

print(f'Sentence: "{test_texts[0]}" -> Prediction: {manualNBC_result1}')
print(f'Sentence: "{test_texts[1]}" -> Prediction: {manualNBC_result2}')

# test using multinomial NB from scikit-learn
print("\n- - - PREDICTION USING MULTINOMIAL NB - - -\n")
test_vectors = vectorizer.transform(test_texts)

sklearn_predictions = sklearn_model.predict(test_vectors)

for text, prediction in zip(test_texts, sklearn_predictions):
    print(f'Sentence: "{text}" -> Prediction: {prediction}')


- - - PREDICTION USING MANUAL NB CLASSIFIER MODEL - - -

Sentence: "Limited offer, click here!" -> Prediction: SPAM
Sentence: "Meeting at 2 PM with the manager." -> Prediction: HAM

- - - PREDICTION USING MULTINOMIAL NB - - -

Sentence: "Limited offer, click here!" -> Prediction: SPAM
Sentence: "Meeting at 2 PM with the manager." -> Prediction: HAM
