Let's implement the Naive Bayes Classifier

Let's first create the **bag of words** from the [IMDB dataset](https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews). We'll redo all these lectures in the end with the messages corpus just for fun, for education instead stick to something more serious.







In [None]:
import pandas as pd

In [None]:
imdb_dataset_filepath = "../datasets/IMDB.csv"
messages_dataset_filepath = "../datasets/train.csv"

#imdb_dataset = pd.read_csv(imdb_dataset_filepath)
imdb_dataset = pd.read_csv(messages_dataset_filepath)

print(imdb_dataset)
#print(f"total examples={len(imdb_dataset)}")
#print(f"positive examples={len(imdb_dataset[imdb_dataset['sentiment'] == 'positive'])}")
#print(f"negative examples={len(imdb_dataset[imdb_dataset['sentiment'] == 'negative'])}")
#




*P^(c) = Nc/ Ndoc*
so in our case for `positive` it's = 25000 / 50000 = 1/2. Same for `negative`.

In [None]:
# create the bag of words from the imdb corpus
def get_bag_of_words(dataset):
  bag_of_words = set()
  for row in dataset.iterrows():
    review = row[1]
    review = review.str.split(" ")
    review = review.to_list()[0]
    for word in review:
      bag_of_words.add(word)
  #print(review)
  return bag_of_words

#print(bag_of_words)
#print(len(bag_of_words))
#for word in bag_of_words:
#  print(word)



*P(wi|c)* is the fraction of times the word *wi* appears among all words in all documents of topic c.

We first concatenate all documents with category c into one big “category c” text.

Then we use the frequency of *wi* in this concatenated document to give a maximum likelihood estimate of probability:

![alt text](../images/mle_concatenated_documents.png)


Here the vocabulary *V* consists of all the word types in all classes, not just the words in class *c*.







In [None]:
documents_of_topic_positive = imdb_dataset[imdb_dataset["block"] == 0]
documents_of_topic_negative = imdb_dataset[imdb_dataset["block"] == 1]






In [None]:
print(documents_of_topic_negative)

In [None]:
print(documents_of_topic_positive)

In [None]:
bag_of_words_in_positive_documents = get_bag_of_words(documents_of_topic_positive)
bag_of_words_in_positive_documents

In [None]:
bag_of_words_in_negative_documents = get_bag_of_words(documents_of_topic_negative)
bag_of_words_in_negative_documents

Issue with MLE training. Imagine we try to estimate the likelihood of the word "fantastic" given the class "positive", but no training documents that contain the word "fantastic" and is classified as "positive". Perhaps "fantastic" used in a *sarcastic* way in *negative* class. In this case the probability would be 0.

![alt text](../images/sarcastic_probability.png)

Since naive Bayes multiplies all the feature likelihoods together, zero probabilities in the likelihood term for any class will cause the probability of the class to be zero, no matter the other evidence!

The simples solution is to add the Laplace add-one smoothing, this one is commonly used in naive Bayes rather than language models.

![alt text](../images/naive_bayes_add_laplace.png)







In [None]:
# V is our vocabulary
V = bag_of_words_in_positive_documents.union(bag_of_words_in_negative_documents)
V_sure = get_bag_of_words(imdb_dataset)
assert V == V_sure, f"expected {len(V_sure)} but got {len(V)}"
print(V)

What do we do about **unknown** words? i.e. words that did not occur in the training document of any class and are not in our vocabulary but did appear in the test data.

The solution for this is to ignore them. Remove them from the test document and not include any probability for them at all. TODO why is this the case???


Some systems may choose to ignore another class of words: **stop words**, very frequent words like *the* and *a*. Sort the vocabulary and remove the top 10-100 vocabulary entries as stop words or use already predefined online. Then each istance of these is removed from both test and training documents as if it never occurred.







##### Worked example
let's calculate the *P(c)*

In [None]:
total_examples = len(imdb_dataset)
positive_example_counts = len(imdb_dataset[imdb_dataset["sentiment"] == "positive"])
negative_example_counts = len(imdb_dataset[imdb_dataset["sentiment"] == "negative"])
prior_positive_class = positive_example_counts / total_examples
prior_negative_class = negative_example_counts / total_examples

print(f"prior_positive_class={prior_positive_class}")
print(f"prior_negative_class={prior_negative_class}")

![alt text](../images/prior_likelihood_features.png)

to apply Naive Bayes classifier to text, we will use each word in the documents as a feature and we consider each of the words in the document by walking an index through every word position in the document.

![alt text](../images/position_words.png)
    

Naive Bayes calculations are done in log space for same reason explained earlier.

![alt text](../images/naive_bayes_log.png)

#### Training the Naive Bayes Classifier
How can we learn the probabilities of *P(c)* i.e. the *prior* and *P(fi|c)* i.e. *likelihood*?

First we'll consider the MLE. Simply use the frequencies in the data. For the class prior *P(c)* we ask which percentage of the documents in our training data class *c* and the *Ndoc*be the total number of documents. Then:













In [None]:
import math
import numpy as np
import re
from functools import reduce
from collections import Counter


# TODO try this at some point potentially filter stop words
#from sklearn.feature_extraction import text
#stop_words = set(text.ENGLISH_STOP_WORDS)
def get_bag_of_words(dataset):
  return set(
    dataset["text"] # access all the text column
    .dropna()       # remove missing values
    #.str.lower()
    .str.split()    # split on whitespaces (more efficient than " ")
    .explode()      # flatten all words into one Series
    #.loc(lambda x: ~x.isin(stop_words)) # filter stopwords early
  )


# we're keeping the punctuation and everything and just considering the
# words as split by whitespaces so some of these results may not be accurate
def train_naive_bayes():
  total_examples = len(imdb_dataset)
  positive_example_counts = len(imdb_dataset[imdb_dataset["block"] == 0])
  negative_example_counts = len(imdb_dataset[imdb_dataset["block"] == 1])

  prior_positive_class = np.log2(positive_example_counts/total_examples)
  prior_negative_class = np.log2(negative_example_counts/total_examples)

  print(prior_positive_class, prior_negative_class)

  # compute the vocabulary V
  bag_of_words_in_negative_documents = get_bag_of_words(documents_of_topic_negative)
  bag_of_words_in_positive_documents = get_bag_of_words(documents_of_topic_positive)
  V = bag_of_words_in_positive_documents.union(bag_of_words_in_negative_documents)
  V_sure = get_bag_of_words(imdb_dataset)
  assert V == V_sure, f"expected {len(V_sure)} but got {len(V)}"

  bigdoc = {
    "positive": " ".join(documents_of_topic_positive["text"].astype(str)),
    "negative": " ".join(documents_of_topic_negative["text"].astype(str)),
  }

  # precompile a regex pattern to match all words in V
  sorted_words = sorted(V, key=lambda x: (-len(x), x))
  pattern_str = r'\b' + '|'.join([re.escape(word) for word in sorted_words]) + r'\b'
  word_regex = re.compile(pattern_str)

  def count_words(text): return Counter(word_regex.findall(text))

  # do we really need to do it this way???
  positive_counts = count_words(bigdoc["positive"])
  negative_counts = count_words(bigdoc["negative"])

  # build counts dictionary
  counts = {}
  for word in V:
    counts[(word, "positive")] = positive_counts.get(word, 0)
    counts[(word, "negative")] = negative_counts.get(word, 0)

  return V, counts


In [None]:

V, counts = train_naive_bayes()

In [None]:
print(V)
print(len(V))

In [None]:
print(counts)
print(len(counts))