Let's implement the Naive Bayes Classifier

Let's first create the **bag of words** from the [IMDB dataset](https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews). We'll redo all these lectures in the end with the messages corpus just for fun, for education instead stick to something more serious.







In [2]:
import pandas as pd

In [None]:
imdb_dataset_filepath = "../datasets/IMDB.csv"

imdb_dataset = pd.read_csv(imdb_dataset_filepath)

print(f"total examples={len(imdb_dataset)}")
print(f"positive examples={len(imdb_dataset[imdb_dataset['sentiment'] == 'positive'])}")
print(f"negative examples={len(imdb_dataset[imdb_dataset['sentiment'] == 'negative'])}")





total examples=50000
positive examples=25000
negative examples=25000


*P^(c) = Nc/ Ndoc*
so in our case for `positive` it's = 25000 / 50000 = 1/2. Same for `negative`.

In [32]:
# create the bag of words from the imdb corpus
def get_bag_of_words(dataset):
  bag_of_words = set()
  for row in dataset.iterrows():
    review = row[1]
    review = review.str.split(" ")
    review = review.to_list()[0]
    for word in review:
      bag_of_words.add(word)
  #print(review)
  return bag_of_words

#print(bag_of_words)
#print(len(bag_of_words))
#for word in bag_of_words:
#  print(word)



*P(wi|c)* is the fraction of times the word *wi* appears among all words in all documents of topic c.

We first concatenate all documents with category c into one big “category c” text.

Then we use the frequency of *wi* in this concatenated document to give a maximum likelihood estimate of probability:

![alt text](../images/mle_concatenated_documents.png)


Here the vocabulary *V* consists of all the word types in all classes, not just the words in class *c*.







In [27]:
documents_of_topic_positive = imdb_dataset[imdb_dataset["sentiment"] == "positive"]
documents_of_topic_negative = imdb_dataset[imdb_dataset["sentiment"] == "negative"]


In [28]:
print(documents_of_topic_negative)

                                                  review sentiment
3      Basically there's a family where a little boy ...  negative
7      This show was an amazing, fresh & innovative i...  negative
8      Encouraged by the positive comments about this...  negative
10     Phil the Alien is one of those quirky films wh...  negative
11     I saw this movie when I was about 12 when it c...  negative
...                                                  ...       ...
49994  This is your typical junk comedy.<br /><br />T...  negative
49996  Bad plot, bad dialogue, bad acting, idiotic di...  negative
49997  I am a Catholic taught in parochial elementary...  negative
49998  I'm going to have to disagree with the previou...  negative
49999  No one expects the Star Trek movies to be high...  negative

[25000 rows x 2 columns]


In [29]:
print(documents_of_topic_positive)

                                                  review sentiment
0      One of the other reviewers has mentioned that ...  positive
1      A wonderful little production. <br /><br />The...  positive
2      I thought this was a wonderful way to spend ti...  positive
4      Petter Mattei's "Love in the Time of Money" is...  positive
5      Probably my all-time favorite movie, a story o...  positive
...                                                  ...       ...
49983  I loved it, having been a fan of the original ...  positive
49985  Imaginary Heroes is clearly the best film of t...  positive
49989  I got this one a few weeks ago and love it! It...  positive
49992  John Garfield plays a Marine who is blinded by...  positive
49995  I thought this movie did a down right good job...  positive

[25000 rows x 2 columns]


In [None]:
bag_of_words_in_positive_documents = get_bag_of_words(documents_of_topic_positive)
bag_of_words_in_positive_documents

In [None]:
bag_of_words_in_negative_documents = get_bag_of_words(documents_of_topic_negative)
bag_of_words_in_negative_documents

Issue with MLE training. Imagine we try to estimate the likelihood of the word "fantastic" given the class "positive", but no training documents that contain the word "fantastic" and is classified as "positive". Perhaps "fantastic" used in a *sarcastic* way in *negative* class. In this case the probability woul be 0.

![alt text](../images/sarcastic_probability.png)

In [39]:
# V is our vocabulary
V = bag_of_words_in_positive_documents.union(bag_of_words_in_negative_documents)
V_sure = get_bag_of_words(imdb_dataset)
assert V == V_sure, f"expected {len(V_sure)} but got {len(V)}"
print(V)



What do we do about **unknown** words?