# Text Classification with Naive Bayes 
# Nicholas Cruz 
# April 22, 2023

## Introduction

In this project, we will develop a classifier using Naive Bayes methods to analyze two datasets:
1. A transcript of the October 22, 2020 Presidential Debate between Trump and Biden, scraped from [debates.org](https://www.debates.org/voter-education/debate-transcripts/). We will study how well the classifier predicts who is speaking given new quotes.
2. A csv containing about 25,000 movie reviews, labeled as either positive or negative. We will study how well the classifier identifies a positive or negative review given new reviews.

After developing this classifier, we will identify potential shortcomings of the model.

### Naive Bayes Classifier

Our implementation of this Naive Bayes classifyer will use conditional probability to make predictions. If we have labels $x$ and $y$, the model will compare the conditional probabilities of the data belonging to $x$ and $y$. The label belonging to the greater value is the model's prediction. 

More formally, we need to find $\max{[P(x|y),P(y|x)]}$, where $$P(x|y) = \frac{P(x \cap y)}{P(y)}$$

$$P(y|x) = \frac{P(y \cap x)}{P(x)}$$

## Creating Dataframes

In [2]:
# libraries for implementation
import pandas as pd
import numpy as np
import requests
from collections import Counter
from zipfile import ZipFile
from bs4 import BeautifulSoup
import re

In [3]:
# libraries for testing
from tqdm import tqdm
from sklearn.model_selection import train_test_split
import html
from IPython.core.display import display, HTML

In [4]:
# helper functions

def re_show(regex, text="", flags=0):
    """
    Displays text with the regex match highlighted.
    """
    text_css = '''"border-style: none;
                   border-width: 0px;
                   padding: 0px;
                   font-size: 14px;
                   color: darkslategray;
                   background-color: white;
                   white-space: pre;
                   line-height: 20px;"
                   '''
    match_css = '''"padding: 0px 1px 0px 1px;
                    margin: 0px 0.5px 0px 0.5px;
                    border-style: solid;
                    border-width: 0.5px;
                    border-color: maroon;
                    background-color: cornsilk;
                    color: crimson;"
                    '''
    
    
    r = re.compile(f"({regex})", flags=flags)
    t = r.sub(fr'###START###\1###END###', text)
    t = html.escape(t)
    t = t.replace("###START###", f"<span style={match_css}>")
    t = t.replace("###END###", f"</span>")
    display(HTML(f'<code style={text_css}>{t}</code>'))
    
def clean_string(string):    
    bad_strings = ['\n','.',',',';','>','/','<br']
    for baddy in bad_strings:
        string = string.replace(baddy,'')
    string = string.replace('-',' ').replace('  ',' ')
    return string    

### Presidential Debates Dataframe

Each row of this dataframe will be a word said in the debate, and each column will indicate how many times Biden and Trump said that word.

In [32]:
# get debate html from web and grab div tag with debate content
debate_url = "https://www.debates.org/voter-education/debate-transcripts/october-22-2020-debate-transcript/"
debate = requests.get(debate_url).text
debate_soup = BeautifulSoup(debate)
debate_html = debate_soup.findAll('div',id="content-sm")[0]

# clean up html and turn into string
debate_str = ""
for block in debate_html.findAll('p'):
    debate_str+=str(block)
for tag in ['<p>','<b>','</b>','</p>','<br/>','\n','\xa0', '<span class="Apple-converted-space">','</span>']:
    debate_str=debate_str.replace(tag," ")

# find all word blocks between BIDEN: and the next all caps word 
biden_blocks = re.findall(r'BIDEN:(.*?)(?=\b[A-Z]{2,}\b)', debate_str) 
biden_str = clean_string(' '.join(biden_blocks)).lower() # clean string
biden_wordlist = biden_str.split() # split on white space
biden_counter = Counter(biden_wordlist) # count words in list

# find all word blocks between TRUMP: and the next all caps word 
trump_blocks = re.findall(r'TRUMP:(.*?)(?=\b[A-Z]{2,}\b)', debate_str)
trump_str = clean_string(' '.join(trump_blocks)).lower() # clean string
trump_wordlist = trump_str.split() # split on white space
trump_counter = Counter(trump_wordlist) # count words in list

debate_words_df = pd.DataFrame({'biden':biden_counter, 'trump':trump_counter}) # make dataframe from dictionary
debate_words_df = debate_words_df.fillna(0) # address NaNs
debate_words_df.biden = debate_words_df.biden.astype(int) # declare int type for counts
debate_words_df.trump = debate_words_df.trump.astype(int)

debate_words_df

# laplace smoothing
debate_words_df = debate_words_df.add(1)

debate_words_df

Unnamed: 0,biden,trump
220000,2,1
americans,6,1
dead,3,3
if,21,22
you,83,127
...,...,...
calls,1,2
normally,1,2
together,1,2
unemployment,1,2


### Movie Reviews Dataframe 

This movie review dataframe was given in class. Because of its size, I have chosen to split the dataset in half for training and testing, rather than training on the entire dataset.

In [33]:
movies = pd.read_csv("movie_reviews.zip")

In [34]:
counter_pos = Counter([])
counter_neg = Counter([])

# for convenience, we will split the dataset into testing and training
X_train, X_test, y_train, y_test = train_test_split(movies['review'], 
                                                    movies['sentiment'], 
                                                    test_size=0.50, 
                                                    random_state=1)

n = len(X_train)
# clean formatting of data
for i in tqdm(range(len(X_train))):
    words = movies.loc[i,'review'].lower()
    words = clean_string(words)    
    words = words.split(' ')
    
    # add to respective word counters
    if movies.loc[i,'sentiment'] == "positive":
        counter_pos += Counter(words)
    else:
        counter_neg += Counter(words)
        
review_words_df = pd.DataFrame({'positive':counter_pos, 'negative':counter_neg}) # make dataframe from dictionary
review_words_df = review_words_df.fillna(0) # address NaNs
review_words_df.positive = review_words_df.positive.astype(int) # declare int type for counts
review_words_df.negative = review_words_df.negative.astype(int)

# laplace smoothing
review_words_df = review_words_df.add(1)

100%|████████████████████████████████████| 12500/12500 [01:07<00:00, 186.05it/s]


In [35]:
review_words_df

Unnamed: 0,positive,negative
i,17427,19716
got,840,1010
to,33135,34839
see,2904,2644
this,17223,20013
...,...,...
goldfinger,1,2
"space""?",1,2
"""p9fos""",1,2
"""shampoo""",1,2


## Creating the Naive Bayes Model

The model will iterate through each word of a text (excluding stop words) and calculate the conditional log probability of each word having a certain label (positive/negative for movie reviews and Biden/Trump for debates). The label of higher value is the prediction. 

**Note.** This implementation uses log likelihood to avoid underflow/overflow issues with floats. This means that we add probabilities where we would usually multiply them.

### Model for Movie Reviews 

In [36]:
stops = "a,able,about,across,after,all,almost,also,am,among,an,and,any,are,as,at,be,because,been,but,by,can,cannot,could,dear,did,do,does,either,else,ever,every,for,from,get,got,had,has,have,he,her,hers,him,his,how,however,i,if,in,into,is,it,its,just,least,let,like,likely,may,me,might,most,must,my,neither,no,nor,not,of,off,often,on,only,or,other,our,own,rather,said,say,says,she,should,since,so,some,than,that,the,their,them,then,there,these,they,this,tis,to,too,twas,us,wants,was,we,were,what,when,where,which,while,who,whom,why,will,with,would,yet,you,your"
stops = stops.split(',')

def rev_probs(review_text):
    value_pos_lst = [] # list of 'positive' probabilities 
    value_neg_lst = [] # list of 'negative' probabilities 

    words = clean_string(review_text).lower().split(' ') # clean text and split on whitespace
    for word in words:
        if word in stops:
            None # handle stop words
        else:
            try:
                p = review_words_df.loc[word]/review_words_df.sum() # probability that word is positive or negative 
                value_pos_lst.append(p["positive"]) # append to list
                value_neg_lst.append(p["negative"]) # append to list
            except:
                None
    tot = np.array(review_words_df.sum()).sum() # total amount of words said
    prod = np.ones(2) # initialize prediction array
    prod[0] = review_words_df['negative'].sum()/(tot) # probability of a word being negative
    prod[1] = review_words_df['positive'].sum()/(tot) # probability of a word being positive
    prod = np.log10(prod) # take log of probabilities to avoid underflow
    
    # multiply probabilities according to Naive Bayes 
    prod[0] += np.log10(value_neg_lst).sum() 
    prod[1] += np.log10(value_pos_lst).sum()
    return prod

# Display results
def print_rev_results(prod):
    print('log likelihood of it being negative is: '+str(prod[0]))
    print('log likelihood of it being positive is: '+str(prod[1]))
    if(prod[0]>prod[1]):    
        print('\n It\'s most likely its negative')
    else:
        print('\n It\'s most likely its positive')
        


### Model for Debates

In [37]:
def deb_probs(debate_text):
    value_biden_lst = []
    value_trump_lst = []

    words = clean_string(debate_text).lower().split(' ')
    for word in words:
        if word in stops:
            None
        else:
            try:
                p = debate_words_df.loc[word]/debate_words_df.sum() 
                value_biden_lst.append(p["biden"])
                value_trump_lst.append(p["trump"])
            except:
                None
    
    tot = np.array(debate_words_df.sum()).sum() 
    prod = np.ones(2) 
    prod[0] = debate_words_df['trump'].sum()/(tot) 
    prod[1] = debate_words_df['biden'].sum()/(tot)
    prod = np.log10(prod)
    
    prod[0] += np.log10(value_trump_lst).sum()
    prod[1] += np.log10(value_biden_lst).sum()
    return prod

# Display results
def print_deb_results(prod):
    print('log likelihood of Trump speaking is: '+str(prod[0]))
    print('log likelihood of Biden speaking is: '+str(prod[1]))
    if(prod[0]>prod[1]):    
        print('\n It\'s most likely Trump')
    else:
        print('\n It\'s most likely Biden')

In [38]:
# quick check that both functions run as intended

review_text = 'This movie was terrible' # I expect negative
prod = rev_probs(review_text)
print_rev_results(prod)

print('\n')

debate_text = 'This is a bunch of malarkey' # I expect Biden
prod = deb_probs(debate_text)
print_deb_results(prod)

log likelihood of it being negative is: -5.783843621237331
log likelihood of it being positive is: -6.642097347854029

 It's most likely its negative


log likelihood of Trump speaking is: -4.238848680362337
log likelihood of Biden speaking is: -3.937818684698356

 It's most likely Biden


## Analysis

### Potential Bias in Datasets

When working with both the datasets, I wondered if one label dominated the other in size. If true, how would this favor certain predictions? While conditional probability attempts to mitigate such bias by considering the size of a label, there are cases where the Naive Bayes approach is problematic. 

Consider the Biden vs. Trump debate.

In [39]:
print('Amount of \'Biden\' words: ' + str(debate_words_df['biden'].sum()))
print('Amount of \'Trump\' words: ' + str(debate_words_df['trump'].sum()))

Amount of 'Biden' words: 8486
Amount of 'Trump' words: 8846


Trump said more words in this specific debate, meaning Biden is not represented as accurately as Trump in the dataset. 

Another consequence of Trump saying more words is that common words are distributed more thinly for Trump than for Biden. Take their top 50 words.

In [None]:
print("Trump's top 25-49 words \n", debate_words_df['trump'].sort_values(ascending=False)[:50], '\n')
print("Biden's top 25-49 words \n", debate_words_df['biden'].sort_values(ascending=False)[:50])

If we forget that many of the top 50 words are "stop words", Biden's top words are said more frequently than Trump's despite Trump saying more words. This means that the model works very well for words said infrequently (like "malarkey"), but is more likely to favor Biden for common words.

This could be due to circumstance and way people communicate in general. Someone who says fewer words in a debate will have to be more concise and forward in their language, so they will naturally use more common words.

### The "Naive" in Naive Bayes

The biggest flaw in the model is the assumption that words are independent of each other. Take the following example:

In [None]:
review_text = 'I do not understand why everyone thinks this movie is terrible. I enjoyed it.'
prod = rev_probs(review_text)
print_rev_results(prod)

The algorithm currently does not understand the context behind word usage, only the frequency of the words. A less naive model should consider patterns of words, not just the words themselves. In the above example, a more sophisticated model might see the word "not" and flip the labels of the subsequent words in the sentence. However, one can only hard-code language patterns so much before the model loses generality. 

## Conclusion

While I am impressed by the Naive Bayes model, I do not think it is effective on written language. The "naive" part of the model can result in horrendously inaccurate results, since it does not consider the greater context of words in a sentence. And despite the use of conditional probability to mitigate bias, the model can still fall short when the dataset itself biases one label over another (such as when Trump speaks more during the debate). I believe we can do better when making predictive models for language. 

## References

1. 1.9. naive Bayes. scikit. (n.d.). Retrieved April 23, 2023, from https://scikit-learn.org/stable/modules/naive_bayes.html#:~:text=Naive%20Bayes%20methods%20are%20a,value%20of%20the%20class%20variable 

2. Dave_Child. (2011, October 19). Regular expressions cheat sheet. Cheatography. Retrieved April 23, 2023, from https://cheatography.com/davechild/cheat-sheets/regular-expressions/ 