# Text classification with naive Bayes

## Introduction

Welcome to an amazing project that dives into the naive Bayes classifiers ability to determine what group a chunk of text belongs to based on previous large data sets. In this project, the two sets of data that the Bayes classifer will be working with are one that is 25,000 movie reviews that are positive or negative and the transcript from the presidential debate from September of 2016. 

The Bayes Theorem looks like $$P(A|B) = \frac{P(B|A)P(A)}{P(B)}$$ and for those of you who haven't taken a probability course, the theorem finds that likelihood of A happening given that B has also occurred. For exmaple, the classifier could be used to find probabilities like the odds that it is thundering given that it's also raining.

The naive Bayes classifer for this project works by taking in a sample string that most likely has something to do with the rest of the dataset. After that string is cleaned up and broken up into a list of individual words, it can go through each of the words and start keeping track of sums of probabilities of where each word was most likely to come from. In the end, you will have a number of sums based on how complex the dataset was and the least negative sum will be where it classifies the phrase from. 

For the first part of this project, I will use the Bayes classifier on the transcript from the 2016 debate between Clinton, Trump and Holt, the mediator.

### Import Python libraries

In [1]:
from collections import Counter
import pandas as pd
from sklearn.model_selection import train_test_split
from tqdm import tqdm
import numpy as np
from bs4 import BeautifulSoup

In these texts, a lot of words are used. Among those words are a lot of words who's purpose are to make communication doable. These words are called stopwords and they are words like "a" and "the." These words are important so that we dont sound like cavemen but they arent actually necessary for getting your point across. Throughout this project, I will see the difference in the classifiers outputs when there words are included and ignored.

In [2]:
with open('stopwords.txt') as f:
    stop_words = f.read()
    
print(stop_words[:94])

a,able,about,across,after,all,almost,also,am,among,an,and,any,are,as,at,be,because,been,but,by


In [3]:
def cleaner(review): # method to clean up strings and split them into lists of words
    nonochars = ['<br','.',',',';','(',')',':','!','?','/>','"']
    for nono in nonochars:
        review = review.replace(nono,'')
    review = review.replace('-','').replace('  ',' ')
    review = review.lower().split()
    return review

# Presidential Debates

For this part of the project, I will apply the Bayes classifier to the presidential debate transcript from 2016. Here I will input a sample phrase into the method and it will return which speaker was most likely to say that phrase.

In [4]:
transcript_file = 'debate_transcripts/6_September 26, 2016 Debate Transcript.html'

# open up the html file and use BeautifulSoup to make the string easy to work with
with open(transcript_file, encoding="utf8") as fp:
    soup = BeautifulSoup(fp, 'html.parser')
s = soup.text

# lists of punctuation and audience words that we want to ignore
punc = ',.;:!?"'
otherbadwords = ['--','[applause]','[inaudible]','[laughter]','[crosstalk]']

# loop through the punctuation you want to ignore and replace them with nothing
for p in punc: 
    s = s.replace(p,'')
    
# replace dashes with spaces
s = s.replace('—',' ')

# removes the que words from the string s
for w in otherbadwords: 
    s = s.replace(w,'')
    
speakers = ['HOLT','CLINTON','TRUMP']

# adds colons after a person starts talk so it is easily identified
for sp in speakers: 
    s = s.replace(sp,sp+':')  

tags = [ sp.lower()+':' for sp in speakers ]

s = s.lower()

# break the transcript up into a list of clean and lowercase words
words = s.split()
words = words[35:]

holts_words = [] # holt list of words
clintons_words = [] # clinton list of words
trumps_words = [] # trump list of words

# based on what the last tag was, adds each word to the list of the last speaker
for w in words: 
    if w == tags[0]: #speaker is Holt
        current_speaker = holts_words
    elif w == tags[1]: #speaker is Clinton
        current_speaker = clintons_words
    elif w == tags[2]: #speaker is Trump
        current_speaker = trumps_words
        
    else:
        current_speaker.append(w)

# counts the number of words each speaker said and puts them into a dictionary        
c_holt = Counter(holts_words)
c_clinton = Counter(clintons_words)
c_trump = Counter(trumps_words)

# take the three dictionaries and turn them into a DataFrame
d = {'Holt': c_holt, 'Clinton': c_clinton, 'Trump': c_trump}

# create the DataFrame
df_debate = pd.DataFrame(d)

# replaces instances of NaN with 0
df_debate = df_debate.fillna(0)

# sort the DataFrame by the frequency of the words
df_debate = df_debate.sort_values(by=['Clinton','Trump'],ascending = False)
df_debate[:10]

Unnamed: 0,Holt,Clinton,Trump
the,96.0,253.0,295.0
to,80.0,239.0,257.0
and,44.0,206.0,289.0
that,22.0,147.0,168.0
i,15.0,141.0,238.0
of,39.0,135.0,171.0
we,22.0,131.0,126.0
a,25.0,121.0,171.0
in,24.0,103.0,110.0
have,27.0,84.0,147.0


In [5]:
print(s[:600]) # what the first 600 characters of the transcript look like


september 26 2016 debate transcript
presidential debate at hofstra university in hempstead new york
september 26 2016
participants
former secretary of state hillary clinton (d) and
businessman donald trump (r)
moderator
lester holt (nbc news)
holt: good evening from hofstra university in hempstead new york i’m lester holt anchor of “nbc nightly news” i want to welcome you to the first presidential debate
the participants tonight are donald trump and hillary clinton this debate is sponsored by the commission on presidential debates a nonpartisan nonprofit organization the commission drafted to


In [6]:
dc = df_debate + 1 # add 1 to all values in the DataFrame to avoid divide by 0 errors

In [7]:
words = list(dc.index.values) # create a list of one of each of the words spoken

In [8]:
def deb_probs(phrase):
    
    # use the cleaner method to clean up the phrase you want to analyze
    phrase  = cleaner(phrase)
    
    # total number of words in the dataframe
    tot = np.array(dc.sum()).sum()
     
    # fills the prod array with the percent of each speakers words spoken out off all of them    
    prod = np.ones(3)
    prod[0] = dc['Holt'].sum()/(tot)
    prod[1] = dc['Clinton'].sum()/(tot)
    prod[2] = dc['Trump'].sum()/(tot)
    
    # take the log of prod to simplify the numbers worked with
    prod = np.log10(prod)

    # loop through the words in the input phrase
    for word in phrase:
        
        # makes the word lowercase and then adds the probablility it came from each of the speakers to prod
        word = word.lower()
        if word not in stop_words:
            if word in words:
                w = dc.loc[word]
                p = w/dc.sum()

                prod += np.log10(np.array(p))    
    
    print(prod)
    
    # returns the string of the most likely person who said the phrase, based on the least negative probability
    return ['Holt','Clinton','Trump'][np.argmax(prod)]

To test if the classifer is working, I copied the opening statement from Holt to see if the classifier can confirm that Holt said it.

In [9]:
phrase = "good evening from hofstra university in hempstead new york i’m lester holt anchor of “nbc nightly news” i want to welcome you to the first presidential debate the participants tonight are donald trump and hillary clinton this debate is sponsored by the commission on presidential debates a nonpartisan nonprofit organization the commission drafted tonight’s format and the rules have been agreed to by the campaigns"
deb_probs(phrase)

[-120.60631291 -142.63831965 -146.30524956]


'Holt'

As you can see, the least negative product was the first one in the array, which is the product that lines up with Holt.

Now that it seems like the classifer works, you can create your own phrases and see who the classifier would say that phrase. The first thing I wanted to try was "Make America Great Again" as that is Trumps signature slogan.

In [10]:
phrase = "Make America Great Again"
deb_probs(phrase)

[-14.35378878 -12.60623401 -13.39703798]


'Clinton'

You would think that the classifier would output Trump but the classifier works by going word by word so Clinton must have said the individual words more than Trump did. Another phrase I was curious about was "I hate America. Do not vote me for president" to see what it would classify the opposite of what you want to say at a presidential debate.

In [11]:
phrase = "I hate America. Do not vote me for president"
deb_probs(phrase)

[-10.12477284  -9.99597476 -10.8922474 ]


'Clinton'

Interestingly, it classified that as something that Clinton would say which perfectly explains why she ended up losing the election to Trump. Now by taking out the conditional statement with the stopwords, you can see the difference when you include the stopwords when classifying 

In [12]:
def deb_probs_with_stopwords(phrase):
    
    # use the cleaner method to clean up the phrase you want to analyze
    phrase  = cleaner(phrase)
    
    # total number of words in the dataframe
    tot = np.array(dc.sum()).sum()
     
    # fills the prod array with the percent of each speakers words spoken out off all of them    
    prod = np.ones(3)
    prod[0] = dc['Holt'].sum()/(tot)
    prod[1] = dc['Clinton'].sum()/(tot)
    prod[2] = dc['Trump'].sum()/(tot)
    
    # take the log of prod to simplify the numbers worked with
    prod = np.log10(prod)

    # loop through the words in the input phrase
    for word in phrase:
        
        # makes the word lowercase and then adds the probablility it came from each of the speakers to prod
        word = word.lower()
        if word in words:
            w = dc.loc[word]
            p = w/dc.sum()

            prod += np.log10(np.array(p))    
    
    print(prod)
    
    # returns the string of the most likely person who said the phrase, based on the least negative probability
    return ['Holt','Clinton','Trump'][np.argmax(prod)]

Now we can check the three previous phrases and see if the results vary.

In [13]:
phrase = "good evening from hofstra university in hempstead new york i’m lester holt anchor of “nbc nightly news” i want to welcome you to the first presidential debate the participants tonight are donald trump and hillary clinton this debate is sponsored by the commission on presidential debates a nonpartisan nonprofit organization the commission drafted tonight’s format and the rules have been agreed to by the campaigns"
print(deb_probs_with_stopwords(phrase))
phrase = "Make America Great Again"
print(deb_probs(phrase))
phrase = "Make America Great Again"
print(deb_probs(phrase))

[-175.4875487  -193.90588002 -196.61046519]
Holt
[-14.35378878 -12.60623401 -13.39703798]
Clinton
[-14.35378878 -12.60623401 -13.39703798]
Clinton


The results are the same when including the stopwords so but this may be because I was only looking at a single transcript. These results coud have been different if you were to look at the transcripts from the last 50 or so years. I would also like to add how the most common words spokes are all about the first person because they are responding to quesitons about themselves, but, one of the most common words is "we" because they are always talking about America as a nation.

In [14]:
df_debate.head(10)

Unnamed: 0,Holt,Clinton,Trump
the,96.0,253.0,295.0
to,80.0,239.0,257.0
and,44.0,206.0,289.0
that,22.0,147.0,168.0
i,15.0,141.0,238.0
of,39.0,135.0,171.0
we,22.0,131.0,126.0
a,25.0,121.0,171.0
in,24.0,103.0,110.0
have,27.0,84.0,147.0


### Movie Reviews

Now I will apply the Bayes classifier to a list of 25,000 movie reviews. Here I will loop through each of the reviews and see the accuracy of the classifier and also make up my own mini-reviews that are examples of what would throw off the classifier. 

In [15]:
movies = pd.read_csv("movie_reviews.zip") # extract the dataframe of movie reviews from the zip file
movies[:5]

Unnamed: 0,review,sentiment
0,"This film is absolutely awful, but nevertheles...",negative
1,Well since seeing part's 1 through 3 I can hon...,negative
2,I got to see this film at a preview and was da...,positive
3,This adaptation positively butchers a classic ...,negative
4,Råzone is an awful movie! It is so simple. It ...,negative


In [16]:
X_train, X_test, y_train, y_test = train_test_split(movies['review'], 
                                                    movies['sentiment'], 
                                                    test_size=0.01, 
                                                    random_state=1)
# create empty strings that for the negative and positive reviews
big_string_neg = ''
big_string_pos = ''

# loop through the list of movie reviews
for x in tqdm(range(len(X_train))):
    
    # if the current review is negative, append the review followed by a space
    if movies.iloc[x,1] == 'negative':
        big_string_neg += movies.iloc[x,0]
        big_string_neg += ' '
    else:
        
    # same process but for positive reviews
        big_string_pos += movies.iloc[x,0]
        big_string_pos += ' '
        
# clean up the large strings with the cleaner method
big_string_neg = cleaner(big_string_neg)
big_string_pos = cleaner(big_string_pos)

# apply the counters on both of the big strings for dictionaries of word counts
c_n = Counter(big_string_neg)
c_p = Counter(big_string_pos)

# turn the counters into a dictionary of dictionaries
d = {'negative':c_n,'positive':c_p}

# turn the dictionary d in to dataframe df
df = pd.DataFrame(d)

# replace any instance of NaN with 0
df = df.fillna(0)

df.sort_values(by=['positive','negative'],ascending = False)

100%|████████████████████████████████████████████████████████████████████████| 24750/24750 [02:46<00:00, 148.94it/s]


Unnamed: 0,negative,positive
the,160622.0,170128.0
and,72816.0,87905.0
a,78039.0,82217.0
of,68092.0,75898.0
to,68035.0,65828.0
...,...,...
eastcoast,1.0,0.0
leathermen,1.0,0.0
tapers,1.0,0.0
'torched',1.0,0.0


In [17]:
X_train, X_test, y_train, y_test = train_test_split(movies['review'], 
                                                    movies['sentiment'], 
                                                    test_size=0.9, 
                                                    random_state=1)
# create empty strings that for the negative and positive reviews
big_string_neg_small = ''
big_string_pos_small = ''

# loop through the list of movie reviews
for x in tqdm(range(len(X_train))):
    
    # if the current review is negative, append the review followed by a space
    if movies.iloc[x,1] == 'negative':
        big_string_neg_small += movies.iloc[x,0]
        big_string_neg_small += ' '
    else:
        
    # same process but for positive reviews
        big_string_pos_small += movies.iloc[x,0]
        big_string_pos_small += ' '
        
# clean up the large strings with the cleaner method
big_string_neg_small = cleaner(big_string_neg_small)
big_string_pos_small = cleaner(big_string_pos_small)

# apply the counters on both of the big strings for dictionaries of word counts
c_n_small = Counter(big_string_neg_small)
c_p_small = Counter(big_string_pos_small)

# turn the counters into a dictionary of dictionaries
d_small = {'negative':c_n_small,'positive':c_p_small}

# turn the dictionary d in to dataframe df
df_small = pd.DataFrame(d_small)

# replace any instance of NaN with 0
df_small = df_small.fillna(0)

df_small.sort_values(by=['positive','negative'],ascending = False)

100%|█████████████████████████████████████████████████████████████████████████| 2500/2500 [00:01<00:00, 1573.14it/s]


Unnamed: 0,negative,positive
the,16251.0,16806.0
and,7579.0,8704.0
a,7865.0,8219.0
of,6913.0,7504.0
to,6805.0,6590.0
...,...,...
reputations,1.0,0.0
arabella,1.0,0.0
cinderella',1.0,0.0
experimentalism,1.0,0.0


In [18]:
wc = df + 1 # add 1 to each value to avoid divide by 0 errors
wc_small = df_small + 1

In [19]:
def rev_probs(review): # determines if the input string is either a negative or a positive review
    
    # clean up the input review
    review  = cleaner(review)
    
    # assign tot to the total number of words
    tot = np.array(wc.sum()).sum()
    
    # fills the prod array with the percent of each speakers words spoken out off all of them    
    prod = np.ones(2)
    prod[0] = wc['negative'].sum()/(tot)
    prod[1] = wc['positive'].sum()/(tot)
    
    # take the log base 10 of each value in prod
    prod = np.log10(prod)
    
    
    for word in review:

        # makes each word lowercase and adds the probability it came from a negative or positive review
        word = word.lower()
        if word not in stop_words:
            w = wc.loc[word]
            p = w/wc.sum()

            prod += np.log10(np.array(p))
    #print(prod)
    
    return ['negative','positive'][np.argmax(prod)]

In [20]:
def rev_probs_small(review): # determines if the input string is either a negative or a positive review
    
    # clean up the input review
    review  = cleaner(review)
    
    # assign tot to the total number of words
    tot = np.array(wc_small.sum()).sum()
    
    # fills the prod array with the percent of each speakers words spoken out off all of them    
    prod = np.ones(2)
    prod[0] = wc_small['negative'].sum()/(tot)
    prod[1] = wc_small['positive'].sum()/(tot)
    
    # take the log base 10 of each value in prod
    prod = np.log10(prod)
    
    
    for word in review:

        # makes each word lowercase and adds the probability it came from a negative or positive review
        word = word.lower()
        if word not in stop_words:
            w = wc_small.loc[word]
            p = w/wc_small.sum()

            prod += np.log10(np.array(p))
    #print(prod)
    
    return ['negative','positive'][np.argmax(prod)]

In [21]:
def rev_probs_no_stopwords(review): # determines if the input string is either a negative or a positive review
    
    # clean up the input review
    review  = cleaner(review)
    
    # assign tot to the total number of words
    tot = np.array(wc.sum()).sum()
    
    # fills the prod array with the percent of each speakers words spoken out off all of them    
    prod = np.ones(2)
    prod[0] = wc['negative'].sum()/(tot)
    prod[1] = wc['positive'].sum()/(tot)
    
    # take the log base 10 of each value in prod
    prod = np.log10(prod)
    
    
    for word in review:

        # makes each word lowercase and adds the probability it came from a negative or positive review
        word = word.lower()
        w = wc.loc[word]
        p = w/wc.sum()

        prod += np.log10(np.array(p))
    #print(prod)
    
    return ['negative','positive'][np.argmax(prod)]

In [22]:
review = """I saw this recent Woody Allen film because I\'m a fan of
his work and I make it a point to try to see everything he does, though
the reviews of this film led me to expect a disappointing effort. They were right.
This is a confused movie that can\'t decide whether it wants to be a comedy,
a romantic fantasy, or a drama about female mid-life crisis. It fails at all three.
<br /><br />Alice (Mia Farrow) is a restless middle aged woman who has married into
great wealth and leads a life of aimless luxury with her rather boring husband and
their two small children. This rather mundane plot concept is livened up with such
implausibilities as an old Chinese folk healer who makes her invisible with some magic
herbs, and the ghost of a former lover (with whom she flies over Manhattan). If these
additions sound too fantastic for you, how about something more prosaic, like an affair
with a saxophone player?<br /><br />I was never quite sure of what this mixed up muddle
was trying to say. There are only a handful of truly funny moments in the film,
and the endingis a really preposterous touch of Pollyanna.<br /><br />Rent \'Crimes and
Misdemeanors\' instead, a superbly well-done film that suceeds in combining comedy with
a serious consideration of ethics and morals. Or go back to "Annie Hall" or "Manhattan"."""

In [23]:
#prod = rev_probs(movies.iloc[3,0])
prod = rev_probs(review)

print(prod)
#print(['negative','positive'][np.argmax(prod)])

negative


Now that I tested that the classifer works with this sample negative review of Annie, I ca  start analyzing the performance of the classifier. The first statistic I wanted to look at is the accuracy of the classifer. By adding up the total number of reviews classified correctly divided by the number of reviews you can get the accuracy. This a slow process so I will get a rough estimate from the first 100 comparisons

In [24]:
num = 0
den = 100
for x in tqdm(range(100)):
    estimate = rev_probs(movies.iloc[x,0])
    if estimate == movies.iloc[x,1]:
        num += 1
print(num/den)

100%|█████████████████████████████████████████████████████████████████████████████| 100/100 [00:28<00:00,  3.53it/s]

0.93





In [25]:
num = 0
den = 100
for x in tqdm(range(100)):
    estimate = rev_probs_no_stopwords(movies.iloc[x,0])
    if estimate == movies.iloc[x,1]:
        num += 1
print(num/den)

100%|█████████████████████████████████████████████████████████████████████████████| 100/100 [00:43<00:00,  2.30it/s]

0.93





In [26]:
num = 0
den = 100
for x in tqdm(range(100)):
    estimate = rev_probs_small(movies.iloc[x,0])
    if estimate == movies.iloc[x,1]:
        num += 1
print(num/den)

100%|█████████████████████████████████████████████████████████████████████████████| 100/100 [00:14<00:00,  7.10it/s]

0.99





This means that in the first 100 reviews, the classifer had a 93% accuracy. When testing the first 100 reviews with a smaller training set the accuracy went from 93 to 99%. Below is an example of a misclassified review: 

In [27]:
print(movies.iloc[3,0])
print()
print("This review is " + movies.iloc[3,1])
print("But it was classified as " + rev_probs(movies.iloc[3,0]))

This adaptation positively butchers a classic which is beloved for its subtlety. Timothy Dalton has absolutely no conception of the different nuances of Rochester's character. I get the feeling he never even read the book, just sauntered on set in his too tight breeches and was handed a character summary that read "Grumpy, broody, murky past." He plays Rochester not as a character or as a real person but as an over the top grouch who never cracks a smile until after he gets engaged at which point he miraculously morphs into a pansy. There is no chemistry. The only feeling that this adaptation excited in me was incredulity and also sympathy for Charlotte Bronte who is most definitely turning in her grave. GO AND REREAD THE BOOK. ROCHESTER HAS A PERSONALITY. AND BY THE WAY: A "PASSIONATE" LOVE SCENE DOES NOT MEAN YOU HAVE TO EAT HER FACE.

This review is negative
But it was classified as positive


When reading this review I understand why it was missclassified. The Database says that this is a negative review but the classifier says that it is positive. When reading it there are a lot of words that point to it being a positive review. Words like "smile" and "positively" are words that I would associate with a positive review but are used in this negative one. When testing the same misclassified review with a classifier that used a larger training set you get these results with that review:



In [28]:
print("This review is " + movies.iloc[3,1])
print("But it was classified as " + rev_probs_small(movies.iloc[3,0]))

This review is negative
But it was classified as negative


Now with the larger training set, the classifier is more accurate. The classifier with the smaller training set has the same effect as if I wrote a review saying "This movie is so poorly written and dumb that it's hilarious." This is a positive review that is very liekly to be classified as a negative one because I am using negative words in a positive way. If I put the sample phrase into the classifier, it says that the review is:

In [29]:
rev_probs("This movie is so poorly written and dumb that it's hilarious")

'negative'

As you can see, even though I though this comedy was hilarious it classified it as a negative review. But If I write something as simple as "This movies was good" it is expected to be classifed as positive.

In [30]:
rev_probs("This movie was good")

'negative'

Now this is a suprising result. This is similar to the case where the classifier said that Clinton is more likely to say "Make America Great Again." Because the classifier goes word by word and this review without the stopwords is just "movie" and "good" you would think it would be positive. So now I want to see what it classifies the word "movie" and "good"

In [31]:
rev_probs("good")

'positive'

In [32]:
rev_probs("movie")

'negative'

From this I conclude that the probability that good apprears in a positive review is outweighed by movie in a negative review. This just shows that people writing positive reviews like using more descriptive words than "good."

In [33]:
rev_probs("This movie was great")

'positive'

When changing "good" to "great" the review is even more positive so it is classified as such

## Conclusions

* The most common words in any set of data will always be the stopwords like "the" and "a" but when you ignore them they are most specific to the class of data you are looking at
* Missclassified texts tend to be because of the way the author uses language. If someone likes to emphasize the positive by using negative words, it trips up the classifer as a computer cannot detect personality. 
* When I wrote my own sample texts where I thought I would know the classification, it would still sometimes missclassify the text. The sample review "This movie is good" was classified as negative even though it is clearly a positive review. This comes down to the fact that the classifier works word by word and some words have higher probabilities of appearing and can tip the weight of the classification.
* When using a classifier that used a training set of 90%, I found it to be much more accurate. Out of the first 100 reviews, it only misclassifed a single review vs. the 7 the 1% training set misclassified.

## Bibliography/References

1. Gandhi, R. (2018, May 17). Naive Bayes classifier. Medium. Retrieved April 23, 2023, from https://towardsdatascience.com/naive-bayes-classifier-81d512f50a7c 