# Naive Bayes on Political Text

In this notebook we use Naive Bayes to explore and classify political data. See the `README.md` for full details. You can download the required DB from the shared dropbox or from blackboard

In [71]:
import sqlite3
import nltk
import random
import numpy as np
import string
from collections import Counter, defaultdict
import os
from string import punctuation
from nltk.corpus import stopwords

# Feel free to include your text patterns functions
#from text_functions_solutions import clean_tokenize, get_patterns

In [52]:
# Verifying database file
# Change to target directory
os.chdir('C:/Users/tarad/OneDrive/Documents/USD_GRAD_SCHOOL-C/ADS509_AppliedTextMining/Module_4/Assignment_4/TextClassification_assignment4')

# Verify change and current directory
print(os.getcwd())  
# Check to see if file exists in current directory
print(os.path.isfile("2020_Conventions.db"))  

C:\Users\tarad\OneDrive\Documents\USD_GRAD_SCHOOL-C\ADS509_AppliedTextMining\Module_4\Assignment_4\TextClassification_assignment4
True


In [53]:
 # Prints the file size in bytes
print(os.path.getsize("2020_Conventions.db")) 

5664768


In [54]:
convention_db = sqlite3.connect("2020_Conventions.db")
convention_cur = convention_db.cursor()

## 1. Exploratory Naive Bayes

We'll first build a NB model on the convention data itself, as a way to understand what words distinguish between the two parties. This is analogous to what we did in the "Comparing Groups" exercise. First, we'll pull in the text 
for each party and prepare it for use in Naive Bayes. 

In [55]:
# Print out tables
tables = convention_cur.execute("SELECT name FROM sqlite_master WHERE type='table';").fetchall()
print(tables)

[('conventions',)]


In [56]:
convention_data = []

# fill the above list up with items that are themselves lists. The 
# sublists will have two elements. The first element in the sublist
# should be the speech in a single string. The second element
# of the sublist should be the party. 

query_results = convention_cur.execute(
                            '''
                            SELECT text, party
                            FROM conventions
                            WHERE party IN ('Republican',  'Democratic')
                            ''')

for row in query_results :
    # store the results in convention_data
    convention_data.append([row[0], row[1]])  

# Check a few rows to verify
print(convention_data[:5])

[['Skip to content The Company Careers Press Freelancers Blog × Services Transcription Captions Foreign Subtitles Translation Freelancers About Contact Login « Return to Transcript Library home  Transcript Categories  All Transcripts 2020 Election Transcripts Classic Speech Transcripts Congressional Testimony & Hearing Transcripts Debate Transcripts Donald Trump Transcripts Entertainment Transcripts Financial Transcripts Interview Transcripts Political Transcripts Press Conference Transcripts Speech Transcripts Sports Transcripts Technology Transcripts Aug 21, 2020 2020 Democratic National Convention (DNC) Night 4 Transcript Rev  ›  Blog  ›  Transcripts  › 2020 Election Transcripts  ›  2020 Democratic National Convention (DNC) Night 4 Transcript Night 4 of the 2020 Democratic National Convention (DNC) on August 20. Read the full transcript of the event here. Transcribe Your Own Content  Try Rev for free  and save time transcribing, captioning, and subtitling.', 'Democratic'], ['I’m her

In [57]:
# it's a best practice to close up your DB connection when you're done
convention_db.close()

Let's look at some random entries and see if they look right. 

In [58]:
random.choices(convention_data,k=5)

[['When required by law.', 'Republican'],
 ['Joe, he believes we stand with our allies and stand up to our adversaries. Right now, we have a president who turns our tragedies into political weapons. Joe will be a president who turns our challenges into purpose. Joe will bring us together to build an economy that doesn’t leave anyone behind, where a good paying job is the floor, not the ceiling.',
  'Democratic'],
 ['Under civilian direction.', 'Republican'],
 ['This time next year, I hope that we will have a government that is accountable to us. That guarantees health care as a human right.',
  'Democratic'],
 ['Good evening, America. I’m Kimberly Guilfoyle. I speak to you tonight as a mother, a former prosecutor, a Latina, and a proud American and yes, a proud supporter of President Donald J. Trump. Why? Because he is the president who delivers for America. He built the greatest economy the world has ever known for the strivers, the working class and middle class. As commander in chie

It'll be useful for us to have a large sample size than 2024 affords, since those speeches tend to be long and contiguous. Let's make a new list-of-lists called `conv_sent_data`. Instead of each first entry in the sublists being an entire speech, make each first entry just a sentence from the speech. Feel free to use NLTK's `sent_tokenize` [function](https://www.nltk.org/api/nltk.tokenize.sent_tokenize.html). 

In [66]:
conv_sent_data = []

# For each speech and party in convention_data, split the speech into sentences
for speech, party in convention_data :
    sentences = nltk.tokenize.sent_tokenize(speech) # Tokenize speech to sentences 
    for sentence in sentences:
        conv_sent_data.append((sentence, party)) # Add each sentence with party label


Again, let's look at some random entries. 

In [67]:
random.choices(conv_sent_data,k=5)

[('Dr.', 'Democratic'),
 ('Everybody gets a lawyer come on over to our country.', 'Republican'),
 ('Let us also take a moment to show our profound appreciation for a man who has always fought by our side, and stood up for our values.',
  'Republican'),
 ('He put opportunity zones and the Trump tax bill that would drive investment and to our communities for decades to come.',
  'Republican'),
 ('He just wasn’t there.', 'Republican')]

Now it's time for our final cleaning before modeling. Go through `conv_sent_data` and take the following steps: 

1. Tokenize on whitespace
1. Remove punctuation
1. Remove tokens that fail the `isalpha` test
1. Remove stopwords
1. Casefold to lowercase
1. Join the remaining tokens into a string


In [88]:
stop_words = set(stopwords.words('english'))
# Add any additional stopwords
custom_stop_words = set(['and', 'or', 'but', 'like', 'know']) 
stop_words = stop_words.union(custom_stop_words)

punctuation = set(punctuation)

clean_conv_sent_data = [] # list of tuples (sentence, party), with sentence cleaned

for idx, (sentence, party) in enumerate(conv_sent_data) :
    # Tokenize by splitting on whitespace
    words = sentence.split()

    # Remove punctuation, non-alphabetic tokens, and stop words, then lowercase the words
    cleaned_words = [word.lower() for word in words if word.isalpha() and word.lower() not in stop_words and word.lower() not in punctuation]
    
    # Join cleaned words back into a cleaned sentence
    cleaned_sentence = ' '.join(cleaned_words)

    # Append cleaned sentence and party to clean_conv_sent_data list
    clean_conv_sent_data.append((cleaned_sentence, party))

# Check for some random cleaned sentences
random.choices(clean_conv_sent_data,k=5)

[('need vote fight president sake', 'Republican'),
 ('military spouses decided start named rosie riveter campaign used recruit women workers world war',
  'Republican'),
 ('without bravery heroism men women law enforcement saved', 'Republican'),
 ('president army special operators conducted raid', 'Republican'),
 ('joe biden may claim ally comes biden wants keep us completely',
  'Republican')]

If that looks good, let's make our function to turn these into features. First we need to build our list of candidate words. I started my exploration at a cutoff of 5. 

In [89]:
word_cutoff = 5

tokens = [w for t, p in clean_conv_sent_data for w in t.split()]

word_dist = nltk.FreqDist(tokens)

feature_words = set()

for word, count in word_dist.items() :
    if count > word_cutoff :
        feature_words.add(word)
        
print(f"With a word cutoff of {word_cutoff}, we have {len(feature_words)} as features in the model.")

With a word cutoff of 5, we have 1774 as features in the model.


In [90]:
def conv_features(text,fw) :
    """Given some text, this returns a dictionary holding the
       feature words.
       
       Args: 
            * text: a piece of text in a continuous string. Assumes
            text has been cleaned and case folded.
            * fw: the *feature words* that we're considering. A word 
            in `text` must be in fw in order to be returned. This 
            prevents us from considering very rarely occurring words.
        
       Returns: 
            A dictionary with the words in `text` that appear in `fw`. 
            Words are only counted once. 
            If `text` were "quick quick brown fox" and `fw` = {'quick','fox','jumps'},
            then this would return a dictionary of 
            {'quick' : True,
             'fox' :    True}
        
    """

    ret_dict = dict()

    # Tokenize the text (split on spaces)
    words = text.split()

    # For each word in the text add it to the dictionary
    for word in words:
        if word in fw:
            ret_dict[word] = True  # Indicating that the word appears in the text
    
    return ret_dict



In [91]:
assert(len(feature_words)>0)
assert(conv_features("obama was the president",feature_words)==
       {'obama':True,'president':True})
assert(conv_features("some people in america are citizens",feature_words)==
                     {'people':True,'america':True,"citizens":True})

In [92]:
# Testing functions by printing 
print(conv_features("obama was the president", feature_words))  # Should print {'obama': True, 'president': True}
print(conv_features("some people in america are citizens", feature_words))  # Should print {'people': True, 'america': True, 'citizens': True}

{'obama': True, 'president': True}
{'people': True, 'america': True, 'citizens': True}


Now we'll build our feature set. Out of curiosity I did a train/test split to see how accurate the classifier was, but we don't strictly need to since this analysis is exploratory. 

In [93]:
featuresets = [(conv_features(text,feature_words), party) for (text, party) in convention_data]

In [94]:
random.seed(20220507)
random.shuffle(featuresets)

test_size = 500

In [95]:
test_set, train_set = featuresets[:test_size], featuresets[test_size:]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, test_set))

0.498


In [96]:
classifier.show_most_informative_features(25)

Most Informative Features
             enforcement = True           Republ : Democr =     27.5 : 1.0
                   votes = True           Democr : Republ =     21.6 : 1.0
                 climate = True           Democr : Republ =     17.3 : 1.0
                 destroy = True           Republ : Democr =     17.1 : 1.0
                supports = True           Republ : Democr =     16.1 : 1.0
                   media = True           Republ : Democr =     15.9 : 1.0
                preserve = True           Republ : Democr =     15.1 : 1.0
                  signed = True           Republ : Democr =     15.1 : 1.0
              appreciate = True           Republ : Democr =     14.0 : 1.0
                freedoms = True           Republ : Democr =     14.0 : 1.0
                 private = True           Republ : Democr =     11.9 : 1.0
                  defund = True           Republ : Democr =     10.9 : 1.0
                    drug = True           Republ : Democr =     10.3 : 1.0

Write a little prose here about what you see in the classifier. Anything odd or interesting?

### My Observations

In the Naive Bayes classifier results many of the most important words reflect clear differences between the two parties. For example, words like "enforcment", "supports", and "defund" are strongly linked to the Republican party, while words like "votes", "climate", and "elect" are often more used in Democratic speeches. This suggests that the two parties emphasize distinct topics. Republicans focusing on law enforcement and defunding issues, while Democrates tend to discuss voting and climate change. Alot of the more neutral terms like"crime" and "freedoms" are stil more notably more commmon inn Republican speeches. The models accuracy is 49.8%, meaning that there might be a big overlap in language use between both the parties, which would make it harder for the classifier to distinguish between the two parties in certain cases.


## Part 2: Classifying Congressional Tweets

In this part we apply the classifer we just built to a set of tweets by people running for congress
in 2018. These tweets are stored in the database `congressional_data.db`. That DB is funky, so I'll
give you the query I used to pull out the tweets. Note that this DB has some big tables and 
is unindexed, so the query takes a minute or two to run on my machine.

In [97]:
cong_db = sqlite3.connect("congressional_data.db")
cong_cur = cong_db.cursor()

In [98]:
results = cong_cur.execute(
        '''
           SELECT DISTINCT 
                  cd.candidate, 
                  cd.party,
                  tw.tweet_text
           FROM candidate_data cd 
           INNER JOIN tweets tw ON cd.twitter_handle = tw.handle 
               AND cd.candidate == tw.candidate 
               AND cd.district == tw.district
           WHERE cd.party in ('Republican','Democratic') 
               AND tw.tweet_text NOT LIKE '%RT%'
        ''')

results = list(results) # Just to store it, since the query is time consuming

In [99]:
tweet_data = []

# Now fill up tweet_data with sublists like we did on the convention speeches.
# Note that this may take a bit of time, since we have a lot of tweets.
# Fill up tweet_data with sublists where the tweet text is paired with the political party
for row in results:
    tweet_data.append([row[2], row[1]])  # row[2] is tweet_text, row[1] is party

# Close the DB connection
cong_db.close()

# Check a few random entries
print(random.choices(tweet_data, k=5))

[[b'Here is my statement: https://t.co/djUy4lDpRR', 'Democratic'], [b'My latest for @FCNP: Action After Parkland #NeverAgain #AssaultWeaponsBan #GVRO https://t.co/FphN5Vq5nr', 'Democratic'], [b'#TalkFamDetention A3 I have more pictures along w/ reflections from my trip up here https://t.co/jWRZlo2WMY', 'Democratic'], [b"What an honor to meet @YKleinHalevi, the keynote speaker at last night's #JCRCevent. Thank you Yossi and @JCRCMINNDAK for all you do to foster peace at home and abroad. https://t.co/WqYYNvdxbl", 'Republican'], [b"I sent the following letter with fellow Members of Congress to President-Elect Trump to reconsider Mr. Bannon's appointment to the WH. https://t.co/kKZAotEmSf", 'Democratic']]


There are a lot of tweets here. Let's take a random sample and see how our classifer does. I'm guessing it won't be too great given the performance on the convention speeches...

In [100]:
random.seed(20201014)

tweet_data_sample = random.choices(tweet_data,k=10)

In [101]:
for tweet, party in tweet_data_sample :

    # Preprocess the tweet: use the conv_features function to extract features
    features = conv_features(tweet, feature_words)

    # Use the classifier to estimate the party based on the features
    estimated_party = classifier.classify(features)  

    
    print(f"Here's our (cleaned) tweet: {tweet}")
    print(f"Actual party is {party} and our classifer says {estimated_party}.")
    print("")
    

Here's our (cleaned) tweet: b'Earlier today, I spoke on the House Floor abt protecting health care for women and praised @PPmarmonte for their work on the Central Coast. https://t.co/WqgTRzT7VV'
Actual party is Democratic and our classifer says Democratic.

Here's our (cleaned) tweet: b'Go Tribe! #RallyTogether https://t.co/0NXutFL9L5'
Actual party is Democratic and our classifer says Democratic.

Here's our (cleaned) tweet: b"Apparently, Trump thinks it's just too easy for students overwhelmed by the crushing burden of debt to pay off student loans #TrumpBudget https://t.co/ckYQO5T0Qh"
Actual party is Democratic and our classifer says Democratic.

Here's our (cleaned) tweet: b'We\xe2\x80\x99re grateful for our first responders, our rescue personnel, our firefighters, our police, and volunteers who have been working tirelessly to keep people safe, provide much-needed help, while putting their own lives on the line.\n\nhttps://t.co/eZPv0vMIz3'
Actual party is Republican and our classife

Now that we've looked at it some, let's score a bunch and see how we're doing.

In [104]:
# dictionary of counts by actual party and estimated party. 
# first key is actual, second is estimated
parties = ['Republican','Democratic']
results = defaultdict(lambda: defaultdict(int))

for p in parties :
    for p1 in parties :
        results[p][p1] = 0


num_to_score = 10000
random.shuffle(tweet_data)

for idx, tp in enumerate(tweet_data) :
    tweet, party = tp    
    # Now do the same thing as above, but we store the results rather
    # than printing them. 
    # Preprocess the tweet 
    tweet = tweet.decode('utf-8') if isinstance(tweet, bytes) else tweet
    features = conv_features(tweet, feature_words)  # Extract features
   
    # get the estimated party
    estimated_party = classifier.classify(features) 
    
    results[party][estimated_party] += 1
    
    if idx > num_to_score : 
        break

In [105]:
results

defaultdict(<function __main__.<lambda>()>,
            {'Republican': defaultdict(int,
                         {'Republican': 2985, 'Democratic': 1387}),
             'Democratic': defaultdict(int,
                         {'Republican': 3965, 'Democratic': 1665})})

### Reflections

In the results,the classifier is struggling with distinguishing between Republican and Democratic tweets. For Republican tweets, the classifier correctly classified 2985 as Republican but misclassified 1387 as Democratic, For Democratic tweets, the classifier misclassified a large portion as Republican (3965) and only correctly classified 1665 as Democratic. This indicates that the might show some biased towards classifying tweets as Republican. This could because of overlapping vocabulary between the two parties or just insufficient distringuishing features in the tweets. The classifier's overall performance suggests a need for better feature engineering or rebalancing of training data to improve classification accuracy. 