## Naive Bayes on Political Text

In this notebook we use Naive Bayes to explore and classify political data. See the `README.md` for full details.

In [1]:
import sqlite3
import nltk
import random
import numpy as np
from collections import Counter, defaultdict

# Feel free to include your text patterns functions
#from text_functions_solutions import clean_tokenize, get_patterns
import re
import string
from string import punctuation
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

# Define constants
punctuation = set(string.punctuation) - {"#"}
sw = set(stopwords.words("english"))
whitespace_pattern = re.compile(r"\s+")
unwanted_tokens = {"nan", "null"}

# Defne clean_tokenize instead of import
def clean_tokenize(text: str) -> str:
    """Clean and tokenize text by removing punctuation, stopwords, and unwanted tokens."""
    
    # Convert to lowercase and normalize quotes
    text = text.lower().replace("’", "'").replace("‘", "'").replace("“", '"').replace("”", '"')

    # Remove punctuation (except hashtags)
    text = "".join(ch for ch in text if ch not in punctuation)

    # Tokenize (split on whitespace)
    tokens = whitespace_pattern.split(text.strip())

    # Remove stopwords and unwanted tokens
    tokens = [token for token in tokens if token and token not in sw and token not in unwanted_tokens]

    return ' '.join(tokens)
    
#from text_functions_solutions import clean_tokenize, get_patterns

DB files from canvas:
* 2020_Conventions.db
* congressional_data.db

In [2]:
convention_db = sqlite3.connect("2020_Conventions.db")
convention_cur = convention_db.cursor()

In [3]:
# Show all tables
convention_cur.execute("SELECT name FROM sqlite_master WHERE type='table';")
print("\nTables in the database:")
print(convention_cur.fetchall())

# Show schema of the table
convention_cur.execute("PRAGMA table_info(conventions);")
print("\nTable schema:")
print(convention_cur.fetchall())

# Show a sample row
convention_cur.execute("SELECT * FROM conventions LIMIT 1;")
print("\nSample row:")
print(convention_cur.fetchall())


Tables in the database:
[('conventions',)]

Table schema:
[(0, 'party', 'TEXT', 0, None, 0), (1, 'night', 'INTEGER', 0, None, 0), (2, 'speaker', 'TEXT', 0, None, 0), (3, 'speaker_count', 'INTEGER', 0, None, 0), (4, 'time', 'TEXT', 0, None, 0), (5, 'text', 'TEXT', 0, None, 0), (6, 'text_len', 'TEXT', 0, None, 0), (7, 'file', 'TEXT', 0, None, 0)]

Sample row:
[('Democratic', 4, 'Unknown', 1, '00:00', 'Skip to content The Company Careers Press Freelancers Blog × Services Transcription Captions Foreign Subtitles Translation Freelancers About Contact Login « Return to Transcript Library home  Transcript Categories  All Transcripts 2020 Election Transcripts Classic Speech Transcripts Congressional Testimony & Hearing Transcripts Debate Transcripts Donald Trump Transcripts Entertainment Transcripts Financial Transcripts Interview Transcripts Political Transcripts Press Conference Transcripts Speech Transcripts Sports Transcripts Technology Transcripts Aug 21, 2020 2020 Democratic National Co

### Part 1: Exploratory Naive Bayes

We'll first build a NB model on the convention data itself, as a way to understand what words distinguish between the two parties. This is analogous to what we did in the "Comparing Groups" class work. First, pull in the text 
for each party and prepare it for use in Naive Bayes.  

In [4]:
convention_data = []

# fill this list up with items that are themselves lists. The 
# first element in the sublist should be the cleaned and tokenized
# text in a single string. The second element should be the party. 

query_results = convention_cur.execute(
                            '''
                            -- your query here
                            SELECT text, party
                            FROM conventions
                            -- Remove the party "Other". 
                            WHERE party != 'Other'
                            ''')

for row in query_results :
    # store the results in convention_data
    # pass # remove this
    text, party = row
    # Clean and tokenize the speech text
    cleaned_text = clean_tokenize(text)
    convention_data.append([cleaned_text, party])

In [5]:
# it's a best practice to close up your DB connection when you're done
convention_db.close()

Let's look at some random entries and see if they look right. 

In [6]:
random.choices(convention_data,k=10)

[['freedom right good family reap blessings hard work accomplish dreams live securely help others force government goodness heart rights granted government claimed identities unalienable members human race today americas greatness challenged extreme notions defunding law enforcement lawlessness abounds hateful rhetoric telling wear work limiting free speech freedom worship old ideas socialism repackaged redefined words let us restore values made america great',
  'Republican'],
 ['gathered beautiful majestic white house known world peoples house cannot help marvel miracle great american story home larger life figures like teddy roosevelt andrew jackson rallied americans bold visions bigger brighter future within walls lived tenacious generals like president grant eisenhower led soldiers course freedom grounds thomas jefferson sent lewis clark daring expedition cross wild uncharted continent depths bloody civil war president abraham lincoln looked windows upon halfcompleted washington m

If that looks good, we now need to make our function to turn these into features. In my solution, I wanted to keep the number of features reasonable, so I only used words that occur at least `word_cutoff` times. Here's the code to test that if you want it. 

In [7]:
word_cutoff = 5

tokens = [w for t, p in convention_data for w in t.split()]

word_dist = nltk.FreqDist(tokens)

feature_words = set()

for word, count in word_dist.items() :
    if count > word_cutoff :
        feature_words.add(word)
        
print(f"With a word cutoff of {word_cutoff}, we have {len(feature_words)} as features in the model.")

With a word cutoff of 5, we have 2385 as features in the model.


In [8]:
def conv_features(text,fw) :
    """Given some text, this returns a dictionary holding the
       feature words.
       
       Args: 
            * text: a piece of text in a continuous string. Assumes
            text has been cleaned and case folded.
            * fw: the *feature words* that we're considering. A word 
            in `text` must be in fw in order to be returned. This 
            prevents us from considering very rarely occurring words.
        
       Returns: 
            A dictionary with the words in `text` that appear in `fw`. 
            Words are only counted once. 
            If `text` were "quick quick brown fox" and `fw` = {'quick','fox','jumps'},
            then this would return a dictionary of 
            {'quick' : True,
             'fox' :    True}
        
    """
    
    # Your code here
    # Split text into unique words
    words = set(text.split())

    # Create dictionary of words that appear in feature words
    ret_dict = {word: True for word in words if word in fw}
    
    return(ret_dict)

In [9]:
assert(len(feature_words)>0)
assert(conv_features("donald is the president",feature_words)==
       {'donald':True,'president':True})
assert(conv_features("people are american in america",feature_words)==
                     {'america':True,'american':True,"people":True})

Now we'll build our feature set. Out of curiosity I did a train/test split to see how accurate the classifier was, but we don't strictly need to since this analysis is exploratory. 

In [10]:
featuresets = [(conv_features(text,feature_words), party) for (text, party) in convention_data]

In [11]:
random.seed(20220507)
random.shuffle(featuresets)

test_size = 500

In [12]:
test_set, train_set = featuresets[:test_size], featuresets[test_size:]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, test_set))

0.498


In [13]:
classifier.show_most_informative_features(25)

Most Informative Features
                   china = True           Republ : Democr =     25.8 : 1.0
                   votes = True           Democr : Republ =     23.8 : 1.0
             enforcement = True           Republ : Democr =     21.5 : 1.0
                 destroy = True           Republ : Democr =     19.2 : 1.0
                freedoms = True           Republ : Democr =     18.2 : 1.0
                 climate = True           Democr : Republ =     17.8 : 1.0
                supports = True           Republ : Democr =     17.1 : 1.0
                   crime = True           Republ : Democr =     16.1 : 1.0
                   media = True           Republ : Democr =     14.9 : 1.0
                 beliefs = True           Republ : Democr =     13.0 : 1.0
               countries = True           Republ : Democr =     13.0 : 1.0
                 defense = True           Republ : Democr =     13.0 : 1.0
                  defund = True           Republ : Democr =     13.0 : 1.0

Write a little prose here about what you see in the classifier. Anything odd or interesting?

### My Observations

<i class="fa fa-exclamation-triangle"></i> _The classifier shows differences in party rhetoric even though it does not perform well in predicting party affiliation. Its accuracy is less than 50%, making it no better than random guessing. The informative features indicate that Republican-associated words include "China," "freedoms," and "destroy," reflecting a focus on international relations, conservative views, and opposition. Meanwhile, Democratic-associated words include "votes" and "climate," suggesting an emphasis on voting rights and environmental issues._



## Part 2: Classifying Congressional Tweets

In this part we apply the classifer we just built to a set of tweets by people running for congress
in 2018. These tweets are stored in the database `congressional_data.db`. That DB is funky, so I'll
give you the query I used to pull out the tweets. Note that this DB has some big tables and 
is unindexed, so the query takes a minute or two to run on my machine.

In [14]:
cong_db = sqlite3.connect("congressional_data.db")
cong_cur = cong_db.cursor()

In [15]:
# Show all tables
cong_cur.execute("SELECT name FROM sqlite_master WHERE type='table';")
print("\nTables in the database:")
print(cong_cur.fetchall())

# Show schema of the table
cong_cur.execute("PRAGMA table_info(candidate_data);")
print("\nTable schema:")
print(cong_cur.fetchall())

# Show a sample row
cong_cur.execute("SELECT * FROM candidate_data LIMIT 1;")
print("\nSample row:")
print(cong_cur.fetchall())


Tables in the database:
[('websites',), ('candidate_data',), ('tweets',)]

Table schema:
[(0, 'index', 'INTEGER', 0, None, 0), (1, 'student', 'TEXT', 0, None, 0), (2, 'state', 'TEXT', 0, None, 0), (3, 'district_num', 'TEXT', 0, None, 0), (4, 'formatted_dist_num', 'INTEGER', 0, None, 0), (5, 'abbrev', 'TEXT', 0, None, 0), (6, 'district', 'TEXT', 0, None, 0), (7, 'candidate', 'TEXT', 0, None, 0), (8, 'party', 'TEXT', 0, None, 0), (9, 'website', 'TEXT', 0, None, 0), (10, 'twitter_handle', 'TEXT', 0, None, 0), (11, 'incumbent', 'TEXT', 0, None, 0), (12, 'age', 'REAL', 0, None, 0), (13, 'gender', 'TEXT', 0, None, 0), (14, 'marital_status', 'TEXT', 0, None, 0), (15, 'white_non_hispanic', 'TEXT', 0, None, 0), (16, 'hispanic', 'TEXT', 0, None, 0), (17, 'black', 'TEXT', 0, None, 0), (18, 'partisian_lean_pvi', 'TEXT', 0, None, 0), (19, 'opposed', 'TEXT', 0, None, 0), (20, 'pct_urban', 'TEXT', 0, None, 0), (21, 'income', 'REAL', 0, None, 0), (22, 'region', 'TEXT', 0, None, 0)]

Sample row:
[(0, 

In [16]:
results = cong_cur.execute(
        '''
           SELECT DISTINCT 
                  cd.candidate, 
                  cd.party,
                  tw.tweet_text
           FROM candidate_data cd 
           INNER JOIN tweets tw ON cd.twitter_handle = tw.handle 
               AND cd.candidate == tw.candidate 
               AND cd.district == tw.district
           WHERE cd.party in ('Republican','Democratic') 
               AND tw.tweet_text NOT LIKE '%RT%'
        ''')

results = list(results) # Just to store it, since the query is time consuming
cong_db.close() # Close DB connection

In [17]:
print(f"Total number of results: {len(results)}")
print("Sample row format:", results[0] if results else "No results")

Total number of results: 664656
Sample row format: ('Mo Brooks', 'Republican', b'"Brooks Joins Alabama Delegation in Voting Against Flawed Funding Bill" http://t.co/3CwjIWYsNq')


In [18]:
tweet_data = []

# Now fill up tweet_data with sublists like we did on the convention speeches.
for candidate, party, tweet_text in results:
    # Convert bytes to string if needed
    if isinstance(tweet_text, bytes):
        tweet_text = tweet_text.decode('utf-8')
    cleaned_text = clean_tokenize(tweet_text) # Clean and tokenize
    if cleaned_text:  # Skip empty tweets
        tweet_data.append([cleaned_text, party])
# Note that this may take a bit of time, since we have a lot of tweets.

There are a lot of tweets here. Let's take a random sample and see how our classifer does. I'm guessing it won't be too great given the performance on the convention speeches...

In [19]:
random.seed(20201014)

tweet_data_sample = random.choices(tweet_data,k=10)

In [20]:

for tweet, party in tweet_data_sample :
    # Convert tweet to features using conv_features
    tweet_features = conv_features(tweet, feature_words)
    estimated_party = classifier.classify(tweet_features)
    # Fill in the right-hand side above with code that estimates the actual party
    
    print(f"Here's our (cleaned) tweet: {tweet}")
    print(f"Actual party is {party} and our classifer says {estimated_party}.")
    print("")
    

Here's our (cleaned) tweet: enjoyed meeting giev kashkooli ufw discuss concerns farmworkers #centralcoast httpstcolngptgvtt9
Actual party is Democratic and our classifer says Republican.

Here's our (cleaned) tweet: good memories yet another outstanding labor day parade #ohio #labor #labordayweekend httpstcou8ads1k1qx
Actual party is Democratic and our classifer says Republican.

Here's our (cleaned) tweet: apparently trump thinks easy students overwhelmed crushing burden debt pay student loans #trumpbudget httpstcockyqo5t0qh
Actual party is Democratic and our classifer says Republican.

Here's our (cleaned) tweet: think learn tax dollars fraudulently used fund union activities httptcobtj6etf29h
Actual party is Republican and our classifer says Republican.

Here's our (cleaned) tweet: lets send message liberal democrats win keep america great raise upset get 😜lets keep donations coming thank 🇺🇸 httpstcolmqixg4vzk
Actual party is Republican and our classifer says Republican.

Here's our

Now that we've looked at it some, let's score a bunch and see how we're doing.

In [21]:
# dictionary of counts by actual party and estimated party. 
# first key is actual, second is estimated
parties = ['Republican','Democratic']
results = defaultdict(lambda: defaultdict(int))

for p in parties :
    for p1 in parties :
        results[p][p1] = 0


num_to_score = 10000
random.shuffle(tweet_data)

for idx, tp in enumerate(tweet_data) :
    tweet, party = tp    
    # Now do the same thing as above, but we store the results rather
    # than printing them. 
    tweet_features = conv_features(tweet, feature_words) # Convert tweet to features
   
    # get the estimated party
    estimated_party = classifier.classify(tweet_features)
    
    results[party][estimated_party] += 1
    
    if idx > num_to_score : 
        break

In [22]:
results

defaultdict(<function __main__.<lambda>()>,
            {'Republican': defaultdict(int,
                         {'Republican': 3719, 'Democratic': 566}),
             'Democratic': defaultdict(int,
                         {'Republican': 4774, 'Democratic': 943})})

In [23]:
# Confusion matrix for results
print("\nConfusion Matrix:")
print("Actual Party (rows) vs Predicted Party (columns)")
print("\t\tPredicted")
print("\t\tDem\tRep")
print("Actual")
print(f"Dem\t\t{results['Democratic']['Democratic']}\t{results['Democratic']['Republican']}")
print(f"Rep\t\t{results['Republican']['Democratic']}\t{results['Republican']['Republican']}")

# Calculate accuracy metrics
total = sum(results[p1][p2] for p1 in parties for p2 in parties)
correct = sum(results[p][p] for p in parties)
accuracy = correct/total if total > 0 else 0

print(f"\nOverall accuracy: {accuracy:.3f}")

# Calculate per-party metrics
for party in parties:
    party_total = sum(results[party].values())
    party_correct = results[party][party]
    party_accuracy = party_correct/party_total if party_total > 0 else 0
    print(f"{party} accuracy: {party_accuracy:.3f} ({party_correct}/{party_total})")


Confusion Matrix:
Actual Party (rows) vs Predicted Party (columns)
		Predicted
		Dem	Rep
Actual
Dem		943	4774
Rep		566	3719

Overall accuracy: 0.466
Republican accuracy: 0.868 (3719/4285)
Democratic accuracy: 0.165 (943/5717)


### Reflections

<i class="fa fa-exclamation-triangle"></i> _The overall accuracy is very low at 46%, which is worse than a 50/50 random guess. The classifier can recognize Republican text well, but it struggles with Democratic content. There is a strong bias toward classifying text as Republican. The training data (convention_speech) or the classification method needs improvement for better performance. The classifier is not reliable for distinguishing between the two parties._ 

#### References

Albrecht, J., Ramachandran, S., & Winkler, C. (2020). Blueprints for text analytics using Python. O'Reilly. 

OpenAI. (2025). ChatGPT (Version 4.0) [AI model]. OpenAI. https://openai.com/chatgpt