# Naive Bayes on Political Text

In this notebook we use Naive Bayes to explore and classify political data. See the `README.md` for full details. You can download the required DB from the shared dropbox or from blackboard

In [17]:
import sqlite3
import nltk
import random
import numpy as np
from collections import Counter, defaultdict

from string import punctuation

# Feel free to include your text patterns functions
#from text_functions_solutions import clean_tokenize, get_patterns

In [16]:
convention_db = sqlite3.connect("convention_speeches.db")
convention_cur = convention_db.cursor()

In [15]:
# Imports
import os
import sqlite3
import random
import numpy as np
from collections import Counter, defaultdict
import nltk
from string import punctuation


# Reproducibility
random.seed(509)
np.random.seed(509)

try:
    nltk.data.find("tokenizers/punkt")
except LookupError:
    nltk.download("punkt")

try:
    nltk.data.find("corpora/stopwords")
except LookupError:
    nltk.download("stopwords")

# Open the conventions DB 
db_candidates = ["convention_speeches.db", "2020_Conventions.db"]
db_path = next((p for p in db_candidates if os.path.exists(p)), None)
if db_path is None:
    raise FileNotFoundError(
        "Could not find convention DB. Put 'convention_speeches.db' or "
        "'2020_Conventions.db' in the current folder."
    )

convention_db = sqlite3.connect(db_path)
convention_cur = convention_db.cursor()

# Quick sanity check
try:
    
    convention_cur.execute("SELECT name FROM sqlite_master WHERE type='table'")
    print("Connected to:", db_path)
    print("Tables:", [t[0] for t in convention_cur.fetchall()])
except Exception as e:
    print("Connected, but couldn’t list tables:", e)

Connected to: convention_speeches.db
Tables: []


## 1. Exploratory Naive Bayes

We'll first build a NB model on the convention data itself, as a way to understand what words distinguish between the two parties. This is analogous to what we did in the "Comparing Groups" exercise. First, we'll pull in the text 
for each party and prepare it for use in Naive Bayes. 

In [13]:
import os, sqlite3

# paths
CONV_DB_PATH = "2020_Conventions.db"      
CONG_DB_PATH = "congressional_data.db"    

assert os.path.exists(CONV_DB_PATH), f"Missing {CONV_DB_PATH}"
assert os.path.exists(CONG_DB_PATH), f"Missing {CONG_DB_PATH}"

# open convention DB
conv_db = sqlite3.connect(CONV_DB_PATH)
conv_cur = conv_db.cursor()

# open congressional DB
cong_db = sqlite3.connect(CONG_DB_PATH)
cong_cur = cong_db.cursor()

def list_tables(cur):
    cur.execute("SELECT name FROM sqlite_master WHERE type='table'")
    return [t[0] for t in cur.fetchall()]

print("Conventions DB tables:", list_tables(conv_cur))
print("Congressional DB tables:", list_tables(cong_cur))

Conventions DB tables: ['conventions']
Congressional DB tables: ['websites', 'candidate_data', 'tweets']


In [26]:
import sqlite3

# Reopen the convention DB
conv_db = sqlite3.connect(CONV_DB_PATH)
convention_cur = conv_db.cursor()

# Show tables 
convention_cur.execute("SELECT name, type, sql FROM sqlite_master")
for row in convention_cur.fetchall():
    print(row)

('conventions', 'table', 'CREATE TABLE conventions (\n    party TEXT, \n    night INTEGER, \n    speaker TEXT,\n    speaker_count INTEGER,\n    time TEXT, \n    text TEXT,\n    text_len TEXT, \n    file TEXT)')


In [27]:
convention_data = []

query_results = convention_cur.execute(
    """
    SELECT text, party
    FROM conventions
    WHERE party != 'Other'
    """
)

for row in query_results:
    speech, party = row
    convention_data.append([speech, party])

print(f"Loaded {len(convention_data)} speeches")
print(convention_data[:2])

Loaded 2541 speeches
[['Skip to content The Company Careers Press Freelancers Blog × Services Transcription Captions Foreign Subtitles Translation Freelancers About Contact Login « Return to Transcript Library home  Transcript Categories  All Transcripts 2020 Election Transcripts Classic Speech Transcripts Congressional Testimony & Hearing Transcripts Debate Transcripts Donald Trump Transcripts Entertainment Transcripts Financial Transcripts Interview Transcripts Political Transcripts Press Conference Transcripts Speech Transcripts Sports Transcripts Technology Transcripts Aug 21, 2020 2020 Democratic National Convention (DNC) Night 4 Transcript Rev  ›  Blog  ›  Transcripts  › 2020 Election Transcripts  ›  2020 Democratic National Convention (DNC) Night 4 Transcript Night 4 of the 2020 Democratic National Convention (DNC) on August 20. Read the full transcript of the event here. Transcribe Your Own Content  Try Rev for free  and save time transcribing, captioning, and subtitling.', 'De

In [24]:
# it's a best practice to close up your DB connection when you're done
convention_db.close()

Let's look at some random entries and see if they look right. 

In [28]:
random.choices(convention_data,k=5)

[['Giving up on the Affordable Care Act would have meant leaving 20 million without coverage out in the cold. But Joe Biden wasn’t about to give up, because he knew what it was like to stand in their shoes. He was sworn into the Senate next to a hospital bed. His wife and daughter had been killed in a car crash. And lying in that bed were his two sons. 40 years later, one of those little boys, his son, Beau was diagnosed with cancer, and given only months to live. It’s hard to imagine a greater grief than losing your child. But Joe always knew that his family was one of the lucky ones. After that accident, his son got 40 more years of life, all because he had healthcare.',
  'Democratic'],
 ['And focuses on protecting our children.', 'Democratic'],
 ['We will rescue kids from failing schools by helping their parents send them to a safe school of their choice. We will completely rebuild our depleted military. The countries that we are protecting at a massive cost to us will be asked to 

It'll be useful for us to have a large sample size than 2024 affords, since those speeches tend to be long and contiguous. Let's make a new list-of-lists called `conv_sent_data`. Instead of each first entry in the sublists being an entire speech, make each first entry just a sentence from the speech. Feel free to use NLTK's `sent_tokenize` [function](https://www.nltk.org/api/nltk.tokenize.sent_tokenize.html). 

In [34]:
import nltk
nltk.download("punkt")
nltk.download("punkt_tab")

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\tanya\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\tanya\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt_tab.zip.


True

In [35]:
from nltk.tokenize import sent_tokenize

conv_sent_data = []

for speech, party in convention_data:
    # Break each speech into sentences
    sentences = sent_tokenize(speech)
    for sent in sentences:
        conv_sent_data.append([sent, party])

print(f"Loaded {len(conv_sent_data)} sentences")
print(conv_sent_data[:5])  

Loaded 10740 sentences
[['Skip to content The Company Careers Press Freelancers Blog × Services Transcription Captions Foreign Subtitles Translation Freelancers About Contact Login « Return to Transcript Library home  Transcript Categories  All Transcripts 2020 Election Transcripts Classic Speech Transcripts Congressional Testimony & Hearing Transcripts Debate Transcripts Donald Trump Transcripts Entertainment Transcripts Financial Transcripts Interview Transcripts Political Transcripts Press Conference Transcripts Speech Transcripts Sports Transcripts Technology Transcripts Aug 21, 2020 2020 Democratic National Convention (DNC) Night 4 Transcript Rev  ›  Blog  ›  Transcripts  › 2020 Election Transcripts  ›  2020 Democratic National Convention (DNC) Night 4 Transcript Night 4 of the 2020 Democratic National Convention (DNC) on August 20.', 'Democratic'], ['Read the full transcript of the event here.', 'Democratic'], ['Transcribe Your Own Content  Try Rev for free  and save time transcr

Again, let's look at some random entries. 

In [39]:
random.choices(conv_sent_data,k=5)

[['90% of Americans support common sense gun laws, because we need to do more to address the epidemic of gun violence.',
  'Democratic'],
 ['This what our President Donald Trump did for me, and for that, I will be forever grateful.',
  'Republican'],
 ['So what did you think about Kamala Harris’s speech last night.',
  'Democratic'],
 ['In America, we have not yet experienced physical persecution, even though the left has tried to silence us.',
  'Republican'],
 ['That same night, Donald Trump came to the hospital along with first lady, Melania Trump.',
  'Republican']]

Now it's time for our final cleaning before modeling. Go through `conv_sent_data` and take the following steps: 

1. Tokenize on whitespace
1. Remove punctuation
1. Remove tokens that fail the `isalpha` test
1. Remove stopwords
1. Casefold to lowercase
1. Join the remaining tokens into a string


In [40]:
import re
from string import punctuation

clean_conv_sent_data = []  

for idx, (sentence, party) in enumerate(conv_sent_data):
    # lowercase
    sent = sentence.lower()
    
    # remove punctuation 
    sent = re.sub(f"[{re.escape(punctuation)}]", "", sent)
    
    # split on whitespace into tokens
    tokens = sent.split()
    
    # save cleaned sentence tokens with party
    clean_conv_sent_data.append((tokens, party))

# samples
random.choices(clean_conv_sent_data, k=5)

[(['he',
   'faced',
   'wars',
   'without',
   'end',
   'in',
   'sight',
   'creation',
   'of',
   'failed',
   'states',
   'like',
   'libya',
   'and',
   'syria',
   'a',
   'pass',
   'that',
   'allowed',
   'a',
   'terrorist',
   'caliphate',
   'to',
   'grow',
   'and',
   'leadership',
   'in',
   'washington',
   'that',
   'allowed',
   'our',
   'military',
   'to',
   'atrophy',
   'while',
   'we',
   'spent',
   'trillions',
   'of',
   'dollars',
   'abroad',
   'instead',
   'of',
   'investing',
   'at',
   'home'],
  'Republican'),
 (['joe',
   'biden',
   'said',
   'black',
   'people',
   'are',
   'a',
   'monolithic',
   'community'],
  'Republican'),
 (['well',
   'i’m',
   'glad',
   'to',
   'hear',
   'it',
   'but',
   'boy',
   'i',
   'think',
   'a',
   'lot',
   'of',
   'americans',
   'are',
   'going',
   'to',
   'be',
   'dealing',
   'with',
   'that',
   'for',
   'a',
   'long',
   'time'],
  'Democratic'),
 (['god',
   'bless',
   'you',

If that looks good, let's make our function to turn these into features. First we need to build our list of candidate words. I started my exploration at a cutoff of 5. 

In [42]:
word_cutoff = 5

# Flatten all tokens from clean_conv_sent_data
tokens = [w for t, p in clean_conv_sent_data for w in t]

# Build frequency distribution
word_dist = nltk.FreqDist(tokens)

# Collect words above cutoff
feature_words = set(word for word, count in word_dist.items() if count > word_cutoff)

print(f"With a word cutoff of {word_cutoff}, we have {len(feature_words)} features in the model.")

With a word cutoff of 5, we have 2514 features in the model.


In [50]:
# Sanity check
print("clean_conv_sent_data size:", len(clean_conv_sent_data))
print("sample:", clean_conv_sent_data[:2])

 Build feature_words from the cleaned tokenized sentences
import nltk

word_cutoff = 5 
tokens = [w for t, _ in clean_conv_sent_data for w in t]  
fd = nltk.FreqDist(tokens)

feature_words = {w for w, c in fd.items() if c >= word_cutoff}
print(f"With a word cutoff of {word_cutoff}, we have {len(feature_words)} features in the model.")

clean_conv_sent_data size: 10740
sample: [(['skip', 'to', 'content', 'the', 'company', 'careers', 'press', 'freelancers', 'blog', '×', 'services', 'transcription', 'captions', 'foreign', 'subtitles', 'translation', 'freelancers', 'about', 'contact', 'login', '«', 'return', 'to', 'transcript', 'library', 'home', 'transcript', 'categories', 'all', 'transcripts', '2020', 'election', 'transcripts', 'classic', 'speech', 'transcripts', 'congressional', 'testimony', 'hearing', 'transcripts', 'debate', 'transcripts', 'donald', 'trump', 'transcripts', 'entertainment', 'transcripts', 'financial', 'transcripts', 'interview', 'transcripts', 'political', 'transcripts', 'press', 'conference', 'transcripts', 'speech', 'transcripts', 'sports', 'transcripts', 'technology', 'transcripts', 'aug', '21', '2020', '2020', 'democratic', 'national', 'convention', 'dnc', 'night', '4', 'transcript', 'rev', '›', 'blog', '›', 'transcripts', '›', '2020', 'election', 'transcripts', '›', '2020', 'democratic', 'nation

In [51]:
# Turn sentences into feature dictionaries
feature_sets = [
    (conv_features(" ".join(tokens), feature_words), party)
    for tokens, party in clean_conv_sent_data
]

print("Number of feature sets:", len(feature_sets))
print("Example:", feature_sets[0])

Number of feature sets: 10740
Example: ({'skip': True, 'to': True, 'content': True, 'the': True, 'company': True, 'careers': True, 'press': True, 'freelancers': True, 'blog': True, '×': True, 'services': True, 'transcription': True, 'captions': True, 'foreign': True, 'subtitles': True, 'translation': True, 'about': True, 'contact': True, 'login': True, '«': True, 'return': True, 'transcript': True, 'library': True, 'home': True, 'categories': True, 'all': True, 'transcripts': True, '2020': True, 'election': True, 'classic': True, 'speech': True, 'congressional': True, 'testimony': True, 'hearing': True, 'debate': True, 'donald': True, 'trump': True, 'entertainment': True, 'financial': True, 'interview': True, 'political': True, 'conference': True, 'sports': True, 'technology': True, 'aug': True, '21': True, 'democratic': True, 'national': True, 'convention': True, 'dnc': True, 'night': True, '4': True, 'rev': True, '›': True, 'of': True, 'on': True, 'august': True, '20': True}, 'Democr

In [52]:
# Shuffle and split into train/test
random.shuffle(feature_sets)

train_size = int(len(feature_sets) * 0.8)  # 80% train, 20% test
train_set, test_set = feature_sets[:train_size], feature_sets[train_size:]

print(f"Training size: {len(train_set)}, Test size: {len(test_set)}")

Training size: 8592, Test size: 2148


In [53]:
from nltk import NaiveBayesClassifier
from nltk.classify import accuracy

# Train
classifier = NaiveBayesClassifier.train(train_set)

# Evaluate
print("Accuracy on test set:", accuracy(classifier, test_set))

# Show most informative features
classifier.show_most_informative_features(15)

Accuracy on test set: 0.7104283054003724
Most Informative Features
                   votes = True           Democr : Republ =     40.6 : 1.0
                 radical = True           Republ : Democr =     30.4 : 1.0
             enforcement = True           Republ : Democr =     22.7 : 1.0
              affordable = True           Democr : Republ =     18.4 : 1.0
                  racism = True           Democr : Republ =     17.0 : 1.0
                   media = True           Republ : Democr =     15.6 : 1.0
                    mike = True           Republ : Democr =     14.4 : 1.0
                   lewis = True           Democr : Republ =     14.2 : 1.0
                 climate = True           Democr : Republ =     14.1 : 1.0
                freedoms = True           Republ : Democr =     13.1 : 1.0
                 freedom = True           Republ : Democr =     13.0 : 1.0
                    isis = True           Republ : Democr =     12.5 : 1.0
                 destroy = True  

In [70]:
result = conv_features("obama was the president", feature_words)
assert "obama" in result
assert "president" in result

result2 = conv_features("some people in america are citizens", feature_words)
assert all(w in result2 for w in ["people", "america", "citizens"])

print("✅ Sanity checks passed!")

✅ Sanity checks passed!


Now we'll build our feature set. Out of curiosity I did a train/test split to see how accurate the classifier was, but we don't strictly need to since this analysis is exploratory. 

In [71]:
featuresets = [(conv_features(text,feature_words), party) for (text, party) in convention_data]

In [72]:
random.seed(20220507)
random.shuffle(featuresets)

test_size = 500

In [73]:
test_set, train_set = featuresets[:test_size], featuresets[test_size:]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, test_set))

0.438


In [74]:
classifier.show_most_informative_features(25)

Most Informative Features
             enforcement = True           Republ : Democr =     27.5 : 1.0
                   votes = True           Democr : Republ =     21.6 : 1.0
                 climate = True           Democr : Republ =     17.3 : 1.0
                 destroy = True           Republ : Democr =     17.1 : 1.0
                supports = True           Republ : Democr =     16.1 : 1.0
                   media = True           Republ : Democr =     15.9 : 1.0
                preserve = True           Republ : Democr =     15.1 : 1.0
                  signed = True           Republ : Democr =     15.1 : 1.0
              appreciate = True           Republ : Democr =     14.0 : 1.0
                freedoms = True           Republ : Democr =     14.0 : 1.0
                 private = True           Republ : Democr =     11.9 : 1.0
                  defund = True           Republ : Democr =     10.9 : 1.0
                    drug = True           Republ : Democr =     10.3 : 1.0

Write a little prose here about what you see in the classifier. Anything odd or interesting?

### My Observations

_Your observations to come._

The model found words that link to each party’s common topics. For example, “enforcement” and “freedoms” show up more for Republicans, while “votes” and “climate” show up more for Democrats. That makes sense since each party tends to focus on different issues in their speeches. What’s surprising is the low accuracy (about 44%), which is close to guessing. This probably means the sentences overlap a lot between parties, so the model has a hard time telling them apart

## Part 2: Classifying Congressional Tweets

In this part we apply the classifer we just built to a set of tweets by people running for congress
in 2018. These tweets are stored in the database `congressional_data.db`. That DB is funky, so I'll
give you the query I used to pull out the tweets. Note that this DB has some big tables and 
is unindexed, so the query takes a minute or two to run on my machine.

In [77]:
cong_db = sqlite3.connect("congressional_twitter_data.db")
cong_cur = cong_db.cursor()

In [78]:
# Connect to the congressional DB 
cong_db = sqlite3.connect("congressional_data.db")
cong_cur = cong_db.cursor()

# Pull 
results = cong_cur.execute(
    '''
       SELECT DISTINCT 
              cd.candidate, 
              cd.party,
              tw.tweet_text
       FROM candidate_data cd 
       INNER JOIN tweets tw 
           ON cd.twitter_handle = tw.handle 
           AND cd.candidate == tw.candidate 
           AND cd.district == tw.district
       WHERE cd.party in ('Republican','Democratic') 
         AND tw.tweet_text NOT LIKE '%RT%'
    '''
)

# Convert to list 
results = list(results)

print(f"Pulled {len(results)} tweets")
print(results[:3])

Pulled 664656 tweets
[('Mo Brooks', 'Republican', b'"Brooks Joins Alabama Delegation in Voting Against Flawed Funding Bill" http://t.co/3CwjIWYsNq'), ('Mo Brooks', 'Republican', b'"Brooks: Senate Democrats Allowing President to Give Americans\xe2\x80\x99 Jobs to Illegals" #securetheborder https://t.co/mZtEaX8xS6'), ('Mo Brooks', 'Republican', b'"NASA on the Square" event this Sat. 11AM \xe2\x80\x93 4PM. Stop by &amp; hear about the incredible work done in #AL05! @DowntownHSV http://t.co/R9zY8WMEpA')]


In [80]:
# Now fill up tweet_data with sublists like we did on the convention speeches.
# Note that this may take a bit of time, since we have a lot of tweets.

tweet_data = []

# Loop through results 
for candidate, party, tweet in results:
    if tweet and party in ("Republican", "Democratic"):
        tweet_data.append([tweet, party])

print(f"Loaded {len(tweet_data)} tweets")
print(tweet_data[:3])

Loaded 664656 tweets
[[b'"Brooks Joins Alabama Delegation in Voting Against Flawed Funding Bill" http://t.co/3CwjIWYsNq', 'Republican'], [b'"Brooks: Senate Democrats Allowing President to Give Americans\xe2\x80\x99 Jobs to Illegals" #securetheborder https://t.co/mZtEaX8xS6', 'Republican'], [b'"NASA on the Square" event this Sat. 11AM \xe2\x80\x93 4PM. Stop by &amp; hear about the incredible work done in #AL05! @DowntownHSV http://t.co/R9zY8WMEpA', 'Republican']]


There are a lot of tweets here. Let's take a random sample and see how our classifer does. I'm guessing it won't be too great given the performance on the convention speeches...

In [81]:
random.seed(20201014)

tweet_data_sample = random.choices(tweet_data,k=10)

In [83]:
for tweet, party in tweet_data_sample:
    # Decode from bytes if needed
    if isinstance(tweet, bytes):
        tweet = tweet.decode("utf-8", errors="ignore")
    
    # Clean and tokenize the tweet
    cleaned_tweet = " ".join([w for w in tweet.lower().split()])
    
    # Extract features
    features = conv_features(cleaned_tweet, feature_words)
    
    # Estimate party with classifier
    estimated_party = classifier.classify(features)
    
    print(f"Here's our (cleaned) tweet: {tweet}")
    print(f"Actual party is {party} and our classifier says {estimated_party}.")
    print("")

Here's our (cleaned) tweet: Earlier today, I spoke on the House Floor abt protecting health care for women and praised @PPmarmonte for their work on the Central Coast. https://t.co/WqgTRzT7VV
Actual party is Democratic and our classifier says Republican.

Here's our (cleaned) tweet: Go Tribe! #RallyTogether https://t.co/0NXutFL9L5
Actual party is Democratic and our classifier says Democratic.

Here's our (cleaned) tweet: Apparently, Trump thinks it's just too easy for students overwhelmed by the crushing burden of debt to pay off student loans #TrumpBudget https://t.co/ckYQO5T0Qh
Actual party is Democratic and our classifier says Republican.

Here's our (cleaned) tweet: We’re grateful for our first responders, our rescue personnel, our firefighters, our police, and volunteers who have been working tirelessly to keep people safe, provide much-needed help, while putting their own lives on the line.

https://t.co/eZPv0vMIz3
Actual party is Republican and our classifier says Republican.

H

Now that we've looked at it some, let's score a bunch and see how we're doing.

In [85]:
from collections import defaultdict

# dictionary of counts by actual party and estimated party
parties = ['Republican', 'Democratic']
results = defaultdict(lambda: defaultdict(int))

for p in parties:
    for p1 in parties:
        results[p][p1] = 0

num_to_score = 10000
random.shuffle(tweet_data)

for idx, tp in enumerate(tweet_data):
    tweet, party = tp
    
    # Decode if tweet is bytes
    if isinstance(tweet, bytes):
        tweet = tweet.decode("utf-8", errors="ignore")
    
    # Clean tweet
    cleaned_tweet = " ".join([w for w in tweet.lower().split()])
    
    # Extract features
    features = conv_features(cleaned_tweet, feature_words)
    
    # Classify
    estimated_party = classifier.classify(features)
    
    # Store result
    results[party][estimated_party] += 1
    
    if idx > num_to_score:
        break

In [86]:
results

defaultdict(<function __main__.<lambda>()>,
            {'Republican': defaultdict(int,
                         {'Republican': 4020, 'Democratic': 352}),
             'Democratic': defaultdict(int,
                         {'Republican': 5162, 'Democratic': 468})})

### Reflections

_Write a little about what you see in the results_ 


The results show that the model is really good at spotting Republican tweets, but it struggles with Democratic ones. Most Democratic tweets got labeled as Republican, which means the classifier is biased. This could be because of the way the data was split or the words that were picked as features. Overall, the model works okay for Republicans but not so much for Democrats, so it would need more balance to be useful.