# Naive Bayes on Political Text

In this notebook we use Naive Bayes to explore and classify political data. See the `README.md` for full details. You can download the required DB from the shared dropbox or from blackboard

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [16]:
import sqlite3
import random
import numpy as np
from collections import Counter, defaultdict
import re
import string
import nltk

nltk.download('punkt')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize

# Feel free to include your text patterns functions
#from text_functions_solutions import clean_tokenize, get_patterns

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [5]:
convention_db = sqlite3.connect("/content/drive/MyDrive/ADS-509-01/Module4/2020_Conventions.db")
convention_cur = convention_db.cursor()

## 1. Exploratory Naive Bayes

We'll first build a NB model on the convention data itself, as a way to understand what words distinguish between the two parties. This is analogous to what we did in the "Comparing Groups" exercise. First, we'll pull in the text
for each party and prepare it for use in Naive Bayes.

In [6]:
convention_data = []



query_results = convention_cur.execute(
    '''
    SELECT text, party
    FROM conventions
    WHERE party != "Other"
    '''
)

for row in query_results:
    speech_text = row[0]
    party = row[1]
    convention_data.append([speech_text, party])


In [7]:
# it's a best practice to close up your DB connection when you're done
convention_db.close()

Let's look at some random entries and see if they look right.

In [11]:
random.choices(convention_data,k=5)

[['When she was second lady, Jill told me that she would like to continue teaching at community college. And I said, “That’s insane. You cannot possibly do that.” Dr.',
  'Democratic'],
 ['His entire campaign was for all Americans. That was a turning point for me.',
  'Republican'],
 ['This week marks the 100th anniversary of the passage of the 19th Amendment, and we celebrate the women who fought for that right, yet so many of the black women who helped secure that victory were still prohibited from voting long after its ratification, but they were undeterred without fanfare or recognition they organized and testified and rallied and marched and fought, not just for their vote but for a seat at the table.',
  'Democratic'],
 ['We will never, ever sign bad trade deals. America first, again, America first.',
  'Republican'],
 ['And that our recovery is truly people up, families up, and communities up.',
  'Democratic']]

It'll be useful for us to have a large sample size than 2024 affords, since those speeches tend to be long and contiguous. Let's make a new list-of-lists called `conv_sent_data`. Instead of each first entry in the sublists being an entire speech, make each first entry just a sentence from the speech. Feel free to use NLTK's `sent_tokenize` [function](https://www.nltk.org/api/nltk.tokenize.sent_tokenize.html).

In [14]:
conv_sent_data = []

for speech, party in convention_data:
    sentences = sent_tokenize(speech)
    for sentence in sentences:
        conv_sent_data.append([sentence, party])



Again, let's look at some random entries.

In [15]:
random.choices(conv_sent_data,k=5)

[['And I gave my speech with much vigor to a completely empty chamber.',
  'Democratic'],
 ['I am 11 years old.', 'Democratic'],
 ['It’s truly heroic.', 'Democratic'],
 ['Joe Biden said black people are a monolithic community.', 'Republican'],
 ['Not so for President Trump.', 'Republican']]

Now it's time for our final cleaning before modeling. Go through `conv_sent_data` and take the following steps:

1. Tokenize on whitespace
1. Remove punctuation
1. Remove tokens that fail the `isalpha` test
1. Remove stopwords
1. Casefold to lowercase
1. Join the remaining tokens into a string


In [24]:
stop_words = set(stopwords.words('english'))

clean_conv_sent_data = [] # list of tuples (sentence, party), with sentence cleaned

# This code was supported with the help of GitHub Copilot
for idx, (sentence, party) in enumerate(conv_sent_data):
    tokens = sentence.split()

    tokens = [word for word in tokens if word.isalpha()]

    tokens = [word.casefold() for word in tokens if word.casefold() not in stop_words]

    cleaned_sentence = ' '.join(tokens)

    clean_conv_sent_data.append((cleaned_sentence, party))

random.choices(clean_conv_sent_data, k=5)

[('president reactivated white house council native american promote economic development rural prosperity indian',
  'Republican'),
 ('president trump first leader savvy crushed status quo establishment political media',
  'Republican'),
 ('clear', 'Democratic'),
 ('', 'Democratic'),
 ('instead left could lead world', 'Democratic')]

If that looks good, let's make our function to turn these into features. First we need to build our list of candidate words. I started my exploration at a cutoff of 5.

In [25]:
word_cutoff = 5

tokens = [w for t, p in clean_conv_sent_data for w in t.split()]

word_dist = nltk.FreqDist(tokens)

feature_words = set()

for word, count in word_dist.items() :
    if count > word_cutoff :
        feature_words.add(word)

print(f"With a word cutoff of {word_cutoff}, we have {len(feature_words)} as features in the model.")

With a word cutoff of 5, we have 1776 as features in the model.


In [34]:
def conv_features(text, fw):
    """Given some text, this returns a dictionary holding the
       feature words.

       Args:
            * text: a piece of text in a continuous string. Assumes
            text has been cleaned and case folded.
            * fw: the *feature words* that we're considering. A word
            in `text` must be in fw in order to be returned. This
            prevents us from considering very rarely occurring words.

       Returns:
            A dictionary with the words in `text` that appear in `fw`.
            Words are only counted once.
            If `text` were "quick quick brown fox" and `fw` = {'quick','fox','jumps'},
            then this would return a dictionary of
            {'quick' : True,
             'fox' :    True}

    """
    # This code was supported with the help of GitHub Copilot
    ret_dict = {}
    words = text.split()
    for word in words:
        if word in fw:
            ret_dict[word] = True
    return ret_dict


In [35]:
assert(len(feature_words)>0)
assert(conv_features("obama was the president",feature_words)==
       {'obama':True,'president':True})
assert(conv_features("some people in america are citizens",feature_words)==
                     {'people':True,'america':True,"citizens":True})

Now we'll build our feature set. Out of curiosity I did a train/test split to see how accurate the classifier was, but we don't strictly need to since this analysis is exploratory.

In [36]:
featuresets = [(conv_features(text,feature_words), party) for (text, party) in convention_data]

In [37]:
random.seed(20220507)
random.shuffle(featuresets)

test_size = 500

In [38]:
test_set, train_set = featuresets[:test_size], featuresets[test_size:]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, test_set))

0.494


In [39]:
classifier.show_most_informative_features(25)

Most Informative Features
             enforcement = True           Republ : Democr =     27.5 : 1.0
                   votes = True           Democr : Republ =     21.6 : 1.0
                 climate = True           Democr : Republ =     17.3 : 1.0
                 destroy = True           Republ : Democr =     17.1 : 1.0
                supports = True           Republ : Democr =     16.1 : 1.0
                   media = True           Republ : Democr =     15.9 : 1.0
                preserve = True           Republ : Democr =     15.1 : 1.0
                  signed = True           Republ : Democr =     15.1 : 1.0
              appreciate = True           Republ : Democr =     14.0 : 1.0
                freedoms = True           Republ : Democr =     14.0 : 1.0
                 private = True           Republ : Democr =     11.9 : 1.0
                  defund = True           Republ : Democr =     10.9 : 1.0
                    drug = True           Republ : Democr =     10.3 : 1.0

Write a little prose here about what you see in the classifier. Anything odd or interesting?

### My Observations

Party-Specific Terms:

Republican: Words like "enforcement", "destroy", "supports", "media", "preserve", "signed", "appreciate", "freedoms", "private", "defund", "drug", "special", "trade", "everyday", "local", "allowed", "moved", "bless", "land", "agenda", "countries", "crime" are highly indicative of Republican speeches. These terms often relate to law enforcement, media critique, national pride, and conservative policies.

Democratic: Words like "votes", "climate", and "elect" are highly indicative of Democratic speeches, reflecting a focus on environmental issues, voting rights, and elections.
Strength of Association:

The ratio indicates how strongly a word is associated with a particular party. For example, "enforcement" is 27.5 times more likely to appear in Republican speeches than Democratic ones. This strong association suggests that these words are key identifiers of the party's rhetoric.
Surprising Terms:

Some words like "media" and "climate" have strong associations but might be expected to appear frequently in both parties' speeches due to their relevance in contemporary discourse. However, their strong association with specific parties highlights differing focuses or contexts in which these words are used.




## Part 2: Classifying Congressional Tweets

In this part we apply the classifer we just built to a set of tweets by people running for congress
in 2018. These tweets are stored in the database `congressional_data.db`. That DB is funky, so I'll
give you the query I used to pull out the tweets. Note that this DB has some big tables and
is unindexed, so the query takes a minute or two to run on my machine.

In [40]:
cong_db = sqlite3.connect("/content/drive/MyDrive/ADS-509-01/Module4/congressional_data.db")
cong_cur = cong_db.cursor()

In [41]:
results = cong_cur.execute(
        '''
           SELECT DISTINCT
                  cd.candidate,
                  cd.party,
                  tw.tweet_text
           FROM candidate_data cd
           INNER JOIN tweets tw ON cd.twitter_handle = tw.handle
               AND cd.candidate == tw.candidate
               AND cd.district == tw.district
           WHERE cd.party in ('Republican','Democratic')
               AND tw.tweet_text NOT LIKE '%RT%'
        ''')

results = list(results) # Just to store it, since the query is time consuming

In [46]:
tweet_data = []

for row in results:
    candidate, party, tweet_text = row
    tweet_data.append([tweet_text, party])
# Now fill up tweet_data with sublists like we did on the convention speeches.
# Note that this may take a bit of time, since we have a lot of tweets.


There are a lot of tweets here. Let's take a random sample and see how our classifer does. I'm guessing it won't be too great given the performance on the convention speeches...

In [47]:
random.seed(20201014)

tweet_data_sample = random.choices(tweet_data,k=10)

In [64]:

# This code was supported with the help of GitHub Copilot
def preprocess_tweet(tweet):
    # Decode tweet if it's in bytes
    if isinstance(tweet, bytes):
        tweet = tweet.decode('utf-8')

    # Tokenize on whitespace
    tokens = tweet.split()

    # Remove tokens that fail the isalpha test and stopwords
    tokens = [word for word in tokens if word.isalpha()]
    tokens = [word.casefold() for word in tokens if word.casefold() not in stop_words]

    # Join the remaining tokens into a string
    cleaned_tweet = ' '.join(tokens)

    return cleaned_tweet

for tweet, party in tweet_data_sample:
    # Preprocess the tweet
    cleaned_tweet = preprocess_tweet(tweet)

    estimated_party = classifier.classify(features)

    print(f"Here's our (cleaned) tweet: {cleaned_tweet}")
    print(f"Actual party is {party} and our classifier says {estimated_party}.")
    print("")

Here's our (cleaned) tweet: mass shooting las vegas horrific act victims families thoughts
Actual party is Democratic and our classifier says Republican.

Here's our (cleaned) tweet: early morning leaving
Actual party is Republican and our classifier says Republican.

Here's our (cleaned) tweet: moderates enemies sides assist
Actual party is Republican and our classifier says Republican.

Here's our (cleaned) tweet: rt national security veterans demanding answers release confidential national security
Actual party is Democratic and our classifier says Republican.

Here's our (cleaned) tweet: 
Actual party is Republican and our classifier says Democratic.

Here's our (cleaned) tweet: glad attend assure everyone could majority americans still stand traditional
Actual party is Democratic and our classifier says Republican.

Here's our (cleaned) tweet: everyone wraps flag patriotism avoid discussion racism kneeling honoring troops
Actual party is Democratic and our classifier says Republic

Now that we've looked at it some, let's score a bunch and see how we're doing.

In [67]:
# This code was supported with the help of GitHub Copilot
# dictionary of counts by actual party and estimated party.
# first key is actual, second is estimated
parties = ['Republican','Democratic']
results = defaultdict(lambda: defaultdict(int))

for p in parties :
    for p1 in parties :
        results[p][p1] = 0


num_to_score = 10000
random.shuffle(tweet_data)

for idx, tp in enumerate(tweet_data) :
    tweet, party = tp
 # Preprocess the tweet
    cleaned_tweet = preprocess_tweet(tweet)

    # Extract features
    features = conv_features(cleaned_tweet, feature_words)

    # Estimate the party using the classifier
    estimated_party = classifier.classify(features)

    results[party][estimated_party] += 1

    if idx > num_to_score :
        break

In [66]:
results

defaultdict(<function __main__.<lambda>()>,
            {'Republican': defaultdict(int,
                         {'Republican': 3329, 'Democratic': 781}),
             'Democratic': defaultdict(int,
                         {'Republican': 4725, 'Democratic': 1167})})

### Reflections

Misclassification Tendency:

The classifier tends to incorrectly classify a significant number of Democratic tweets as Republican (4725 misclassifications). This suggests that the classifier might be biased towards predicting tweets as Republican.
The number of misclassifications for Republican tweets as Democratic (781) is much lower in comparison.

Possible Reasons for Misclassification:

1. Feature Overlap: There might be a significant overlap in the vocabulary used by both parties, leading to confusion in classification.

2. Feature Selection: The features selected for the classifier might be more indicative of Republican tweets, or there might be an overrepresentation of Republican-related features in the training set.

3. Dataset Imbalance: If the training dataset had more Republican tweets, the classifier could be biased towards predicting Republican.
Accuracy:

The classifier correctly identifies Republican tweets with a higher accuracy compared to Democratic tweets. This indicates that the classifier might be better at recognizing features associated with Republican tweets.

In [68]:
!apt-get install texlive texlive-xetex texlive-latex-extra pandoc
!pip install pypandoc

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following additional packages will be installed:
  dvisvgm fonts-droid-fallback fonts-lato fonts-lmodern fonts-noto-mono fonts-texgyre
  fonts-urw-base35 libapache-pom-java libcmark-gfm-extensions0.29.0.gfm.3 libcmark-gfm0.29.0.gfm.3
  libcommons-logging-java libcommons-parent-java libfontbox-java libfontenc1 libgs9 libgs9-common
  libidn12 libijs-0.35 libjbig2dec0 libkpathsea6 libpdfbox-java libptexenc1 libruby3.0 libsynctex2
  libteckit0 libtexlua53 libtexluajit2 libwoff1 libzzip-0-13 lmodern pandoc-data poppler-data
  preview-latex-style rake ruby ruby-net-telnet ruby-rubygems ruby-webrick ruby-xmlrpc ruby3.0
  rubygems-integration t1utils teckit tex-common tex-gyre texlive-base texlive-binaries
  texlive-fonts-recommended texlive-latex-base texlive-latex-recommended texlive-pictures
  texlive-plain-generic tipa xfonts-encodings xfonts-utils
Suggested packages:
  fonts-noto fonts-fre

In [88]:
!jupyter nbconvert --to PDF "/content/drive/MyDrive/ADS-509-01/Module4/Political Naive Bayes.ipynb"

[NbConvertApp] Converting notebook /content/drive/MyDrive/ADS-509-01/Module4/Political Naive Bayes.ipynb to PDF
[NbConvertApp] Writing 89458 bytes to notebook.tex
[NbConvertApp] Building PDF
[NbConvertApp] Running xelatex 3 times: ['xelatex', 'notebook.tex', '-quiet']
[NbConvertApp] Running bibtex 1 time: ['bibtex', 'notebook']
[NbConvertApp] PDF successfully created
[NbConvertApp] Writing 91847 bytes to /content/drive/MyDrive/ADS-509-01/Module4/Political Naive Bayes.pdf
