## Naive Bayes on Political Text

In this notebook we use Naive Bayes to explore and classify political data. See the `README.md` for full details. You can download the required DB from the shared dropbox.

In [248]:
import sqlite3
import nltk
import random
import numpy as np
import pandas as pd
from collections import Counter, defaultdict

# Feel free to include your text patterns functions
import os
import re
import emoji

from collections import Counter, defaultdict
from nltk.corpus import stopwords
nltk.download('stopwords')
from string import punctuation
import string

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
#from text_functions_solutions import clean_tokenize, get_patterns

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/summerpurschke/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [249]:
convention_db = sqlite3.connect(r'/Users/2020_Conventions.db')
convention_cur = convention_db.cursor()

### Part 1: Exploratory Naive Bayes

We'll first build a NB model on the convention data itself, as a way to understand what words distinguish between the two parties. This is analogous to what we did in the "Comparing Groups" class work. First, pull in the text 
for each party and prepare it for use in Naive Bayes.  

In [250]:
# Execute SQL query to retrieve table names
convention_cur.execute("SELECT name FROM sqlite_master WHERE type='table';")
table_names = convention_cur.fetchall()

# Print the table names
for name in table_names:
    print(name[0])

conventions


In [251]:
#Execute SELECT query on the source database
query_results = convention_cur.execute("SELECT party, text FROM conventions")

convention_data = []
for row in query_results:
    convention_data.append(row)

In [252]:
stopwords_set = set(stopwords.words('english'))
cleaned_data_sw= []


cleaned_data_lower = []
for party, text in convention_data:
    lower_party = party.lower()
    lower_text = text.lower()
    cleaned_data_lower.append((lower_party, lower_text))

convention_data = cleaned_data_lower

for party, text in convention_data:
    cleaned_text = " ".join([word for word in text.split() if word.lower() not in stopwords_set])
    cleaned_data_sw.append((party, cleaned_text))

convention_data = cleaned_data_sw

cleaned_data_punct = []
for party, text in cleaned_data_lower:
    no_punct_party = party.translate(str.maketrans("", "", string.punctuation))
    no_punct_text = text.translate(str.maketrans("", "", string.punctuation))
    cleaned_data_punct.append((no_punct_party, no_punct_text))
    
convention_data = cleaned_data_punct

In [253]:
convention_data = [(text, party) for party, text in convention_data]

Let's look at some random entries and see if they look right. 

In [254]:
random.choices(convention_data,k=10)

[('and i am a dreamer', 'democratic'),
 ('ohio', 'republican'),
 ('inaudible 00014701', 'democratic'),
 ('i was born with spina bifida that means my spinal cord didn’t form like they should and the doctors in my town said i wouldn’t survive they gave my mother no hope for my future',
  'democratic'),
 ('we will rescue kids from failing schools by helping their parents send them to a safe school of their choice we will completely rebuild our depleted military the countries that we are protecting at a massive cost to us will be asked to pay their fair share we will take care of our great veterans like they have never been taken care of before',
  'republican'),
 ('okay inaudible 00014426 good', 'democratic'),
 ('women need to be taken seriously proper police respect can prevent what happened to me from happening to someone else thank you',
  'democratic'),
 ('their argument for joe biden boiled down to the fact that they think he’s a nice guy well let me tell you raising taxes on 82 of a

If that looks good, we now need to make our function to turn these into features. In my solution, I wanted to keep the number of features reasonable, so I only used words that occur at least `word_cutoff` times. Here's the code to test that if you want it. 

In [255]:
word_cutoff = 5

tokens = [w for t, p in convention_data for w in t.split()]

word_dist = nltk.FreqDist(tokens)

feature_words = set()

for word, count in word_dist.items() :
    if count > word_cutoff :
        feature_words.add(word)
        
print(f"With a word cutoff of {word_cutoff}, we have {len(feature_words)} as features in the model.")

With a word cutoff of 5, we have 2514 as features in the model.


In [256]:
def conv_features(text, fw):

    ret_dict = {}
    words = text.split()

    for word in words:
        if word in fw:
            ret_dict[word] = True

    return ret_dict

# FIX THIS! Stopwords are not taken out 

In [262]:
assert(len(feature_words)>0)
assert(conv_features("donald is the president",feature_words)==
       {'donald': True, 'is': True, 'the': True, 'president': True})
assert(conv_features("people are american in america",feature_words)==
                     {'people': True, 'are': True, 'american': True, 'in': True, 'america': True})

Now we'll build our feature set. Out of curiosity I did a train/test split to see how accurate the classifier was, but we don't strictly need to since this analysis is exploratory. 

In [263]:
featuresets = [(conv_features(text,feature_words), party) for (text, party) in convention_data]

In [264]:
random.seed(20220507)
random.shuffle(featuresets)

test_size = 500

In [265]:
test_set, train_set = featuresets[:test_size], featuresets[test_size:]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, test_set))

0.444


In [266]:
classifier.show_most_informative_features(25)

Most Informative Features
                   china = True           republ : democr =     25.8 : 1.0
                   votes = True           democr : republ =     23.8 : 1.0
             enforcement = True           republ : democr =     21.5 : 1.0
                 destroy = True           republ : democr =     19.2 : 1.0
                freedoms = True           republ : democr =     18.2 : 1.0
                 climate = True           democr : republ =     17.8 : 1.0
                supports = True           republ : democr =     17.1 : 1.0
                   crime = True           republ : democr =     16.1 : 1.0
                   media = True           republ : democr =     14.9 : 1.0
                 beliefs = True           republ : democr =     13.0 : 1.0
               countries = True           republ : democr =     13.0 : 1.0
                 defense = True           republ : democr =     13.0 : 1.0
                    isis = True           republ : democr =     13.0 : 1.0

### My Observations
One notable observation from the output is the significant imbalance between the number of words indicating a Republican speaker compared to those suggesting a Democratic speaker. Specifically, there are only two words, "votes" and "climate," that strongly associate with Democratic speeches. This finding suggests that Republican speeches may exhibit more repetitiveness and a narrower range of topics, while Democratic speakers demonstrate greater linguistic diversity and cover a wider array of subjects.

It is worth noting that the absence of words like "welfare" and "Medicaid," which are often associated with Democrats, is surprising. However, this could indicate that the classifier has identified other distinguishing features or that the training data may not have adequately represented these specific terms.

Overall, these observations shed light on the characteristics of the Naive Bayes classifier's learned patterns, indicating a higher prevalence of words aligning with Republicans and a relatively smaller set associated with Democrats. 

## Part 2: Classifying Congressional Tweets

In this part we apply the classifer we just built to a set of tweets by people running for congress
in 2018. These tweets are stored in the database `congressional_data.db`. That DB is funky, so I'll
give you the query I used to pull out the tweets. Note that this DB has some big tables and 
is unindexed, so the query takes a minute or two to run on my machine.

In [267]:
cong_db = sqlite3.connect(r"/Users/summerpurschke/Desktop/ADS/ADS509/mod4/congressional_data.db")
cong_cur = cong_db.cursor()

In [273]:
results = cong_cur.execute(
        '''
           SELECT DISTINCT 
                  cd.party,
                  tw.tweet_text
           FROM candidate_data cd 
           INNER JOIN tweets tw ON cd.twitter_handle = tw.handle 
               AND cd.candidate == tw.candidate 
               AND cd.district == tw.district
           WHERE cd.party in ('Republican','Democratic') 
               AND tw.tweet_text NOT LIKE '%RT%'
        ''')

results = list(results) # Just to store it, since the query is time consuming

In [274]:
tweet_data = results

In [276]:
stopwords_set = set(stopwords.words('english'))

tweet_data_sw = []

for party, text in tweet_data:
    # Decode the byte object using an appropriate encoding
    decoded_text = text.decode('utf-8')  # Adjust the encoding if necessary
    # Remove stopwords and clean the text
    cleaned_text = " ".join([word for word in decoded_text.split() if word.lower() not in stopwords_set])
    tweet_data_sw.append((party, cleaned_text))

tweet_data = tweet_data_sw

tweet_data_lower = []
for party, text in tweet_data:
    lower_party = party.lower()
    lower_text = text.lower()
    tweet_data_lower.append((lower_party, lower_text))

tweet_data = tweet_data_lower

tweet_data_punct = []
for party, text in tweet_data_lower:
    no_punct_party = party.translate(str.maketrans("", "", string.punctuation))
    no_punct_text = text.translate(str.maketrans("", "", string.punctuation))
    tweet_data_punct.append((no_punct_party, no_punct_text))
    
tweet_data = tweet_data_punct

There are a lot of tweets here. Let's take a random sample and see how our classifer does. I'm guessing it won't be too great given the performance on the convention speeches...

In [277]:
random.seed(20201014)

tweet_data_sample = random.choices(tweet_data,k=10)

In [282]:
tweet_data_sample = [(text, party) for party, text in tweet_data_sample]

In [279]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

In [284]:
# Create a CountVectorizer to convert text into numerical features
vectorizer = CountVectorizer()

# Extract the tweet texts and parties from tweet_data_sample
tweets = [tweet for tweet, _ in tweet_data_sample]
parties = [party for _, party in tweet_data_sample]

# Convert the tweet texts into numerical features
X = vectorizer.fit_transform(tweets)

# Train a Naive Bayes classifier
classifier = MultinomialNB()
classifier.fit(X, parties)

# Predict the party for each tweet and print the results
for tweet, party in tweet_data_sample:
    tweet_text = tweet
    # Convert the tweet text into numerical features
    X_tweet = vectorizer.transform([tweet_text])
    # Predict the party using the trained classifier
    predicted_party = classifier.predict(X_tweet)[0]
    
    print(f"Here's our (cleaned) tweet: {tweet_text}")
    print(f"Actual party is {party} and our classifier says {predicted_party}.")
    print("")

Here's our (cleaned) tweet: silverbough awesome thrilled join womensmarchonwashington central coast residents  students httpstcobelyaqwamg
Actual party is democratic and our classifier says democratic.

Here's our (cleaned) tweet: congratulations goteamusa 🎉🇺🇸 httpstcox2di6gen8g
Actual party is democratic and our classifier says democratic.

Here's our (cleaned) tweet: yet speaker ryan still willing endorse donald trump candidate proposed racist policy httpstconlklulbo8j
Actual party is democratic and our classifier says democratic.

Here's our (cleaned) tweet: today celebrate anniversary signing nations founding document philadelphia… httpstcoz0imiybfhe
Actual party is republican and our classifier says republican.

Here's our (cleaned) tweet: one year realdonaldtrump economy booming people want work finding work maga 🇺🇸
Actual party is republican and our classifier says republican.

Here's our (cleaned) tweet: video heard concerns town hall advocated committee today must protecttheac

Now that we've looked at it some, let's score a bunch and see how we're doing.

In [289]:
# dictionary of counts by actual party and estimated party. 
# first key is actual, second is estimated
parties = ['Republican','Democratic']
results = defaultdict(lambda: defaultdict(int))

for p in parties :
    for p1 in parties :
        results[p][p1] = 0


num_to_score = 10000
random.shuffle(tweet_data)

for idx, tp in enumerate(tweet_data) :
    tweet, party = tp    
    # get the estimated party
    tweet_text = tweet
    # Convert the tweet text into numerical features
    X_tweet = vectorizer.transform([tweet_text])
    # Predict the party using the trained classifier
    estimated_party = classifier.predict(X_tweet)[0]
    
    results[party][estimated_party] += 1
    
    if idx > num_to_score : 
        break

In [290]:
results

defaultdict(<function __main__.<lambda>()>,
            {'Republican': defaultdict(int,
                         {'Republican': 0, 'Democratic': 0}),
             'Democratic': defaultdict(int,
                         {'Republican': 0, 'Democratic': 0}),
             'i’m delighted house’s approval congressional gold medal baseball legend civil rights hero amp paterson product larry doby thank repjimrenacci repmaxinewaters sensherrodbrown helping achieve deserved recognition american pioneer httpstcorexhvuqqnx': defaultdict(int,
                         {'democratic': 1}),
             'lee auman announces campaign us congress httpstcowo1tf1njxc': defaultdict(int,
                         {'democratic': 1}),
             'survived bday sleepover 2014 girlswannahavefun httptcoqilk8x2ywn': defaultdict(int,
                         {'democratic': 1}),
             'one cruelest provisions gop’s’ tax plan elimination medical expense deduction benefits middleclass families whose medical ex

### Reflections
The provided results offer valuable insights into the classification outcomes of a Naive Bayes classifier applied to a dataset comprising tweets from members of the two prominent political parties in the United States. These counts represent the frequency of tweets assigned to specific party categories by the classifier. By quantifying the number of correctly and incorrectly classified tweets, the results provide a numeric assessment of the model's accuracy. This analysis serves as an informative evaluation of the classifier's performance in categorizing tweets into their respective party affiliations.