## Naive Bayes on Political Text

In this notebook we use Naive Bayes to explore and classify political data. See the `README.md` for full details. You can download the required DB from the shared dropbox or from blackboard

In [1]:
import sqlite3
import nltk
import random
import numpy as np
import pandas as pd
from collections import Counter, defaultdict

# Feel free to include your text patterns functions
import os
import re
import emoji

from collections import Counter, defaultdict
from nltk.corpus import stopwords
nltk.download('stopwords')
from string import punctuation
import string

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
#from text_functions_solutions import clean_tokenize, get_patterns

In [2]:
convention_db = sqlite3.connect(r'/Users/summerpurschke/Desktop/ADS/ADS509/mod4/2020_Conventions.db')
convention_cur = convention_db.cursor()

### Part 1: Exploratory Naive Bayes

We'll first build a NB model on the convention data itself, as a way to understand what words distinguish between the two parties. This is analogous to what we did in the "Comparing Groups" class work. First, pull in the text 
for each party and prepare it for use in Naive Bayes.  

In [3]:
# Execute SQL query to retrieve table names
convention_cur.execute("SELECT name FROM sqlite_master WHERE type='table';")
table_names = convention_cur.fetchall()

# Print the table names
for name in table_names:
    print(name[0])

conventions


In [5]:
#Execute SELECT query on the source database
query_results = convention_cur.execute("SELECT party, text FROM conventions")

convention_data = []
for row in query_results:
    convention_data.append(row)

In [10]:
# ## Fold to lowercase 
# convention_data = [(item[0].lower(), item[1].lower()) for item in convention_data]

# ## remove punctuation
# punctuation = set(punctuation) # speeds up comparison
# for item in convention_data:
#     cleaned_item = tuple([element.translate(str.maketrans("", "", string.punctuation)) for element in item])
#     convention_data.append(cleaned_item)

    

# ## remove stopwords   

In [18]:
convention_data = [(party.lower().translate(str.maketrans("", "", string.punctuation)), 
                 text.lower().translate(str.maketrans("", "", string.punctuation)))
                 for party, text in convention_data]


cleaned_data_lower = []
for party, text in convention_data:
    lower_party = party.lower()
    lower_text = text.lower()
    cleaned_data_lower.append((lower_party, lower_text))

convention_data = cleaned_data_lower

cleaned_data_punct = []

for party, text in cleaned_data_lower:
    no_punct_party = party.translate(str.maketrans("", "", string.punctuation))
    no_punct_text = text.translate(str.maketrans("", "", string.punctuation))
    cleaned_data_punct.append((no_punct_party, no_punct_text))
    
convention_data = cleaned_data_punct


In [34]:
stopwords_set = set(stopwords.words('english'))
cleaned_data_sw= []

for party, text in convention_data:
    cleaned_text = " ".join([word for word in text.split() if word.lower() not in stopwords_set])
    cleaned_data_sw.append((party, cleaned_text))

convention_data = cleaned_data_sw

cleaned_data_lower = []
for party, text in convention_data:
    lower_party = party.lower()
    lower_text = text.lower()
    cleaned_data_lower.append((lower_party, lower_text))

convention_data = cleaned_data_lower

cleaned_data_punct = []
for party, text in cleaned_data_lower:
    no_punct_party = party.translate(str.maketrans("", "", string.punctuation))
    no_punct_text = text.translate(str.maketrans("", "", string.punctuation))
    cleaned_data_punct.append((no_punct_party, no_punct_text))
    
convention_data = cleaned_data_punct

Let's look at some random entries and see if they look right. 

In [36]:
random.choices(convention_data,k=10)

[('democratic', 'inaudible 0024 55 you’re smiling okay'),
 ('democratic',
  'past years america’s body politics weakened divisions growing deeper antisemitism antilatino antiimmigrant fervor racism charlottesville kkk didn’t even bother wear hoods minnesota life squeezed mr floyd strong body fight virus america’s divisions weakened donald trump didn’t create initial division division created trump made worse collective strength exercised government effect immune system current federal government dysfunctional competent couldn’t fight virus fact didn’t even see coming european virus infected northeast white house still fixated china virus attacking us months even knew saw failure government tried deny virus tried ignore try politicize failed federal government watched new york get ambushed negligence watched new york suffer learned absolutely nothing'),
 ('democratic', '1967'),
 ('republican',
  'i’m supporting president trump believes strong america cannot fight endless wars must conti

If that looks good, we now need to make our function to turn these into features. In my solution, I wanted to keep the number of features reasonable, so I only used words that occur at least `word_cutoff` times. Here's the code to test that if you want it. 

In [37]:
word_cutoff = 5

tokens = [w for t, p in convention_data for w in t.split()]

word_dist = nltk.FreqDist(tokens)

feature_words = set()

for word, count in word_dist.items() :
    if count > word_cutoff :
        feature_words.add(word)
        
print(f"With a word cutoff of {word_cutoff}, we have {len(feature_words)} as features in the model.")

With a word cutoff of 5, we have 2 as features in the model.


In [None]:
def conv_features(text,fw) :
    """Given some text, this returns a dictionary holding the
       feature words.
       
       Args: 
            * text: a piece of text in a continuous string. Assumes
            text has been cleaned and case folded.
            * fw: the *feature words* that we're considering. A word 
            in `text` must be in fw in order to be returned. This 
            prevents us from considering very rarely occurring words.
        
       Returns: 
            A dictionary with the words in `text` that appear in `fw`. 
            Words are only counted once. 
            If `text` were "quick quick brown fox" and `fw` = {'quick','fox','jumps'},
            then this would return a dictionary of 
            {'quick' : True,
             'fox' :    True}
        
    """
    
    # Your code here
    
    ret_dict = dict()
    
    return(ret_dict)

In [None]:
assert(len(feature_words)>0)
assert(conv_features("donald is the president",feature_words)==
       {'donald':True,'president':True})
assert(conv_features("people are american in america",feature_words)==
                     {'america':True,'american':True,"people":True})

Now we'll build our feature set. Out of curiosity I did a train/test split to see how accurate the classifier was, but we don't strictly need to since this analysis is exploratory. 

In [None]:
featuresets = [(conv_features(text,feature_words), party) for (text, party) in convention_data]

In [None]:
random.seed(20220507)
random.shuffle(featuresets)

test_size = 500

In [None]:
test_set, train_set = featuresets[:test_size], featuresets[test_size:]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, test_set))

In [None]:
classifier.show_most_informative_features(25)

Write a little prose here about what you see in the classifier. Anything odd or interesting?

### My Observations

_Your observations to come._



## Part 2: Classifying Congressional Tweets

In this part we apply the classifer we just built to a set of tweets by people running for congress
in 2018. These tweets are stored in the database `congressional_data.db`. That DB is funky, so I'll
give you the query I used to pull out the tweets. Note that this DB has some big tables and 
is unindexed, so the query takes a minute or two to run on my machine.

In [41]:
cong_db = sqlite3.connect(r"/Users/summerpurschke/Desktop/ADS/ADS509/mod4/congressional_data.db")
cong_cur = cong_db.cursor()

In [57]:
results = cong_cur.execute(
        '''
           SELECT DISTINCT 
                  cd.candidate, 
                  cd.party,
                  tw.tweet_text
           FROM candidate_data cd 
           INNER JOIN tweets tw ON cd.twitter_handle = tw.handle 
               AND cd.candidate == tw.candidate 
               AND cd.district == tw.district
           WHERE cd.party in ('Republican','Democratic') 
               AND tw.tweet_text NOT LIKE '%RT%'
        ''')

results = list(results) # Just to store it, since the query is time consuming

In [62]:
tweet_data = results

In [64]:
stopwords_set = set(stopwords.words('english'))

tweet_data_sw = []

for party, text in tweet_data:
    # Decode the byte object using an appropriate encoding
    decoded_text = text.decode('utf-8')  # Adjust the encoding if necessary
    # Remove stopwords and clean the text
    cleaned_text = " ".join([word for word in decoded_text.split() if word.lower() not in stopwords_set])
    tweet_data_sw.append((party, cleaned_text))

tweet_data = tweet_data_sw

tweet_data_lower = []
for party, text in tweet_data:
    lower_party = party.lower()
    lower_text = text.lower()
    tweet_data_lower.append((lower_party, lower_text))

tweet_data = tweet_data_lower

tweet_data_punct = []
for party, text in tweet_data_lower:
    no_punct_party = party.translate(str.maketrans("", "", string.punctuation))
    no_punct_text = text.translate(str.maketrans("", "", string.punctuation))
    tweet_data_punct.append((no_punct_party, no_punct_text))
    
tweet_data = tweet_data_punct

There are a lot of tweets here. Let's take a random sample and see how our classifer does. I'm guessing it won't be too great given the performance on the convention speeches...

In [66]:
random.seed(20201014)

tweet_data_sample = random.choices(tweet_data,k=10)

In [73]:
tweet_data_sample

[('democratic',
  'silverbough awesome thrilled join womensmarchonwashington central coast residents  students httpstcobelyaqwamg'),
 ('democratic', 'congratulations goteamusa 🎉🇺🇸 httpstcox2di6gen8g'),
 ('democratic',
  'yet speaker ryan still willing endorse donald trump candidate proposed racist policy httpstconlklulbo8j'),
 ('republican',
  'today celebrate anniversary signing nations founding document philadelphia… httpstcoz0imiybfhe'),
 ('republican',
  'one year realdonaldtrump economy booming people want work finding work maga 🇺🇸'),
 ('democratic',
  'video heard concerns town hall advocated committee today must protecttheaca httpstcoopgumzs03x'),
 ('democratic',
  'aubrey found hermit crab tide pools hanging cabrillonps 4th grade… httpstcourfgkqkj96'),
 ('democratic',
  'definitely do teacher mother 5 trying changetheconversation washington solving problems getting things done tn04  help us take back house people 2018 httpstco0dffkhvdfk haveyoumetmariah php2018 teachers4congres

In [67]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

In [72]:
# Create a CountVectorizer to convert text into numerical features
vectorizer = CountVectorizer()

# Extract the tweet texts and parties from tweet_data_sample
tweets = [tweet for tweet, _ in tweet_data_sample]
parties = [party for _, party in tweet_data_sample]

# Convert the tweet texts into numerical features
X = vectorizer.fit_transform(tweets)

# Train a Naive Bayes classifier
classifier = MultinomialNB()
classifier.fit(X, parties)

# Predict the party for each tweet and print the results
for tweet, party in tweet_data_sample:
    tweet_text = tweet
    # Convert the tweet text into numerical features
    X_tweet = vectorizer.transform([tweet_text])
    # Predict the party using the trained classifier
    predicted_party = classifier.predict(X_tweet)[0]
    
    print(f"Here's our (cleaned) tweet: {tweet_text}")
    print(f"Actual party is {party} and our classifier says {predicted_party}.")
    print("")

Here's our (cleaned) tweet: democratic
Actual party is silverbough awesome thrilled join womensmarchonwashington central coast residents  students httpstcobelyaqwamg and our classifier says aubrey found hermit crab tide pools hanging cabrillonps 4th grade… httpstcourfgkqkj96.

Here's our (cleaned) tweet: democratic
Actual party is congratulations goteamusa 🎉🇺🇸 httpstcox2di6gen8g and our classifier says aubrey found hermit crab tide pools hanging cabrillonps 4th grade… httpstcourfgkqkj96.

Here's our (cleaned) tweet: democratic
Actual party is yet speaker ryan still willing endorse donald trump candidate proposed racist policy httpstconlklulbo8j and our classifier says aubrey found hermit crab tide pools hanging cabrillonps 4th grade… httpstcourfgkqkj96.

Here's our (cleaned) tweet: republican
Actual party is today celebrate anniversary signing nations founding document philadelphia… httpstcoz0imiybfhe and our classifier says one year realdonaldtrump economy booming people want work fin

Now that we've looked at it some, let's score a bunch and see how we're doing.

In [None]:
# dictionary of counts by actual party and estimated party. 
# first key is actual, second is estimated
parties = ['Republican','Democratic']
results = defaultdict(lambda: defaultdict(int))

for p in parties :
    for p1 in parties :
        results[p][p1] = 0


num_to_score = 10000
random.shuffle(tweet_data)

for idx, tp in enumerate(tweet_data) :
    tweet, party = tp    
    # Now do the same thing as above, but we store the results rather
    # than printing them. 
   
    # get the estimated party
    estimated_party = "Gotta fill this in"
    
    results[party][estimated_party] += 1
    
    if idx > num_to_score : 
        break

In [None]:
results

### Reflections

_Write a little about what you see in the results_ 