## Naive Bayes on Political Text
### Stephen Kuc
### ADS509

In this notebook we use Naive Bayes to explore and classify political data. See the `README.md` for full details.

In [81]:
import sqlite3
import nltk
import random
import numpy as np
from collections import Counter, defaultdict
import pandas as pd
from nltk.classify.scikitlearn import SklearnClassifier

# Feel free to include your text patterns functions
# from text_function_solutions import descriptive_stats, remove_stop,  tokenize, prepare

In [50]:
from nltk.corpus import stopwords
from string import punctuation
import re


# Some punctuation variations
punctuation = set(punctuation) # speeds up comparison
tw_punct = punctuation - {"#"}

# Stopwords
sw = stopwords.words("english")
sw = set(sw)

# Two useful regex
whitespace_pattern = re.compile(r"\s+")
hashtag_pattern = re.compile(r"^#[0-9a-zA-Z]+")

# and now our functions
def descriptive_stats(tokens, num_tokens = 5, verbose=True) :

    
    # finding total number of tokens by the length of tokens list
    num_tokens = len(tokens)
    
    # creating a set of tokens to find unique, and then length of that
    num_unique_tokens = len(set(tokens))
    
    # dividing unique over total to find diversity
    lexical_diversity = num_unique_tokens / num_tokens
    
    
    # creating a quick for-loop to cound all characters in document
    num_characters_list = []
    for word in tokens:
        num_characters_list.append(len(word))
        
    num_characters = sum(num_characters_list)
    
    # utilizing Counter to count the characters and then finding the top 5
    
    common_count = Counter(tokens)
    
    most_common_five = common_count.most_common(5)
    
    if verbose == True:        
        print(f"There are {num_tokens} tokens in the data.")
        print(f"There are {num_unique_tokens} unique tokens in the data.")
        print(f"There are {num_characters} characters in the data.")
        print(f"The lexical diversity is {lexical_diversity:.3f} in the data.")
    
        # print the five most common tokens
        print(f"These are the five most common tokens and their count:\n {most_common_five}")
        
    return([num_tokens, num_unique_tokens,
            lexical_diversity,
            num_characters, most_common_five])

    
def remove_stop(tokens) :
    
    return [t for t in tokens if t.lower() not in sw]
 
def remove_punctuation(text, punct_set=tw_punct) : 
    return("".join([ch for ch in text if ch not in punct_set]))

def tokenize(text) : 
    text = text.split(' ') # split on whitespace to include hashtags and other information
    return(text)

def prepare(text, pipeline) : 
    tokens = str(text)
    
    for transform in pipeline : 
        tokens = transform(tokens)
        
    return(tokens)

def join_tokens(tokens):
    
    text = " ".join(tokens)
    
    return(text)

In [11]:
convention_db = sqlite3.connect("2020_Conventions.db") # we are in the same directory
convention_cur = convention_db.cursor()

### Part 1: Exploratory Naive Bayes

We'll first build a NB model on the convention data itself, as a way to understand what words distinguish between the two parties. This is analogous to what we did in the "Comparing Groups" class work. First, pull in the text 
for each party and prepare it for use in Naive Bayes.  

In [63]:
convention_data = []

# fill this list up with items that are themselves lists. The 
# first element in the sublist should be the cleaned and tokenized
# text in a single string. The second element should be the party. 

query_results = convention_cur.execute("SELECT text, party FROM conventions;")

my_pipeline = [str.lower, remove_punctuation, tokenize, remove_stop, join_tokens]

convention_table = []

for row in query_results :
    text, party = row
    
    convention_table.append((text, party))

    
    
convention_df = pd.DataFrame(convention_table, columns = ["text", "party"])

tokens = convention_df['text'].apply(prepare, pipeline = my_pipeline)

convention_df['text'] = tokens

convention_data = convention_df.values.tolist()

# there must be a better way to do this 


Let's look at some random entries and see if they look right. 

In [65]:
random.choices(convention_data,k=10)

[['tennessee', 'Democratic'],
 ['good evening it’s honor come tonight commonwealth kentucky leader washington either new york california consider responsibility look middle america election incredibly consequential middle america president trump knows inherited first generation americans couldn’t promise children better life he’s made mission administration change know work beside every day today’s democrat party doesn’t want improve life middle america prefer us flyover country keep quiet let decide live lives',
  'Republican'],
 ['let us pray pray must grateful citizens country boldly claim one nation god pray must praising lord country freedom religion cherished republicans democrats begin conventions heads bowed prayer pray must conscious suffering covid unwearied frontliners care us pray must lives may protected respected troubled cities police guard tense world situations men women uniform keep peace innocent life baby womb elders nursing care hospice immigrants refugees lives th

If that looks good, we now need to make our function to turn these into features. In my solution, I wanted to keep the number of features reasonable, so I only used words that occur at least `word_cutoff` times. Here's the code to test that if you want it. 

In [66]:
word_cutoff = 5

tokens = [w for t, p in convention_data for w in t.split()]

word_dist = nltk.FreqDist(tokens)

feature_words = set()

for word, count in word_dist.items() :
    if count > word_cutoff :
        feature_words.add(word)
        
print(f"With a word cutoff of {word_cutoff}, we have {len(feature_words)} as features in the model.")


With a word cutoff of 5, we have 2391 as features in the model.


In [67]:
def conv_features(text,fw) :
    """Given some text, this returns a dictionary holding the
       feature words.
       
       Args: 
            * text: a piece of text in a continuous string. Assumes
            text has been cleaned and case folded.
            * fw: the *feature words* that we're considering. A word 
            in `text` must be in fw in order to be returned. This 
            prevents us from considering very rarely occurring words.
        
       Returns: 
            A dictionary with the words in `text` that appear in `fw`. 
            Words are only counted once. 
            If `text` were "quick quick brown fox" and `fw` = {'quick','fox','jumps'},
            then this would return a dictionary of 
            {'quick' : True,
             'fox' :    True}
        
    """
    
    ret_dict = dict()
    
    words = text.split(" ") # splitting words on white space to create as tokens again
    
    for w in words: 
        if w in fw: # checking if word is included in feature words
            ret_dict[w] = True
        else:
            pass
    
    return(ret_dict)

In [68]:
assert(len(feature_words)>0)
assert(conv_features("donald is the president",feature_words)==
       {'donald':True,'president':True})
assert(conv_features("people are american in america",feature_words)==
                     {'america':True,'american':True,"people":True})

Now we'll build our feature set. Out of curiosity I did a train/test split to see how accurate the classifier was, but we don't strictly need to since this analysis is exploratory. 

In [69]:
featuresets = [(conv_features(text,feature_words), party) for (text, party) in convention_data]

In [70]:
random.seed(20220507)
random.shuffle(featuresets)

test_size = 500

In [71]:
test_set, train_set = featuresets[:test_size], featuresets[test_size:]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, test_set))

0.5


In [72]:
classifier.show_most_informative_features(25)

Most Informative Features
                   china = True           Republ : Democr =     25.8 : 1.0
                   votes = True           Democr : Republ =     23.8 : 1.0
             enforcement = True           Republ : Democr =     21.5 : 1.0
                 destroy = True           Republ : Democr =     19.2 : 1.0
                freedoms = True           Republ : Democr =     18.2 : 1.0
                 climate = True           Democr : Republ =     17.8 : 1.0
                supports = True           Republ : Democr =     17.1 : 1.0
                   crime = True           Republ : Democr =     16.1 : 1.0
                   media = True           Republ : Democr =     14.9 : 1.0
                 beliefs = True           Republ : Democr =     13.0 : 1.0
               countries = True           Republ : Democr =     13.0 : 1.0
                 defense = True           Republ : Democr =     13.0 : 1.0
                    isis = True           Republ : Democr =     13.0 : 1.0

### Write a little prose here about what you see in the classifier. Anything odd or interesting?

### My Observations

These are very interesting. Knowing that these were in 2020, when COVID was at its peak, or at least near its peak, it is pretty apparent that it was an issue talked about. The Republicans seem to like to blame China, or at least utilize China as, presumably, an enemy, although, we can't fully come to that that's how they're using "China" in their speeches with just this information. 

23 of the 25 most important features were all with Republicans saying a certain word at least 10x more than Democrats. So, it seems that they utilize these American-like buzz words, like "destroy", "freedoms", "beliefs", "liberal" "religion", "flag", to try to incite some more pride and possibly nationalism within their supporters. They also mentioned "enemy", "isis", and again "china". These words also seem a bit divisive.

Perhaps, that is a bit negative on the Republican party, but that's what this data seems to point to. 

Within the Democrats, the two words in the top 25 are climate and votes. The democrats seem to utilize less distinct words, and focus in on matters like the climate. They also are probably lobbying for votes and for people to go vote, because perhaps they felt at this point in time, that was the only way they'd lose to the Republicans.


## Part 2: Classifying Congressional Tweets

In this part we apply the classifer we just built to a set of tweets by people running for congress
in 2018. These tweets are stored in the database `congressional_data.db`. That DB is funky, so I'll
give you the query I used to pull out the tweets. Note that this DB has some big tables and 
is unindexed, so the query takes a minute or two to run on my machine.

In [73]:
cong_db = sqlite3.connect("congressional_data.db")
cong_cur = cong_db.cursor()

In [74]:
results = cong_cur.execute(
        '''
           SELECT DISTINCT 
                  cd.candidate, 
                  cd.party,
                  tw.tweet_text
           FROM candidate_data cd 
           INNER JOIN tweets tw ON cd.twitter_handle = tw.handle 
               AND cd.candidate == tw.candidate 
               AND cd.district == tw.district
           WHERE cd.party in ('Republican','Democratic') 
               AND tw.tweet_text NOT LIKE '%RT%'
        ''')

results = list(results) # Just to store it, since the query is time consuming

In [75]:
tweet_data = []

tweet_table = []


# Note that this may take a bit of time, since we have a lot of tweets.
for row in results :
    candidate, party, text = row
    
    tweet_table.append((text, party))

# Now fill up tweet_data with sublists like we did on the convention speeches.

# making table into df to apply pipeline on column Series objectg

tweet_df = pd.DataFrame(tweet_table, columns = ["text", "party"])

tokens = tweet_df['text'].apply(prepare, pipeline = my_pipeline)

tweet_df['text'] = tokens

tweet_data = tweet_df.values.tolist()





There are a lot of tweets here. Let's take a random sample and see how our classifer does. I'm guessing it won't be too great given the performance on the convention speeches...

In [118]:
random.seed(20201014)

tweet_data_sample = random.choices(tweet_data,k=10)
tweet_data_sample

[['bmy thoughts affected terrible fires burning across californiannthank first responders working tirelessly keep us safe httpstcojnphimfh5l',
  'Democratic'],
 ['bkicking xe2x80x98hiring red white amp youxe2x80x9d veterans job fair w txworkforce commissioner hughes minutemaidparks morn 150 employers w jobs  4000 expected veteranshappy families amp stronger texas #txlege httpstcokwjayi8cng',
  'Democratic'],
 ['bexcited back bitter end nov 16 #songwriter #acoustic #guitar #thebitterend #livemusic httptcopoy7hax29r',
  'Democratic'],
 ['bpotus says since election created 24 million new jobs including 200000 new jobs manufacturing alone #sotu',
  'Republican'],
 ['bauthor atomic obsession john mueller expected join show tonight also lots heatlh care news today httpbitly1qq6qf',
  'Republican'],
 ['bxe2x80x9cwhat ixe2x80x99ve learned since announced xe2x80x94 itxe2x80x99s five months back june xe2x80x94 people ready httpstcobjweysbcw1',
  'Democratic'],
 ['bthoughts prayers great friends 

In [121]:
# running feature set code again, for this data

word_cutoff = 5

tokens = [w for t, p in tweet_data for w in t.split()]

word_dist = nltk.FreqDist(tokens)

feature_words = set()

for word, count in word_dist.items() :
    if count > word_cutoff :
        feature_words.add(word)
        
print(f"With a word cutoff of {word_cutoff}, we have {len(feature_words)} as features in the model.")

With a word cutoff of 5, we have 55150 as features in the model.


In [122]:
featuresets_twit = [(conv_features(text,feature_words), party) for (text, party) in tweet_data]

In [125]:
limit = 10 # to only show 10
c = 0

for tweet, party in featuresets_twit :
    
    estimated_party = classifier.classify(tweet)
    # Fill in the right-hand side above with code that estimates the actual party
    
    print(f"Here's our (cleaned) tweet: {tweet}")
    print(f"Actual party is {party} and our classifer says {estimated_party}.")
    if party == estimated_party:
        print("The classifier was correct")
    else:
        print("The classifier was incorrect")
    print(" ")
    
    c += 1    
    if c > limit: # to ensure it doesn't print all the tweets
        break
    

Here's our (cleaned) tweet: {'bbigger': True, 'paychecks': True, 'xe2x86x92': True, 'realdonaldtrump': True, 'xe2x80x9cyour': True, 'going': True, 'way': True, 'taxes': True, 'right': True, 'first': True, 'time': True, 'long': True, 'youxe2x80x99ve': True, 'seen': True, 'coming': True, 'backxe2x80x9d': True}
Actual party is Republican and our classifer says Republican.
The classifier was correct
 
Here's our (cleaned) tweet: {'b50': True, 'years': True, 'repjohnlewis': True, 'brave': True, 'foot': True, 'soldiers': True, 'risked': True, 'lives': True, 'voting': True, 'rights': True, 'bills': True, '#restorethevra': True, 'languishing': True}
Actual party is Democratic and our classifer says Republican.
The classifier was incorrect
 
Here's our (cleaned) tweet: {'btogether': True, 'going': True, 'win': True, 'may': True, '8': True, 'work': True, 'create': True, 'future': True, 'children': True, 'west': True, 'virginia': True, 'join': True, 'team': True, '2': True, 'canvassing': True, 'h

Now that we've looked at it some, let's score a bunch and see how we're doing.

In [126]:
# dictionary of counts by actual party and estimated party. 
# first key is actual, second is estimated
parties = ['Republican','Democratic']
results = defaultdict(lambda: defaultdict(int))

for p in parties :
    for p1 in parties :
        results[p][p1] = 0


num_to_score = 10000
random.shuffle(tweet_data)

for idx, tp in enumerate(featuresets_twit) :
    tweet, party = tp    
    # Now do the same thing as above, but we store the results rather
    # than printing them. 
    
    
    estimated_party = classifier.classify(tweet)
    
    results[party][estimated_party] += 1
    
    if idx > num_to_score : 
        break

In [127]:
results


defaultdict(<function __main__.<lambda>()>,
            {'Republican': defaultdict(int,
                         {'Republican': 3595, 'Democratic': 778}),
             'Democratic': defaultdict(int,
                         {'Republican': 4633, 'Democratic': 996})})

### Reflections

It appears that the classifier leans heavily towards Republicans on these tweets. One reason that may be the cause of some error in the application of the model is that politicians may not tweet like politicians give speeches. There are many more tweets than speeches given, as well, and they can be very one-off or completely unrelated to politics at times. And, perhaps, democrats tweet much more freely than they give speeches, using terms Republicans use, or even responding to Republicans, perhaps. Part of the twitter data has hashtags, links, and other typos and errors, which would need further cleaning to better match the training data. There was only very slight class imbalance; which may point towards the skew of these results, but not quite enough to fully explain it.

Since it is a Naive Bayes classifier, it is going off the assumption that every pair of features is independent of each other, which is not the case for text or speech. As it's going off of probabilities of determining the class of a tweet based on its text, and the speech data was the training data, it goes back to the mismatch of the corpus in terms of the context of the actual text, and the different words likely used in speeches versus on twitter. Also, since republicans have much more of the 'higher importance' of features, if any of those words were used, I'd imagine the Naive Bayes model will lean heavily towards classifying it as Republican simply due to the probabilities. Again, perhaps Democrats utilized these words in response to Republicans and use their words against them, and/or they simply tweet more freely than they give speeches.

