## Naive Bayes on Political Text

In this notebook we use Naive Bayes to explore and classify political data. See the `README.md` for full details. You can download the required DB from the shared dropbox or from blackboard

In [53]:
import sqlite3
import nltk
import random
import numpy as np
import re
from collections import Counter, defaultdict

# Imports added
import pandas as pd
from string import punctuation
from contractions import fix

# Feel free to include your text patterns functions
#from text_functions_solutions import clean_tokenize, get_patterns

In [10]:
convention_db = sqlite3.connect("2020_Conventions.db")
convention_cur = convention_db.cursor()

### Part 1: Exploratory Naive Bayes

We'll first build a NB model on the convention data itself, as a way to understand what words distinguish between the two parties. This is analogous to what we did in the "Comparing Groups" class work. First, pull in the text 
for each party and prepare it for use in Naive Bayes.  

In [25]:
# First figure out what table is named within convention_db to be able to query
convention_cur.execute("SELECT name FROM sqlite_master WHERE type='table';")
tables = convention_cur.fetchall()

# Print all table names within the db
for i in tables:
    print(i[0])

conventions


In [27]:
# Preview the data in the table conventions so I can properly query it in the next cell
convention_cur.execute("SELECT * FROM conventions LIMIT 5;")
rows = convention_cur.fetchall()

# Get & print column names
column_names = [i[0] for i in convention_cur.description]
print(column_names)

# Print each row
for j in rows:
    print(j)

['party', 'night', 'speaker', 'speaker_count', 'time', 'text', 'text_len', 'file']
('Democratic', 4, 'Unknown', 1, '00:00', 'Skip to content The Company Careers Press Freelancers Blog × Services Transcription Captions Foreign Subtitles Translation Freelancers About Contact Login « Return to Transcript Library home  Transcript Categories  All Transcripts 2020 Election Transcripts Classic Speech Transcripts Congressional Testimony & Hearing Transcripts Debate Transcripts Donald Trump Transcripts Entertainment Transcripts Financial Transcripts Interview Transcripts Political Transcripts Press Conference Transcripts Speech Transcripts Sports Transcripts Technology Transcripts Aug 21, 2020 2020 Democratic National Convention (DNC) Night 4 Transcript Rev  ›  Blog  ›  Transcripts  › 2020 Election Transcripts  ›  2020 Democratic National Convention (DNC) Night 4 Transcript Night 4 of the 2020 Democratic National Convention (DNC) on August 20. Read the full transcript of the event here. Transcr

In [24]:
# Load in all of conventions data prior to cleaning it out 
query1 = '''SELECT * FROM conventions;'''

conventions_raw_df = pd.read_sql_query(query1, convention_db)
conventions_raw_df.head()

Unnamed: 0,party,night,speaker,speaker_count,time,text,text_len,file
0,Democratic,4,Unknown,1,00:00,Skip to content The Company Careers Press Free...,127,www_rev_com_blog_transcripts2020-democratic-na...
1,Democratic,4,Speaker 1,1,00:33,I’m here by calling the full session of the 48...,41,www_rev_com_blog_transcripts2020-democratic-na...
2,Democratic,4,Speaker 2,1,00:59,"Every four years, we come together to reaffirm...",17,www_rev_com_blog_transcripts2020-democratic-na...
3,Democratic,4,Kerry Washington,1,01:07,We fight for a more perfect union because we a...,28,www_rev_com_blog_transcripts2020-democratic-na...
4,Democratic,4,Bernie Sanders,1,01:18,"We must come together to defeat Donald Trump, ...",22,www_rev_com_blog_transcripts2020-democratic-na...


In [64]:
# Clean & tokenize text field as instructed to for final convention_data list of lists in cell below

# Use functions from last assignment for cleaning & tokenization 
sw = nltk.corpus.stopwords.words("english")
punctuation=set(punctuation)

def remove_stop(tokens):
    # Return all tokens that are not in sw (stopwords)
    tokens = [word for word in tokens if word not in sw]
    return(tokens)
 
def remove_punctuation(text, punct_set=punctuation): 
    return("".join([ch for ch in text if ch not in punct_set]))

def tokenize(text): 
    """ Splitting on whitespace rather than the book's tokenize function. That 
        function will drop tokens like '#hashtag' or '2A', which we need for Twitter. """
    collapse_whitespace = re.compile(r'\s+')
    return(collapse_whitespace.split(text))

def fold_lowercase(text):
    # Cast to string first - in case anything is not already a string
    text = str(text)
    return text.lower()

def expand_contractions(text):
    # Expand out contractions using contractions package
    return fix(text)

def remove_empty_tokens(tokens):
    # Remove any empty tokens
    tokens = [word for word in tokens if word != '']
    return(tokens)

def prepare(text, 
            pipeline=[fold_lowercase, expand_contractions, remove_punctuation, tokenize, remove_stop, 
                      remove_empty_tokens]): 

    for transform in pipeline: 
        text = transform(text)
        
    return(text)

# Use pipeline to clean & tokenize "text" column
default_pipeline = [
    fold_lowercase,
    expand_contractions,
    remove_punctuation,
    tokenize,
    remove_stop,
    remove_empty_tokens
]

conventions_raw_df["tokens"] = conventions_raw_df["text"].apply(prepare,pipeline=default_pipeline)
conventions_raw_df.head()

Unnamed: 0,party,night,speaker,speaker_count,time,text,text_len,file,tokens
0,Democratic,4,Unknown,1,00:00,Skip to content The Company Careers Press Free...,127,www_rev_com_blog_transcripts2020-democratic-na...,"[skip, content, company, careers, press, freel..."
1,Democratic,4,Speaker 1,1,00:33,I’m here by calling the full session of the 48...,41,www_rev_com_blog_transcripts2020-democratic-na...,"[calling, full, session, 48th, quadrennial, na..."
2,Democratic,4,Speaker 2,1,00:59,"Every four years, we come together to reaffirm...",17,www_rev_com_blog_transcripts2020-democratic-na...,"[every, four, years, come, together, reaffirm,..."
3,Democratic,4,Kerry Washington,1,01:07,We fight for a more perfect union because we a...,28,www_rev_com_blog_transcripts2020-democratic-na...,"[fight, perfect, union, fighting, soul, countr..."
4,Democratic,4,Bernie Sanders,1,01:18,"We must come together to defeat Donald Trump, ...",22,www_rev_com_blog_transcripts2020-democratic-na...,"[must, come, together, defeat, donald, trump, ..."


In [75]:
# fill this list up with items that are themselves lists. The 
# first element in the sublist should be the cleaned and tokenized
# text in a single string. As part of your cleaning process,
# remove the stopwords from the text. The second element of the sublist
# should be the party. 
convention_data = conventions_raw_df[["tokens", "party"]].values.tolist()
print(convention_data[:5])

[[['skip', 'content', 'company', 'careers', 'press', 'freelancers', 'blog', '×', 'services', 'transcription', 'captions', 'foreign', 'subtitles', 'translation', 'freelancers', 'contact', 'login', '«', 'return', 'transcript', 'library', 'home', 'transcript', 'categories', 'transcripts', '2020', 'election', 'transcripts', 'classic', 'speech', 'transcripts', 'congressional', 'testimony', 'hearing', 'transcripts', 'debate', 'transcripts', 'donald', 'trump', 'transcripts', 'entertainment', 'transcripts', 'financial', 'transcripts', 'interview', 'transcripts', 'political', 'transcripts', 'press', 'conference', 'transcripts', 'speech', 'transcripts', 'sports', 'transcripts', 'technology', 'transcripts', 'aug', '21', '2020', '2020', 'democratic', 'national', 'convention', 'dnc', 'night', '4', 'transcript', 'rev', '›', 'blog', '›', 'transcripts', '›', '2020', 'election', 'transcripts', '›', '2020', 'democratic', 'national', 'convention', 'dnc', 'night', '4', 'transcript', 'night', '4', '2020', 

Let's look at some random entries and see if they look right. 

In [76]:
random.choices(convention_data,k=5)

[[['trump',
   'team',
   'gave',
   'us',
   'empathy',
   'never',
   'received',
   'obama',
   'administration',
   'obama',
   'administration',
   'said',
   'everything',
   'could',
   'trump',
   'administration',
   'let',
   'say',
   'kayla',
   'donald',
   'trump',
   'president',
   'kayla',
   'captured',
   'would',
   'today',
   'kayla',
   'hostage',
   'would',
   'go',
   'outside',
   'night',
   'look',
   'moon',
   'would',
   'look',
   'moon',
   'would',
   'promise',
   'would',
   'everything',
   'could',
   'get',
   'home',
   'see',
   'moon',
   'reminded',
   'promise',
   'could',
   'keep'],
  'Republican'],
 [['proudly',
   'displayed',
   'shortly',
   'later',
   'website',
   'displayed',
   'big',
   'letters',
   'make',
   'mistake',
   'give',
   'power',
   'joe',
   'biden',
   'radical',
   'left',
   'defund',
   'police',
   'departments',
   'across',
   'america',
   'pass',
   'federal',
   'legislation',
   'reduce',
   'law',
   

If that looks good, we now need to make our function to turn these into features. In my solution, I wanted to keep the number of features reasonable, so I only used words that occur at least `word_cutoff` times. Here's the code to test that if you want it. 

In [79]:
word_cutoff = 5

tokens = [token for token_list, party in convention_data for token in token_list]

word_dist = nltk.FreqDist(tokens)

feature_words = set()

for word, count in word_dist.items() :
    if count > word_cutoff :
        feature_words.add(word)
        
print(f"With a word cutoff of {word_cutoff}, we have {len(feature_words)} as features in the model.")

With a word cutoff of 5, we have 2348 as features in the model.


In [86]:
def conv_features(text,fw) :
    """Given some text, this returns a dictionary holding the
       feature words.
       
       Args: 
            * text: a piece of text in a continuous string. Assumes
            text has been cleaned and case folded.
            * fw: the *feature words* that we're considering. A word 
            in `text` must be in fw in order to be returned. This 
            prevents us from considering very rarely occurring words.
        
       Returns: 
            A dictionary with the words in `text` that appear in `fw`. 
            Words are only counted once. 
            If `text` were "quick quick brown fox" and `fw` = {'quick','fox','jumps'},
            then this would return a dictionary of 
            {'quick' : True,
             'fox' :    True}
        
    """
    # Intitialize empty dict
    ret_dict = dict()
    
    # Split text on whitespace if not already list
    if type(text) != list:
        collapse_whitespace = re.compile(r'\s+')
        text = collapse_whitespace.split(text)
    
    # Iterate through tokens in text to append to dict
    for i in text:
        if (i in fw) & (i not in ret_dict.keys()):
            ret_dict[i] = True
    
    return(ret_dict)

In [84]:
# Test example provided
text = "quick quick brown fox"
fw = {'quick', 'fox', 'jumps'}
print(conv_features(text, fw))

{'quick': True, 'fox': True}


In [81]:
assert(len(feature_words)>0)
assert(conv_features("donald is the president",feature_words)==
       {'donald':True,'president':True})
assert(conv_features("some people in america are citizens",feature_words)==
                     {'people':True,'america':True,"citizens":True})

Now we'll build our feature set. Out of curiosity I did a train/test split to see how accurate the classifier was, but we don't strictly need to since this analysis is exploratory. 

In [87]:
featuresets = [(conv_features(text,feature_words), party) for (text, party) in convention_data]

In [88]:
random.seed(20220507)
random.shuffle(featuresets)

test_size = 500

In [89]:
test_set, train_set = featuresets[:test_size], featuresets[test_size:]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, test_set))

0.506


In [90]:
classifier.show_most_informative_features(25)

Most Informative Features
                   china = True           Republ : Democr =     25.8 : 1.0
                   votes = True           Democr : Republ =     23.8 : 1.0
             enforcement = True           Republ : Democr =     21.5 : 1.0
                 destroy = True           Republ : Democr =     19.2 : 1.0
                freedoms = True           Republ : Democr =     18.2 : 1.0
                 climate = True           Democr : Republ =     17.8 : 1.0
                supports = True           Republ : Democr =     17.1 : 1.0
                   crime = True           Republ : Democr =     16.1 : 1.0
                   media = True           Republ : Democr =     14.9 : 1.0
                 beliefs = True           Republ : Democr =     13.0 : 1.0
               countries = True           Republ : Democr =     13.0 : 1.0
                 defense = True           Republ : Democr =     13.0 : 1.0
                    isis = True           Republ : Democr =     13.0 : 1.0

Write a little prose here about what you see in the classifier. Anything odd or interesting?

### My Observations

_Your observations to come._



## Part 2: Classifying Congressional Tweets

In this part we apply the classifer we just built to a set of tweets by people running for congress
in 2018. These tweets are stored in the database `congressional_data.db`. That DB is funky, so I'll
give you the query I used to pull out the tweets. Note that this DB has some big tables and 
is unindexed, so the query takes a minute or two to run on my machine.

In [None]:
cong_db = sqlite3.connect("congressional_data.db")
cong_cur = cong_db.cursor()

In [None]:
results = cong_cur.execute(
        '''
           SELECT DISTINCT 
                  cd.candidate, 
                  cd.party,
                  tw.tweet_text
           FROM candidate_data cd 
           INNER JOIN tweets tw ON cd.twitter_handle = tw.handle 
               AND cd.candidate == tw.candidate 
               AND cd.district == tw.district
           WHERE cd.party in ('Republican','Democratic') 
               AND tw.tweet_text NOT LIKE '%RT%'
        ''')

results = list(results) # Just to store it, since the query is time consuming

In [None]:
tweet_data = []

# Now fill up tweet_data with sublists like we did on the convention speeches.
# Note that this may take a bit of time, since we have a lot of tweets.


There are a lot of tweets here. Let's take a random sample and see how our classifer does. I'm guessing it won't be too great given the performance on the convention speeches...

In [None]:
random.seed(20201014)

tweet_data_sample = random.choices(tweet_data,k=10)

In [None]:

for tweet, party in tweet_data_sample :
    estimated_party = 'Gotta fill this in'
    # Fill in the right-hand side above with code that estimates the actual party
    
    print(f"Here's our (cleaned) tweet: {tweet}")
    print(f"Actual party is {party} and our classifer says {estimated_party}.")
    print("")
    

Now that we've looked at it some, let's score a bunch and see how we're doing.

In [None]:
# dictionary of counts by actual party and estimated party. 
# first key is actual, second is estimated
parties = ['Republican','Democratic']
results = defaultdict(lambda: defaultdict(int))

for p in parties :
    for p1 in parties :
        results[p][p1] = 0


num_to_score = 10000
random.shuffle(tweet_data)

for idx, tp in enumerate(tweet_data) :
    tweet, party = tp    
    # Now do the same thing as above, but we store the results rather
    # than printing them. 
   
    # get the estimated party
    estimated_party = "Gotta fill this in"
    
    results[party][estimated_party] += 1
    
    if idx > num_to_score : 
        break

In [None]:
results

### Reflections

_Write a little about what you see in the results_ 