# Module 4 Assignment 4.1 Text Classification

## Naive Bayes on Political Text

In this notebook we use Naive Bayes to explore and classify political data. See the `README.md` for full details. You can download the required DB from the shared dropbox or from blackboard

# Importing Packages

In [2]:
import sqlite3
import nltk
import random
import numpy as np
from collections import Counter, defaultdict

from string import punctuation

from nltk.tokenize import sent_tokenize
from nltk.corpus import stopwords

sw = stopwords.words("english")


In [3]:
# creating SQL connection to SQLite database
convention_db = sqlite3.connect("2020_Conventions.db")
convention_cur = convention_db.cursor()

## 1. Exploratory Naive Bayes

We'll first build a NB model on the convention data itself, as a way to understand what words distinguish between the two parties. This is analogous to what we did in the "Comparing Groups" exercise. First, we'll pull in the text 
for each party and prepare it for use in Naive Bayes. 

In [4]:
# list to store speech and party
convention_data = []

# fill this list up with items that are themselves lists. The 
# first element in the sublist should be the cleaned and tokenized
# text in a single string. The second element should be the party. 

query_results = convention_cur.execute('SELECT * FROM conventions')

for row in query_results :
    # store the results in convention_data
    convention_data.append([row[5], row[0]])
    

In [5]:
# it's a best practice to close up your DB connection when you're done
convention_db.close()

Let's look at some random entries and see if they look right. 

In [6]:
random.choices(convention_data,k=5)

[['I’m John Peterson, owner of a second generation metal fabrication business called Schuette Metals. We’ve been stamping our products and services made in the USA since 1957. My brother and I purchased the business from my uncles almost 38 years ago. What was a 12 person shop has now grown into a company employing 165 people today. Like most companies that are successful over the long run, we had to reinvent ourselves as the market changed. Six years ago, we invested heavily in our business just as a greatest recession appeared. Barack Obama and Joe Biden, two career politicians who knew nothing about business, couldn’t get the government out of our way. And it put our business in a tailspin. Sadly, we were forced to make decisions which included cutting staff, a torturous experience when our employees are like family. The Obama- Biden era banking regulations left us no choice. It tied our lenders hands and deprived us of the lifeblood of our business capital.',
  'Republican'],
 ['Ca

It'll be useful for us to have a large sample size than 2024 affords, since those speeches tend to be long and contiguous. Let's make a new list-of-lists called `conv_sent_data`. Instead of each first entry in the sublists being an entire speech, make each first entry just a sentence from the speech. Feel free to use NLTK's `sent_tokenize` [function](https://www.nltk.org/api/nltk.tokenize.sent_tokenize.html). 

In [7]:
conv_sent_data = []

for speech, party in convention_data :
    conv_sent_data.append([sent_tokenize(speech), party])
    

Again, let's look at some random entries. 

In [8]:
random.choices(conv_sent_data,k=5)

[[['So before we show you a film about our dad’s journey, we wanted to give Beau the last word.',
   'Beau.'],
  'Democratic'],
 [['This is a different kind of convention.'], 'Democratic'],
 [['He had been deported in September and had come back in October to terrorize our community.',
   'I am extremely grateful to President Trump and the FBI for their efforts to deliver justice for Jackie and all of the other innocent victims of violent crime.',
   'I am honored to support the President because he is supporting us.',
   'I know he will never stop fighting for justice, for law and order, for peace, security in our communities.'],
  'Republican'],
 [['That I will bear arms.'], 'Republican'],
 [['They didn’t want to hear Biden’s hollow words of empathy.',
   'They wanted their jobs back.',
   'As Vice President, he supported the transpacific partnership, which would have been a death sentence for the US auto industry.',
   'He backed the horrendous South Korea trade deal, which took man

Now it's time for our final cleaning before modeling. Go through `conv_sent_data` and take the following steps: 

1. Tokenize on whitespace
1. Remove punctuation
1. Remove tokens that fail the `isalpha` test
1. Remove stopwords
1. Casefold to lowercase
1. Join the remaining tokens into a string


In [9]:
punctuation = set(punctuation)
tw_punct = punctuation - {"#"}

In [10]:
# Function to perform tokenization and normalization

def get_cleaned_tokens(speech) :
    
    # Joining list of speeches into a single string
    sentence = ' '.join(speech)
    
    # Removing punctuation
    sentence = "".join([char for char in sentence if char not in tw_punct])

    # Case folding and tokenization 
    sentence = [token.lower().strip() for token in sentence.split()]
    
    # Removing tokens that don't contain alphabetic characters
    sentence = [token for token in sentence if token.isalpha()]
  
    # Removing stop words
    sentence = [token for token in sentence if token not in sw]
        
    # Joining tokens into a string
    sentence = ' '.join(sentence)
        
    return sentence

In [11]:
clean_conv_sent_data = [] # list of tuples (sentence, party), with sentence cleaned

for speech_party in conv_sent_data :
    
    # passing index 0 to clean and extract the tokens
    processed_speech = get_cleaned_tokens(speech_party[0])
    
    clean_conv_sent_data.append((processed_speech, speech_party[1]))

random.choices(clean_conv_sent_data,k=5)

[('time fight believe join us', 'Democratic'),
 ('inaudible smiling okay', 'Democratic'),
 ('future really rests investment going investing trillion infrastructure ports bridges highways making sure access things really make difference like solar facility outside harrisburg scranton boy central pennsylvania okay northeast keep faith guys',
  'Democratic'),
 ('means much families grow single parent home serve boys girls right daily basis even covid us give moms opportunity take money check paycheck home back children able go school may opportunity otherwise means much us much greater opportunity individuals come together walk life people really able see positive change filled hope especially throughout time',
  'Republican'),
 ('failed american people catastrophically four years ago came convention said new yorkers know con see one tonight asking vote donald trump bad guy urging vote done bad job today unemployment historic highs small businesses struggling survive way ran mayor spent y

If that looks good, let's make our function to turn these into features. First we need to build our list of candidate words. I started my exploration at a cutoff of 5. 

In [12]:
word_cutoff = 5

tokens = [word for text, party in clean_conv_sent_data for word in text.split()]

word_dist = nltk.FreqDist(tokens)

feature_words = set()

for word, count in word_dist.items() :
    if count > word_cutoff :
        feature_words.add(word)
        
print(f"With a word cutoff of {word_cutoff}, we have {len(feature_words)} as features in the model.")

With a word cutoff of 5, we have 2255 as features in the model.


In [13]:
def conv_features(text, fw) :
    """Given some text, this returns a dictionary holding the
       feature words.
       
       Args: 
            * text: a piece of text in a continuous string. Assumes
            text has been cleaned and case folded.
            * fw: the *feature words* that we're considering. A word 
            in `text` must be in fw in order to be returned. This 
            prevents us from considering very rarely occurring words.
        
       Returns: 
            A dictionary with the words in `text` that appear in `fw`. 
            Words are only counted once. 
            If `text` were "quick quick brown fox" and `fw` = {'quick','fox','jumps'},
            then this would return a dictionary of 
            {'quick' : True,
             'fox' :    True}
        
    """
    
    # Your code here
    ret_dict = dict()
    
    for token in text.split() :
        if token in fw :
            ret_dict[token] = True
      
    return(ret_dict)

In [14]:
assert(len(feature_words)>0)
assert(conv_features("obama was the president",feature_words)==
       {'obama':True,'president':True})
assert(conv_features("some people in america are citizens",feature_words)==
                     {'people':True,'america':True,"citizens":True})

Now we'll build our feature set. Out of curiosity I did a train/test split to see how accurate the classifier was, but we don't strictly need to since this analysis is exploratory. 

In [15]:
featuresets = [(conv_features(text,feature_words), party) for (text, party) in convention_data]

In [16]:
random.seed(20220507)
random.shuffle(featuresets)

test_size = 500

In [17]:
test_set, train_set = featuresets[:test_size], featuresets[test_size:]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, test_set))

0.498


In [18]:
classifier.show_most_informative_features(25)

Most Informative Features
             enforcement = True           Republ : Democr =     27.5 : 1.0
                   votes = True           Democr : Republ =     21.6 : 1.0
                 climate = True           Democr : Republ =     17.3 : 1.0
                 destroy = True           Republ : Democr =     17.1 : 1.0
                supports = True           Republ : Democr =     16.1 : 1.0
                   media = True           Republ : Democr =     15.9 : 1.0
                preserve = True           Republ : Democr =     15.1 : 1.0
                  signed = True           Republ : Democr =     15.1 : 1.0
              appreciate = True           Republ : Democr =     14.0 : 1.0
                freedoms = True           Republ : Democr =     14.0 : 1.0
                 private = True           Republ : Democr =     11.9 : 1.0
                  defund = True           Republ : Democr =     10.9 : 1.0
                    drug = True           Republ : Democr =     10.3 : 1.0

**Write a little prose here about what you see in the classifier. Anything odd or interesting?**

### My Observations

The accuracy of the classifier is 0.498 suggesting that it correctly predicts 50% (approx.) of the instances reflecting a moderate performance and it does not distinguish well between the two classes (Republican and Democratic). From the most informative features, we observe that the term "enforcement" is 27.5 times more likely to appear in Republican contexts, implying a stronger association with the party. Likewise, the term "votes" ' appears 21.6 times more often in Democratic texts, showing its frequent use in discussions tied to the Democratic party.


## Part 2: Classifying Congressional Tweets

In this part we apply the classifer we just built to a set of tweets by people running for congress
in 2018. These tweets are stored in the database `congressional_data.db`. That DB is funky, so I'll
give you the query I used to pull out the tweets. Note that this DB has some big tables and 
is unindexed, so the query takes a minute or two to run on my machine.

In [19]:
cong_db = sqlite3.connect("congressional_data.db")
cong_cur = cong_db.cursor()

In [20]:
results = cong_cur.execute(
        '''
           SELECT DISTINCT 
                  cd.candidate, 
                  cd.party,
                  tw.tweet_text
           FROM candidate_data cd 
           INNER JOIN tweets tw ON cd.twitter_handle = tw.handle 
               AND cd.candidate == tw.candidate 
               AND cd.district == tw.district
           WHERE cd.party in ('Republican','Democratic') 
               AND tw.tweet_text NOT LIKE '%RT%'
        ''')

results = list(results) # Just to store it, since the query is time consuming

The results output shows that the strings are encoded in base format. Hence we decode the tweets using decode() function.

In [21]:
tweet_data = []

# Now fill up tweet_data with sublists like we did on the convention speeches.
# Note that this may take a bit of time, since we have a lot of tweets.
for candidate, party, tweet in results :
    
    # Decode the byte string
    decoded_tweet = tweet.decode('utf-8')
    tweet_data.append((decoded_tweet, party))

In [22]:
def clean_tweets(tweets) :
    
    # Removing punctuation
    tweets = "".join([char for char in tweets if char not in tw_punct])
    
    # Case folding and tokenization 
    tweets = [token.lower().strip() for token in tweets.split()]
  
    # Removing stop words
    tweets = [token for token in tweets if token not in sw]

    # Joining tokens into a string
    tweets = ' '.join(tweets)
    
    return tweets

In [23]:
clean_tweet_data = []

for tweet_party in tweet_data :
    
    processed_tweet = clean_tweets(tweet_party[0])
    clean_tweet_data.append([processed_tweet, tweet_party[1]])

There are a lot of tweets here. Let's take a random sample and see how our classifer does. I'm guessing it won't be too great given the performance on the convention speeches...

In [24]:
random.seed(20201014)

tweet_data_sample = random.choices(clean_tweet_data,k=10)

In [25]:
tweet_data_sample

[['earlier today spoke house floor abt protecting health care women praised ppmarmonte work central coast httpstcowqgtrzt7vv',
  'Democratic'],
 ['go tribe #rallytogether httpstco0nxutfl9l5', 'Democratic'],
 ['apparently trump thinks easy students overwhelmed crushing burden debt pay student loans #trumpbudget httpstcockyqo5t0qh',
  'Democratic'],
 ['we’re grateful first responders rescue personnel firefighters police volunteers working tirelessly keep people safe provide muchneeded help putting lives line httpstcoezpv0vmiz3',
  'Republican'],
 ['let’s make even greater #kag 🇺🇸 httpstcoy9qozd5l2z', 'Republican'],
 ['1hr cavs tie series 22 im #allin216 repbarbaralee scared #roadtovictory',
  'Democratic'],
 ['congrats belliottsd new gig sd city hall glad continue serve… httpstcofkvmw3cqdi',
  'Democratic'],
 ['really close 3500 raised toward match right whoot that’s 7000 nonmath majors room 😂 help us get httpstcotu34c472sd httpstcoqsdqkypsmc',
  'Democratic'],
 ['today comment period po

In [26]:

for tweet, party in tweet_data_sample :
    estimated_party = classifier.classify(conv_features(tweet, feature_words))
    
    print(f"Here's our (cleaned) tweet: {tweet}")
    print(f"Actual party is {party} and our classifer says {estimated_party}.")
    print("")
    

Here's our (cleaned) tweet: earlier today spoke house floor abt protecting health care women praised ppmarmonte work central coast httpstcowqgtrzt7vv
Actual party is Democratic and our classifer says Republican.

Here's our (cleaned) tweet: go tribe #rallytogether httpstco0nxutfl9l5
Actual party is Democratic and our classifer says Democratic.

Here's our (cleaned) tweet: apparently trump thinks easy students overwhelmed crushing burden debt pay student loans #trumpbudget httpstcockyqo5t0qh
Actual party is Democratic and our classifer says Republican.

Here's our (cleaned) tweet: we’re grateful first responders rescue personnel firefighters police volunteers working tirelessly keep people safe provide muchneeded help putting lives line httpstcoezpv0vmiz3
Actual party is Republican and our classifer says Republican.

Here's our (cleaned) tweet: let’s make even greater #kag 🇺🇸 httpstcoy9qozd5l2z
Actual party is Republican and our classifer says Republican.

Here's our (cleaned) tweet: 1h

Now that we've looked at it some, let's score a bunch and see how we're doing.

In [27]:
# dictionary of counts by actual party and estimated party. 
# first key is actual, second is estimated
parties = ['Republican','Democratic']
results = defaultdict(lambda: defaultdict(int))

for p in parties :
    for p1 in parties :
        results[p][p1] = 0


num_to_score = 10000
random.shuffle(clean_tweet_data)

for idx, tp in enumerate(clean_tweet_data) :
    tweet, party = tp    
    # Now do the same thing as above, but we store the results rather
    # than printing them. 
   
    # get the estimated party
    estimated_party = classifier.classify(conv_features(tweet, feature_words))
    
    results[party][estimated_party] += 1
    
    if idx > num_to_score : 
        break

In [28]:
results

defaultdict(<function __main__.<lambda>()>,
            {'Republican': defaultdict(int,
                         {'Republican': 3592, 'Democratic': 686}),
             'Democratic': defaultdict(int,
                         {'Republican': 4738, 'Democratic': 986})})

### Reflections

**Write a little about what you see in the results**

The results show that for tweets from the Republican party, the classifier correctly identified 3,592 as Republican but misclassified 686 as Democratic. For tweets from the Democratic party, it accurately classified 986 as Democratic but incorrectly labeled 4,738 as Republican. This indicates that the classifier is better at identifying Republican tweets compared to Democratic ones. The imbalance suggests that the model might have a bias towards classifying tweets as Republican, or it may need additional features to better differentiate between the two parties.