## Naive Bayes on Political Text

In this notebook we use Naive Bayes to explore and classify political data. See the `README.md` for full details.

In [3]:
import sqlite3
import nltk
import random
import numpy as np
from collections import Counter, defaultdict

# Feel free to include your text patterns functions
#from text_functions_solutions import clean_tokenize, get_patterns

In [5]:
conn = sqlite3.connect("2020_Conventions.db")
cursor = conn.cursor()

In [8]:
cursor = conn.cursor()

# Inspect the table structure
cursor.execute("PRAGMA table_info(conventions);")
columns = cursor.fetchall()

# Fetch a few rows to see the data
cursor.execute("SELECT * FROM conventions LIMIT 5;")
sample_data = cursor.fetchall()

# Close the connection
conn.close()

columns

[(0, 'party', 'TEXT', 0, None, 0),
 (1, 'night', 'INTEGER', 0, None, 0),
 (2, 'speaker', 'TEXT', 0, None, 0),
 (3, 'speaker_count', 'INTEGER', 0, None, 0),
 (4, 'time', 'TEXT', 0, None, 0),
 (5, 'text', 'TEXT', 0, None, 0),
 (6, 'text_len', 'TEXT', 0, None, 0),
 (7, 'file', 'TEXT', 0, None, 0)]

### Part 1: Exploratory Naive Bayes

We'll first build a NB model on the convention data itself, as a way to understand what words distinguish between the two parties. This is analogous to what we did in the "Comparing Groups" class work. First, pull in the text 
for each party and prepare it for use in Naive Bayes.  

In [50]:
import re 
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

def clean_tokenize(text):
    # lower the text 
    text = text.lower()
    # remove non alphabetic character 
    text = re.sub(r"[^a-z\s#]", '', text)
    # tokenize the text 
    tokens = word_tokenize(text)
    # remove stopwords
    stop_words = set(stopwords.words('english'))
    tokens = [w for w in tokens if not w in stop_words]
    return tokens


convention_data = []

# fill this list up with items that are themselves lists. The 
# first element in the sublist should be the cleaned and tokenized
# text in a single string. The second element should be the party. 

conn = sqlite3.connect("2020_Conventions.db")
cursor = conn.cursor()

query = "SELECT text, party FROM conventions"
cursor.execute(query)
rows = cursor.fetchall()
rows

for row in rows:
    text, party = row
    tokens = clean_tokenize(text)
    convention_data.append((tokens,party))




Let's look at some random entries and see if they look right. 

In [51]:
random.choices(convention_data,k=2)

[(['joe',
   'bidens',
   'america',
   'radical',
   'left',
   'get',
   'whatever',
   'want',
   'get',
   'pay',
   'theyve',
   'already',
   'taken',
   'joe',
   'biden',
   'democratic',
   'party',
   'dont',
   'let',
   'take',
   'america'],
  'Republican'),
 (['oath'], 'Republican')]

If that looks good, we now need to make our function to turn these into features. In my solution, I wanted to keep the number of features reasonable, so I only used words that occur at least `word_cutoff` times. Here's the code to test that if you want it. 

In [24]:
word_cutoff = 5

tokens = [w for t, p in convention_data for w in t]

word_dist = nltk.FreqDist(tokens)

feature_words = set()

for word, count in word_dist.items() :
    if count > word_cutoff :
        feature_words.add(word)
        
print(f"With a word cutoff of {word_cutoff}, we have {len(feature_words)} as features in the model.")

With a word cutoff of 5, we have 2327 as features in the model.


In [25]:
def conv_features(text,fw) :
    """Given some text, this returns a dictionary holding the
       feature words.
       
       Args: 
            * text: a piece of text in a continuous string. Assumes
            text has been cleaned and case folded.
            * fw: the *feature words* that we're considering. A word 
            in `text` must be in fw in order to be returned. This 
            prevents us from considering very rarely occurring words.
        
       Returns: 
            A dictionary with the words in `text` that appear in `fw`. 
            Words are only counted once. 
            If `text` were "quick quick brown fox" and `fw` = {'quick','fox','jumps'},
            then this would return a dictionary of 
            {'quick' : True,
             'fox' :    True}
        
    """
     # Your code here

    text_words = set(text.split())
    ret_dict = dict()
    
    for word in text_words:
         #print(word)
         if word in fw:
              ret_dict[word] = True

    return(ret_dict)

In [26]:
conv_features("donald is the president",feature_words)

{'donald': True, 'president': True}

In [27]:
assert(len(feature_words)>0)
assert(conv_features("donald is the president",feature_words)==
       {'donald':True,'president':True})
assert(conv_features("people are american in america",feature_words)==
                     {'america':True,'american':True,"people":True})

Now we'll build our feature set. Out of curiosity I did a train/test split to see how accurate the classifier was, but we don't strictly need to since this analysis is exploratory. 

In [28]:
featuresets = [(conv_features(' '.join(text),feature_words), party) for (text, party) in convention_data]

In [29]:
random.seed(20220507)
random.shuffle(featuresets)

test_size = 500

In [30]:
test_set, train_set = featuresets[:test_size], featuresets[test_size:]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, test_set))

0.494


In [31]:
classifier.show_most_informative_features(25)

Most Informative Features
                   china = True           Republ : Democr =     25.8 : 1.0
                   votes = True           Democr : Republ =     23.8 : 1.0
             enforcement = True           Republ : Democr =     21.5 : 1.0
                 destroy = True           Republ : Democr =     19.2 : 1.0
                freedoms = True           Republ : Democr =     18.2 : 1.0
                 climate = True           Democr : Republ =     17.8 : 1.0
                supports = True           Republ : Democr =     17.1 : 1.0
                   crime = True           Republ : Democr =     16.1 : 1.0
                   media = True           Republ : Democr =     14.9 : 1.0
                 beliefs = True           Republ : Democr =     13.0 : 1.0
               countries = True           Republ : Democr =     13.0 : 1.0
                 defense = True           Republ : Democr =     13.0 : 1.0
                  defund = True           Republ : Democr =     13.0 : 1.0

Write a little prose here about what you see in the classifier. Anything odd or interesting?

### My Observations

`Republican-Associated Words:` Words like "china", "enforcement", "destroy", "freedoms", "crime", "media", "defense", "defund", "religion", "trade", "flag", and "greatness" are strongly associated with Republican speeches. These words have high odds ratios indicating they are significantly more likely to appear in Republican text. The presence of words like "china" and "trade" suggests a focus on international relations and economic policies, which were likely significant topics for Republicans during the convention. </br>
`Democratic-Associated Words:` Words like "votes" and "climate" are strongly associated with Democratic speeches, though fewer words are listed for Democrats compared to Republicans in the most informative features. Words like "climate" indicate an emphasis on environmental issues in Democratic speeches.



## Part 2: Classifying Congressional Tweets

In this part we apply the classifer we just built to a set of tweets by people running for congress
in 2018. These tweets are stored in the database `congressional_data.db`. That DB is funky, so I'll
give you the query I used to pull out the tweets. Note that this DB has some big tables and 
is unindexed, so the query takes a minute or two to run on my machine.

In [147]:
cong_db = sqlite3.connect("congressional_data.db")
cong_cur = cong_db.cursor()

In [32]:
cong_db = sqlite3.connect("congressional_data.db")
cong_cur = cong_db.cursor()


In [33]:
results = cong_cur.execute(
        '''
           SELECT DISTINCT 
                  cd.candidate, 
                  cd.party,
                  tw.tweet_text
           FROM candidate_data cd 
           INNER JOIN tweets tw ON cd.twitter_handle = tw.handle 
               AND cd.candidate == tw.candidate 
               AND cd.district == tw.district
           WHERE cd.party in ('Republican','Democratic') 
               AND tw.tweet_text NOT LIKE '%RT%'
        ''')

results = list(results) # Just to store it, since the query is time consuming
cong_db.close()  # Close the database connection


In [34]:
# Prepare tweet_data with sublists
tweet_data = [(row[2], row[1]) for row in results]  # (tweet_text, party)

len(tweet_data), tweet_data[:5] 

(664656,
 [(b'"Brooks Joins Alabama Delegation in Voting Against Flawed Funding Bill" http://t.co/3CwjIWYsNq',
   'Republican'),
  (b'"Brooks: Senate Democrats Allowing President to Give Americans\xe2\x80\x99 Jobs to Illegals" #securetheborder https://t.co/mZtEaX8xS6',
   'Republican'),
  (b'"NASA on the Square" event this Sat. 11AM \xe2\x80\x93 4PM. Stop by &amp; hear about the incredible work done in #AL05! @DowntownHSV http://t.co/R9zY8WMEpA',
   'Republican'),
  (b'"The trouble with Socialism is that eventually you run out of other people\'s money." - Margaret Thatcher https://t.co/X97g7wzQwJ',
   'Republican'),
  (b'"The trouble with socialism is eventually you run out of other people\'s money" \xe2\x80\x93 Thatcher. She\'ll be sorely missed. http://t.co/Z8gBnDQUh8',
   'Republican')])

In [None]:
tweet_data = []

# Now fill up tweet_data with sublists like we did on the convention speeches.
# Note that this may take a bit of time, since we have a lot of tweets.



There are a lot of tweets here. Let's take a random sample and see how our classifer does. I'm guessing it won't be too great given the performance on the convention speeches...

In [35]:
random.seed(20201014)

tweet_data_sample = random.choices(tweet_data,k=10)

In [36]:
len(tweet_data_sample)

10

In [47]:
random.seed(20201014)

# Clean and tokenize function
def clean_tokenize(text):
    if isinstance(text, bytes):
        text = text.decode('utf-8')  # Decode bytes to string
    text = text.lower()
    text = re.sub(r'[^a-z\s]', '', text)
    tokens = word_tokenize(text)
    return tokens

# Check the balance of the dataset
party_counts = defaultdict(int)
for tweet, party in tweet_data:
    party_counts[party] += 1

print(f"Democratic tweets: {party_counts['Democratic']}")
print(f"Republican tweets: {party_counts['Republican']}")

# Ensure we have a balanced dataset for training
dem_tweets = [tp for tp in tweet_data if tp[1] == 'Democratic']
rep_tweets = [tp for tp in tweet_data if tp[1] == 'Republican']
min_tweet_count = min(len(dem_tweets), len(rep_tweets))

balanced_tweet_data = dem_tweets[:min_tweet_count] + rep_tweets[:min_tweet_count]
random.shuffle(balanced_tweet_data)

# Convert the tokens back into strings for conv_features
featuresets = [(conv_features(' '.join(clean_tokenize(text)), feature_words), party) for (text, party) in balanced_tweet_data]

# Split the data into training and test sets
test_size = int(0.2 * len(featuresets))
train_set, test_set = featuresets[test_size:], featuresets[:test_size]

# Train the Naive Bayes classifier
classifier = nltk.NaiveBayesClassifier.train(train_set)

# Evaluate the classifier
accuracy = nltk.classify.accuracy(classifier, test_set)
print(f"Classifier accuracy: {accuracy:.4f}")

# Sample 10 tweets
tweet_data_sample = random.choices(tweet_data, k=10)


for tweet, party in tweet_data_sample :
    #estimated_party = 'Gotta fill this in'
    tokens = clean_tokenize(tweet)
    # convert to feature vector 
    features = conv_features("".join(tokens), feature_words)
    # estimate the party 
    estimated_party = classifier.classify(features)
    # Fill in the right-hand side above with code that estimates the actual party
    
    print(f"Here's our (cleaned) tweet: {tweet}")
    print(f"Actual party is {party} and our classifer says {estimated_party}.")
    print("")
    

Democratic tweets: 376125
Republican tweets: 288531
Classifier accuracy: 0.6140
Here's our (cleaned) tweet: b'Interesting read. http://t.co/BgcVCNpACE'
Actual party is Republican and our classifer says Democratic.

Here's our (cleaned) tweet: b'This week, the GOP will kick off their Natl Convention. Will we hear about what unites us or divides us? https://t.co/RmliqMGFtv #RNCinCLE'
Actual party is Democratic and our classifer says Democratic.

Here's our (cleaned) tweet: b'@patsox23 Done'
Actual party is Democratic and our classifer says Democratic.

Here's our (cleaned) tweet: b'Will be on @CNNnewsroom at 1:15 pm PT / 4:15 pm ET today.'
Actual party is Democratic and our classifer says Democratic.

Here's our (cleaned) tweet: b'.@SenJohnThune &amp; @SenToomey: Proposed SEC #JOBSAct regulations would have adverse effects on #SmallBiz &amp; investors   http://t.co/9I94uNhjS8'
Actual party is Republican and our classifer says Democratic.

Here's our (cleaned) tweet: b'Equal protection un

Now that we've looked at it some, let's score a bunch and see how we're doing.

In [48]:
# Dictionary of counts by actual party and estimated party
parties = ['Republican', 'Democratic']
results = defaultdict(lambda: defaultdict(int))

for p in parties:
    for p1 in parties:
        results[p][p1] = 0

num_to_score = 10000
random.shuffle(tweet_data)

for idx, tp in enumerate(tweet_data):
    tweet, party = tp    
    # Clean and tokenize the tweet
    tokens = clean_tokenize(tweet)
    # Convert to feature vector
    features = conv_features(' '.join(tokens), feature_words)
    # Estimate the party using the classifier
    estimated_party = classifier.classify(features)
    
    results[party][estimated_party] += 1
    
    if idx > num_to_score:
        break

# Print the results
for actual_party in results:
    for estimated_party in results[actual_party]:
        print(f"Actual: {actual_party}, Estimated: {estimated_party}, Count: {results[actual_party][estimated_party]}")

Actual: Republican, Estimated: Republican, Count: 1662
Actual: Republican, Estimated: Democratic, Count: 2699
Actual: Democratic, Estimated: Republican, Count: 799
Actual: Democratic, Estimated: Democratic, Count: 4842


In [49]:
results

defaultdict(<function __main__.<lambda>()>,
            {'Republican': defaultdict(int,
                         {'Republican': 1662, 'Democratic': 2699}),
             'Democratic': defaultdict(int,
                         {'Republican': 799, 'Democratic': 4842})})

### Reflections

- The classifier's accuracy is 61.40%, which suggests that while it is somewhat effective, there is significant room for improvement. The classifier correctly predicts the party for about 61% of the tweets. 

**`Republican Tweets:`**

- Correctly classified: 1,662 </br>
- Incorrectly classified as Democratic: 2,699 </br>
The classifier has a higher rate of misclassification for Republican tweets, with more tweets being incorrectly classified as Democratic than correctly classified as Republican. </br>

**`Democratic Tweets:`**

Correctly classified: 4,842 </br>
- Incorrectly classified as Republican: 799 </br>
- The classifier performs better on Democratic tweets, with a higher number of correct classifications compared to incorrect ones.