## [Computational Social Science] Project 5: Natural Language Processing

In this project, you will use natural language processing techniques to explore a dataset containing tweets from members of the 116th United States Congress that met from January 3, 2019 to January 2, 2021. The dataset has also been cleaned to contain information about each legislator. Concretely, you will do the following:

* Preprocess the text of legislators' tweets
* Conduct Exploratory Data Analysis of the text
* Use sentiment analysis to explore differences between legislators' tweets
* Featurize text with manual feature engineering, frequency-based, and vector-based techniques
* Predict legislators' political parties and whether they are a Senator or Representative

You will explore two questions that relate to two central findings in political science and examine how they relate to the text of legislators' tweets. First, political scientists have argued that U.S. politics is currently highly polarized relative to other periods in American history, but also that the polarization is asymmetric. Historically, there were several conservative Democrats (i.e. "blue dog Democrats") and liberal Republicans (i.e. "Rockefeller Republicans"), as measured by popular measurement tools like [DW-NOMINATE](https://en.wikipedia.org/wiki/NOMINATE_(scaling_method)#:~:text=DW\%2DNOMINATE\%20scores\%20have\%20been,in\%20the\%20liberal\%2Dconservative\%20scale.). However, in the last few years, there are few if any examples of any Democrat in Congress being further to the right than any Republican and vice versa. At the same time, scholars have argued that this polarization is mostly a function of the Republican party moving further right than the Democratic party has moved left. **Does this sort of asymmetric polarization show up in how politicians communicate to their constituents through tweets?**

Second, the U.S. Congress is a bicameral legislature, and there has long been debate about partisanship in the Senate versus the House. The House of Representatives is apportioned by population and all members serve two year terms. In the Senate, each state receives two Senators and each Senator serves a term of six years. For a variety of reasons (smaller chamber size, more insulation from the voters, rules and norms like the filibuster, etc.), the Senate has been argued to be the "cooling saucer" of Congress in that it is more bipartisan and moderate than the House. **Does the theory that the Senate is more moderate have support in Senators' tweets?**

**Note**: See the project handout for more details on caveats and the data dictionary.

In [2]:
# pandas and numpy
import pandas as pd
import numpy as numpy

# punctuation, stop words and English language model
from string import punctuation
from spacy.lang.en.stop_words import STOP_WORDS
import en_core_web_sm
nlp = en_core_web_sm.load()

# textblob
from textblob import TextBlob

# countvectorizer, tfidfvectorizer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# gensim
import gensim
from gensim import models

# plotting
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [3]:
# load data 
# ----------
congress_tweets = pd.read_csv("data/116th Congressional Tweets and Demographics.csv")
# fill in this line of code with a sufficient number of tweets, depending on your computational resources
congress_tweets = congress_tweets.sample(n=5000)
congress_tweets.head()

Unnamed: 0,tweet_id,screen_name,datetime,text,name_wikipedia,position,joined_congress_date,birthday,gender,state,district_number,party,trump_2016_state_share,clinton_2016_state_share,obama_2012_state_share,romney_2012_state_share
476174,1.09545e+18,SteveDaines,2019-02-12T17:27:18-05:00,BIG DAY for conservation in Montana and across...,Steve Daines,Sen,3-Jan-15,8/20/1962,M,MT,Senate,Republican,279240,177709,201839,267928
545367,1.25996e+18,MartinHeinrich,2020-05-11T17:30:40-04:00,Even though this year's Santa Fe Indian Market...,Martin Heinrich,Sen,3-Jan-13,10/17/1971,M,NM,Senate,Democrat,319667,385234,415335,335788
154284,1.3335e+18,RepLindaSanchez,2020-11-30T14:50:36-05:00,RT @BillPascrell For years the IRS disproporti...,Linda Sánchez,Rep,3-Jan-03,1/28/1969,F,CA,38,Democrat,4483814,8753792,7854285,4839958
810427,1.25265e+18,SenTedCruz,2020-04-21T13:35:21-04:00,As the Senate finally moves ahead with additio...,Ted Cruz,Sen,3-Jan-13,12/22/1970,M,TX,Senate,Republican,4685047,3877868,3308124,4569843
17917,1.14985e+18,RepGosar,2019-07-12T21:15:54-04:00,RT @BreitbartNews Exclusive—@RepGosar: America...,Paul Gosar,Rep,3-Jan-11,11/22/1958,M,AZ,4,Republican,1252401,1161167,1025232,1233654


Actually, I'm not sure if I should be adding these new columns (below) back into the dataset because I'm only taking a small sample of the dataset. I'll re-think this when I go into Part II of the project.

In [4]:
#Create dictionary for states' 2012 and 2016 election results, store it in dataset because we probably will do some change over time stuff related to RQ2
mappedstates = {}  # Create an empty dictionary to hold state mappings

# Iterate over rows for 2012 data
for index, row in congress_tweets.iterrows():
    if row['obama_2012_state_share'] > row['romney_2012_state_share']:
        mappedstates[row['state']] = ['Democrat2012', None]  # Assign Democrat for 2012
    else:
        mappedstates[row['state']] = ['Republican2012', None]  # Assign Republican for 2012

# Iterate over rows for 2016 data and update mappings
for index, row in congress_tweets.iterrows():
    if row['clinton_2016_state_share'] > row['trump_2016_state_share']:
        mappedstates[row['state']][1] = 'Democrat2016'  # Update 2016 to Democrat
    else:
        mappedstates[row['state']][1] = 'Republican2016'  # Update 2016 to Republican

# Create new columns for party affiliation in 2012 and 2016
congress_tweets['party_2012'] = congress_tweets['state'].map(lambda state: mappedstates.get(state, [None, None])[0])
congress_tweets['party_2016'] = congress_tweets['state'].map(lambda state: mappedstates.get(state, [None, None])[1])

# Check the updated DataFrame
congress_tweets.head()

Unnamed: 0,tweet_id,screen_name,datetime,text,name_wikipedia,position,joined_congress_date,birthday,gender,state,district_number,party,trump_2016_state_share,clinton_2016_state_share,obama_2012_state_share,romney_2012_state_share,party_2012,party_2016
476174,1.09545e+18,SteveDaines,2019-02-12T17:27:18-05:00,BIG DAY for conservation in Montana and across...,Steve Daines,Sen,3-Jan-15,8/20/1962,M,MT,Senate,Republican,279240,177709,201839,267928,Republican2012,Republican2016
545367,1.25996e+18,MartinHeinrich,2020-05-11T17:30:40-04:00,Even though this year's Santa Fe Indian Market...,Martin Heinrich,Sen,3-Jan-13,10/17/1971,M,NM,Senate,Democrat,319667,385234,415335,335788,Democrat2012,Democrat2016
154284,1.3335e+18,RepLindaSanchez,2020-11-30T14:50:36-05:00,RT @BillPascrell For years the IRS disproporti...,Linda Sánchez,Rep,3-Jan-03,1/28/1969,F,CA,38,Democrat,4483814,8753792,7854285,4839958,Democrat2012,Democrat2016
810427,1.25265e+18,SenTedCruz,2020-04-21T13:35:21-04:00,As the Senate finally moves ahead with additio...,Ted Cruz,Sen,3-Jan-13,12/22/1970,M,TX,Senate,Republican,4685047,3877868,3308124,4569843,Republican2012,Republican2016
17917,1.14985e+18,RepGosar,2019-07-12T21:15:54-04:00,RT @BreitbartNews Exclusive—@RepGosar: America...,Paul Gosar,Rep,3-Jan-11,11/22/1958,M,AZ,4,Republican,1252401,1161167,1025232,1233654,Republican2012,Republican2016


In [5]:
#Play around with formatting of temporal variables
congress_tweets['joined_congress_date'] = pd.to_datetime(congress_tweets['joined_congress_date'], format='%d-%b-%y')

congress_tweets['birthday'] = pd.to_datetime(congress_tweets['birthday'], format='%m/%d/%Y')

congress_tweets.head()

Unnamed: 0,tweet_id,screen_name,datetime,text,name_wikipedia,position,joined_congress_date,birthday,gender,state,district_number,party,trump_2016_state_share,clinton_2016_state_share,obama_2012_state_share,romney_2012_state_share,party_2012,party_2016
476174,1.09545e+18,SteveDaines,2019-02-12T17:27:18-05:00,BIG DAY for conservation in Montana and across...,Steve Daines,Sen,2015-01-03,1962-08-20,M,MT,Senate,Republican,279240,177709,201839,267928,Republican2012,Republican2016
545367,1.25996e+18,MartinHeinrich,2020-05-11T17:30:40-04:00,Even though this year's Santa Fe Indian Market...,Martin Heinrich,Sen,2013-01-03,1971-10-17,M,NM,Senate,Democrat,319667,385234,415335,335788,Democrat2012,Democrat2016
154284,1.3335e+18,RepLindaSanchez,2020-11-30T14:50:36-05:00,RT @BillPascrell For years the IRS disproporti...,Linda Sánchez,Rep,2003-01-03,1969-01-28,F,CA,38,Democrat,4483814,8753792,7854285,4839958,Democrat2012,Democrat2016
810427,1.25265e+18,SenTedCruz,2020-04-21T13:35:21-04:00,As the Senate finally moves ahead with additio...,Ted Cruz,Sen,2013-01-03,1970-12-22,M,TX,Senate,Republican,4685047,3877868,3308124,4569843,Republican2012,Republican2016
17917,1.14985e+18,RepGosar,2019-07-12T21:15:54-04:00,RT @BreitbartNews Exclusive—@RepGosar: America...,Paul Gosar,Rep,2011-01-03,1958-11-22,M,AZ,4,Republican,1252401,1161167,1025232,1233654,Republican2012,Republican2016


In [7]:
print(congress_tweets.columns)

Index(['tweet_id', 'screen_name', 'datetime', 'text', 'name_wikipedia',
       'position', 'joined_congress_date', 'birthday', 'gender', 'state',
       'district_number', 'party', 'trump_2016_state_share',
       'clinton_2016_state_share', 'obama_2012_state_share',
       'romney_2012_state_share', 'party_2012', 'party_2016'],
      dtype='object')


In [6]:
#Obviously a person's name does not change, even if their screen name does. What we want to do is capture all the screen names associated with a particular individual

# Create a mapping dictionary from 'screen_name' to 'name_wikipedia'
screenname_to_wiki = dict(zip(congress_tweets['screen_name'], congress_tweets['name_wikipedia']))

# Map the dictionary to create the 'mapped_wikipedia' column
congress_tweets['mapped_wikipedia'] = congress_tweets['screen_name'].map(screenname_to_wiki)

# Group by 'mapped_wikipedia' and aggregate 'screen_names' into a list, ensuring uniqueness
screenname_grouped = congress_tweets.groupby('mapped_wikipedia')['screen_name'].apply(lambda x: list(set(x))).reset_index()

# Rename the column in 'screenname_grouped' to 'screen_name_all'
screenname_grouped = screenname_grouped.rename(columns={'screen_name': 'screen_name_all'})

# Merge the aggregated 'screen_name_all' back into the original DataFrame
congress_tweets = congress_tweets.merge(screenname_grouped[['mapped_wikipedia', 'screen_name_all']], 
                                        on='mapped_wikipedia', 
                                        how='left', 
                                        suffixes=('', '_merged'))

# Identify and drop any redundant 'screen_name_all' columns, leaving only the correct one
columns_to_drop = [col for col in congress_tweets.columns if 'screen_name_all' in col and col != 'screen_name_all']
congress_tweets.drop(columns=columns_to_drop, inplace=True)

#Check resul

KeyError: 'original_screen_name'

In [None]:
# Filter rows where 'screen_name_all' contains more than one screen name
multiple_screen_names = congress_tweets[congress_tweets['screen_name_all'].apply(len) > 1]

# Display rows with multiple screen names
print(multiple_screen_names[['screen_name_all']])


Lol, so there seem to be no updated screen names in this sample, I'm going to keep this code though, since the project PDF indicated that there will be different screen names for the same individual in some cases.

## Preprocessing

The first step in working with text data is to preprocess it. Make sure you do the following:

* Remove punctuation and stop words. The `rem_punc_stop()` function we used in lab is provided to you but you should feel free to edit it as necessary for other steps
* Remove tokens that occur frequently in tweets, but may not be helpful for downstream classification. For instance, many tweets contain a flag for retweeting, or share a URL 

As you search online, you might run into solutions that rely on regular expressions. You are free to use these, but you should also be able to preprocess using the techniques we covered in lab. Specifically, we encourage you to use spaCy's token attributes and string methods to do some of this text preprocessing.

In [None]:
import spacy
from string import punctuation
from spacy.lang.en.stop_words import STOP_WORDS
import re
import html  # Import html library for decoding

# Load the SpaCy model
nlp = spacy.load("en_core_web_sm")

def rem_punc_stop(text):
    stop_words = STOP_WORDS  # List of common stopwords in English
    punc = set(punctuation)  # Set of punctuation characters
    
    # Decode any HTML escape codes like &amp;
    text = html.unescape(text)
    
    # Remove newline characters and unnecessary whitespaces
    text = text.replace('\n', ' ').strip()
    
    # Remove punctuation
    punc_free = "".join([ch for ch in text if ch not in punc])
    
    # Process the text with SpaCy
    doc = nlp(punc_free)
    
    # Get all words, excluding URLs
    spacy_words = [token.text for token in doc if not token.like_url]
    
    # Filter out stop words and convert to lowercase
    no_punc = [word for word in spacy_words if word.lower() not in stop_words]
    
    return no_punc


In [None]:
# apply the function to the 'text' i.e. tweet column
# ----------
congress_tweets['tokens'] = congress_tweets['text'].map(lambda x: rem_punc_stop(x)) # can use apply here 
congress_tweets['tokens'] # visualize

In [None]:
#!git status

In [None]:
#!git add "Taylor Updates Project 5 Student.ipynb"

In [None]:
#!git status

In [None]:
#!git commit -m "Insert new message"

In [None]:
#!git push

## Exploratory Data Analysis

Use two of the techniques we covered in lab (or other techniques outside of lab!) to explore the text of the tweets. You should construct these visualizations with an eye toward the eventual classification tasks: (1) predicting the legislator's political party based on the text of their tweet, and (2) predicting whether the legislator is a Senator or Representative. As a reminder, in lab we covered word frequencies, word clouds, word/character counts, scattertext, and topic modeling as possible exploration tools. 

### EDA 1

In [None]:
# apply function to text object
from wordcloud import WordCloud
import matplotlib.pyplot as plt
text = ' '.join(congress_tweets['tokens'].map(lambda text: ' '.join(text)))

# create WordCloud visualization using the "text" object 
wordcloud = WordCloud(background_color = "white",  # set background color to white
                      random_state=41              # set random state to ensure same word cloud each time
                      ).generate(text)             # change the background color


# plot 
plt.imshow(wordcloud,                  # specify wordcloud
           interpolation = 'bilinear') # specifies how the words are displayed
plt.axis('off')                        # turn off axes
plt.show()                             # show the plot

### EDA 2-ish

### Retweets and Quote Tweets

I think that retweets and quote tweets are meaningful signals, though we might not want to have them messing up our sentiment analysis. 
First, let's pull them out to see how prevalent they actually are (reminding ourselves that this is a small subset of our very large dataset)

In [None]:
# Find all tweets containing "RT" or "QT"
RTtweets = congress_tweets[congress_tweets['text'].str.contains('RT', na=False)]
QTtweets = congress_tweets[congress_tweets['text'].str.contains('QT', na=False)]

# Print the count of retweets and quote tweets
print(f"Number of tweets containing Retweets: {RTtweets.shape[0]}")
print(f"Number of tweets containing Quote tweets: {QTtweets.shape[0]}")

Now there are some analytic choices here. 
For quote tweets, we will probably need to conceptualize this as a conversation, where the sentiment/message of the original tweet can be the opposite of the re-poster.
However, should 'simple' re-tweets be considered representative of the person doing the re-tweeting? 
Or should they be attributed to the original speaker(s) instead? What should we do if multiple people are being referenced?
Let's see if people are re-tweeting/quote-tweeting other individuals who are also in our database. As we can only reasonably infer characteristics about people being re-tweeted if they are in our database already.

In [None]:
repost_users = congress_tweets[congress_tweets['text'].str.contains('@', na=False)]

print(repost_users[['text']].head())

# Print the count of tweets with an @
print(f"Total number of tweets containing an @: {repost_users.shape[0]}")

In [None]:
import re

mentionedusers = {}  # Create an empty dictionary to hold mentions and associated tweet text

# Iterate over rows of congress_tweets
for index, row in congress_tweets.iterrows():
    # Extract mentions (users starting with @) from the tweet text
    mentions = re.findall(r'@([a-zA-Z0-9_]+)', row['text'])
    
    # Check if any of the mentions are in the congress_tweets['original_screen_name'] list
    for mention in mentions:
        if mention in congress_tweets['original_screen_name'].values:
            # Find the political affiliation of the person doing the mentioning (the tweet author)
            mentioner_affiliation = congress_tweets.loc[congress_tweets['original_screen_name'] == row['original_screen_name'], 'party'].values[0]
            
            # Find the political affiliation of the person being mentioned
            mentioned_affiliation = congress_tweets.loc[congress_tweets['original_screen_name'] == mention, 'party'].values[0]
            
            # Store the tweet text, the mentioner’s political affiliation, and the mentioned person’s political affiliation in the dictionary
            mentionedusers[mention] = {
                'mentioner_screen_name': row['original_screen_name'],  # Add mentioner's screen name
                'mentioner_text': row['text'],
                'mentioner_affiliation': mentioner_affiliation,
                'mentioned_affiliation': mentioned_affiliation
            }

# Print only the first 5 mentions that are in congress_tweets['original_screen_name'], including political affiliation and the tweet author
for i, (mention, details) in enumerate(mentionedusers.items()):
    if i >= 5:
        break
    print(f"Tweet Author: {details['mentioner_screen_name']}")
    print(f"Author Political Affiliation: {details['mentioner_affiliation']}")
    print(f"Mentioned User: {mention}")
    print(f"Mentioned User Political Affiliation: {details['mentioned_affiliation']}\n")
    print(f"Tweet Text: {details['mentioner_text']}")
    print("_________")
    
print(f"Number of mentions: {len(mentionedusers)}")


From this, we can see the different ways users are interacting with one another in the form of @'s, retweets, and quote tweets. 

## Sentiment Analysis

Next, let's analyze the sentiments contained within the tweets. You may use TextBlob or another library for these tasks. Do the following:

* Choose two legislators, one who you think will be more liberal and one who you think will be more conservative, and analyze their sentiment and/or subjectivity scores per tweet. For instance, you might do two scatterplots that plot each legislator's sentiment against their subjectivity, or two density plots for their sentiments. Do the scores match what you thought?
* Plot two more visualizations like the ones you chose in the first part, but do them to compare (1) Democrats v. Republicans and (2) Senators v. Representatives 

`TextBlob` has already been imported in the top cell.

In [63]:
...

Ellipsis

## Featurization

Before going to classification, explore different featurization techniques. Create three dataframes or arrays to represent your text features, specifically:

* Features engineered from your previous analysis. For example, word counts, sentiment scores, topic model etc.
* A term frequency-inverse document frequency matrix. 
* An embedding-based featurization (like a document averaged word2vec)

In the next section, you will experiment with each of these featurization techniques to see which one produces the best classifications.

In [None]:
...

### Engineered Text Features

In [None]:
# Engineered Features
...

### Bag-of-words or Tf-idf

In [None]:
# Frequency Based featurization
...

### Word Embedding

In [None]:
# Load Word2Vec model from Google; OPTIONAL depending on your computational resources (the file is ~1 GB)
# Also note that this file path assumes that the word vectors are underneath 'data'; you may wish to point to the CSS course repo and change the path
# or move the vector file to the project repo 

#model = gensim.models.KeyedVectors.load_word2vec_format('data/GoogleNews-vectors-negative300.bin.gz', binary = True) 

In [None]:
# Function to average word embeddings for a document; use examples from lab to apply this function. You can use also other techniques such as PCA and doc2vec instead.
def document_vector(word2vec_model, doc):
    doc = [word for word in doc if word in model.vocab]
    return np.mean(model[doc], axis=0)

In [None]:
# embedding based featurization
...

## Classification

Either use cross-validation or partition your data with training/validation/test sets for this section. Do the following:

* Choose a supervised learning algorithm such as logistic regression, random forest etc. 
* Train six models. For each of the three dataframes you created in the featurization part, train one model to predict whether the author of the tweet is a Democrat or Republican, and a second model to predict whether the author is a Senator or Representative.
* Report the accuracy and other relevant metrics for each of these six models.
* Choose the featurization technique associated with your best model. Combine those text features with non-text features. Train two more models: (1) A supervised learning algorithm that uses just the non-text features and (2) a supervised learning algorithm that combines text and non-text features. Report accuracy and other relevant metrics. 

If time permits, you are encouraged to use hyperparameter tuning or AutoML techniques like TPOT, but are not explicitly required to do so.

### Train Six Models with Just Text

In [None]:
# six models ([engineered features, frequency-based, embedding] * [democrat/republican, senator/representative])
...

### Two Combined Models

In [None]:
# two models ([best text features + non-text features] * [democrat/republican, senator/representative])
...

## Discussion Questions

1. Why do standard preprocessing techniques need to be further customized to a particular corpus?

**YOUR ANSWER HERE** ...

2. Did you find evidence for the idea that Democrats and Republicans have different sentiments in their tweets? What about Senators and Representatives?

**YOUR ANSWER HERE** ...

3. Why is validating your exploratory and unsupervised learning approaches with a supervised learning algorithm valuable?

**YOUR ANSWER HERE** ...

4. Did text only, non-text only, or text and non-text features together perform the best? What is the intuition behind combining text and non-text features in a supervised learning algorithm?

**YOUR ANSWER HERE** ...