Predictive NLP Classification of Presidential Rhetoric

Nate Hiatt (natehiatt@gmail.com), Nathan Bass (bassn727@gmail.com), Shelley Wang (ShelleyLWang@gmail.com)

Background

Much has been made of the increasing divide between Republicans and Democrats; but, can we see evidence of this purported change through political rhetoric? In this project we will run speeches made by Republican and Democratic presidential candidates through a machine learning model. We test whether this model can accurately predict whether a speech was given by a Republican or a Democrat. Finally, we will examine the most important words and phrases in determining party affiliation, and track whether presidential speech has gotten more or less identifiable (more polarized) over time.

We've gathered a rich collection of political texts, including speeches from The Republican/Democratic National Convention, Presidential Inaugural Addresses and transcripts of Presidential Debates. These texts were sourced from the University of California, Santa Barbara's Presidency Project website. The collection focuses on material from 1960 onwards, as this is the period from which we have complete debate transcripts available. We utilized BeautifulSoup for web scraping, effectively extracting the texts and integrating them into a structured dataframe for further processing.

Data Understanding and Preparation

Our initial dataset comprised over 100 distinct texts, each associated with a name and year.

The first step in our data preparation involved tokenizing the textual content. We employed Python and the Natural Language Toolkit (nltk) for this purpose, segmenting the texts into individual tokens. This tokenization process resulted in a single list object, with each token as an element. The tokenized data was then fed into our analytical pipeline, where it was vectorized to facilitate modeling.

Modeling

We instatiate a Multinomial Naive Bayes model. This model uses Bayes probability to statistically test the hypothesis that a text of a document belongs to a certain class (in this case, a political party). We also instatiate a Tf-Idf Vectorizer. This type of vectorizer is very powerful for content-based classification because adds importance weight to certain tokens using a tf-idf score. The higher the tf-idf score, the more important that word is in that document compared to how important it is in all the documents. In the below images, we graphed the top unigrams and bigrams for each political party: i.e., the higher up on the graph, the more important that word was in predicting whether a speech belonged to either a Republican or a Democrat.

We also instatiated a Guassian Naive Bayes model for thoroughness. However, GNB assumes that the features (words) follow a Gaussian (normal) distribution within each class. In text classification, the features are typically discrete word frequencies or tf-idf values, which do not follow a Gaussian distribution. In addition, GNB is better suited for continuous features, while MNB is specifically designed for discrete features like word counts. Thus, the GNB did not perform as well as the MNB.

Finally, for text data, the document term matrix returned by a vectorizer is typically a sparse matrix, since there are many tokens that each document does not have (i.e. there are lots of 0 values). This means our model has a very high number of features/columns/words. Tree-based models work very well with high dimensional data, so we instatiated Random Forest model.

After grid searching all three models using different hyperparameters, we compared all their accuracy scores to select the best performing model.

Results

Our best model is a Multinomial Naive Bayes model. On our training dataset, when our model classified whether a speech was made by a Democrat or a Republican it was right 98% of the time. On unseen testing data, its classification was right 87% of the time.

The model is overfit, which means it could use further tuning to reduce this disparity. One such possible method is Principal Component Analysis. PCA reduces the number of features (and thus the complexity of the model) by creating components consisting of similar features, ones that move in a similar direction. Our group did not have the time for this method, but we consider it in our next steps.

Overall, we’re still getting highly accurate party predictions based just on what a candidate said.

A crucial application of the accuracy of this model is the ability to track changes in political rhetoric over time. If our model becomes more predictive (more accurate) in a certain year, that means rhetoric became more polarized during that year. We visualized this below:

Next Steps

So what about going forward?

The model is overfit, which means it could use further tuning. One such possible method is Principal Component Analysis. PCA reduces the number of features (and thus the complexity of the model) by creating components consisting of similar features, ones that move in a similar direction. It tries to preserve as much of the variance as possible in those features, so that they describe more of the variance in the target even with a lower dimensional space. Our group did not have the time for this method.

The model would also benefit from a larger data training set. In particular, it would be helpful to pull in campaign stops and other less formal speech occasions. Including candidates for party nominations who nonetheless failed to become the party nominee would also be worthwhile. It is worth considering bringing in other political rhetoric, not merely from those seeking presidential office, although that may go beyond the scope of this particular dataset and model.

With our trained model, there are other analyses that would be worth pursuing. To name a few: how much does rhetoric change before and after a politician becomes his or her party’s nominee? What about once they win the election? And how much does context affect rhetoric: a town hall, versus cable news, versus a formal press conference, and so on?

Finally, we would want to allow others to make use of this model as they see fit. It could be useful to the public to have a front-facing website that allows individuals to input text and get a likelihood prediciton of the speaker's party affiliation.

Repo Structure

├── data
│   ├── 
├── Images
│   ├── 
├── Notebooks
│   ├── nathan_working.ipynb
│   ├── nate_scratch.ipynb
│   ├── shelley_scratch.ipynb
├── presentation.pdf
├── .gitignore
├── Final.ipynb
├── LICENSE
├── README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Predictive NLP Classification of Presidential Rhetoric

Background

Data Understanding and Preparation

Modeling

Results

Next Steps

Repo Structure

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 99 Commits
data		data
images		images
notebooks		notebooks
.gitignore		.gitignore
Final.ipynb		Final.ipynb
LICENSE		LICENSE
README.md		README.md

License

shelleylwang/presidency_project

Folders and files

Latest commit

History

Repository files navigation

Predictive NLP Classification of Presidential Rhetoric

Background

Data Understanding and Preparation

Modeling

Results

Next Steps

Repo Structure

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages