# Introduction

Presidential approval polls are a key metric for understanding public sentiment of political events. However, traditional polling can be time and labor intensive. With the rise of social media, we have at our fingertips a wealth of data on public sentiment of any number of topics, including politics. The goal of this project is to build a model that uses social media comments to generate a presidential approval rating.

## Approval Rating Data

Presidential approval ratings have been collected from [FiveThirtyEight](https://projects.fivethirtyeight.com/biden-approval-rating/). For this project, I will be using only data from Biden's presidency, so the dates in this dataset are January 20, 2021 - September 15, 2021.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

from pmaw import PushshiftAPI
import datetime as dt

from textblob import TextBlob

In [2]:
biden = pd.read_csv('Data/Biden.csv')

In [3]:
biden['pollster'].value_counts()

Morning Consult                                    644
Rasmussen Reports/Pulse Opinion Research           320
YouGov                                             139
Ipsos                                               76
HarrisX                                             44
Global Strategy Group/GBAO (Navigator Research)     32
Marist College                                      26
Léger                                               24
RMG Research                                        24
American Research Group                             21
IBD/TIPP                                            20
Quinnipiac University                               18
Gallup                                              16
Echelon Insights                                    14
AP-NORC                                             14
Monmouth University                                 12
McLaughlin & Associates                             12
Harris Poll                                         12
Cor Servic

In [3]:
biden.head()

Unnamed: 0,president,subgroup,modeldate,startdate,enddate,pollster,grade,samplesize,population,weight,...,disapprove,adjusted_approve,adjusted_disapprove,multiversions,tracking,url,poll_id,question_id,createddate,timestamp
0,Joseph R. Biden Jr.,All polls,9/15/2021,1/21/2021,2/2/2021,Gallup,B+,906.0,a,1.314707,...,37.0,56.394227,36.431592,,,https://news.gallup.com/poll/329348/biden-begi...,74344,139651,2/4/2021,9/15/2021 13:24
1,Joseph R. Biden Jr.,All polls,9/15/2021,1/28/2021,2/1/2021,Quinnipiac University,A-,1075.0,a,1.53249,...,36.0,51.37561,35.903037,,,https://poll.qu.edu/poll-release?releaseid=3766,74334,139617,2/3/2021,9/15/2021 13:24
2,Joseph R. Biden Jr.,All polls,9/15/2021,1/27/2021,2/1/2021,Global Strategy Group/GBAO (Navigator Research),B/C,1005.0,rv,0.915957,...,39.0,52.49792,38.182669,,,https://navigatorresearch.org/wp-content/uploa...,74324,139564,2/2/2021,9/15/2021 13:24
3,Joseph R. Biden Jr.,All polls,9/15/2021,1/28/2021,2/1/2021,AP-NORC,,1055.0,a,1.427064,...,38.0,57.39818,38.411874,,,https://apnorc.org/wp-content/uploads/2021/02/...,74336,139629,2/4/2021,9/15/2021 13:24
4,Joseph R. Biden Jr.,All polls,9/15/2021,1/28/2021,2/1/2021,Rasmussen Reports/Pulse Opinion Research,B,1500.0,lv,0.565827,...,45.0,53.40231,39.062956,,T,https://www.rasmussenreports.com/public_conten...,74325,139568,2/2/2021,9/15/2021 13:24


In [4]:
biden.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1595 entries, 0 to 1594
Data columns (total 22 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   president            1595 non-null   object 
 1   subgroup             1595 non-null   object 
 2   modeldate            1595 non-null   object 
 3   startdate            1595 non-null   object 
 4   enddate              1595 non-null   object 
 5   pollster             1595 non-null   object 
 6   grade                1541 non-null   object 
 7   samplesize           1587 non-null   float64
 8   population           1595 non-null   object 
 9   weight               1595 non-null   float64
 10  influence            1595 non-null   float64
 11  approve              1595 non-null   float64
 12  disapprove           1595 non-null   float64
 13  adjusted_approve     1595 non-null   float64
 14  adjusted_disapprove  1595 non-null   float64
 15  multiversions        24 non-null     o

In [5]:
biden.drop(columns=['subgroup', 'president', 'modeldate', 'multiversions', 'tracking', 'url', 
                 'poll_id', 'question_id', 'createddate', 'timestamp'], inplace=True)

In [6]:
biden.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1595 entries, 0 to 1594
Data columns (total 12 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   startdate            1595 non-null   object 
 1   enddate              1595 non-null   object 
 2   pollster             1595 non-null   object 
 3   grade                1541 non-null   object 
 4   samplesize           1587 non-null   float64
 5   population           1595 non-null   object 
 6   weight               1595 non-null   float64
 7   influence            1595 non-null   float64
 8   approve              1595 non-null   float64
 9   disapprove           1595 non-null   float64
 10  adjusted_approve     1595 non-null   float64
 11  adjusted_disapprove  1595 non-null   float64
dtypes: float64(7), object(5)
memory usage: 149.7+ KB


## Collecting Social Media Data

For this project, I will use data from Reddit. Because reddit is structured into various subreddits by interest, it is easy both to filter for political content and ensure better representation of a variety of viewpoints. I will collect data from six different subreddits: r/politics and r/news (the two largest general political subreddits), r/liberal and r/democrats (two of the largest left-leaning political subreddits) and r/conservative and r/libertarian (two of the largest right-leaning political subreddits).

In [7]:
api = PushshiftAPI()

before = int(dt.datetime(2021,9,15,0,0).timestamp())
after = int(dt.datetime(2021,1,20,0,0).timestamp())

subs = ['politics', 'news', 'conservative', 'liberal', 'libertarian', 'democrats']

In [8]:
def get_data(subreddit):
    limit = 100000
    comments = api.search_comments(subreddit=subreddit, limit=limit, before=before, after=after)
    df = pd.DataFrame(comments)
    return df

def combine_data(subs):
    full_df = pd.DataFrame()
    for sub in subs:
        new_df = get_data(sub)
        full_df = pd.concat([full_df, new_df])
    return full_df

In [9]:
reddit = combine_data(subs)

Total:: Success Rate: 84.67% - Requests: 1194 - Batches: 120 - Items Remaining: 0
Total:: Success Rate: 90.25% - Requests: 1118 - Batches: 112 - Items Remaining: 0
Total:: Success Rate: 85.17% - Requests: 1187 - Batches: 119 - Items Remaining: 0
69913 result(s) not found in Pushshift
Total:: Success Rate: 77.23% - Requests: 470 - Batches: 47 - Items Remaining: 1
1 result(s) not found in Pushshift
Total:: Success Rate: 88.17% - Requests: 1150 - Batches: 115 - Items Remaining: 0
Total:: Success Rate: 93.84% - Requests: 1201 - Batches: 121 - Items Remaining: 0


In [12]:
reddit.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 530086 entries, 0 to 99999
Data columns (total 50 columns):
 #   Column                           Non-Null Count   Dtype  
---  ------                           --------------   -----  
 0   all_awardings                    530086 non-null  object 
 1   associated_award                 0 non-null       object 
 2   author                           530086 non-null  object 
 3   author_flair_background_color    138693 non-null  object 
 4   author_flair_css_class           20339 non-null   object 
 5   author_flair_template_id         46511 non-null   object 
 6   author_flair_text                76278 non-null   object 
 7   author_flair_text_color          156693 non-null  object 
 8   awarders                         509693 non-null  object 
 9   body                             530086 non-null  object 
 10  collapsed_because_crowd_control  0 non-null       object 
 11  comment_type                     0 non-null       object 
 12  cre

In [13]:
reddit.drop(columns = ['all_awardings', 'associated_award', 'author_flair_background_color', 
                        'author_flair_css_class', 'author_flair_template_id', 'author_flair_text', 
                        'author_flair_text_color', 'awarders', 'collapsed_because_crowd_control', 
                        'comment_type', 'gildings', 'id', 'is_submitter', 'link_id', 'locked', 'no_follow', 'parent_id', 
                        'retrieved_on', 'send_replies', 'stickied', 'subreddit_id', 'top_awarded_type', 'treatment_tags',
                        'author_flair_richtext', 'author_flair_type', 'author_fullname', 'author_patreon_flair', 
                        'author_premium', 'distinguished', 'author_cakeday', 'collapsed_reason_code', 
                        'archived', 'body_sha1', 'can_gild', 'collapsed', 'collapsed_reason', 'controversiality', 
                        'gilded', 'retrieved_utc', 'score_hidden', 'subreddit_name_prefixed', 'subreddit_type', 
                        'edited'], inplace=True)

reddit['created_utc'] = pd.to_datetime(reddit['created_utc'], unit='s')

In [14]:
reddit.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 530086 entries, 0 to 99999
Data columns (total 7 columns):
 #   Column                 Non-Null Count   Dtype         
---  ------                 --------------   -----         
 0   author                 530086 non-null  object        
 1   body                   530086 non-null  object        
 2   created_utc            530086 non-null  datetime64[ns]
 3   permalink              530086 non-null  object        
 4   score                  530086 non-null  int64         
 5   subreddit              530086 non-null  object        
 6   total_awards_received  530086 non-null  int64         
dtypes: datetime64[ns](1), int64(2), object(4)
memory usage: 32.4+ MB


## Sentiment Analysis

Finally, I will use TextBlob to generate sentiment scores for the data. 

In [15]:
reddit['body'] = reddit['body'].astype(str)

def subjectivity(text):
    return TextBlob(text).sentiment.subjectivity

def polarity(text):
    return TextBlob(text).sentiment.polarity

reddit['Polarity'] = reddit['body'].apply(polarity)
reddit['Subjectivity'] = reddit['body'].apply(subjectivity)

In [16]:
def get_sentiment(score):
    if score < 0:
        return 'Negative'
    elif score == 0:
        return 'Neutral'
    else: return 'Positive'
    
reddit['Sentiment'] = reddit['Polarity'].apply(get_sentiment)

In [17]:
reddit.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 530086 entries, 0 to 99999
Data columns (total 10 columns):
 #   Column                 Non-Null Count   Dtype         
---  ------                 --------------   -----         
 0   author                 530086 non-null  object        
 1   body                   530086 non-null  object        
 2   created_utc            530086 non-null  datetime64[ns]
 3   permalink              530086 non-null  object        
 4   score                  530086 non-null  int64         
 5   subreddit              530086 non-null  object        
 6   total_awards_received  530086 non-null  int64         
 7   Polarity               530086 non-null  float64       
 8   Subjectivity           530086 non-null  float64       
 9   Sentiment              530086 non-null  object        
dtypes: datetime64[ns](1), float64(2), int64(2), object(5)
memory usage: 44.5+ MB


In [18]:
reddit.head()

Unnamed: 0,author,body,created_utc,permalink,score,subreddit,total_awards_received,Polarity,Subjectivity,Sentiment
0,[deleted],[removed],2021-04-01 17:17:59,/r/politics/comments/mhmik7/ap_poll_finds_bide...,1,politics,0,0.0,0.0,Neutral
1,Flip2428,Growing up there I was always dumbfounded on h...,2021-04-01 17:17:58,/r/politics/comments/mhy1zm/new_mexico_is_set_...,1,politics,0,-0.051389,0.368056,Negative
2,boatpile,"Gaetz didn't even say ""no age you can't be sex...",2021-04-01 17:17:58,/r/politics/comments/mhuv7u/theres_no_age_that...,1,politics,0,0.4125,0.820833,Positive
3,theombudsmen,"I'm sure everyone knows this, but to clarify t...",2021-04-01 17:17:56,/r/politics/comments/mhzm4r/the_gop_rightly_fe...,1,politics,0,0.1,0.853472,Positive
4,Tony_Chu,Hang up decorations with zip ties and cheer th...,2021-04-01 17:17:54,/r/politics/comments/mhulza/biden_must_clean_u...,1,politics,0,0.0,0.0,Neutral


In [8]:
reddit.to_csv('Data/reddit_data.csv')
#biden.to_csv('Data/approval_data.csv')

In [2]:
reddit = pd.read_csv('Data/reddit_data.csv', index_col=0)
#approval = pd.read_csv('Data/approval_topline.csv', index_col=0)

In [17]:
reddit.head()

Unnamed: 0,author,body,created_utc,permalink,score,subreddit,total_awards_received,Polarity,Subjectivity,Sentiment,date
1,Flip2428,Growing up there I was always dumbfounded on h...,2021-04-01 17:17:58,/r/politics/comments/mhy1zm/new_mexico_is_set_...,1,politics,0,-0.051389,0.368056,Negative,2021-04-01
2,boatpile,"Gaetz didn't even say ""no age you can't be sex...",2021-04-01 17:17:58,/r/politics/comments/mhuv7u/theres_no_age_that...,1,politics,0,0.4125,0.820833,Positive,2021-04-01
3,theombudsmen,"I'm sure everyone knows this, but to clarify t...",2021-04-01 17:17:56,/r/politics/comments/mhzm4r/the_gop_rightly_fe...,1,politics,0,0.1,0.853472,Positive,2021-04-01
4,Tony_Chu,Hang up decorations with zip ties and cheer th...,2021-04-01 17:17:54,/r/politics/comments/mhulza/biden_must_clean_u...,1,politics,0,0.0,0.0,Neutral,2021-04-01
5,Fellums2,"Aside from the fact that he is just awful, he ...",2021-04-01 17:17:53,/r/politics/comments/mhmik7/ap_poll_finds_bide...,1,politics,0,-0.284821,0.735714,Negative,2021-04-01


In [18]:
reddit.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 459988 entries, 1 to 99998
Data columns (total 11 columns):
 #   Column                 Non-Null Count   Dtype         
---  ------                 --------------   -----         
 0   author                 459988 non-null  object        
 1   body                   459988 non-null  object        
 2   created_utc            459988 non-null  datetime64[ns]
 3   permalink              459988 non-null  object        
 4   score                  459988 non-null  int64         
 5   subreddit              459988 non-null  object        
 6   total_awards_received  459988 non-null  int64         
 7   Polarity               459988 non-null  float64       
 8   Subjectivity           459988 non-null  float64       
 9   Sentiment              459988 non-null  object        
 10  date                   459988 non-null  object        
dtypes: datetime64[ns](1), float64(2), int64(2), object(6)
memory usage: 42.1+ MB


In [55]:
reddit['date'] = reddit['created_utc'].dt.strftime('%#m/%#d/%Y')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  reddit['date'] = reddit['created_utc'].dt.strftime('%#m/%#d/%Y')


In [8]:
reddit['created_utc'] = pd.to_datetime(reddit['created_utc'])

In [15]:
reddit = reddit[reddit['body'] != "[removed]"]

In [30]:
ratings = approval.to_dict()

In [22]:
approval.reset_index(inplace=True)

In [50]:
reddit.reset_index(inplace=True)

In [29]:
approval.set_index('modeldate', inplace=True)

In [59]:
reddit['target'] = reddit['date'].map(ratings)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  reddit['target'] = reddit['date'].map(ratings)


In [51]:
reddit['date'][0]

'04/01/2021'

'04/01/2021'

In [24]:
approval.drop(columns=['president', 'subgroup', 'approve_hi', 'approve_lo', 'disapprove_estimate', 'disapprove_hi', 'disapprove_lo', 'timestamp'], inplace=True)

In [33]:
ratings = ratings['approve_estimate']

In [34]:
ratings

{'10/6/2021': 44.934047,
 '10/5/2021': 45.804673,
 '10/4/2021': 44.817262,
 '10/3/2021': 44.872265,
 '10/2/2021': 44.972914,
 '10/1/2021': 45.867995,
 '9/30/2021': 44.877784,
 '9/29/2021': 45.886283,
 '9/28/2021': 44.921096,
 '9/27/2021': 44.919215,
 '9/26/2021': 45.634022,
 '9/25/2021': 47.340349,
 '9/24/2021': 44.870707,
 '9/23/2021': 46.997247,
 '9/22/2021': 46.327145,
 '9/21/2021': 46.175496,
 '9/20/2021': 46.274407,
 '9/19/2021': 45.521082,
 '9/18/2021': 45.521082,
 '9/17/2021': 45.521082,
 '9/16/2021': 45.452362,
 '9/15/2021': 46.106447,
 '9/14/2021': 45.64581,
 '9/13/2021': 45.883096,
 '9/12/2021': 45.581974,
 '9/11/2021': 45.581974,
 '9/10/2021': 45.518097,
 '9/9/2021': 45.613606,
 '9/8/2021': 45.313775,
 '9/7/2021': 45.393391,
 '9/6/2021': 45.625495,
 '9/5/2021': 46.02389,
 '9/4/2021': 45.923564,
 '9/3/2021': 46.02389,
 '9/2/2021': 45.871456,
 '9/1/2021': 46.408959,
 '8/31/2021': 46.33472,
 '8/30/2021': 46.681208,
 '8/29/2021': 47.009413,
 '8/28/2021': 48.47553,
 '8/27/2021': 

In [62]:
reddit

Unnamed: 0,author,body,created_utc,permalink,score,subreddit,total_awards_received,Polarity,Subjectivity,Sentiment,date,target
0,Flip2428,Growing up there I was always dumbfounded on h...,2021-04-01 17:17:58,/r/politics/comments/mhy1zm/new_mexico_is_set_...,1,politics,0,-0.051389,0.368056,Negative,4/1/2021,53.430790
1,boatpile,"Gaetz didn't even say ""no age you can't be sex...",2021-04-01 17:17:58,/r/politics/comments/mhuv7u/theres_no_age_that...,1,politics,0,0.412500,0.820833,Positive,4/1/2021,53.430790
2,theombudsmen,"I'm sure everyone knows this, but to clarify t...",2021-04-01 17:17:56,/r/politics/comments/mhzm4r/the_gop_rightly_fe...,1,politics,0,0.100000,0.853472,Positive,4/1/2021,53.430790
3,Tony_Chu,Hang up decorations with zip ties and cheer th...,2021-04-01 17:17:54,/r/politics/comments/mhulza/biden_must_clean_u...,1,politics,0,0.000000,0.000000,Neutral,4/1/2021,53.430790
4,Fellums2,"Aside from the fact that he is just awful, he ...",2021-04-01 17:17:53,/r/politics/comments/mhmik7/ap_poll_finds_bide...,1,politics,0,-0.284821,0.735714,Negative,4/1/2021,53.430790
...,...,...,...,...,...,...,...,...,...,...,...,...
459983,Andreklooster,SNL in real life .. kinda like the apprentice ...,2021-07-28 06:48:42,/r/democrats/comments/oswgpt/_/h6sx7pf/,1,democrats,0,0.200000,0.300000,Positive,7/28/2021,51.538847
459984,Andreklooster,Thank (insert deity of choice) .. would be ent...,2021-07-28 06:47:11,/r/democrats/comments/oswgpt/_/h6sx3ka/,1,democrats,0,0.000000,0.000000,Neutral,7/28/2021,51.538847
459985,Andreklooster,🖐️,2021-07-28 06:44:14,/r/democrats/comments/oswgpt/_/h6swvor/,1,democrats,0,0.000000,0.000000,Neutral,7/28/2021,51.538847
459986,fraggleliberachi,"He asked what has Biden done that’s positive, ...",2021-07-28 06:40:19,/r/democrats/comments/oseq9m/whats_something_t...,1,democrats,0,-0.011364,0.717172,Negative,7/28/2021,51.538847


In [57]:
reddit['date'] = reddit['date'].astype(str)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  reddit['date'] = reddit['date'].astype(str)


In [46]:
reddit.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 459988 entries, 1 to 99998
Data columns (total 12 columns):
 #   Column                 Non-Null Count   Dtype         
---  ------                 --------------   -----         
 0   author                 459988 non-null  object        
 1   body                   459988 non-null  object        
 2   created_utc            459988 non-null  datetime64[ns]
 3   permalink              459988 non-null  object        
 4   score                  459988 non-null  int64         
 5   subreddit              459988 non-null  object        
 6   total_awards_received  459988 non-null  int64         
 7   Polarity               459988 non-null  float64       
 8   Subjectivity           459988 non-null  float64       
 9   Sentiment              459988 non-null  object        
 10  date                   459988 non-null  object        
 11  target                 0 non-null       float64       
dtypes: datetime64[ns](1), float64(3), int64(2), o

In [61]:
reddit.drop(columns=['index'], inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().drop(


In [7]:
reddit.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 451652 entries, 0 to 459987
Data columns (total 12 columns):
 #   Column                 Non-Null Count   Dtype  
---  ------                 --------------   -----  
 0   author                 451652 non-null  object 
 1   body                   451652 non-null  object 
 2   created_utc            451652 non-null  object 
 3   permalink              451652 non-null  object 
 4   score                  451652 non-null  int64  
 5   subreddit              451652 non-null  object 
 6   total_awards_received  451652 non-null  int64  
 7   Polarity               451652 non-null  float64
 8   Subjectivity           451652 non-null  float64
 9   Sentiment              451652 non-null  object 
 10  date                   451652 non-null  object 
 11  target                 451652 non-null  float64
dtypes: float64(3), int64(2), object(7)
memory usage: 44.8+ MB


In [6]:
reddit = reddit.dropna()