# Introduction

The effect of social media on politics is a frequently discussed issue. Claims are often made by media outlets and regulators that social media has a generally negative effect on political sentiment in the US—that social media creates “echo chambers” of like-minded people which lead to polarization and increasingly extreme views.

In [1]:
import pandas as pd
import numpy as np

from pmaw import PushshiftAPI
import datetime as dt

from textblob import TextBlob

## Collecting Social Media Data

For this project, I will use data from Reddit. Because reddit is structured into various subreddits by interest, it is easy both to filter for political content and ensure better representation of a variety of viewpoints. I will collect data from six different subreddits: r/politics and r/news (the two largest general political subreddits), r/liberal and r/democrats (two of the largest left-leaning political subreddits) and r/conservative and r/libertarian (two of the largest right-leaning political subreddits).

In [2]:
api = PushshiftAPI()

before = int(dt.datetime(2021,9,15,0,0).timestamp())
after = int(dt.datetime(2021,1,23,0,0).timestamp())

subs = ['politics', 'news', 'conservative', 'liberal', 'libertarian', 'democrats']

In [3]:
def get_data(subreddit):
    limit = 100000
    comments = api.search_comments(subreddit=subreddit, limit=limit, before=before, after=after)
    df = pd.DataFrame(comments)
    return df

def combine_data(subs):
    full_df = pd.DataFrame()
    for sub in subs:
        new_df = get_data(sub)
        full_df = pd.concat([full_df, new_df])
    return full_df

In [4]:
reddit = combine_data(subs)

Total:: Success Rate: 99.80% - Requests: 1015 - Batches: 102 - Items Remaining: 0
Total:: Success Rate: 99.41% - Requests: 1018 - Batches: 102 - Items Remaining: 0
Total:: Success Rate: 94.40% - Requests: 1072 - Batches: 108 - Items Remaining: 0
70489 result(s) not found in Pushshift
Total:: Success Rate: 88.19% - Requests: 398 - Batches: 42 - Items Remaining: 0
Total:: Success Rate: 82.91% - Requests: 1223 - Batches: 123 - Items Remaining: 0
Total:: Success Rate: 71.49% - Requests: 1596 - Batches: 160 - Items Remaining: 0


In [5]:
reddit.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 529511 entries, 0 to 99999
Data columns (total 51 columns):
 #   Column                           Non-Null Count   Dtype  
---  ------                           --------------   -----  
 0   all_awardings                    529511 non-null  object 
 1   associated_award                 0 non-null       object 
 2   author                           529511 non-null  object 
 3   author_flair_background_color    138503 non-null  object 
 4   author_flair_css_class           20610 non-null   object 
 5   author_flair_richtext            449806 non-null  object 
 6   author_flair_template_id         46738 non-null   object 
 7   author_flair_text                76450 non-null   object 
 8   author_flair_text_color          156709 non-null  object 
 9   author_flair_type                449806 non-null  object 
 10  author_fullname                  449806 non-null  object 
 11  author_patreon_flair             449806 non-null  object 
 12  aut

In [6]:
reddit.drop(columns = ['all_awardings', 'associated_award', 'author_flair_background_color', 
                        'author_flair_css_class', 'author_flair_template_id', 'author_flair_text', 
                        'author_flair_text_color', 'awarders', 'collapsed_because_crowd_control', 
                        'comment_type', 'gildings', 'id', 'is_submitter', 'link_id', 'locked', 'no_follow', 'parent_id', 
                        'retrieved_on', 'send_replies', 'stickied', 'subreddit_id', 'top_awarded_type', 'treatment_tags',
                        'author_flair_richtext', 'author_flair_type', 'author_fullname', 'author_patreon_flair', 
                        'author_premium', 'distinguished', 'author_cakeday', 'collapsed_reason_code', 
                        'archived', 'body_sha1', 'can_gild', 'collapsed', 'collapsed_reason', 'controversiality', 
                        'gilded', 'retrieved_utc', 'score_hidden', 'subreddit_name_prefixed', 'subreddit_type', 
                        'edited'], inplace=True)

In [7]:
reddit['created_utc'] = pd.to_datetime(reddit['created_utc'], unit='s')

reddit = reddit[reddit['body'] != "[removed]"]
reddit = reddit[reddit['body'] != "[deleted]"]

In [8]:
reddit.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 450183 entries, 0 to 99999
Data columns (total 8 columns):
 #   Column                 Non-Null Count   Dtype         
---  ------                 --------------   -----         
 0   author                 450183 non-null  object        
 1   body                   450183 non-null  object        
 2   created_utc            450183 non-null  datetime64[ns]
 3   permalink              450183 non-null  object        
 4   score                  450183 non-null  int64         
 5   subreddit              450183 non-null  object        
 6   total_awards_received  450183 non-null  int64         
 7   editable               0 non-null       object        
dtypes: datetime64[ns](1), int64(2), object(5)
memory usage: 30.9+ MB


In [9]:
reddit.head()

Unnamed: 0,author,body,created_utc,permalink,score,subreddit,total_awards_received,editable
0,execdysfunction,Maybe. We need to be aiming higher,2021-04-03 19:41:59,/r/politics/comments/mj839d/schumer_senate_wil...,1,politics,0,
1,yappledapple,I hadn't heard that one. I think the ones stil...,2021-04-03 19:41:59,/r/politics/comments/mjcrfb/schumer_says_senat...,1,politics,0,
2,Tots4trump,“The statue was presented to the British as a ...,2021-04-03 19:41:56,/r/politics/comments/mjczhl/confederate_symbol...,1,politics,0,
3,DroopyMcCool,Is this something that is in the DOI's purview...,2021-04-03 19:41:55,/r/politics/comments/mj6klw/secretary_deb_haal...,1,politics,0,
4,FlyingRock,New York legalizing is definitely why it's bei...,2021-04-03 19:41:55,/r/politics/comments/mj839d/schumer_senate_wil...,1,politics,0,


## Sentiment Analysis

TextBlob will be used to generate sentiment scores for the data.

In [10]:
reddit['body'] = reddit['body'].astype(str)

def subjectivity(text):
    return TextBlob(text).sentiment.subjectivity

def polarity(text):
    return TextBlob(text).sentiment.polarity

reddit['Polarity'] = reddit['body'].apply(polarity)
reddit['Subjectivity'] = reddit['body'].apply(subjectivity)

In [11]:
def get_sentiment(score):
    if score < 0:
        return 'Negative'
    elif score == 0:
        return 'Neutral'
    else: return 'Positive'
    
reddit['Sentiment'] = reddit['Polarity'].apply(get_sentiment)

In [12]:
reddit.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 450183 entries, 0 to 99999
Data columns (total 11 columns):
 #   Column                 Non-Null Count   Dtype         
---  ------                 --------------   -----         
 0   author                 450183 non-null  object        
 1   body                   450183 non-null  object        
 2   created_utc            450183 non-null  datetime64[ns]
 3   permalink              450183 non-null  object        
 4   score                  450183 non-null  int64         
 5   subreddit              450183 non-null  object        
 6   total_awards_received  450183 non-null  int64         
 7   editable               0 non-null       object        
 8   Polarity               450183 non-null  float64       
 9   Subjectivity           450183 non-null  float64       
 10  Sentiment              450183 non-null  object        
dtypes: datetime64[ns](1), float64(2), int64(2), object(6)
memory usage: 41.2+ MB


## Creating the Target Variable

In order to facilitate modelling, each comment will need to be matched with the presidential approval rating for the day in question. 

In [13]:
approval = pd.read_csv('Data/approval_topline.csv')

In [14]:
approval.head()

Unnamed: 0,president,subgroup,modeldate,approve_estimate,approve_hi,approve_lo,disapprove_estimate,disapprove_hi,disapprove_lo,timestamp
0,Joseph R. Biden Jr.,Voters,10/6/2021,44.934047,49.671079,40.197016,49.332431,54.603749,44.061113,10/6/2021 13:57
1,Joseph R. Biden Jr.,Voters,10/5/2021,45.804673,50.491925,41.117422,48.38875,54.107867,42.669634,10/5/2021 20:25
2,Joseph R. Biden Jr.,All polls,10/4/2021,44.817262,49.089776,40.544748,47.846172,54.00567,41.686673,10/4/2021 13:15
3,Joseph R. Biden Jr.,Adults,10/3/2021,44.872265,49.258332,40.486198,48.060846,53.324118,42.797574,10/3/2021 20:34
4,Joseph R. Biden Jr.,All polls,10/2/2021,44.972914,49.438682,40.507146,48.700901,54.313456,43.088347,10/3/2021 20:34


In [15]:
approval.set_index('modeldate', inplace=True)
approval.drop(columns=['president', 'subgroup', 'approve_hi', 'approve_lo', 'disapprove_estimate', 
                       'disapprove_hi', 'disapprove_lo', 'timestamp'], inplace=True)

In [16]:
#create dictionary of approval ratings by date
ratings = approval.to_dict()
ratings = ratings['approve_estimate']

In [17]:
#create date column with format matching the dictionary for mapping
reddit['date'] = reddit['created_utc'].dt.strftime('%#m/%#d/%Y')
reddit['date'] = reddit['date'].astype(str)

In [18]:
reddit['target'] = reddit['date'].map(ratings)

In [19]:
reddit.head()

Unnamed: 0,author,body,created_utc,permalink,score,subreddit,total_awards_received,editable,Polarity,Subjectivity,Sentiment,date,target
0,execdysfunction,Maybe. We need to be aiming higher,2021-04-03 19:41:59,/r/politics/comments/mj839d/schumer_senate_wil...,1,politics,0,,0.25,0.5,Positive,4/3/2021,53.414394
1,yappledapple,I hadn't heard that one. I think the ones stil...,2021-04-03 19:41:59,/r/politics/comments/mjcrfb/schumer_says_senat...,1,politics,0,,-0.166667,0.5,Negative,4/3/2021,53.414394
2,Tots4trump,“The statue was presented to the British as a ...,2021-04-03 19:41:56,/r/politics/comments/mjczhl/confederate_symbol...,1,politics,0,,0.295,0.43,Positive,4/3/2021,53.414394
3,DroopyMcCool,Is this something that is in the DOI's purview...,2021-04-03 19:41:55,/r/politics/comments/mj6klw/secretary_deb_haal...,1,politics,0,,0.068182,0.227273,Positive,4/3/2021,53.414394
4,FlyingRock,New York legalizing is definitely why it's bei...,2021-04-03 19:41:55,/r/politics/comments/mj839d/schumer_senate_wil...,1,politics,0,,0.033939,0.517576,Positive,4/3/2021,53.414394


In [20]:
#save as csv
reddit.to_csv('Data/reddit_data.csv')