# Generating the data
#### In this document, our team's goal is to retrieve data from Reddit's r/Politics subreddit.
> Our team decided to use data from 2020 leading to, during, and after the USA elections.<br>
> We have gathered samples of approximately 10_000 comments and 10_000 posts during 2020 <br>
<br>
As this is a project about unsupervised machine learning and NLP, our goal is simply to examine what kind of clusters our model can come up with given text from our data as input.

In [1]:
from psaw import PushshiftAPI
import pandas as pd
import time
from functions import clean_text_arr, retrieve_data_submissions, retrieve_data_comments, time_stamp

In [2]:
api = PushshiftAPI()

## Scraping reddit comments from r/politics

Looking at the upvoted and downvoted comments on the subreddit during the 2020 elections.

In [4]:
# Change the sort to get either most downvoted or  most upvoted comments
comments = api.search_comments(subreddit='politics', limit=5000, after=time_stamp("01/01/2020"), before=time_stamp("31/12/2020"), sort="desc",sort_type="score")

comments_in_post = []

for comment in comments:
    if comment.body != '[removed]' and comment.body != '[deleted]'  :
        comments_in_post.append(
            {"Author": comment.author, "comment": comment.body, "score": comment.score})

df = pd.DataFrame(comments_in_post)



## Cleaning data


In [5]:
df.drop_duplicates(inplace=True)
df.comment = clean_text_arr(df.comment) 
df

Unnamed: 0,Author,comment,score
0,Chendii,Congrats on President elect Joe Biden for winn...,21260
1,jasoniscursed,The level of narcissism it takes to think that...,19822
2,Qu1nlan,Fuck,19726
3,fitDEEZbruh,Direct quote from this guy Redistricting is li...,18723
4,mistervanilla,Are you telling me that a cyber security speci...,17778
...,...,...,...
2658,thedarkdescent1,The purpose of foreign aid has ALWAYS been qui...,1
2659,crlcan81,These don t have anything to do with tobacco o...,1
2660,GabuEx,In order to commit in person voter fraud you w...,1
2661,9xInfinity,Biden said he d be fine with a Republican VP I...,1


## Creating csv file from comments

In [6]:
# This section is  left as comments as to not overwrite the existing data

# df.to_csv('../ds-project-data-engineering-3/data/large_upvoted.csv')
#df.to_csv('../ds-project-data-engineering-3/data/large_downvoted.csv')

In [7]:
df

Unnamed: 0,Author,comment,score
0,Chendii,Congrats on President elect Joe Biden for winn...,21260
1,jasoniscursed,The level of narcissism it takes to think that...,19822
2,Qu1nlan,Fuck,19726
3,fitDEEZbruh,Direct quote from this guy Redistricting is li...,18723
4,mistervanilla,Are you telling me that a cyber security speci...,17778
...,...,...,...
2658,thedarkdescent1,The purpose of foreign aid has ALWAYS been qui...,1
2659,crlcan81,These don t have anything to do with tobacco o...,1
2660,GabuEx,In order to commit in person voter fraud you w...,1
2661,9xInfinity,Biden said he d be fine with a Republican VP I...,1


# Scraping reddit submissions from r/politics<br>

In [8]:
# This mask contains what the most useful features seem to be. We may or may not use them
mask = ['author', 'title', 'score', 'subreddit', 'subreddit_subscribers', 'all_awardings', 'is_crosspostable', 'is_original_content', 'num_comments', 'num_crossposts']

In [9]:
subreddit = "politics"

start_date = '01/01/2020'
end_date = '31/12/2020'

limit = 10000
df = retrieve_data_submissions(subreddit=subreddit, start_date=start_date, end_date=end_date, limit=limit)

df = df[mask]
df



Unnamed: 0,author,title,score,subreddit,subreddit_subscribers,all_awardings,is_crosspostable,is_original_content,num_comments,num_crossposts
0,krakaman042,Every american needs to see this. This man can...,1,politics,7051711,[],False,False,4,0
1,BitterFuture,Fifty maskless anti-lockdown protesters force ...,1,politics,7051708,[],False,False,3,0
2,2legit2fart,Senators Tell The USPTO To Remove The Arbitrar...,1,politics,7051708,[],True,False,22,0
3,Itzyatzee,Iran’s Rouhani issues Trump death threat: ‘In ...,1,politics,7051696,[],True,False,18,0
4,flawy12,Jovan Pulitzer hacks Fulton County voting mach...,1,politics,7051685,[],False,False,2,0
...,...,...,...,...,...,...,...,...,...,...
9961,wolfie_poe,Trump’s Future: Tons of Cash and Plenty of Opt...,1,politics,7012600,[],False,False,11,0
9962,[deleted],Kushner helped create shell Trump campaign com...,3,politics,7012600,[],False,False,2,0
9963,emmsdahbeaawws,‘Dominion Effect’ Gives 3 Percent More Votes f...,0,politics,7012595,[],False,False,2,0
9964,TrumpSharted,Meghan McCain hits Trump over attacks on fathe...,2236,politics,7012595,"[{'award_sub_type': 'GLOBAL', 'award_type': 'g...",True,False,399,1


### Cleaning up the text from post titles

In [11]:
df['title'] = clean_text_arr(df['title'])
df.sample(10)

# left as comment as to not overwrite existing data
# df.to_csv("data/politics_posts_2020.csv")

Unnamed: 0,author,title,score,subreddit,subreddit_subscribers,all_awardings,is_crosspostable,is_original_content,num_comments,num_crossposts
2250,MitzieTidwell,New York Post Tells Trump To Give It Up And St...,1,politics,7041924,[],True,False,4,0
7993,Nerdwerfer,The Red Slime Lawsuit That Could Sink Right Wi...,1,politics,7019183,[],True,False,5,0
6471,2f4s3g5d,Eligibility of 364 000 Georgia voters challeng...,1,politics,7023531,[],False,False,1,0
4218,sustainablereview,10 reasons why the COVID 19 relief package wil...,1,politics,7030528,[],False,False,1,0
304,Madridsta120,Trump s covid bill includes 180 day countdown ...,1,politics,7050309,[],True,False,9,0
6296,newfrontier58,Donald Trump Hits Feces Flinging Stage of Elec...,1,politics,7023885,[],True,False,3,0
437,JoeThomas90,Trump is a historic loser No other one term pr...,1,politics,7049936,[],True,False,670,0
2925,discocrisco,Analysis Why Donald Trump is already the 2024 ...,1,politics,7038224,[],True,False,98,0
8300,Ok-Possibility-5066,Homeless and no ID You can still vote in Denve...,1,politics,7017758,[],False,False,2,0
6377,TrumpDumper,Gavin Newsom names California s first Latino U...,1,politics,7023709,[],True,False,3,0
