# Part 1: Subreddit Data Collection
In this notebook, we will be scraping data from two different subreddits using the Pushshift API.

## Import Packages

In [3]:
import pandas as pd
import numpy as np
import requests
import re

import datetime
from datetime import timedelta

## Custom Functions to Facilitate Scraping
Functions below will facilitate data collection process. Process will be repeated twice for two different subreddits. Moreover, because Pushshift API limits to 100 records per pull, function below will incorporate a date parameter feature and iterate through datest to help us pull more than 100 records.

In [4]:
def get_json(subreddit, start_date, end_date):
    #set up empty list to store data
    json_data = []
    #base subreddit url
    base_url = 'https://api.pushshift.io/reddit/search/submission/?subreddit='
    #dates
    start_date = datetime.datetime.strptime(start_date, '%Y-%m-%d').date()
    end_date = datetime.datetime.strptime(end_date, '%Y-%m-%d').date()
    dates = pd.date_range(start_date,end_date-timedelta(days=1),freq='d')
    for date in dates:
        #format dates
        next_date = date + timedelta(days=1)
        date_ep = int(date.timestamp())
        next_date_ep = int(next_date.timestamp())
        #get json data with request
        suffix = "&after=" + str(date_ep) + '&before=' + str(next_date_ep)
        res = requests.get(base_url+subreddit+suffix)
        #if request is successful, append json output to json_data list
        if res.status_code==200:
            json = res.json()
            data = json.get("data")
            json_data = json_data + data
    return json_data

In [5]:
def get_field(field, json):
    items = len(json)
    results = [json[i].get(field) for i in range(items)]
    return results

In [6]:
def get_df(json,fields):
    df = pd.DataFrame()
    for field in fields:
        values = get_field(field,json)
        df[field] = values
    return df

## Scraping Data
Set up the fields that we want to extract from each returned JSON. Set up the date parameters for the API calls.

In [52]:
#create list of fields to extract
fields = ['id','subreddit','created_utc',
          'is_video',
          'spoiler',
          'is_self',
          'score',
          'is_original_content',
          'is_created_from_ads_ui',
          'media_only',
          'over_18',
          'num_comments',
          'num_crossposts',
          'author',
          'author_premium',
          'title',
          'selftext']

#set start and end date for posts
sdate = '2019-01-01'
edate = '2019-12-31'

### Getting JSON output for each subreddit

In [53]:
json_1 = get_json("askwomen", sdate, edate)

In [55]:
json_2 = get_json("askmen", sdate, edate)

### Conveerting JSON output to a dataframe (df)

In [54]:
df1 = get_df(json_1, fields)
df1

Unnamed: 0,id,subreddit,created_utc,is_video,spoiler,is_self,score,is_original_content,is_created_from_ads_ui,media_only,over_18,num_comments,num_crossposts,author,author_premium,title,selftext
0,abcwoq,AskWomen,1546301192,False,False,True,1,False,,False,False,2,0,throwaway12898932,,How would you feel after holding unrequited fe...,
1,abcxi2,AskWomen,1546301333,False,False,True,1,False,,False,False,1,0,world_citizen7,,"Women, how important is money? [serious]",[removed]
2,abd1f7,AskWomen,1546301994,False,False,True,1,False,,False,False,1,0,jimhalpertignorantsl,,Let’s get real. We all got Amazon gift cards f...,
3,abd1sa,AskWomen,1546302060,False,False,True,1,False,,False,False,3,0,gelatodragon,,"Ladies, do you have any comical/other situatio...",Has this happened to any other female Redditor...
4,abd2z2,AskWomen,1546302285,False,False,True,1,False,,False,False,1,0,MondoFerrari,,Married women vs married men flirting,[removed]
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3645,ehf8gq,AskWomen,1577670631,False,False,True,1,False,,False,False,34,0,NinjaOpsWomanager,False,What difficult (for you) task did you accompli...,
3646,ehfdqc,AskWomen,1577671351,False,False,True,1,False,,False,False,1,0,laurenwalters076,False,How to tell if I’ve gained weight?,[removed]
3647,ehff57,AskWomen,1577671545,False,False,True,1,False,,False,False,5,0,HikeTheSky,False,Do I have to be scared when my gf's mom sends ...,
3648,ehfjsf,AskWomen,1577672189,False,False,True,1,False,,False,False,2,0,Cc100c,False,Winter blues,[removed]


In [56]:
df2 = get_df(json_2, fields)
df2

Unnamed: 0,id,subreddit,created_utc,is_video,spoiler,is_self,score,is_original_content,is_created_from_ads_ui,media_only,over_18,num_comments,num_crossposts,author,author_premium,title,selftext
0,abcvrh,AskMen,1546301025,False,False,True,1,False,,False,False,2,0,sunriseglow,,"Men of Reddit, do you daydream about us as muc...",[removed]
1,abcxuj,AskMen,1546301382,False,False,True,1,False,,False,False,26,0,Greatpocketlintking,,Men of reddit who use a loofah in the shower; ...,GF found out I use mine on my whole body and s...
2,abd1ou,AskMen,1546302042,False,False,True,1,False,,False,False,0,0,hr-chicago,,Why do guys who ghost you keep checking your I...,[removed]
3,abd1u7,AskMen,1546302070,False,False,True,1,False,,False,False,31,0,sunriseglow,,"Men of Reddit, what does the media get wrong a...","TV shows, movies, literature, etc. tends to po..."
4,abd7up,AskMen,1546303212,False,False,True,1,False,,False,False,0,0,rextract99,,Wife wants children to have both our last names.,[removed]
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2818,efokeu,AskMen,1577322222,False,False,True,1,False,,False,False,2,0,GeraltOfRivia2077,False,Those of you who drop your trousers all the wa...,I can't figure out why someone would go throug...
2819,efolee,AskMen,1577322375,False,False,True,1,False,,False,False,26,0,TheChosenOne55,False,"Men, How much interaction do you have with oth...",For example - I found myself through all my li...
2820,efolml,AskMen,1577322410,False,False,True,1,False,,False,False,0,0,Liuboa,False,How do I confess to a girl that i have crush o...,[removed]
2821,efoowm,AskMen,1577322901,False,False,True,1,False,,False,False,1,0,BlueRoseGirl_xx,False,"Men, if your male friend was talking to you ab...",I’ve read guys don’t usually kiss and tell but...


In [57]:
df = pd.concat([df1,df2])
df.sample(8)

Unnamed: 0,id,subreddit,created_utc,is_video,spoiler,is_self,score,is_original_content,is_created_from_ads_ui,media_only,over_18,num_comments,num_crossposts,author,author_premium,title,selftext
1629,bri4j5,AskWomen,1558485206,False,False,True,0,False,,False,False,21,0,ILikeMonitorLizards,,If you were pregnant with a child you knew was...,
925,b3ji2b,AskWomen,1553126480,False,False,True,1,False,,False,False,1,0,witchyarchivist,,For those who have severe panic attacks before...,[removed]
2066,cy42gi,AskMen,1567302234,False,False,True,1,False,,False,False,1,0,oil2k6,,Are there any early warning signs of erectile ...,[removed]
1962,c2pto2,AskWomen,1560995013,False,False,True,1,False,,False,False,1,0,badbitchchunli,,What are your go-to podcasts?,"I want to get into listening to podcasts more,..."
2312,ckh5xg,AskWomen,1564620663,False,False,True,4,False,,False,False,16,0,incendiaryashes,,You’re going away for 5 days-how do you pack y...,
2509,crf47y,AskWomen,1566002456,False,False,True,1,False,,False,False,1,0,Stranger1001,,Have you found it common for men to vocalize t...,[removed]
1070,bjpb4a,AskMen,1556761029,False,False,True,22,False,,False,False,43,0,sadboipri,,what is the most important risk that you've ev...,
972,bbuh26,AskMen,1554947673,False,False,True,1,False,,False,False,1,0,dbot2000,,Guys with Boobs: do you touch your boobs and p...,


## Additional Data Handling
- Clean date times from epoch UTC to regular UTC datetime
- Create new fields, including text lengths and word counts
- Create new text field that removes texts containing links 

In [58]:
df['created_utc'] = [datetime.datetime.fromtimestamp(i) for i in df['created_utc']]

In [59]:
#https://stackoverflow.com/questions/11331982/how-to-remove-any-url-within-a-string-in-python
def remove_urls (vTEXT):
    vTEXT = re.sub(r'(https|http)?:\/\/(\w|\.|\/|\?|\=|\&|\%)*\b', '', vTEXT, flags=re.MULTILINE)
    return(vTEXT)

In [60]:
df = df.fillna('')
df['selftext'] = df['selftext'].replace('[removed]', '')
df['all_text'] = df['title'] + ' ' + df['selftext']

#get string lengths
df['title_length'] = [len(i) for i in df['title']]
df['selftext_length'] = [len(i) for i in df['selftext']]
df['all_text_length'] = [len(i) for i in df['all_text']]

#get word counts
df['title_words'] = [len(i.split()) for i in df['title']]
df['selftext_words'] = [len(i.split()) for i in df['selftext']]
df['all_text_words'] = [len(i.split()) for i in df['all_text']]

#no link
df['no_links_text'] = [remove_urls(i) for i in df['all_text']]

df['contains_link'] = np.where(df['all_text'].str.contains('https'),1,0)

df.sample(3)

Unnamed: 0,id,subreddit,created_utc,is_video,spoiler,is_self,score,is_original_content,is_created_from_ads_ui,media_only,...,selftext,all_text,title_length,selftext_length,all_text_length,title_words,selftext_words,all_text_words,no_links_text,contains_link
416,ahszsr,AskWomen,2019-01-19 20:55:16,False,False,True,1,False,,False,...,,How men should approach to women and which way...,64,0,65,13,0,13,How men should approach to women and which way...,0
2422,cm524u,AskWomen,2019-08-04 22:23:11,False,False,True,1,False,,False,...,,Has anyone ordered anything from Rotita? Good?...,51,0,52,8,0,8,Has anyone ordered anything from Rotita? Good?...,0
1503,c5yg8h,AskMen,2019-06-26 20:17:33,False,False,True,1,False,,False,...,,Has your dating life changed after you got pla...,59,0,60,10,0,10,Has your dating life changed after you got pla...,0


## Save File
Save file to data folder

In [61]:
df.to_csv("../data/women_men_19.csv")