# A Scraper for the AITA subreddit created using PRAW and PSAW

This scraper produces JSON files of submissions from the AITA subreddit. Each JSON file contains a list of dictionaries with relevant information about the submission for my analysis: post title, text, label, upvotes, upvote ratio, id and URL. I initially used this to scrape 100k AITA posts 

I had to combine PRAW and PSAW because some of the specific reddit submission search fields that I needed (link_flair_text, score) were not accurate in PSAW, but PRAW's time-range search functionality has been disabled. This combined scraper first performs a time-range search on the AITA subreddit using PSAW. The submission ids from those results are then put intp PRAW, which mines the relevant submission information.  

PRAW (https://praw.readthedocs.io/en/latest/) and PSAW were both used to create this scraper. 



In [1]:
import json
import praw
reddit = praw.Reddit(client_id='YOUR_ID', client_secret='YOUR_SECRET', user_agent='YOUR_AGENT')

In [4]:
from psaw import PushshiftAPI
api = PushshiftAPI()
subreddit = reddit.subreddit('AmItheAsshole')

import datetime

In [5]:
def get_more_posts(start = datetime.date(2020,1,14), lim=1000):
    
    #initialize post list
    posts = []        
    
    #search for posts from before the specified time in psaw
    results = list(api.search_submissions(before=start,
                                subreddit='amitheasshole', #change subreddit
                                filter=['url','num_comments','created_utc','id'], #change traits returned
                                limit=lim))      
    
    for i in results:
        
        #insert the id of the results into PRAW
        j = praw.models.Submission(reddit,id=i.id)
    
        #add results to list if link_flair_text meets criteria
        if j.link_flair_text!=None:
            post_dict = {}
            post_dict["title"] = j.title
            post_dict["text"] = j.selftext
            post_dict["label"] = j.link_flair_text
            post_dict['upvotes'] = j.score
            post_dict['upvote_ratio'] = j.upvote_ratio
            post_dict['id'] = j.id
            post_dict['url'] = j.url
            posts.append(post_dict)
    
    #return list of posts and the timestamp of the last post in the search
    return posts, results[-1].created_utc


In [None]:
target = 10000 #the number of posts you want to acquire
post_list = [] #list of posts
time = datetime.date(2020,1,14) #start date

for i in range(10000000000000000000000000000): #an arbitrarily large number for the range so it doesn't stop before it needs to
    
    if len(post_list)<target: #continue using the get_more_posts function until post_list is long enough
        
        print(time) #optional for seeing when a new loop starts
        
        (posts,time) = get_more_posts(time,1000)
        post_list.extend(posts)
        
        with open('aita_psaw_praw_{}.json'.format(str(len(post_list))), 'w') as json_file: #optional, save files from one iteration
            json.dump(post_list, json_file)
    
    elif len(post_list)>target:
        
        with open('aita_psaw_praw_final_{}.json'.format(str(len(post_list))), 'w') as json_file: #save final file
            json.dump(post_list, json_file)
            
        with open("final_time.json",'w') as json_file: #save final timestamp in case you need to run some more
            json.dump(time,json_file)
            
        print(time) #optional for seeing the last timestamp
        
        break #exit the loop
        
    else: # in case you can't hit the target for whatever reason
        
        with open('aita_psaw_praw_final_{}.json'.format(str(len(post_list))), 'w') as json_file: #save final file
            json.dump(post_list, json_file)
            
        with open("final_time.json",'w') as json_file: #save final timestamp in case you need to run some more
            json.dump(time,json_file)
            
        print(time) #optional for seeing the last timestamp
        
        break #exit the loop
    