<a href="https://colab.research.google.com/github/tanyagupta1/Machine-Learning/blob/main/BTP/MH_data_script.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Using Pushshift Module to extract Submissions Data from Reddit via Python

PRAW is pretty good at gettin reddit data but there are some limitations with it.
Including the removal of the [subreddit.submissions endpoint](https://www.reddit.com/r/changelog/comments/7tus5f/update_to_search_api/.). 

So for extracting Reddit submissions and the primarily data such as upvotes and comments count, I put together this notebook using Pushshift.

If you still prefer PRAW for extract submissions, I have written a code [template here](https://github.com/SeyiAgboola/Seyi_Projects/blob/master/submission_list.py).

I will also [host the code on GitHub](https://github.com/SeyiAgboola/Reddit-Data-Mining/blob/master/Using_Pushshift_Module_to_extract_Submissions.ipynb).

More info on the removal of the [subreddit.submissions endpoint](https://www.reddit.com/r/redditdev/comments/8bia9n/praw_psa_the_subredditsubmissions_method_no/).

# Import modules

In [59]:
import pandas as pd
import requests #Pushshift accesses Reddit via an url so this is needed
import json #JSON manipulation
import csv #To Convert final table into a csv file to save to your machine
import time
import datetime

In [62]:
#Adapted from this https://gist.github.com/dylankilkenny/3dbf6123527260165f8c5c3bc3ee331b
#This function builds an Pushshift URL, accesses the webpage and stores JSON data in a nested list
def getPushshiftData(after, before, sub):
    #Build URL
    url = 'https://api.pushshift.io/reddit/search/submission/?size=10&after='+str(after)+'&before='+str(before)+'&subreddit='+str(sub)
    #Print URL to show user
    print(url)
    #Request URL
    r = requests.get(url)
    #Load JSON data from webpage into data variable
    data = json.loads(r.text)
    #return the data element which contains all the submissions data
    return data['data']

def getPushshiftData_comments(sub_id):
    #Build URL
    url = 'https://api.pushshift.io/reddit/search/comment/?link_id=t3_'+str(sub_id)
    #Print URL to show user
    print(url)
    #Request URL
    r = requests.get(url)
    #Load JSON data from webpage into data variable
    data = json.loads(r.text)
    #return the data element which contains all the submissions data
    return data['data']

# Extract key information from Submissions
* Submission Title
* Body of Post 
* Author
* Submission post ID
* Score
* Awards
* Upload Time
* No. of Comments 

# Extract key information from Comments
* body,
* author
* is_op
* date created
* score
* no of awards



In [63]:
#This function will be used to extract the key data points from each JSON result
def collectSubData(subm):
    subData = list() 
    title = subm['title']
    body = subm['selftext']
    # #flairs are not always present so we wrap in try/except
    # try:
    #     flair = subm['link_flair_text']
    # except KeyError:
    #     flair = "NaN"    
    author = subm['author']
    sub_id = subm['id']
    score = subm['score']
    awards = subm['total_awards_received']
    created = datetime.datetime.fromtimestamp(subm['created_utc']) #1520561700.0
    numComms = subm['num_comments']
    subData.append((sub_id,title,body,author,score,awards,numComms,created))
    subStats[sub_id] = subData

def collectComments(subm):
    data = getPushshiftData_comments(subm)
    comments[subm]=list()
    for comment in data:
      author=comment['author']
      body=comment['body']
      created = datetime.datetime.fromtimestamp(comment['created_utc'])
      is_op=comment['is_submitter']
      score=comment['score']
      awards=comment['total_awards_received']
      comments[subm].append((body,author,is_op,created,score,awards))

# Update your Search Settings here

In [66]:
#Create your timestamps and queries for your search URL
#https://www.unixtimestamp.com/index.php > Use this to create your timestamps
after = "1577817000" #Submissions after this timestamp (1577836800 = 01 Jan 20)
before = "1580322600" #Submissions before this timestamp (1607040000 = 04 Dec 20)
sub = "ptsd" #Which Subreddit to search in
#subCount tracks the no. of total submissions we collect
subCount = 0
#subStats is the dictionary where we will store our data.
subStats = {}
comments={}

In [67]:
# We need to run this function outside the loop first to get the updated after variable
data = getPushshiftData(after, before, sub)
# Will run until all posts have been gathered i.e. When the length of data variable = 0
# from the 'after' date up until before date
while len(data) > 0: #The length of data is the number submissions (data[0], data[1] etc), once it hits zero (after and before vars are the same) end
    if(subCount>=10):
      break;
    for submission in data:
        collectSubData(submission)
        collectComments(submission['id'])
        subCount+=1
    # Calls getPushshiftData() with the created date of the last submission
    print(len(data))
    print(str(datetime.datetime.fromtimestamp(data[-1]['created_utc'])))
    #update after variable to last created date of submission
    after = data[-1]['created_utc']
    #data has changed due to the new after variable provided by above code
    data = getPushshiftData(after, before, sub)
    
print(len(data))

https://api.pushshift.io/reddit/search/submission/?size=10&after=1577817000&before=1580322600&subreddit=ptsd
https://api.pushshift.io/reddit/search/comment/?link_id=t3_ei84a7
https://api.pushshift.io/reddit/search/comment/?link_id=t3_ei8r5h
https://api.pushshift.io/reddit/search/comment/?link_id=t3_ei95yg
https://api.pushshift.io/reddit/search/comment/?link_id=t3_ei9ela
https://api.pushshift.io/reddit/search/comment/?link_id=t3_ei9f9p
https://api.pushshift.io/reddit/search/comment/?link_id=t3_ei9l8w
https://api.pushshift.io/reddit/search/comment/?link_id=t3_ei9za5
https://api.pushshift.io/reddit/search/comment/?link_id=t3_eia3x2
https://api.pushshift.io/reddit/search/comment/?link_id=t3_eia9b8
https://api.pushshift.io/reddit/search/comment/?link_id=t3_eibybn
10
2020-01-01 01:18:17
https://api.pushshift.io/reddit/search/submission/?size=10&after=1577841497&before=1580322600&subreddit=ptsd
10


# Check your Submission Extraction was successful

In [68]:
# print(str(len(subStats)) + " submissions have added to list")
# print("1st entry is:")
# print(list(subStats.values())[0][0][1] + " created: " + str(list(subStats.values())[0][0][5]))
# print("Last entry is:")
# print(list(subStats.values())[-1][0][1] + " created: " + str(list(subStats.values())[-1][0][5]))

10 submissions have added to list
1st entry is:
Ending the year by myself. created: 0
Last entry is:
Why I can't stand the thought that tonight is the beginning to another year. created: 0


# Save data to CSV file

In [69]:
def updateSubs_file():
    upload_count = 0
    #location = "\\Reddit Data\\" >> If you're running this outside of a notebook you'll need this to direct to a specific location
    print("input filename of submission file, please add .csv")
    filename = input() #This asks the user what to name the file
    file = filename
    with open(file, 'w', newline='', encoding='utf-8') as file: 
        a = csv.writer(file, delimiter=',')
        headers = ['sub_id','title','body','author','score','awards','numComms','created']
        a.writerow(headers)
        for sub in subStats:
            a.writerow(subStats[sub][0])
            upload_count+=1
            
        print(str(upload_count) + " submissions have been uploaded")
updateSubs_file()

input filename of submission file, please add .csv
ptsd_posts.csv
10 submissions have been uploaded


In [70]:
def updateSubs_file_comm():
    upload_count = 0
    #location = "\\Reddit Data\\" >> If you're running this outside of a notebook you'll need this to direct to a specific location
    print("input filename of comments file, please add .csv")
    filename = input() #This asks the user what to name the file
    file = filename
    with open(file, 'w', newline='', encoding='utf-8') as file: 
        fieldnames = ['post_id','body','author','is_op','created','score','awards']
        writer = csv.writer(file)
        writer.writerow(fieldnames)
        for post in comments:
          for com in comments[post]:
            tmp=list()
            tmp.append(post)
            tmp.extend(com)
            # print(tmp)
            writer.writerow(tmp)
            upload_count+=1
            
        print(str(upload_count) + " comments have been uploaded")
updateSubs_file_comm()

input filename of comments file, please add .csv
ptsd_comments.csv
59 comments have been uploaded


In [38]:
print(subStats)
print(comments)

{'ei84a7': [('fcoz0y1', 'Happy new year.  Everything gets better in January.  I hope you get some good rest before morning.'), ('fcoq6la', "Love yourself! Don't keep self-love away from you. It is there, inside! Find it, feel it and live it! Don't judge your life! Trust your life! You are doing the best that you can!  Allow yourself to feel joy about it instead of judging yourself. So, have have a great 2020!  🌟"), ('fcok30g', "Happy new year.\n\nPTSD forces us to get stronger to survive, and if you're getting stronger, that makes next year better than this one. I wish you luck in 2020.")], 'ei8r5h': [('fct3ta4', 'Thank you c: \n\nTake this &lt;3'), ('fct3r23', 'Ah I see. I’m sorry :( I totally understand wanting to talk about it with friends and being afraid of being a downer. I feel like as long as you have learned from your experience you have nothing to be ashamed about.\n\nI kept the conversation of me having ptsd very short, vague, and to the point. Just to make them aware that I