pip install pushshift.py
At present, only python 3 is supported.
A minimalist wrapper for searching public reddit comments/submissions via the pushshift.io API.
Pushshift is an extremely useful resource, but the API is poorly documented. As such, this API wrapper is currently designed to make it easy to pass pretty much any search parameter the user wants to try.
Although it is not necessarily reflective of the current status of the API, you should attempt to familiarize yourself with the Pushshift API documentation to better understand what search arguments are likely to work.
- Handles rate limiting and exponential backoff subject to maximum retries and maximum backoff limits. A minimum rate limit of 1 request per second is used as a default per consultation with Pushshift's maintainer, /u/Stuck_in_the_matrix.
- Handles paging of results when using supported sort options. At the moment, only
created_utc
sort types page properly. Returns all historical results for a given query by default. - Optionally handles incorporation of
praw
to fetch objects after getting ids from pushshift - If not using
praw
, returns results incomment
andsubmission
objects whose API is similar to the correspondingpraw
objects. Additionally, result objects have an additional.d_
attribute that offers dict access to the associated data attributes. - Optionally adds a
created
attribute which converts a comment/submission'screated_utc
timestamp to the user's local time. (may raise exceptions for users with certain timezone settings). - Simple interface to pass query arguments to the API. The API is sparsely documented, so it's often fruitful to just try an argument and see if it works.
- Limited support for pushshift's
aggs
argument. - A
stop_condition
argument to make it simple to stop yielding results given arbitrary user-defined criteria
Non-default sorts (i.e. sorting by anything other than
created_utc
) have limited support from the pushshift.io API. As such, this project will raise an exception for any request that can't provide reliably sorted and paged data.Non-default sorts require a limit <= max_results_per_request (500 by default)
from pushshift_py import PushshiftAPI
api = PushshiftAPI()
Or to use pushshift search to fetch ids and then use praw to fetch objects:
import praw
from pushshift_py import PushshiftAPI
r = praw.Reddit(...)
api = PushshiftAPI(r)
# The `search_comments` and `search_submissions` methods return generator objects
gen = api.search_submissions(limit=100)
results = list(gen)
First 10 submissions to /r/politics in 2017, filtering results to url/author/title/subreddit fields.
The created_utc
field will be added automatically (it's used for paging).
import datetime as dt
start_epoch=int(dt.datetime(2017, 1, 1).timestamp())
list(api.search_submissions(after=start_epoch,
subreddit='politics',
filter=['url','author', 'title', 'subreddit'],
limit=10))
According to the pushshift.io API documentation, we should be able to search submissions by url,
but (at the time of this writing) this doesn't actually work in practice.
The API should still respect the limit
argument and possibly other supported arguments,
but no guarantees. If you find that an argument you have passed is not supported by the API,
best thing is to just remove it from the query and modify your api call to only utilize
supported arguments to mitigate risks from of unexpected behavior.
url = 'http://www.politico.com/story/2017/02/mike-flynn-russia-ties-investigation-235272'
url_results = list(api.search_submissions(url=url, limit=500))
len(url_results), any(r.url == url for r in url_results)
# 500, False
Use the q
parameter to search text. Omitting the limit
parameter does a full
historical search. Requests are performed in batches of size specified by the
max_results_per_request
parameter (default=500). Omitting the "max_reponse_cache"
test in the demo below will return all results. Otherwise, this demo will perform two
API requests returning 500 comments each. Alternatively, the generator can be queried for additional results.
gen = api.search_comments(q='OP', subreddit='askreddit')
max_response_cache = 1000
cache = []
for c in gen:
cache.append(c)
# Omit this test to actually return all results. Wouldn't recommend it though: could take a while, but you do you.
if len(cache) >= max_response_cache:
break
# If you really want to: pick up where we left off to get the rest of the results.
if False:
for c in gen:
cache.append(c)
Replicating the example from the pushshift documentation:
I haven't really experimented much with this functionality of the API, so I figured
the simplest way to support it would be to just disable most of the bells and whistles
provided by the API wrapper when the aggs
argument is provided (i.e. paging, converting
the result to a namedtuple for dot notation attribute access).
api = PushshiftAPI()
gen = api.search_comments(q='trump',
after='7d',
aggs='created_utc',
frequency='hour',
size=0,
)
result = next(gen)
gen = api.search_submissions(stop_condition=lambda x: 'bot' in x.author)
for subm in gen:
pass
print(subm.author)
PSAW's source is provided under the Simplified BSD License.
- Copyright (c), 2018, David Marx