# PushShift

---
r/boardgames

In [1]:
import requests
import calendar
import time
import pandas as pd

In [2]:
url = 'https://api.pushshift.io/reddit/search/submission'

Most important things to put into PushShift [API](https://api.pushshift.io/docs#/default/search_reddit_posts_reddit_search_submission_get):

- `subreddit` (name of subreddit)
- `since` (earliest date to get)
- `until` (latest date to get)
- `size` (how many submissions to get back per call of the API.  **Use maybe 100 or fewer per call to avoid timeouts.**)
- `filter` (which pieces of info do you want about the post?  We're primarily just interested in the text itself, rather than other metadata.  **First get a model working that ONLY uses NLP techniques.** If you later want to enhance your classifier using metadata, go ahead.)

In [3]:
params = {'subreddit': 'boardgames',
          'limit': 10,
         }

In [4]:
res = requests.get(url, params)
res.status_code

200

In [5]:
res.json().keys()

dict_keys(['data', 'error', 'metadata'])

In [6]:
#res.json()['data'][:1]

In [7]:
df = pd.DataFrame(res.json()['data'])
df.head(3)

Unnamed: 0,subreddit,selftext,author_fullname,gilded,title,link_flair_richtext,subreddit_name_prefixed,hidden,pwls,link_flair_css_class,...,created_utc,num_crossposts,media,is_video,retrieved_utc,updated_utc,utc_datetime_str,post_hint,preview,url_overridden_by_dest
0,boardgames,i seem to recall someone having a program or w...,t2_7kieo,0,has anyone made a foamcore insert or cardholde...,"[{'e': 'text', 't': 'How-To/DIY'}]",r/boardgames,False,6,howto,...,1673037957,0,,False,1673037977,1673037977,2023-01-06 20:45:57,,,
1,boardgames,[removed],t2_toabz5s0,0,here is some money making tips,[],r/boardgames,False,6,,...,1673037511,0,,False,1673037527,1673037528,2023-01-06 20:38:31,self,{'images': [{'source': {'url': 'https://extern...,
2,boardgames,So i wanted to do test ChatGPT a bit and asked...,t2_snchy,0,I asked Chatgpt to create a Libary with 40 gam...,[],r/boardgames,False,6,,...,1673037478,0,,False,1673037494,1673037495,2023-01-06 20:37:58,,,


As we can see, this has 91 columns.  We basically just want the `selftext` column, and maybe the title.

In [8]:
params = {'subreddit': 'boardgames',
          'limit': 10,
          'filter': 'subreddit, selftext, title, created_utc'
         }

In [9]:
res = requests.get(url, params)
res.status_code

200

In [10]:
df = pd.DataFrame(res.json()['data'])
df.head(10)

Unnamed: 0,subreddit,selftext,title,created_utc
0,boardgames,i seem to recall someone having a program or w...,has anyone made a foamcore insert or cardholde...,1673037957
1,boardgames,[removed],here is some money making tips,1673037511
2,boardgames,So i wanted to do test ChatGPT a bit and asked...,I asked Chatgpt to create a Libary with 40 gam...,1673037478
3,boardgames,"So, I've submitted a post like this before. It...",Truly unique games from the last 1-2 years,1673037068
4,boardgames,"Had World for a while, grabbed National Parks ...",Trekking Trilogy complete,1673036868
5,boardgames,"Whenever I bring boardgames to my meet ups, I ...",Ideas for carrying gaming boards?,1673035230
6,boardgames,"Invest in cheap breakfast in bed tables, prefe...",Pro tip for couples who play board games toget...,1673035206
7,boardgames,Hello. Are there any fanmade expansions for Ca...,Carcassonne Star Wars Fanmade Expansions,1673035134
8,boardgames,"In my view, a game rated a 5 complexity has th...",How do you view and judge a game's weight/comp...,1673029540
9,boardgames,,is this rare? Memoir '44 Air Pack days of Wond...,1673028127


The `created_utc` is just the UTC time when the post was created, i.e. the number of seconds that have elapsed since January 1st, 1970.

Notie that the `created_utc` numbers are decreasing.  By default, PushShift grabs the *newest* posts from the specified subreddit.  So our code grabbed the 10 most recent posts on the subreddit.

So how do we get 10 *more* posts, that don't overlap with the ones we already got?

In [11]:
params2 = {'subreddit': 'boardgames',
          'limit': 10,
          'filter': 'subreddit, selftext, title, created_utc',
           'until': df['created_utc'].min()
           #Include posts that have a UTC earlier than the earliest one in our df
         }

In [12]:
res2 = requests.get(url, params2)

In [56]:
df = pd.DataFrame(res2.json()['data'])
df.head(10)

Unnamed: 0,subreddit,selftext,title,created_utc
0,boardgames,,I made laser-engraved Azul coasters for a chri...,1673027845
1,boardgames,,Friend posted this on Facebook. Any idea what ...,1673026770
2,boardgames,,A Friend of mine posted this on Facebook and I...,1673026385
3,boardgames,I'm looking for recommendations to watch peopl...,Good Channels to watch games?,1673026055
4,boardgames,Hi everyone. I was spending the holidays with ...,Any info on an older game called Bid and Bluff.,1673025517
5,boardgames,,Don't Forget To Read The Instructions: Sheeple,1673023782
6,boardgames,"Me, my mom and grandma are trying to play Scot...","In Scotland Yard: Sherlock Holmes Edition, whe...",1673023103
7,boardgames,,Has anyone tried one of these digital board ga...,1673022029
8,boardgames,,Help identifying board game in this picture…,1673021483
9,boardgames,Hey looking for some fun games to play with a ...,Adult games,1673021001


Notice that the latest UTC time in this new dataframe is just before the earliest one in our last dataframe.

In [55]:
df['created_utc'].min()

1673021001

# PRAW
---

PRAW requires you to register a username and an "app".  Click "create another app" under Reddit's [preferences](https://www.reddit.com/prefs/apps).  Then you'll need to provide the "redirect uri" and "secret" to PRAW.

In [14]:
import praw

In [15]:
#reddit = praw.Reddit(
   # client_id="CLIENT_ID",
   # "client_secrets"="CLIENT_SECRET",
   # user_agent="testscript by [MY USER NAME]"  #This field can be anything
#)

## How to use PRAW without giving away your account info to github?

Use the `praw.ini` file and fill out the fields.  **BUT MAKE SURE TO IGNORE `praw.ini` in your `.gitignore` file!** Ideally, put something in your `README.md` to show people how to set up their own info and replicate your code.

After setting up the `praw.ini` file, we can just run the following cell:

In [16]:
reddit=praw.Reddit()

In [17]:
#Get 10 posts from boardgames subreddit
posts = reddit.subreddit('composer').new(limit=10)

In [18]:
type(posts)
#It's a generator

praw.models.listing.generator.ListingGenerator

In [19]:
for post in posts:
    print(post.title)
    
#Make sure to actually capture all the information as you run through this loop,
#since generators drop this information once they spit it our once

The Mountain Path - Ys inspired music
Re-engraved my 'Prelude e' (mosg) and uploaded to youtube :)
Looking for feedback on this 2.5 minute piece I wrote for string orchestra
Which of these notations do you think is easier to read?
Spitfire Audio and Sibelius 7
Trombone Sonata, II. Cantilena, inspired by jazz and VGM
2-voice canon. - Looking for feedback
How to learn about late-romantic/early modern composition
Toogle pause-note: Note input in engraving software
Looking for software suggestion for composing.


In [20]:
posts2 = reddit.subreddit('musicproduction').new(limit=10)

In [21]:
items = []
for post in posts2:
    items.append(post)

In [22]:
items

[Submission(id='1052rcw'),
 Submission(id='10515l0'),
 Submission(id='104xzcd'),
 Submission(id='104x0q4'),
 Submission(id='104wnyn'),
 Submission(id='104vsd8'),
 Submission(id='104vgkm'),
 Submission(id='104ve26'),
 Submission(id='104toag'),
 Submission(id='104tj2p')]

In [23]:
items[0].selftext

'Is it because of analog gear / analog distortion? The song is from 1995. Cant get it done with digital. The whole grittiness\n\nhttps://youtu.be/CmNhYcFE1-o'

In [24]:
items[0].title

'how to get this sound? Thanks, because i have no idea.'

In [25]:
posts2 = reddit.subreddit('musicproduction').new(limit=1000)