# Scraping Reddit Data in Colab
- to get reddit API access
  1. create reddit api application
     - go to 'reddit apps' page (https://www.reddit.com/prefs/apps)
     - select 'script' as the type of app
     - name your app with description
     - setup redirect uri to be (http://localhost:8080): you need this to get your refresh token
     - refer to: https://www.jcchouinard.com/reddit-api/
  2. copy your client_id and client_secrets
- example source: https://colab.research.google.com/github/TannerGilbert/Tutorials/blob/master/Reddit%20Webscraping%20using%20PRAW/Reddit%20API.ipynb
- https://pythonprogramming.net/parsing-comments-python-reddit-api-wrapper-praw-tutorial/?completed=/introduction-python-reddit-api-wrapper-praw-tutorial/

- Async PRAW, also known as Async PRAW (AIO), is an asynchronous version of the PRAW library. Asynchronous programming allows you to write code that can perform multiple tasks concurrently without waiting for each task to complete before moving on to the next one. This can be particularly useful when working with APIs or performing web scraping, as it allows you to make multiple requests in parallel and maximize efficiency.
- for async praw, use 'asyncpraw' library

In [1]:
# !pip install praw
!pip install praw

Collecting praw
  Downloading praw-7.7.1-py3-none-any.whl.metadata (9.8 kB)
Collecting prawcore<3,>=2.1 (from praw)
  Downloading prawcore-2.4.0-py3-none-any.whl.metadata (5.0 kB)
Collecting update-checker>=0.18 (from praw)
  Downloading update_checker-0.18.0-py3-none-any.whl.metadata (2.3 kB)
Downloading praw-7.7.1-py3-none-any.whl (191 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m191.0/191.0 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading prawcore-2.4.0-py3-none-any.whl (17 kB)
Downloading update_checker-0.18.0-py3-none-any.whl (7.0 kB)
Installing collected packages: update-checker, prawcore, praw
Successfully installed praw-7.7.1 prawcore-2.4.0 update-checker-0.18.0


- praw:
  - synchronous (blocking): it makes a request to the Reddit API (using the praw library, which this code is likely using), and the program waits for the Reddit API to respond before continuing execution.

In [2]:
import praw

Before it can be used to scrape data we need to authenticate ourselves. For this we need to create a Reddit instance and provide it with a client_id , client_secret and a user_agent . To create a Reddit application and get your id and secret you need to navigate to [this page](https://www.reddit.com/prefs/apps).

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
!ls drive/MyDrive/'Colab Notebooks'/Reddit*

'drive/MyDrive/Colab Notebooks/Reddit_client_secrets.json'


In [5]:
json_file = 'drive/MyDrive/Colab Notebooks/Reddit_client_secrets.json'

In [6]:
import json
with open(json_file, "r" ) as fp:
    data = json.load( fp )

client_id = data['client_id']
client_secret = data['client_secret']
user_agent = data['user_agent']

In [7]:
# client_id = "..."
# client_secret = "..."
# user_agent = "..."

In [21]:
# create a reddit instance
reddit = praw.Reddit(client_id=client_id,
                     client_secret=client_secret,
                     user_agent=user_agent)

- get information or posts from a specifc subreddit using the reddit.subreddit method and passing it a subreddit name.

In [22]:
# get 10 hot posts from the DataScience subreddit

subreddit = reddit.subreddit('DataScience')
hot_posts = subreddit.hot(limit=10)

for post in hot_posts:
    print(post.title)

It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.



Weekly Entering & Transitioning - Thread 30 Sep, 2024 - 07 Oct, 2024
MS in CS/DS (or Eng), what is a good option? Berkeley, Northwestern, Harvard Ext, GT...?
What do recruiters/HMs want to see on your GitHub?
Open-source library to display PDFs in Dash apps
Is undergrad research valuable?
Amazon Pre-Interview Surveys?? No response
Help With Text Classification Project
I'm looking to start a career in data science. I have some questions...
How does ELL compare to langchain?
Ok, 250k ($) INTERN in Data Science - how is this even possible?!


In [23]:
# get hot posts from all subreddits
hot_posts = reddit.subreddit('all').hot(limit=5)
for post in hot_posts:
    print(post.title)

It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.



Bombshell special counsel filing includes new allegations of Trump's 'increasingly desperate' efforts to overturn election
THE DETROIT TIGERS SWEEP ASTROS IN TWO GAMES AND MOVE ON TO FACE THE CLEVLAND GUARDIANS IN THE ALDS
Sober driver arrested for DUI and thrown in jail because officer knew his brother
A South African man proposed to his gf at KFC and a journalist took picture and tried to shame him publicly. The backlash rained downed heavily on journalist  with multiple companies offering gifts to fund the couple’s dream wedding and support their new life together. 
My brother’s 25y/o car was slapped with this while he was auditing a superyacht company


- if you do not want sticky posts:
  - sticky posts: special posts that are "stickied" or "pinned" to the top of a subreddit's page. These posts remain fixed at the top, regardless of the chronological order of other posts, and are easily visible to all subreddit visitors.

In [24]:
subreddit = reddit.subreddit('DataScience')
hot_posts = subreddit.hot(limit=10)
for submission in hot_posts:
    if not submission.stickied:
        print(submission.title)

It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.



MS in CS/DS (or Eng), what is a good option? Berkeley, Northwestern, Harvard Ext, GT...?
What do recruiters/HMs want to see on your GitHub?
Open-source library to display PDFs in Dash apps
Is undergrad research valuable?
Amazon Pre-Interview Surveys?? No response
Help With Text Classification Project
I'm looking to start a career in data science. I have some questions...
How does ELL compare to langchain?
Ok, 250k ($) INTERN in Data Science - how is this even possible?!


- We can also gather all sorts of information on this submission:

In [25]:
subreddit = reddit.subreddit('DataScience')
hot_posts = subreddit.hot(limit=10)
for submission in hot_posts:
    if not submission.stickied:
        print('Title: {}, ups: {}, downs: {}, Have we visited?: {}'.format(submission.title,
                                                                           submission.ups,
                                                                           submission.downs,
                                                                           submission.visited))

It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.



Title: MS in CS/DS (or Eng), what is a good option? Berkeley, Northwestern, Harvard Ext, GT...?, ups: 28, downs: 0, Have we visited?: False
Title: What do recruiters/HMs want to see on your GitHub?, ups: 150, downs: 0, Have we visited?: False
Title: Open-source library to display PDFs in Dash apps, ups: 25, downs: 0, Have we visited?: False
Title: Is undergrad research valuable?, ups: 48, downs: 0, Have we visited?: False
Title: Amazon Pre-Interview Surveys?? No response, ups: 0, downs: 0, Have we visited?: False
Title: Help With Text Classification Project, ups: 17, downs: 0, Have we visited?: False
Title: I'm looking to start a career in data science. I have some questions..., ups: 0, downs: 0, Have we visited?: False
Title: How does ELL compare to langchain?, ups: 4, downs: 0, Have we visited?: False
Title: Ok, 250k ($) INTERN in Data Science - how is this even possible?!, ups: 275, downs: 0, Have we visited?: False


In [26]:
# get MachineLearning subreddit data
ml_subreddit = reddit.subreddit('MachineLearning')

print(ml_subreddit.description)

It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.



**[Rules For Posts](https://www.reddit.com/r/MachineLearning/about/rules/)**
--------
+[Research](https://www.reddit.com/r/MachineLearning/search?sort=new&restrict_sr=on&q=flair%3AResearch)
--------
+[Discussion](https://www.reddit.com/r/MachineLearning/search?sort=new&restrict_sr=on&q=flair%3ADiscussion)
--------
+[Project](https://www.reddit.com/r/MachineLearning/search?sort=new&restrict_sr=on&q=flair%3AProject)
--------
+[News](https://www.reddit.com/r/MachineLearning/search?sort=new&restrict_sr=on&q=flair%3ANews)
--------
***[@slashML on Twitter](https://twitter.com/slashML)***
--------
***[Chat with us on Slack](https://join.slack.com/t/rml-talk/shared_invite/enQtNjkyMzI3NjA2NTY2LWY0ZmRjZjNhYjI5NzYwM2Y0YzZhZWNiODQ3ZGFjYmI2NTU3YjE1ZDU5MzM2ZTQ4ZGJmOTFmNWVkMzFiMzVhYjg)***
--------
**Beginners:**
--------
Please have a look at [our FAQ and Link-Collection](http://www.reddit.com/r/MachineLearning/wiki/index)

[Metacademy](http://www.metacademy.org) is a great resource which compiles le

- to save the scraped data in some kind of variable or file

In [27]:
import pandas as pd

posts = []
ml_subreddit = reddit.subreddit('MachineLearning')
for post in ml_subreddit.hot(limit=10):
    posts.append([post.title, post.score, post.id, post.subreddit, post.url, post.num_comments, post.selftext, post.created])
posts = pd.DataFrame(posts,columns=['title', 'score', 'id', 'subreddit', 'url', 'num_comments', 'body', 'created'])
posts

It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.



Unnamed: 0,title,score,id,subreddit,url,num_comments,body,created
0,[D] Self-Promotion Thread,7,1fru46i,MachineLearning,https://www.reddit.com/r/MachineLearning/comme...,11,"Please post your personal projects, startups, ...",1727576000.0
1,[D] Monthly Who's Hiring and Who wants to be H...,24,1ftdkmb,MachineLearning,https://www.reddit.com/r/MachineLearning/comme...,9,**For Job Postings** please use this template\...,1727750000.0
2,[P] Just-in-Time Implementation: A Python Libr...,139,1fujbuz,MachineLearning,https://www.reddit.com/r/MachineLearning/comme...,27,Hey r/MachineLearning !\n\nYou know how we hav...,1727884000.0
3,[Discussion] What resource do you use to keep ...,101,1fu7gls,MachineLearning,https://www.reddit.com/r/MachineLearning/comme...,39,"In my day job, I work on recommender and searc...",1727841000.0
4,[D] How Safe Are Your LLM Chatbots?,6,1fufrd1,MachineLearning,https://www.reddit.com/r/MachineLearning/comme...,14,"Hi folks, I’ve been tackling security concerns...",1727874000.0
5,[D] Experiment with NotebookLM + Daily Medical...,3,1fugsec,MachineLearning,https://www.reddit.com/r/MachineLearning/comme...,0,We're working on making our Medical AI/LLM upd...,1727877000.0
6,[D] Why is Tree of Thought an impactful work?,80,1ftx04x,MachineLearning,https://www.reddit.com/r/MachineLearning/comme...,31,My advisor recently asked me to read the tot p...,1727812000.0
7,[R] Dealing with paper reproductions,32,1fu1n9y,MachineLearning,https://www.reddit.com/r/MachineLearning/comme...,8,"Hello, I’m currently a 1st year PhD student in...",1727823000.0
8,[R] Where to find inspiration for a new resear...,1,1fuj25i,MachineLearning,https://www.reddit.com/r/MachineLearning/comme...,7,"So, I want to propose my own topic for a thesi...",1727883000.0
9,[R] latest and greatest image to 3D mesh model,3,1fu9ed0,MachineLearning,https://www.reddit.com/r/MachineLearning/comme...,1,What’s out there at the minute?\nAre there any...,1727849000.0


In [28]:
posts.to_csv('top_ml_subreddit_posts.csv')

- streaming data: real-time data are delivered as a continuous stream of events. (This allows developers and users to receive live updates about various activities happening on Reddit, such as new posts, comments, votes, and more.)
- "parent" typically refers to the original post or comment to which another comment or reply is directly responding.

In [29]:
subreddit = reddit.subreddit('news')

i = 0
for comment in subreddit.stream.comments():
    try:
        print(30*'_')
        parent_id = str(comment.parent())
        submission = reddit.comment(parent_id)
        print('Parent:')
        print(submission.body)
        print('Reply')
        print(comment.body)
        i = i + 1
        if (i == 5): break
    except praw.exceptions.PRAWException as e:
        pass

It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.

It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.

It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.

It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.



______________________________
Parent:
______________________________
Parent:
How do toddlers deal with them? That’s our only aversion to getting them put in. Toddler terror
Reply
The one we have is dial activated. As long as you don’t turn the dial on, it just operates as a regular toilet.
______________________________
Parent:


It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.

It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.



______________________________
Parent:
>It's been two years since [Hurricane Ian](https://www.cbsnews.com/miami/news/florida-insurance-claims-ian/) hit Southwest Florida and an estimated 50 thousand homeowners are still locked in battles with their insurance companies. 

>That split roof is an open wound for the Rapkins, who still have to mow the lawn and make mortgage payments on their rotting home every month. They're also paying rent on an apartment nearby and $4,000 a year to Heritage for home insurance.

Insurance companies are forcing consumers to take them to court over legitimate claims knowing most victims won't follow through.

This is an elaborate scheme to cut losses and boost profits.
Reply
My dad has been battling his insurance company for over 7 years now because of a burst pipe that flooded and ruined the house while he was away. Every time a plumber and such who had come out to inspect and agrees that the house wasn’t built to code and that the insurance company needs 

It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.

It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.



I guess people are too young now to get the reference
Reply
They should use a luigi board
______________________________
Parent:
FYI, they make travel bidets. Everything from an adapter for a water bottle to a battery powered, rechargeable wand where the water tank becomes the holder. I have the latter and it's fantastic.
Reply
Good to know! Ty!
______________________________
Parent:
If a Amazon delivery vehicles hits you but you are a  prime member, you are out of luck.
Reply
amazon drivers rolling the dice running over a bunch of cars.


- one more example

In [30]:
posts = []
ds_subreddit = reddit.subreddit('ChatGPT')
for post in ds_subreddit.hot(limit=5):
    posts.append([post.title, post.score, post.id, post.subreddit, post.url, post.num_comments, post.selftext, post.created])
posts = pd.DataFrame(posts,columns=['title', 'score', 'id', 'subreddit', 'url', 'num_comments', 'body', 'created'])
posts

It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.



Unnamed: 0,title,score,id,subreddit,url,num_comments,body,created
0,"Weekly Self-Promotional Mega Thread 45, 30.09....",7,1fsue0e,ChatGPT,https://www.reddit.com/r/ChatGPT/comments/1fsu...,18,All the self-promotional posts about your AI p...,1727700000.0
1,Einstein's Relatives,1075,1fuhgw1,ChatGPT,https://v.redd.it/hu7oziexqcsd1,82,,1727879000.0
2,"“Saying please and thank you to ChatGPT, proba...",262,1fupiei,ChatGPT,https://i.redd.it/o657hi6mgdsd1.jpeg,106,,1727899000.0
3,"Smart enough to understand quantum physics, bu...",1345,1fubq89,ChatGPT,https://v.redd.it/324wfgyh5bsd1,158,,1727860000.0
4,Nvidia has just announced an open-source GPT-4...,2156,1fu75bn,ChatGPT,https://i.redd.it/e2usk5m6k9sd1.jpeg,249,It'll be as powerful. They also promised to re...,1727840000.0


- There are many more ! Please refer to the manual.

-----------