# Scraping Reddit Data in Colab
- to get reddit API access
  1. create reddit api application
     - go to 'reddit apps' page (https://www.reddit.com/prefs/apps)
     - select 'script' as the type of app
     - name your app with description
     - setup redirect uri to be (http://localhost:8080): you need this to get your refresh token
     - refer to: https://www.jcchouinard.com/reddit-api/
  2. copy your client_id and client_secrets
- example source: https://colab.research.google.com/github/TannerGilbert/Tutorials/blob/master/Reddit%20Webscraping%20using%20PRAW/Reddit%20API.ipynb
- https://pythonprogramming.net/parsing-comments-python-reddit-api-wrapper-praw-tutorial/?completed=/introduction-python-reddit-api-wrapper-praw-tutorial/

- Async PRAW, also known as Async PRAW (AIO), is an asynchronous version of the PRAW library. Asynchronous programming allows you to write code that can perform multiple tasks concurrently without waiting for each task to complete before moving on to the next one. This can be particularly useful when working with APIs or performing web scraping, as it allows you to make multiple requests in parallel and maximize efficiency.
- for async praw, use 'asyncpraw' library

In [1]:
# !pip install praw
!pip install praw

Collecting praw
  Downloading praw-7.7.1-py3-none-any.whl (191 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m191.0/191.0 kB[0m [31m1.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting prawcore<3,>=2.1 (from praw)
  Downloading prawcore-2.3.0-py3-none-any.whl (16 kB)
Collecting update-checker>=0.18 (from praw)
  Downloading update_checker-0.18.0-py3-none-any.whl (7.0 kB)
Installing collected packages: update-checker, prawcore, praw
Successfully installed praw-7.7.1 prawcore-2.3.0 update-checker-0.18.0


In [2]:
import praw

Before it can be used to scrape data we need to authenticate ourselves. For this we need to create a Reddit instance and provide it with a client_id , client_secret and a user_agent . To create a Reddit application and get your id and secret you need to navigate to [this page](https://www.reddit.com/prefs/apps).

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
!ls drive/MyDrive/'Colab Notebooks'/Reddit*

'drive/MyDrive/Colab Notebooks/Reddit_client_secrets.json'


In [5]:
json_file = 'drive/MyDrive/Colab Notebooks/Reddit_client_secrets.json'

In [6]:
import json
with open(json_file, "r" ) as fp:
    data = json.load( fp )

client_id = data['client_id']
client_secret = data['client_secret']
user_agent = data['user_agent']

In [7]:
# client_id = "..."
# client_secret = "..."
# user_agent = "..."

In [8]:
# create a reddit instance
reddit = praw.Reddit(client_id=client_id,
                     client_secret=client_secret,
                     user_agent=user_agent)

- get information or posts from a specifc subreddit using the reddit.subreddit method and passing it a subreddit name.

In [9]:
# get 10 hot posts from the DataScience subreddit
subreddit = reddit.subreddit('DataScience')
hot_posts = subreddit.hot(limit=10)

for post in hot_posts:
    print(post.title)

It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.



Weekly Entering & Transitioning - Thread 17 Jul, 2023 - 24 Jul, 2023
What are people’s thoughts on SAS?
I love Rstudio IDE! Do you know a similar IDE for Python?
Is it normal to both lack documentation & have a hard time completing tasks without colleagues assistance as a new hire? Not sure if I'm simply not cut out for data analytics.
Need 3rd perspective
What’s the most difficult data science problem you’ve ever conquered?
Master in DS. Worth Quitting a Real Job to Get a Data Scientist Internship?
Rant about feeling extremely depressed about new job and advice needed!
NYC Entry Level Data Analyst Salary Expectations
If i want to create my own LLM, what should I do? Where should I start?


In [10]:
# get hot posts from all subreddits
hot_posts = reddit.subreddit('all').hot(limit=5)
for post in hot_posts:
    print(post.title)

It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.



Tell us you haven’t got laid, without telling us
Go to prison for felony and graduate to crazy cat lady
A bittersweet reaction
Guy after a surgery.
meirl


- if you do not want sticky posts:
  - sticky posts: special posts that are "stickied" or "pinned" to the top of a subreddit's page. These posts remain fixed at the top, regardless of the chronological order of other posts, and are easily visible to all subreddit visitors.

In [11]:
subreddit = reddit.subreddit('DataScience')
hot_posts = subreddit.hot(limit=10)
for submission in hot_posts:
    if not submission.stickied:
        print(submission.title)

It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.



What are people’s thoughts on SAS?
I love Rstudio IDE! Do you know a similar IDE for Python?
Is it normal to both lack documentation & have a hard time completing tasks without colleagues assistance as a new hire? Not sure if I'm simply not cut out for data analytics.
Need 3rd perspective
What’s the most difficult data science problem you’ve ever conquered?
Master in DS. Worth Quitting a Real Job to Get a Data Scientist Internship?
Rant about feeling extremely depressed about new job and advice needed!
NYC Entry Level Data Analyst Salary Expectations
If i want to create my own LLM, what should I do? Where should I start?


- We can also gather all sorts of information on this submission:

In [12]:
subreddit = reddit.subreddit('DataScience')
hot_posts = subreddit.hot(limit=10)
for submission in hot_posts:
    if not submission.stickied:
        print('Title: {}, ups: {}, downs: {}, Have we visited?: {}'.format(submission.title,
                                                                           submission.ups,
                                                                           submission.downs,
                                                                           submission.visited))

It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.



Title: What are people’s thoughts on SAS?, ups: 78, downs: 0, Have we visited?: False
Title: I love Rstudio IDE! Do you know a similar IDE for Python?, ups: 40, downs: 0, Have we visited?: False
Title: Is it normal to both lack documentation & have a hard time completing tasks without colleagues assistance as a new hire? Not sure if I'm simply not cut out for data analytics., ups: 66, downs: 0, Have we visited?: False
Title: Need 3rd perspective, ups: 9, downs: 0, Have we visited?: False
Title: What’s the most difficult data science problem you’ve ever conquered?, ups: 12, downs: 0, Have we visited?: False
Title: Master in DS. Worth Quitting a Real Job to Get a Data Scientist Internship?, ups: 10, downs: 0, Have we visited?: False
Title: Rant about feeling extremely depressed about new job and advice needed!, ups: 9, downs: 0, Have we visited?: False
Title: NYC Entry Level Data Analyst Salary Expectations, ups: 6, downs: 0, Have we visited?: False
Title: If i want to create my own LLM,

- to save the scraped data in some kind of variable or file.

In [14]:
# get MachineLearning subreddit data
ml_subreddit = reddit.subreddit('MachineLearning')

print(ml_subreddit.description)

It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.



**[Rules For Posts](https://www.reddit.com/r/MachineLearning/about/rules/)**
--------
+[Research](https://www.reddit.com/r/MachineLearning/search?sort=new&restrict_sr=on&q=flair%3AResearch)
--------
+[Discussion](https://www.reddit.com/r/MachineLearning/search?sort=new&restrict_sr=on&q=flair%3ADiscussion)
--------
+[Project](https://www.reddit.com/r/MachineLearning/search?sort=new&restrict_sr=on&q=flair%3AProject)
--------
+[News](https://www.reddit.com/r/MachineLearning/search?sort=new&restrict_sr=on&q=flair%3ANews)
--------
***[@slashML on Twitter](https://twitter.com/slashML)***
--------
***[Chat with us on Slack](https://join.slack.com/t/rml-talk/shared_invite/enQtNjkyMzI3NjA2NTY2LWY0ZmRjZjNhYjI5NzYwM2Y0YzZhZWNiODQ3ZGFjYmI2NTU3YjE1ZDU5MzM2ZTQ4ZGJmOTFmNWVkMzFiMzVhYjg)***
--------
**Beginners:**
--------
Please have a look at [our FAQ and Link-Collection](http://www.reddit.com/r/MachineLearning/wiki/index)

[Metacademy](http://www.metacademy.org) is a great resource which compiles le

- to save the scraped data in some kind of variable or file

In [15]:
import pandas as pd

posts = []
ml_subreddit = reddit.subreddit('MachineLearning')
for post in ml_subreddit.hot(limit=10):
    posts.append([post.title, post.score, post.id, post.subreddit, post.url, post.num_comments, post.selftext, post.created])
posts = pd.DataFrame(posts,columns=['title', 'score', 'id', 'subreddit', 'url', 'num_comments', 'body', 'created'])
posts

It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.



Unnamed: 0,title,score,id,subreddit,url,num_comments,body,created
0,[D] Simple Questions Thread,10,1518fj5,MachineLearning,https://www.reddit.com/r/MachineLearning/comme...,20,Please post your questions here instead of cre...,1689520000.0
1,Additional Resources,217,14ionyi,MachineLearning,https://www.reddit.com/r/MachineLearning/comme...,31,"Hi everyone,\n\nAfter an [extended blackout](h...",1687706000.0
2,[P] Running Llama 2 locally in <10 min,48,1547wcv,MachineLearning,https://www.reddit.com/r/MachineLearning/comme...,5,I wanted to play with Llama 2 right after its ...,1689803000.0
3,[N] Upstage AI's 30M Llama 1 Outshines 70B Lla...,48,153yfry,MachineLearning,https://www.reddit.com/r/MachineLearning/comme...,34,# Title Fix: Upstage AI's 30B Llama 1 Outshine...,1689781000.0
4,[Project] Running Llama2 Locally on Apple Sili...,96,153sl0y,MachineLearning,https://www.reddit.com/r/MachineLearning/comme...,21,* Project page: [https://github.com/mlc-ai/mlc...,1689767000.0
5,[R] Converting neural networks into equivalent...,8,15478c9,MachineLearning,https://www.reddit.com/r/MachineLearning/comme...,11,According to the paper Neural Networks are Dec...,1689801000.0
6,[Project] Unofficial implementation of Retenti...,27,153vzu6,MachineLearning,https://www.reddit.com/r/MachineLearning/comme...,11,"So very recently, a new paper was published t...",1689775000.0
7,[P] TruLens-Eval is an open source project for...,12,1542fbt,MachineLearning,https://www.reddit.com/r/MachineLearning/comme...,0,Hey [r/MachineLearning](https://www.reddit.com...,1689790000.0
8,[N] Ensuring Reliable Few-Shot Prompt Selectio...,11,153z255,MachineLearning,https://www.reddit.com/r/MachineLearning/comme...,0,Hello Redditors!\n\nIt's pretty well known tha...,1689782000.0
9,[P] MiniGPT4.cpp: (4bit/5bit/16float) MiniGPT4...,1,154hgek,MachineLearning,https://www.reddit.com/r/MachineLearning/comme...,1,[https://github.com/Maknee/minigpt4.cpp](https...,1689829000.0


In [16]:
posts.to_csv('top_ml_subreddit_posts.csv')

- streaming data: real-time data are delivered as a continuous stream of events. (This allows developers and users to receive live updates about various activities happening on Reddit, such as new posts, comments, votes, and more.)
- "parent" typically refers to the original post or comment to which another comment or reply is directly responding.

In [23]:
subreddit = reddit.subreddit('news')

i = 0
for comment in subreddit.stream.comments():
    try:
        print(30*'_')
        print()
        parent_id = str(comment.parent())
        submission = reddit.comment(parent_id)
        print('Parent:')
        print(submission.body)
        print('Reply')
        print(comment.body)
        i = i + 1
        if (i == 5): break
    except praw.exceptions.PRAWException as e:
        pass

It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.

It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.

It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.

It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.



______________________________

Parent:
What is he supposed to be wearing? 

No honest question, wth should a 40 year old wear because I haven't a clue. There's clothes for old people and young people but the middle is just weird. As a woman at least.
Reply
I just wear a mesh shirt and chaps.  Cant go wrong with that.
______________________________

Parent:
With a bit of tweaking it could work for the cold open of a Supernatural episode
Reply
But then the kid would have to end up being a Fey or something like that.
______________________________

Parent:


It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.

It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.

It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.



Wait
.. who doesn't poop daily..
Wth
Reply
Uhm… a LOT of people…??? Women also have higher rates of constipation. So do people on Keto diet, or other restrictive diets. Or low fiber intake. Hormonal changes. Pregnancy. Thyroid issues. Certain medications. Eating disorders. Depression. Low activity or exercise. Poor sleep/inadequate rest. Bowel diseases/conditions. Travel constipation. Dehydration. And SO much more can affect your bowel movements.

Do this many people really lack critical thinking skills or the ability to conceptualize others’ bodies function differently from their own…???
______________________________

Parent:
______________________________

Parent:
Millions of  people on the outside don't have AC   ... why would anyone worry about the prisoners ?!
Reply
I hope you get a second brain cell so you can worry about more than one thing at a time.
______________________________

Parent:
You ever take a shit sorry large and think you could probably handle anal sex without to

- one more example

In [24]:
posts = []
ds_subreddit = reddit.subreddit('ChatGPT')
for post in ds_subreddit.hot(limit=5):
    posts.append([post.title, post.score, post.id, post.subreddit, post.url, post.num_comments, post.selftext, post.created])
posts = pd.DataFrame(posts,columns=['title', 'score', 'id', 'subreddit', 'url', 'num_comments', 'body', 'created'])
posts

It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.



Unnamed: 0,title,score,id,subreddit,url,num_comments,body,created
0,"Journey ""Create a Presentation"" contest - $650...",91,14si211,ChatGPT,https://www.reddit.com/r/ChatGPT/comments/14si...,107,Hello!\n\nWe are team Journey. Journey is a pr...,1688669000.0
1,General discussion thread,382,137vqso,ChatGPT,https://www.reddit.com/r/ChatGPT/comments/137v...,2337,To discuss anything and everything related to ...,1683225000.0
2,Girl gave me her number and it ended up being ...,705,154fck9,ChatGPT,https://i.redd.it/umwjuoqub1db1.jpg,131,,1689823000.0
3,Falsely accused of using CHATGPT by professor.,2484,1543ahv,ChatGPT,https://www.reddit.com/r/ChatGPT/comments/1543...,717,“I believe large portions of your essay draft ...,1689792000.0
4,"Apple has developed ""Apple GPT"" as it prepares...",1191,1542i5i,ChatGPT,https://www.reddit.com/r/ChatGPT/comments/1542...,282,Apple has been relatively quiet on the generat...,1689790000.0


- There are many more ! Please refer to the manual.

-----------