# **1. Install PRAW**

In [1]:
#!pip install praw
import praw


# **2. Create a Reddit App**

To access the Reddit API, you'll need to create an application on Reddit and obtain your API credentials. Follow these steps:

- Go to the Reddit website (https://www.reddit.com/) and log in to your account. Feel free to create a throwaway account for this project!
- Navigate to the Reddit Apps page (https://www.reddit.com/prefs/apps).
- Click the "are you a developer? create an app..." button in the top left.
- Provide a name for your app (e.g., "PRAW"), select the app type ('script') , and optionally add a description. Use http://localhost:8080 as your redirect URI.
- After submitting the form, you will reach a page that looks like the following image. You'll see your application's details, including the client ID and client secret. Keep these credentials handy for the next step.


![Praw](https://www.honchosearch.com/hubfs/Imported_Blog_Media/Client-ID-Client-Secret.png)

# **3. Initialize PRAW**

In [2]:
reddit = praw.Reddit(
    client_id='jjMYUBu4dyHMdpIYkCrsOQ',
    client_secret='Sj3IYg7ZTOaM0XthXLfBIPq6AgfAqw',
    user_agent='project_3',
    username='suli1524',
    password='Suli.1524'
)

Replace 'YOUR_CLIENT_ID', 'YOUR_CLIENT_SECRET', 'YOUR_USER_AGENT', 'YOUR_REDDIT_USERNAME', and 'YOUR_REDDIT_PASSWORD' with your actual Reddit API credentials.

Your user agent is an identifier used by reddit to identify the source of requests. You can make it whatever you want, but you'll want to choose something descriptive and unique, and it's recommended that your username is included.

**I have removed my own credentials from this workbook. We can show you how to hide your credentials before submitting the project! The following code will need your own credentials in order to successfully work.**

# 4. Take a look at the documentation [here](https://praw.readthedocs.io/)!

In [3]:
# Below is JUST an example of how you can use PRAW

# Choose your subreddit
subreddit = reddit.subreddit('marvel')

# Adjust the limit as needed -- Note that this will grab the 25 most recent posts
posts = subreddit.new(limit=1000)

In [4]:
 #Below is JUST an example of how you can use PRAW

# Choose your subreddit
subreddit_1 = reddit.subreddit('DC_Cinematic')

# Adjust the limit as needed -- Note that this will grab the 25 most recent posts
posts_1 = subreddit_1.new(limit=1000)

## NOTE
- Reddit API Limitations: The Reddit API imposes limitations on the number of posts you can retrieve in a single request. The maximum number of posts per request is typically 100. Therefore, if you set the limit parameter to a value greater than 100, PRAW will make multiple requests behind the scenes to fetch the desired number of posts.
- Rate Limiting: The Reddit API also enforces rate limits to prevent abuse and ensure fair usage. The specific rate limits can vary depending on factors such as your Reddit account's age and karma. As a standard user, you're typically allowed to make 60 requests per minute. If you exceed the rate limit, you may receive an error response until the rate limit resets.
- TIP: You can use the created_utc attribute of a post to keep track of the timestamp and ensure non-overlapping pulls. The created_utc attribute represents the post's creation time in UTC.

In [5]:
import pandas as pd

data = []
for post in posts:
    data.append([post.created_utc, post.title, post.selftext, post.subreddit])

# Turn into a dataframe
marvel = pd.DataFrame(data, columns = ['created_utc', 'title', 'self_text', 'subreddit'])
marvel

Unnamed: 0,created_utc,title,self_text,subreddit
0,1.697500e+09,Ultimate Invasion suffered by having Bryan Hitch,I'd like to share my thoughts on Ultimate Inva...,Marvel
1,1.697497e+09,[Fan Art] Silver Surfer sketch,,Marvel
2,1.697495e+09,Finally watched Doctor Strange in the Multiver...,I had fairly modest expectations because it se...,Marvel
3,1.697495e+09,Hickman Fantastic Four after Secret Wars,"Hi folks, \n\nI've read whole Hickman run on A...",Marvel
4,1.697493e+09,If they reboot Blade,Michael Jai White as Blade?\n\nEdit:\n\nOr oth...,Marvel
...,...,...,...,...
983,1.695701e+09,"I don't know about you guys, but for some reas...",,Marvel
984,1.695700e+09,Marvel vs. Capcom Origins turns 11 years old t...,,Marvel
985,1.695700e+09,Here's a selfie from our fellow Marvel edgy bo...,Omg he just like me fr,Marvel
986,1.695697e+09,What are you top 10 Marvel video games,Mines:\n10) Ghost Rider (2007)\n9) The Punishe...,Marvel


In [6]:

data = []
for post in posts_1:
    data.append([post.created_utc, post.title, post.selftext, post.subreddit])

# Turn into a dataframe
dc_cinematic = pd.DataFrame(data, columns = ['created_utc', 'title', 'self_text', 'subreddit'])
dc_cinematic

Unnamed: 0,created_utc,title,self_text,subreddit
0,1.697491e+09,Who should be the main Flash of the DCU?,The new DCU will most likely have the Flash in...,DC_Cinematic
1,1.697477e+09,My cosplay Catwoman from Batman Returns,,DC_Cinematic
2,1.697464e+09,Which DC movie will cross 400 million first?,Which movie in the DCU coming up will be the f...,DC_Cinematic
3,1.697462e+09,Where would you prefer the DCU to establish as...,I personally prefer some type of off planet ba...,DC_Cinematic
4,1.697437e+09,"""Batman has comic fans uneasy"" - an 80s newspa...",,DC_Cinematic
...,...,...,...,...
970,1.691987e+09,Seeing Blue Beetle?,,DC_Cinematic
971,1.691988e+09,Scarab protecting Jaime. Blue Beetle Movie clip,,DC_Cinematic
972,1.691980e+09,I could make an excellent TV series about Mart...,The show would be called Manhunter (Real origi...,DC_Cinematic
973,1.691975e+09,Blue Beetle 3D Model Fan Art,"Hello everyone, hope you all like this model. ...",DC_Cinematic


In [7]:
marvel.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 988 entries, 0 to 987
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   created_utc  988 non-null    float64
 1   title        988 non-null    object 
 2   self_text    988 non-null    object 
 3   subreddit    988 non-null    object 
dtypes: float64(1), object(3)
memory usage: 31.0+ KB


In [8]:
dc_cinematic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 975 entries, 0 to 974
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   created_utc  975 non-null    float64
 1   title        975 non-null    object 
 2   self_text    975 non-null    object 
 3   subreddit    975 non-null    object 
dtypes: float64(1), object(3)
memory usage: 30.6+ KB


In [9]:
marvel.head()

Unnamed: 0,created_utc,title,self_text,subreddit
0,1697500000.0,Ultimate Invasion suffered by having Bryan Hitch,I'd like to share my thoughts on Ultimate Inva...,Marvel
1,1697497000.0,[Fan Art] Silver Surfer sketch,,Marvel
2,1697495000.0,Finally watched Doctor Strange in the Multiver...,I had fairly modest expectations because it se...,Marvel
3,1697495000.0,Hickman Fantastic Four after Secret Wars,"Hi folks, \n\nI've read whole Hickman run on A...",Marvel
4,1697493000.0,If they reboot Blade,Michael Jai White as Blade?\n\nEdit:\n\nOr oth...,Marvel


In [10]:
marvel.isna().sum()

created_utc    0
title          0
self_text      0
subreddit      0
dtype: int64

In [11]:
dc_cinematic.head()

Unnamed: 0,created_utc,title,self_text,subreddit
0,1697491000.0,Who should be the main Flash of the DCU?,The new DCU will most likely have the Flash in...,DC_Cinematic
1,1697477000.0,My cosplay Catwoman from Batman Returns,,DC_Cinematic
2,1697464000.0,Which DC movie will cross 400 million first?,Which movie in the DCU coming up will be the f...,DC_Cinematic
3,1697462000.0,Where would you prefer the DCU to establish as...,I personally prefer some type of off planet ba...,DC_Cinematic
4,1697437000.0,"""Batman has comic fans uneasy"" - an 80s newspa...",,DC_Cinematic


In [12]:
dc_cinematic.isna().sum()

created_utc    0
title          0
self_text      0
subreddit      0
dtype: int64

In [13]:
# make it to CSV
marvel = marvel.to_csv('marvel_post.csv',index=False)

In [14]:
dc_cinematic = dc_cinematic.to_csv('dc_cinematic_post.csv',index=False)

In [15]:
#get df for both subreddits
#concat both df
#EDA -lem,countvectori
 

Remember, you will need to pull *at least* 1000 posts from each subreddit, not just 25. Like I mentioned above, you can use the created_utc attribute of a post to keep track of the timestamp and ensure non-overlapping pulls. We will leave this work for you all to complete.

Once you have at least 1000 posts from each subreddit, you can do some EDA (perhaps maybe the most common words in each subreddit..?) Eventually, you will want to combine your two dataframes together to do modeling.

### Hopefully this is enough of a tutorial to help get you started! If you have any questions, let us know!

### Note: Rather than working in this template notebook, make a brand new "scraping" notebook (or script), with your own comments, so you can use this project in a portfolio!