# **1. Install PRAW**

In [1]:
#!pip install praw
import praw


# **2. Create a Reddit App**

To access the Reddit API, you'll need to create an application on Reddit and obtain your API credentials. Follow these steps:

- Go to the Reddit website (https://www.reddit.com/) and log in to your account. Feel free to create a throwaway account for this project!
- Navigate to the Reddit Apps page (https://www.reddit.com/prefs/apps).
- Click the "are you a developer? create an app..." button in the top left.
- Provide a name for your app (e.g., "PRAW"), select the app type ('script') , and optionally add a description. Use http://localhost:8080 as your redirect URI.
- After submitting the form, you will reach a page that looks like the following image. You'll see your application's details, including the client ID and client secret. Keep these credentials handy for the next step.


![Praw](https://www.honchosearch.com/hubfs/Imported_Blog_Media/Client-ID-Client-Secret.png)

# **3. Initialize PRAW**

In [2]:
reddit = praw.Reddit(
    client_id='jjMYUBu4dyHMdpIYkCrsOQ',
    client_secret='Sj3IYg7ZTOaM0XthXLfBIPq6AgfAqw',
    user_agent='project_3',
    username='suli1524',
    password='Suli.1524'
)

Replace 'YOUR_CLIENT_ID', 'YOUR_CLIENT_SECRET', 'YOUR_USER_AGENT', 'YOUR_REDDIT_USERNAME', and 'YOUR_REDDIT_PASSWORD' with your actual Reddit API credentials.

Your user agent is an identifier used by reddit to identify the source of requests. You can make it whatever you want, but you'll want to choose something descriptive and unique, and it's recommended that your username is included.

**I have removed my own credentials from this workbook. We can show you how to hide your credentials before submitting the project! The following code will need your own credentials in order to successfully work.**

# 4. Take a look at the documentation [here](https://praw.readthedocs.io/)!

In [3]:
# Below is JUST an example of how you can use PRAW

# Choose your subreddit
subreddit = reddit.subreddit('marvel')

# Adjust the limit as needed -- Note that this will grab the 25 most recent posts
posts = subreddit.new(limit=1000)

In [4]:
 #Below is JUST an example of how you can use PRAW

# Choose your subreddit
subreddit_1 = reddit.subreddit('DC_Cinematic')

# Adjust the limit as needed -- Note that this will grab the 25 most recent posts
posts_1 = subreddit_1.new(limit=1000)

## NOTE
- Reddit API Limitations: The Reddit API imposes limitations on the number of posts you can retrieve in a single request. The maximum number of posts per request is typically 100. Therefore, if you set the limit parameter to a value greater than 100, PRAW will make multiple requests behind the scenes to fetch the desired number of posts.
- Rate Limiting: The Reddit API also enforces rate limits to prevent abuse and ensure fair usage. The specific rate limits can vary depending on factors such as your Reddit account's age and karma. As a standard user, you're typically allowed to make 60 requests per minute. If you exceed the rate limit, you may receive an error response until the rate limit resets.
- TIP: You can use the created_utc attribute of a post to keep track of the timestamp and ensure non-overlapping pulls. The created_utc attribute represents the post's creation time in UTC.

In [5]:
import pandas as pd

data = []
for post in posts:
    data.append([post.created_utc, post.title, post.selftext, post.subreddit])

# Turn into a dataframe
marvel = pd.DataFrame(data, columns = ['created_utc', 'title', 'self_text', 'subreddit'])
marvel

Unnamed: 0,created_utc,title,self_text,subreddit
0,1.697581e+09,Well that was fun: Guardian at New York Comic ...,@marc.kandel (On IG),Marvel
1,1.697581e+09,Thank you Kevin 5G for making this happen!,,Marvel
2,1.697581e+09,New Mutant Monday 10/30/23,Been thinking of rewatching new mutants for th...,Marvel
3,1.697579e+09,We're in the Endgame now.,,Marvel
4,1.697578e+09,Ghost rider delivers his vengeance (Spider-Man...,,Marvel
...,...,...,...,...
984,1.695831e+09,How have the X-Men been lately? Took a break l...,"As the title says, I stopped reading the curre...",Marvel
985,1.695830e+09,Sam Raimi,Hey everyone! Sam raimi is going to be at my l...,Marvel
986,1.695830e+09,Back when Bruce could control when to transfor...,,Marvel
987,1.695830e+09,Tony learns how to be a dad to his crazy ai [I...,,Marvel


In [6]:

data = []
for post in posts_1:
    data.append([post.created_utc, post.title, post.selftext, post.subreddit])

# Turn into a dataframe
dc_cinematic = pd.DataFrame(data, columns = ['created_utc', 'title', 'self_text', 'subreddit'])
dc_cinematic

Unnamed: 0,created_utc,title,self_text,subreddit
0,1.697581e+09,Blue Beetle DCEU Connections/References,Haven't seen Blue Beetle but I wanna know does...,DC_Cinematic
1,1.697511e+09,Who should be the villain(s) in Superman: Legacy?,,DC_Cinematic
2,1.697530e+09,"DC asks you to write a pitch for the batman 2,...",,DC_Cinematic
3,1.697564e+09,Unpopular Opinion: I think this is best flash ...,,DC_Cinematic
4,1.697537e+09,Who are some unseen heroes of the DCEU,I just rewatched BVS (ultimate) I paused to ob...,DC_Cinematic
...,...,...,...,...
973,1.692023e+09,"How would you feel if Superman Legacy gets ""Be...","I mean almost all past DC movies got ""Best DC ...",DC_Cinematic
974,1.692020e+09,"In the the trailer, it looks like they're both...",,DC_Cinematic
975,1.691997e+09,Oh the irony,,DC_Cinematic
976,1.692013e+09,Should Batcow be in the DCU?,,DC_Cinematic


In [7]:
marvel.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 989 entries, 0 to 988
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   created_utc  989 non-null    float64
 1   title        989 non-null    object 
 2   self_text    989 non-null    object 
 3   subreddit    989 non-null    object 
dtypes: float64(1), object(3)
memory usage: 31.0+ KB


In [8]:
dc_cinematic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 978 entries, 0 to 977
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   created_utc  978 non-null    float64
 1   title        978 non-null    object 
 2   self_text    978 non-null    object 
 3   subreddit    978 non-null    object 
dtypes: float64(1), object(3)
memory usage: 30.7+ KB


In [9]:
marvel.head()

Unnamed: 0,created_utc,title,self_text,subreddit
0,1697581000.0,Well that was fun: Guardian at New York Comic ...,@marc.kandel (On IG),Marvel
1,1697581000.0,Thank you Kevin 5G for making this happen!,,Marvel
2,1697581000.0,New Mutant Monday 10/30/23,Been thinking of rewatching new mutants for th...,Marvel
3,1697579000.0,We're in the Endgame now.,,Marvel
4,1697578000.0,Ghost rider delivers his vengeance (Spider-Man...,,Marvel


In [10]:
marvel.isna().sum()

created_utc    0
title          0
self_text      0
subreddit      0
dtype: int64

In [11]:
dc_cinematic.head()

Unnamed: 0,created_utc,title,self_text,subreddit
0,1697581000.0,Blue Beetle DCEU Connections/References,Haven't seen Blue Beetle but I wanna know does...,DC_Cinematic
1,1697511000.0,Who should be the villain(s) in Superman: Legacy?,,DC_Cinematic
2,1697530000.0,"DC asks you to write a pitch for the batman 2,...",,DC_Cinematic
3,1697564000.0,Unpopular Opinion: I think this is best flash ...,,DC_Cinematic
4,1697537000.0,Who are some unseen heroes of the DCEU,I just rewatched BVS (ultimate) I paused to ob...,DC_Cinematic


In [12]:
dc_cinematic.isna().sum()

created_utc    0
title          0
self_text      0
subreddit      0
dtype: int64

In [13]:
# make it to CSV
marvel = marvel.to_csv('marvel_post.csv',index=False)

In [14]:
dc_cinematic = dc_cinematic.to_csv('dc_cinematic_post.csv',index=False)

## EDA

In [17]:
# imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import requests
import time 

from sklearn.feature_extraction.text import CountVectorizer



##### Read Data

In [19]:
#read Data
marvel = pd.read_csv('marvel_post.csv')
dc_cinematic = pd.read_csv('dc_cinematic_post.csv')

In [20]:
marvel.head()

Unnamed: 0,created_utc,title,self_text,subreddit
0,1697581000.0,Well that was fun: Guardian at New York Comic ...,@marc.kandel (On IG),Marvel
1,1697581000.0,Thank you Kevin 5G for making this happen!,,Marvel
2,1697581000.0,New Mutant Monday 10/30/23,Been thinking of rewatching new mutants for th...,Marvel
3,1697579000.0,We're in the Endgame now.,,Marvel
4,1697578000.0,Ghost rider delivers his vengeance (Spider-Man...,,Marvel


In [22]:
dc_cinematic.head()

Unnamed: 0,created_utc,title,self_text,subreddit
0,1697581000.0,Blue Beetle DCEU Connections/References,Haven't seen Blue Beetle but I wanna know does...,DC_Cinematic
1,1697511000.0,Who should be the villain(s) in Superman: Legacy?,,DC_Cinematic
2,1697530000.0,"DC asks you to write a pitch for the batman 2,...",,DC_Cinematic
3,1697564000.0,Unpopular Opinion: I think this is best flash ...,,DC_Cinematic
4,1697537000.0,Who are some unseen heroes of the DCEU,I just rewatched BVS (ultimate) I paused to ob...,DC_Cinematic


##### checking for missing values

In [24]:
# marvel
marvel.isna().sum()

created_utc      0
title            0
self_text      532
subreddit        0
dtype: int64

In [25]:
#dc_cinematic
dc_cinematic.isna().sum()

created_utc      0
title            0
self_text      577
subreddit        0
dtype: int64

In [27]:
marvel.describe(include='object')

Unnamed: 0,title,self_text,subreddit
count,989,457,989
unique,989,457,1
top,Well that was fun: Guardian at New York Comic ...,@marc.kandel (On IG),Marvel
freq,1,1,989


In [28]:
dc_cinematic.describe(include='object')

Unnamed: 0,title,self_text,subreddit
count,978,401,978
unique,977,401,1
top,Batman No Way Home fan edit where all 4 Batmen...,Haven't seen Blue Beetle but I wanna know does...,DC_Cinematic
freq,2,1,978


##### EDA on title 

In [29]:
# create column for post_length
marvel['post_length'] = marvel['title'].map(len)
dc_cinematic['post_length'] = dc_cinematic['title'].map(len)

In [31]:
marvel.head(0)

Unnamed: 0,created_utc,title,self_text,subreddit,post_length


In [32]:
dc_cinematic.head(0)

Unnamed: 0,created_utc,title,self_text,subreddit,post_length


In [33]:
# create column for post_word_count
marvel['post_word_count'] = marvel['title'].map(lambda x: len(x.split()))
dc_cinematic['post_word_count'] = dc_cinematic['title'].map(lambda x: len(x.split()))

In [35]:
marvel.head(0)

Unnamed: 0,created_utc,title,self_text,subreddit,post_length,post_word_count


In [37]:
dc_cinematic.head(0)

Unnamed: 0,created_utc,title,self_text,subreddit,post_length,post_word_count


In [38]:
# create a ratio
marvel['ratio'] = marvel.post_length/marvel.post_word_count
dc_cinematic['ratio'] = dc_cinematic.post_length/dc_cinematic.post_word_count

In [40]:
marvel.head(0)

Unnamed: 0,created_utc,title,self_text,subreddit,post_length,post_word_count,ratio


In [42]:
dc_cinematic.head(2)

Unnamed: 0,created_utc,title,self_text,subreddit,post_length,post_word_count,ratio
0,1697581000.0,Blue Beetle DCEU Connections/References,Haven't seen Blue Beetle but I wanna know does...,DC_Cinematic,39,4,9.75
1,1697511000.0,Who should be the villain(s) in Superman: Legacy?,,DC_Cinematic,49,8,6.125


In [46]:
marvel.describe()

Unnamed: 0,created_utc,post_length,post_word_count,ratio
count,989.0,989.0,989.0,989.0
mean,1696769000.0,61.637007,10.82912,5.895385
std,515667.2,41.378581,7.624548,1.151312
min,1695801000.0,3.0,1.0,3.0
25%,1696360000.0,34.0,6.0,5.21875
50%,1696798000.0,52.0,9.0,5.7
75%,1697232000.0,76.0,14.0,6.333333
max,1697581000.0,288.0,55.0,16.0


In [47]:
dc_cinematic.describe()

Unnamed: 0,created_utc,post_length,post_word_count,ratio
count,978.0,978.0,978.0,978.0
mean,1694303000.0,73.54908,12.985685,5.720376
std,1563411.0,47.08389,8.203389,0.858811
min,1691984000.0,5.0,1.0,3.0
25%,1692915000.0,42.25,8.0,5.166667
50%,1694102000.0,62.0,11.0,5.660256
75%,1695481000.0,90.0,16.0,6.12375
max,1697581000.0,299.0,57.0,10.5


#### Top 10 post word count

In [51]:
# top 10 longest marvel_post
marvel.sort_values(by='post_word_count', ascending=False)[['subreddit','post_word_count','post_length','ratio','title']].head(10)

Unnamed: 0,subreddit,post_word_count,post_length,ratio,title
866,Marvel,55,278,5.054545,So I think we can all agree that Wanda’s not d...
944,Marvel,52,288,5.538462,Here’s a weird question for you all: How would...
151,Marvel,51,273,5.352941,if marvel were to animate a comic (the way jap...
262,Marvel,50,283,5.66,I always thought it would be cool if Marvel ev...
918,Marvel,45,238,5.288889,Does anyone know what would have been Donny ca...
572,Marvel,43,204,4.744186,What do you think is the chance that the chara...
891,Marvel,43,226,5.255814,I've seen a lot of people at this point wantin...
438,Marvel,41,246,6.0,I remember this art work being the staple of m...
622,Marvel,41,229,5.585366,At least the avengers bothered to show up at D...
6,Marvel,40,218,5.45,"In Avengers Infinity War, when Iron Man and Dr..."


In [53]:
# top 10 longest dc_cinematic_post
dc_cinematic.sort_values(by='post_word_count', ascending=False)[['subreddit','post_word_count','post_length','ratio','title']].head(10)

Unnamed: 0,subreddit,post_word_count,post_length,ratio,title
18,DC_Cinematic,57,299,5.245614,I have always wondered who it was Clark was pr...
974,DC_Cinematic,52,299,5.75,"In the the trailer, it looks like they're both..."
358,DC_Cinematic,51,293,5.745098,Tim Burton breaks silence on The Flash using h...
503,DC_Cinematic,50,286,5.72,Mr. Mind and the Monster Society of Evil shoul...
204,DC_Cinematic,49,273,5.571429,"""And, yes, some actors will be playing charact..."
209,DC_Cinematic,48,299,6.229167,"Fun Fact: Birdman, an Oscar-winning movie rele..."
686,DC_Cinematic,48,275,5.729167,That’s it! #BlueBeetle had #5 best hold in DCE...
665,DC_Cinematic,46,278,6.043478,"With actuals out, #BlueBeetle was the first ch..."
930,DC_Cinematic,46,264,5.73913,"James Gunn praises WONDER WOMAN HISTORIA as ""a..."
805,DC_Cinematic,46,267,5.804348,When James Gunn cast David Dastmalchian as sup...


#### Bottom 10 post word count

In [52]:
# top 10 shortest marvel_post
marvel.sort_values(by='post_word_count', ascending=True)[['subreddit','post_word_count','post_length','ratio','title']].head(10)

Unnamed: 0,subreddit,post_word_count,post_length,ratio,title
703,Marvel,1,16,16.0,Self-Explanatory
824,Marvel,1,8,8.0,Galactus
485,Marvel,1,7,7.0,Sadness
278,Marvel,1,6,6.0,Ultron
202,Marvel,1,3,3.0,HoX
595,Marvel,1,9,9.0,Opiniyann
701,Marvel,1,11,11.0,Cancerverse
617,Marvel,1,8,8.0,Eternals
227,Marvel,1,13,13.0,Annihilation!
731,Marvel,1,8,8.0,Curious?


In [54]:
# top 10 shortest dc_cinematic_post
dc_cinematic.sort_values(by='post_word_count', ascending=True)[['subreddit','post_word_count','post_length','ratio','title']].head(10)

Unnamed: 0,subreddit,post_word_count,post_length,ratio,title
649,DC_Cinematic,1,5,5.0,Iykyk
44,DC_Cinematic,2,15,7.5,Source material
277,DC_Cinematic,2,20,10.0,Snyder’s controversy
286,DC_Cinematic,2,21,10.5,Batfleck appreciation
530,DC_Cinematic,2,15,7.5,Aquaman Trailer
427,DC_Cinematic,2,13,6.5,"Belgium, 1918"
753,DC_Cinematic,2,10,5.0,Her emails
870,DC_Cinematic,2,16,8.0,Gigachad move!!!
128,DC_Cinematic,2,8,4.0,DCU Poll
510,DC_Cinematic,2,12,6.0,Graphic data


#### CountVectorizing

In [None]:
# marvel top 20 common word
cv = CountVectorizer(stop_words='english')
cv.fit(marvel['title'])

marvel_post_cv = cv.transform(marvel['title'])
marvel_post_df = pd.DataFrame(marvel_post_cv.todense(), columns=cv.get_feature_names())

In [None]:
marvel_top_words = pd.DataFrame(startrek_post_df.sum().sort_values(ascending=False).head(20), columns = ['Count'])
marvel_top_words.T

In [15]:
#get df for both subreddits
#concat both df
#EDA -lem,countvectori
 

Remember, you will need to pull *at least* 1000 posts from each subreddit, not just 25. Like I mentioned above, you can use the created_utc attribute of a post to keep track of the timestamp and ensure non-overlapping pulls. We will leave this work for you all to complete.

Once you have at least 1000 posts from each subreddit, you can do some EDA (perhaps maybe the most common words in each subreddit..?) Eventually, you will want to combine your two dataframes together to do modeling.

### Hopefully this is enough of a tutorial to help get you started! If you have any questions, let us know!

### Note: Rather than working in this template notebook, make a brand new "scraping" notebook (or script), with your own comments, so you can use this project in a portfolio!