## Reddit API Scraper via Reddit Praw

In this notebook, you will see how this app scraps from Reddit using the Reddit Praw API. Additionally, all scrapped data will be placed into a pandas dataframe for further use in the application

### Setup for Praw

Here, we will import all necessary libraries for this notebook

In [1]:
import praw
import pandas as pd
import datetime as dt
import time
from collections import defaultdict


Below is the set up for an API agent that will used to surf Reddit

In [2]:
# Initialize PRAW with your Reddit API credentials
reddit = praw.Reddit(
    client_id="Wh4ZlWWXXIgDh948wfr7XA",
    client_secret="KvRaTg25M0xF07AJuFcJWDcHlcwpng",
    user_agent="RUControversial/1.0 by /u/Such_Touch2846",
)

# User: None indicates that the script is logged in as the authenticated user without posting permissions
print(f"Connected as user: {reddit.user.me()}")

Connected as user: None


### Subreddit Scrapping: r/AITA

After successfully connecting to the API, we will now select a subreddit, r/AITA for our project, and begin to pull posts. We will grab 25 posts from each category

Categories
- Hot
- Recent
- Top

Please excuse the subreddit's language

In [5]:
subreddit = reddit.subreddit("AmItheAsshole")  # Example subreddit

# Define valid verdicts
# Define valid verdicts with their meanings
valid_verdicts = [
    "Asshole",
    "Not the A-hole",
    "Everyone Sucks",
    "No A-holes here"
]


# Initialize a set to store unique flairs
unique_flairs = set()



MAX_POSTS_PER_VERDICT = 50
posts_by_verdict = defaultdict(list)
fetched_count = defaultdict(int)
collected_post_ids = set()

# Function to collect posts
def collect_posts(submissions):
    for submission in submissions:
        # Stop if all verdicts have reached the max
        if all(fetched_count[verdict] >= MAX_POSTS_PER_VERDICT for verdict in valid_verdicts):
            break
        if submission.id in collected_post_ids:
            continue
        flair = submission.link_flair_text
        if flair and flair in valid_verdicts and fetched_count[flair] < MAX_POSTS_PER_VERDICT:
            post_data = {
                "post_id": submission.id,
                "title": submission.title,
                "selftext": submission.selftext,
                "subreddit_name": submission.subreddit.display_name,
                "author": str(submission.author) if submission.author else "[deleted]",
                "score": submission.score,
                "upvote_ratio": submission.upvote_ratio,
                "num_comments": submission.num_comments,
                "created_utc": dt.datetime.fromtimestamp(submission.created_utc, dt.timezone.utc),
                "flair": submission.link_flair_text,
                "url": submission.url,
                "is_self": submission.is_self,
                "label": submission.link_flair_text
            }
            posts_by_verdict[flair].append(post_data)
            fetched_count[flair] += 1
            collected_post_ids.add(submission.id)
            print(f"Collected '{flair}': {fetched_count[flair]}")

# Collect posts until we have 25 for each flair
try:
    collect_posts(subreddit.new(limit=None))
except praw.exceptions.RedditAPIException as e:
    print(f"API error: {e}")
except Exception as e:
    print(f"General error: {e}")

# Combine posts into a list
all_posts = []
for verdict_posts in posts_by_verdict.values():
    all_posts.extend(verdict_posts)

# Create DataFrame
df = pd.DataFrame(all_posts)

# Print results
print("\nPosts per flair:", dict(fetched_count))
print("Total number of posts:", len(df))

Collected 'Not the A-hole': 1
Collected 'Not the A-hole': 2
Collected 'No A-holes here': 1
Collected 'Not the A-hole': 3
Collected 'Not the A-hole': 4
Collected 'Asshole': 1
Collected 'Not the A-hole': 5
Collected 'Not the A-hole': 6
Collected 'Not the A-hole': 7
Collected 'Asshole': 2
Collected 'Asshole': 3
Collected 'Asshole': 4
Collected 'Not the A-hole': 8
Collected 'Asshole': 5
Collected 'Asshole': 6
Collected 'Not the A-hole': 9
Collected 'Asshole': 7
Collected 'Asshole': 8
Collected 'Everyone Sucks': 1
Collected 'Not the A-hole': 10
Collected 'Not the A-hole': 11
Collected 'Asshole': 9
Collected 'Not the A-hole': 12
Collected 'Not the A-hole': 13
Collected 'Not the A-hole': 14
Collected 'Not the A-hole': 15
Collected 'Asshole': 10
Collected 'Not the A-hole': 16
Collected 'Asshole': 11
Collected 'Everyone Sucks': 2
Collected 'Asshole': 12
Collected 'Asshole': 13
Collected 'Not the A-hole': 17
Collected 'Asshole': 14
Collected 'No A-holes here': 2
Collected 'Not the A-hole': 18
Co

### Export Data

After pulling the reddit posts, we will now convert the data to a csv and export it.

In [6]:
df.to_csv("aita_posts.csv", index=False)