# Reddit Data Source Review

This notebook documents the evaluation of Reddit as a real-time data source for market attention signals.

## Objective
The goal was to stream Reddit posts in real-time to detect mentions of specific stock and crypto tickers, and evaluate Reddit’s usefulness for a market attention dashboard.

In [24]:
from dotenv import load_dotenv
import os
import asyncpraw
import asyncio
from datetime import datetime
import re
import pandas as pd

# Load credentials
load_dotenv()
client_id = os.getenv("REDDIT_CLIENT_ID")
client_secret = os.getenv("REDDIT_CLIENT_SECRET")
user_agent = os.getenv("USER_AGENT")

TICKERS = {"GOOGL", "AAPL", "AMZN", "NVDA", "MSFT", "BTC", "XRP"}
SUBREDDITS = "personalfinance+stocks+wallstreetbets+investing+CryptoCurrency"

We first load our credentials to be authenticated by the API and strat using it. This contains the tickers used within the corresponding subreddits.

In [25]:
async def stream_posts():
    reddit = asyncpraw.Reddit(
        client_id=client_id,
        client_secret=client_secret,
        user_agent=user_agent
    )
    subreddit = await reddit.subreddit(SUBREDDITS)
    async for post in subreddit.stream.submissions(skip_existing=True):
        text = (post.title or "") + " " + (post.selftext or "")
        words = set(re.findall(r"\b[A-Z]{2,5}\b", text.upper()))
        matched = TICKERS & words
        if matched:
            print(f"[{datetime.utcnow()}] 🧠 [{post.subreddit}] {post.title} — matched: {matched}")

    await reddit.close()


This was the original function for retrieving the posts from reddit. In practical use, instead of printing the posts matched, we would save the posts with its metadata to a database or json file for later use.

In [51]:
async def stream_posts(max_results: int = 5):

    # set up reddit client
    reddit = asyncpraw.Reddit(
        client_id=client_id,
        client_secret=client_secret,
        user_agent=user_agent
    )

    subreddit = await reddit.subreddit(SUBREDDITS)
    seen = 0

    async for post in subreddit.stream.submissions(skip_existing=False):
        text = (post.title or "") + " " + (post.selftext or "")
        words = set(re.findall(r"\b[A-Z]{2,5}\b", text.upper()))
        matched = TICKERS & words

        print(f"[{datetime.utcnow()}] | [{post.subreddit}] {post.title} ({post.id}) — matched: {matched}")
        seen+=1
        if seen >= max_results:
            break

    await reddit.close()


In [52]:
await stream_posts(5)

[2025-07-06 16:00:34.753904] | [CryptoCurrency] Those who like squeezing in the era of the dollar collapse- Altcoins preparing for massive breakout (1lsug2s) — matched: {'BTC'}
[2025-07-06 16:00:34.753904] | [personalfinance] Can i afford this home? Should i buy? (1lsulnl) — matched: set()
[2025-07-06 16:00:34.753904] | [CryptoCurrency] Are airdrops actually legit? (1lsum80) — matched: {'BTC'}
[2025-07-06 16:00:34.753904] | [CryptoCurrency] I'm getting spammed with 0.000001 USDT transactions on Polygon – from random wallets. Why? (1lsun6g) — matched: set()
[2025-07-06 16:00:34.753904] | [CryptoCurrency] Gold Explorer Joins Bitcoin Treasury Bandwagon - Decrypt (1lsuwg6) — matched: set()


After changing the function to limit to 5 posts, removing the constraint for new posts and also stop filtering the stream of posts, we get these 5 random posts from the subreddits. If either of this was reverted back to the original code, it would take quite a long time to get the desired results.

In [47]:
async def stream_posts(max_results: int = 5):

    # set up reddit client
    reddit = asyncpraw.Reddit(
        client_id=client_id,
        client_secret=client_secret,
        user_agent=user_agent
    )

    subreddit = await reddit.subreddit(SUBREDDITS)
    seen = 0
    rows = []

    async for post in subreddit.stream.submissions(skip_existing=False):
        text = (post.title or "") + " " + (post.selftext or "")
        words = set(re.findall(r"\b[A-Z]{2,5}\b", text.upper()))

        seen+=1

        rows.append({
            "subreddit"    : str(post.subreddit),
            "title"        : post.title,
            "score"        : post.score,
            "ups"          : post.ups,
            "downs"        : post.downs,
        })

        if seen >= max_results:
            break

    await reddit.close()

    return pd.DataFrame(rows)


In [50]:
pd.set_option("display.max_columns", None)
pd.set_option("display.max_colwidth", None)
pd.set_option("display.width", 0)

df = await stream_posts(5)
df.head()

Unnamed: 0,subreddit,title,score,ups,downs
0,investing,Best app for investing that you use?,14,14,0
1,personalfinance,Has anyone here successfully used emergency debt relief during a financial crisis? How did it work out?,0,0,0
2,CryptoCurrency,"Haters of XRP… Why? Only Real, Thoughtful Answers Please",0,0,0
3,personalfinance,Is it better to have two people on a house loan even if one doesn’t have credit?,0,0,0
4,personalfinance,Savings for Grad School + Retirement,3,3,0


If we were to use reddit as a data source, this is how we would save it for further use. A table of the data collected would like this.

## Observations & Findings

- Subreddits used: `personalfinance`, `stocks`, `wallstreetbets`, `investing`, `CryptoCurrency`.
- Tickers monitored: `GOOGL`, `AAPL`, `AMZN`, `NVDA`, `MSFT`, `BTC`, `XRP`.

### Match Volume:
- Extremely low match rate (~1–3/hour) even with active subreddits.
- Many posts containing tickers were **not relevant**.

## Conclusion

Reddit alone is **not sufficient** for real-time attention monitoring due to:
- Sparse volume for specific tickers.
- High noise-to-signal ratio.
- Inconsistent naming (tickers vs. full names).