# Download Reddit Data
Downloads posts from specified subreddits.

The idea was to have some initial experiment data with *social media* flavor and get things started.

In [None]:
from tqdm.notebook import tqdm
from langchain_community.document_loaders import RedditPostsLoader
from dotenv import load_dotenv
load_dotenv()
from os import getenv
import pickle

## Setup
Define what exactly to download:
* `CATEGORIES`: subset of `["controversial", "hot", "new", "rising", "top"]`
* `SEARCH_QUERIES`: list of subreddits
* `NUM_POSTS`: number of posts to download

To use the *Langchain commmunity* `RedditPostsLoader`, a *reddit* client ID and secret have to be provided as env variables `REDDIT_CLIENT_ID` and `REDDIT_CLIENT_SECRET`. This can for example be done using a `.env` file.

In [None]:
CATEGORIES = ["new", "hot"] # list of any of ["controversial", "hot", "new", "rising", "top"]
SEARCH_QUERIES = ["MachineLearning"] # list of subreddits
NUM_POSTS = 1000 # number of posts to download; seems to be per category x search_query

## Load

In [None]:
loader = RedditPostsLoader(
    client_id=getenv("REDDIT_CLIENT_ID"),
    client_secret=getenv("REDDIT_CLIENT_SECRET"),
    user_agent="extractor",
    categories=CATEGORIES,
    mode="subreddit",
    search_queries=SEARCH_QUERIES, 
    number_posts=NUM_POSTS
)
docs = loader.load()

## Save
Documents will be *pickled* to `reddit-docs.pickle` in the following cell to allow re-use in other notebooks. Posts without original text content will be skipped.

In [None]:
# filter docs with no page content
docs = [d for d in docs if d.page_content]
print(f"{len(docs)} documents")

with open("reddit-docs.pickle", "wb") as file:
    pickle.dump(docs, file)
    print("Wrote docs to reddit-docs.pickle")