# Downloading 100.000+ reddit posts

Now that we have the WikiPedia data, we need to obtain our Reddit Posts.

We do so using the [Pushift Python Library](https://github.com/pushshift/api), which is an interface to the [Pushift.io](https://api.pushshift.io/) API which has a huge dataset of all public Reddit posts and comments.

In [2]:
import json
import os
from typing import Generator
from psaw import PushshiftAPI

try:
    from library_functions.config import Config
except ModuleNotFoundError:
    from project.library_functions.config import Config


ModuleNotFoundError: No module named 'psaw'

Initialize the API interface:

In [None]:
api = PushshiftAPI()

Create a folder for our data. Due to the volume of data, this isn't on our repository.

In [None]:
reddit_path = Config.Path.private_data_folder / "reddit_data"
os.mkdir(reddit_path)

Note that we also made functions to download and analyze *comments*, not only posts. However, that increases our dataset by a factor of 10 (1.000.000+ comments), and therefore we found it unpractical to operate on and analyze in such a short timeframe.

In [None]:
reddit_comments_path = reddit_path.joinpath("comments")
reddit_submissions_path = reddit_path.joinpath("submissions")

os.mkdir(reddit_comments_path)
os.mkdir(reddit_submissions_path)

Let's make two utility functions to save the posts/comments returned by the API in a more convenient form:

In [None]:
def save_submissions(submission):
    relevant_dict = {
        "title": submission.title,
        "author": submission.author,
        "timestamp": submission.created,
        "id": submission.id,
    }
    try:
        relevant_dict["content"] = submission.selftext
    except:
        relevant_dict["content"] = ""
    with open(
        reddit_submissions_path.joinpath(relevant_dict["id"] + ".json"), "w+"
    ) as f:
        json.dump(relevant_dict, f)


def save_comment(comment):
    relevant_dict = {
        "author": comment.author,
        "body": comment.body,
        "timestamp": comment.created,
        "id": "c__" + comment.id,
    }
    with open(
        reddit_comments_path.joinpath(relevant_dict["id"] + ".json"), 
        "w+"
        ) as f:
        json.dump(relevant_dict, f)



Getting all the submissions (or comments) is as easy as specifying the subreddit we are interested in, and passing a list of the fields we want to retrieve. This returns a generator which downloads the posts as it is iterated over:

In [None]:
all_submissions = api.search_submissions(
    subreddit="nootropics",
    filter=["title", "author", "created_utc", "id", "selftext", "url"],
)

all_comments = api.search_comments(
    subreddit="nootropics",
    filter=["author", "created_utc", "id", "body", "url"],
)

And now, go through all the submissions above and save them to file:

In [None]:
def download_all_submissions(sub_generator: Generator):
    for i, sub in enumerate(sub_generator):

        if i % 100 == 0:
            print(f"Downloading submissions {i} - {i+100}.")

        save_submissions(sub)


def download_all_comments(com_generator: Generator):
    for i, comment in enumerate(com_generator):

        if i % 100 == 0:
            print(f"Downloading comments {i} - {i+100}.")

        save_comment(comment)

download_all_comments(all_comments)
download_all_submissions(all_submissions)