# Reddit and PRAW

This computational notebook contains the code for accessing posts (also known as threads or submissions) and comments from Reddit (https://www.reddit.com/), reading them into Python and converting them into a format that Gephi is able to read: an edge list in CSV format (https://gephi.org/users/supported-graph-formats/csv-format/). The comments include both the top-level comments (comments on the original post) and replies (comments on someone else's comment).

The first command installs the PRAW library using PIP, the 'package installer for Python'. The exclamation mark at the beginning tells Jupyter that the following command is not a Python code but a shell command that should be executed by the operating system. In this course, knowledge of shell commands is not important.

In [None]:
!pip install praw

Next we import the libraries we're going to use.

PRAW is a Python "wrapper" for the Reddit API, that means it is a convenient way of accessing the Reddit API when we are writing Python code. Have a look at the documentation at https://praw.readthedocs.io/. In particular the section under https://praw.readthedocs.io/en/stable/code_overview/praw_models.html tells us which methods and attributes about a comment, redditor, submission or subreddit we can access.

Pandas is probably the single most useful Python library for data science. For the purpose of this course, knowing Pandas well is not a requirement, but if you are serious about getting into data science, you should familiarise yourself with its features and learn Pandas immediately after learning Python. If you would like to know more, an introduction can be found at: https://pandas.pydata.org/docs/user_guide/10min.html

In [None]:
import praw

import pandas as pd

A client ID and secret can be obtained by accessing https://www.reddit.com/prefs/apps after logging into Reddit and creating a new app.

The values below are made up. Never share your secret with someone else!

In [None]:
reddit = praw.Reddit(
    client_id = "my-client-id",
    client_secret = "my-secret",
    user_agent = "MyAcademicAppName/0.1 by myusername"
)

Next we retrieve the top 10 posts from the Scotland subreddit (https://www.reddit.com/r/Scotland)

In [None]:
for submission in reddit.subreddit("Scotland").hot(limit = 10):
    print(submission.title)
    print(submission.num_comments)
    print(submission.score)
    print(submission.id)

Tip: if you know the ID of a submission, you can view it in the browser like this:
http://redd.it/112uhu6

In [None]:
submission = reddit.submission("1kv58uq")

The following code shows how to print some information about the submission. It is not necessary for actually converting Reddit data into a Gephi edgelist.

In [None]:
print(submission.title)
print(submission.num_comments)
print(submission.score)
print(submission.url)

In [None]:
author = submission.author

In [None]:
print(author.name)

In [None]:
# expand comment tree to download all comments - for very large comment trees this may take a while
submission.comments.replace_more(limit = None)

In [None]:
# this is how you would access only the top-level comments
top_level_comments = list(submission.comments)

# this is how we access all commments
# note that we need to call submission.comments.list() *after* calling replace_more to make sure we get all comments
all_comments = submission.comments.list()

In [None]:
len(all_comments)

In [None]:
all_comments[0].body

In [None]:
def comment_list_to_edge_list(comment_list, include_parent = False):
    """
    A function that converts the comments on a Reddit post into a Gephi CSV edge list.

    The input list is expected to be in the format returned by submission.comments.list().

    The DataFrame returned is in a format so it can be written to a CSV edge list using Pandas' to_csv method.

    The edge list will include an edge from the author of a reply to the author of the comment that is being
    replied to.

    If you pass the optional parameter include_parent, the edge list will also include edges from the authors of
    top-level comments to the author of the original post. In that case include_parent should be set to the
    user name of the redditor who wrote the original post (submission.author).
    """

    # convert from Python list to Pandas DataFrame (https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html)
    df = pd.DataFrame([vars(comment) for comment in comment_list])

    # remove columns we don't care about, and set 'id' to be the index column
    df = df[["id", "author", "parent_id"]].set_index("id")

    # remove first three characters from parent_id field
    df["parent_id"] = df["parent_id"].apply(lambda row: row[3:])

    # join dataframe with itself to look up parent for each child comment
    df_joined = df.join(df, on = "parent_id", lsuffix = "_child", rsuffix = "_parent")

    edgelist = df_joined[["author_child", "author_parent"]]

    if include_parent is not False:
        edgelist = edgelist.fillna({'author_parent': include_parent})
    else:
        edgelist = edgelist.dropna()

    edgelist = edgelist.rename(columns = {'author_child': 'Source', 'author_parent': 'Target'})

    return edgelist

In [None]:
# here we pass the name of the user who created the original submission
edgelist = comment_list_to_edge_list(all_comments, submission.author)

In [None]:
# write CSV file - the file should appear in the same directory as the Jupyter notebook
edgelist.to_csv('comments_for_gephi_edgelist.csv', index = False)