Skip to content

Convert arctic_shift Reddit data dumps into thread-view documents

Notifications You must be signed in to change notification settings

stas00/reddit-to-threads

Repository files navigation

Convert Reddit data dumps into thread-views

This repo helps to reconstruct similar to original Reddit thread views. It contains only the scripts to do so.

ArthurHeitmann/arctic_shift provides Reddit data dumps per sub via torrents.

artic_shift data dumps come in pairs:

  1. submissions
  2. comments

e.g., careerguidance_comments.zst and careerguidance_submissions.zst.

Once the dumps have been downloaded - they are in .zst format - one needs to extract the data and then convert it to threads while sorting the comments correctly.

I found the easiest quick and dirty way to accomplish this was to feed a pair of each dump files into an sqlite database per sub, and in another script extract each thread's submission and corresponding comments, rebuild the graph and parse the tree to flatten it into the desired per thread document.

This is far from being the most efficient tooling, but more of a quick solution to handle a few Reddit subs. There is no intention of turning this repo into something more than what it is, so the code is provided as is.

Tools

To preprocess .zst to sqlite to .jsonl in 2 stages run:

./zst2sqlite.py *.zst
./sqlite2threads.py *.db

The end result will be .jsonl files per sub with flattened submission+comments in a single text record.

If you want each stage to be broken down read the following notes.

Stage 1. Converting .zst to sqlite database

If going from .zst files directly to sqlite per sub do:

./zst2sqlite.py careerguidance_*.zst

Potential TODO: some subs have tens of millions of records and building a single sqlite db might take days, because it can't be parallelized - so probably could shard the input data to produce multiple .db files - say one per million of records or so.

Stage 2. Converting sqlite database into jsonl flattened threads per submission

Create a flattened comments thread from careerguidance.jsonl from sqlite .db file:

./sqlite2threads.py careerguidance.db

You can tweak the traverse function to format the comments differently. e.g. if you want the email style reply, reply-to-reply sort of nesting, just tweak the prefix variable to your liking - some ideas are already in the file.

Related tools

Some additional optional tools that go in a roundabout way of accomplishing the same. Mostly useful if you want to introspect the contents of .zst files via json dumps.

Converting .zst to .jsonl files w/o thread conversion

This just extracts json data from .zst files. This is not really needed if the desired result is to get the converted threads as this would just add an additional step.

Convert a pair of files to .jsonl dumps

./zst2jsonl.py careerguidance_*.zst

convert a folder data with many .zst files:

./zst2jsonl.py data

Converting .jsonl to sqlite database

If going from .jsonl files generated by zst2jsonl.py to flattened reddit threads use this tool:

./jsonl2sqlite.py careerguidance_*.jsonl

You need both _submissions and _comments file pairs, but you can send many other subs as well.

About

Convert arctic_shift Reddit data dumps into thread-view documents

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages