This repo helps to reconstruct similar to original Reddit thread views. It contains only the scripts to do so.
ArthurHeitmann/arctic_shift provides Reddit data dumps per sub via torrents.
artic_shift
data dumps come in pairs:
- submissions
- comments
e.g., careerguidance_comments.zst
and careerguidance_submissions.zst
.
Once the dumps have been downloaded - they are in .zst
format - one needs to extract the data and then convert it to threads while sorting the comments correctly.
I found the easiest quick and dirty way to accomplish this was to feed a pair of each dump files into an sqlite database per sub, and in another script extract each thread's submission and corresponding comments, rebuild the graph and parse the tree to flatten it into the desired per thread document.
This is far from being the most efficient tooling, but more of a quick solution to handle a few Reddit subs. There is no intention of turning this repo into something more than what it is, so the code is provided as is.
To preprocess .zst
to sqlite to .jsonl
in 2 stages run:
./zst2sqlite.py *.zst
./sqlite2threads.py *.db
The end result will be .jsonl
files per sub with flattened submission+comments in a single text
record.
If you want each stage to be broken down read the following notes.
If going from .zst
files directly to sqlite per sub do:
./zst2sqlite.py careerguidance_*.zst
Potential TODO: some subs have tens of millions of records and building a single sqlite db might take days, because it can't be parallelized - so probably could shard the input data to produce multiple .db files - say one per million of records or so.
Create a flattened comments thread from careerguidance.jsonl
from sqlite .db
file:
./sqlite2threads.py careerguidance.db
You can tweak the traverse
function to format the comments differently. e.g. if you want the email style reply, reply-to-reply sort of nesting, just tweak the prefix
variable to your liking - some ideas are already in the file.
Some additional optional tools that go in a roundabout way of accomplishing the same. Mostly useful if you want to introspect the contents of .zst
files via json dumps.
This just extracts json data from .zst
files. This is not really needed if the desired result is to get the converted threads as this would just add an additional step.
Convert a pair of files to .jsonl
dumps
./zst2jsonl.py careerguidance_*.zst
convert a folder data
with many .zst
files:
./zst2jsonl.py data
If going from .jsonl
files generated by zst2jsonl.py
to flattened reddit threads use this tool:
./jsonl2sqlite.py careerguidance_*.jsonl
You need both _submissions
and _comments
file pairs, but you can send many other subs as well.