Convert Reddit data dumps into thread-views

This repo helps to reconstruct similar to original Reddit thread views. It contains only the scripts to do so.

ArthurHeitmann/arctic_shift provides Reddit data dumps per sub via torrents.

artic_shift data dumps come in pairs:

submissions
comments

e.g., careerguidance_comments.zst and careerguidance_submissions.zst.

Once the dumps have been downloaded - they are in .zst format - one needs to extract the data and then convert it to threads while sorting the comments correctly.

I found the easiest quick and dirty way to accomplish this was to feed a pair of each dump files into an sqlite database per sub, and in another script extract each thread's submission and corresponding comments, rebuild the graph and parse the tree to flatten it into the desired per thread document.

This is far from being the most efficient tooling, but more of a quick solution to handle a few Reddit subs. There is no intention of turning this repo into something more than what it is, so the code is provided as is.

Tools

To preprocess .zst to sqlite to .jsonl in 2 stages run:

./zst2sqlite.py *.zst
./sqlite2threads.py *.db

The end result will be .jsonl files per sub with flattened submission+comments in a single text record.

If you want each stage to be broken down read the following notes.

Stage 1. Converting .zst to sqlite database

If going from .zst files directly to sqlite per sub do:

./zst2sqlite.py careerguidance_*.zst

Potential TODO: some subs have tens of millions of records and building a single sqlite db might take days, because it can't be parallelized - so probably could shard the input data to produce multiple .db files - say one per million of records or so.

Stage 2. Converting sqlite database into jsonl flattened threads per submission

Create a flattened comments thread from careerguidance.jsonl from sqlite .db file:

./sqlite2threads.py careerguidance.db

You can tweak the traverse function to format the comments differently. e.g. if you want the email style reply, reply-to-reply sort of nesting, just tweak the prefix variable to your liking - some ideas are already in the file.

Related tools

Some additional optional tools that go in a roundabout way of accomplishing the same. Mostly useful if you want to introspect the contents of .zst files via json dumps.

Converting .zst to .jsonl files w/o thread conversion

This just extracts json data from .zst files. This is not really needed if the desired result is to get the converted threads as this would just add an additional step.

Convert a pair of files to .jsonl dumps

./zst2jsonl.py careerguidance_*.zst

convert a folder data with many .zst files:

./zst2jsonl.py data

Converting .jsonl to sqlite database

If going from .jsonl files generated by zst2jsonl.py to flattened reddit threads use this tool:

./jsonl2sqlite.py careerguidance_*.jsonl

You need both _submissions and _comments file pairs, but you can send many other subs as well.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
zst_blocks_format		zst_blocks_format
.gitignore		.gitignore
README.md		README.md
fileStreams.py		fileStreams.py
jsonl2sqlite.py		jsonl2sqlite.py
sqlite2threads.py		sqlite2threads.py
zst2jsonl.py		zst2jsonl.py
zst2sqlite.py		zst2sqlite.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Convert Reddit data dumps into thread-views

Tools

Stage 1. Converting .zst to sqlite database

Stage 2. Converting sqlite database into jsonl flattened threads per submission

Related tools

Converting .zst to .jsonl files w/o thread conversion

Converting .jsonl to sqlite database

About

Releases

Packages

Languages

stas00/reddit-to-threads

Folders and files

Latest commit

History

Repository files navigation

Convert Reddit data dumps into thread-views

Tools

Stage 1. Converting .zst to sqlite database

Stage 2. Converting sqlite database into jsonl flattened threads per submission

Related tools

Converting .zst to .jsonl files w/o thread conversion

Converting .jsonl to sqlite database

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages