Skip to content
This repository has been archived by the owner on Oct 14, 2020. It is now read-only.

import scripts (stackoverflow) supporting citus distributed postgresql

Notifications You must be signed in to change notification settings

supaplextor/so-citus-import-utils

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

so-citus-import-utils

Be sure you have the appropriate citus extension installed. https://github.com/citusdata/citus

Most of this runs on the cordinator (vs worker nodes). Since the StackOverflow export is so large (even compressed), citus partitions tables over worker nodes to fan out the workload. An insert query will affect one node, where a select query will aggregate table partitions (sharding actually), into one series of results. This means the db application only needs to worry about the cordinator node when accessing each distributed database.

Forking https://github.com/badmonster-nc/stackoverflow_in_pg and using shell pipelines (stdin/stdout) vs local files. Instead of so2pg scripts, these are -xml-to-psql.py scripts.

StackOverflow exports can be downloaded from https://archive.org/download/stackexchange

After citus extensions "make install". I usually clone things in ~/Projects/ or ~/usr/src/. Unless an elegant solution is available, for the meantime hostnames are hardcoded in scripts. This lab is just PoC, it's up to you to fix hostnames etc in these examples. This use case is to burn in citus with a real world data set.

The so archive of 7z files should reside in "../stackoverflow" relative to the "so-citus-import-utils" directory.

ln -s ~/Downloads/stackexchange/ stackoverflow
git clone https://github.com/supaplextor/so-citus-import-utils.git
cd so-citus-import-utils
./import-site.sh math

Archive Status Oct 13 2020

The worker nodes have been offline for sometime, so I have no way to fully keep testing this. The stackoverflow export allowed me to dig into the db without hammering their website.

About

import scripts (stackoverflow) supporting citus distributed postgresql

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published