Warehousing DbSNP's JSON Data into PostgreSQL
NCBI hosts a large, open-sourced dataset of human SNPs (Single-nucleotide Polymorphisms). Further, they store a good deal of auxillary data that is related to each SNP. The data is hosted on an FTP server here:
and is split across 25 gzipped JSON files (Chromosomes 1-22, X, Y and Mitochondrial DNA), amassing a total compressed size of ~100GB (~2TB uncompressed!).
More details can be found in this series of blog posts, detailing a three-part walkthrough, breaking the development of this application down in three steps: