Warehousing DbSNP's JSON Data into PostgreSQL
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
data
docs
snip_warehouse
.gitignore
README.md
__init__.py
requirements.txt
run.py
setup.py

README.md

Warehousing DbSNP's JSON Data into PostgreSQL

Intro

NCBI hosts a large, open-sourced dataset of human SNPs (Single-nucleotide Polymorphisms). Further, they store a good deal of auxillary data that is related to each SNP. The data is hosted on an FTP server here:

ftp://ftp.ncbi.nlm.nih.gov/snp/.redesign/latest_release/JSON

and is split across 25 gzipped JSON files (Chromosomes 1-22, X, Y and Mitochondrial DNA), amassing a total compressed size of ~100GB (~2TB uncompressed!).

Further Reading

More details can be found in this series of blog posts, detailing a three-part walkthrough, breaking the development of this application down in three steps:

  1. Downloading JSON SNP Data & Initilizing the Database
  2. Extracting ClinVar Disease & Frequency Study Data
  3. Efficiently Writing Data to PostgreSQL Database