Skip to content

seanharr11/snip_warehouse

master
Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Warehousing DbSNP's JSON Data into PostgreSQL

Intro

NCBI hosts a large, open-sourced dataset of human SNPs (Single-nucleotide Polymorphisms). Further, they store a good deal of auxillary data that is related to each SNP. The data is hosted on an FTP server here:

ftp://ftp.ncbi.nlm.nih.gov/snp/.redesign/latest_release/JSON

and is split across 25 gzipped JSON files (Chromosomes 1-22, X, Y and Mitochondrial DNA), amassing a total compressed size of ~100GB (~2TB uncompressed!).

Further Reading

More details can be found in this series of blog posts, detailing a three-part walkthrough, breaking the development of this application down in three steps:

  1. Downloading JSON SNP Data & Initilizing the Database
  2. Extracting ClinVar Disease & Frequency Study Data
  3. Efficiently Writing Data to PostgreSQL Database