# Abstract

**Author:** [Charles Tapley Hoyt](https://github.com/cthoyt/)

**Estimated Run Time:** 

This notebook outlines the process to programatically download a curated SNP listing and build a namespace with the PyBEL namespace builder. 

In [2]:
import requests
import os
import time

import pybel_tools
from pybel_tools.definition_utils import write_namespace

In [3]:
pybel_tools.__version__

'0.1.3-dev'

In [4]:
time.asctime()

'Wed Mar 15 12:02:31 2017'

# Data from dbSNP

In [3]:
dbsnp_url = 'ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606/VCF/00-All.vcf.gz'

In [2]:
dbsnp_out_path = os.path.expanduser('~/Downloads/dbsnp_snps.belns')

## Download

A standard resource for SNPs is dbSNP. It lists variations for which .

This notebook was written during the period for which build 147 is the most current.

ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606/VCF/All_20161122.vcf.gz is the physical link, but there's a symlink for the most recent at ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606/VCF/00-All.vcf.gz.

The VCF format has comments starting with a `##` for the first 50 or so lines, then rows with the following format

```
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO
1	10019	rs775809821	TA	T	.	.	RS=775809821;RSPOS=10020;dbSNPBuildID=144;SSR=0;SAO=0;VP=0x050000020005000002000200;GENEINFO=DDX11L1:100287102;WGT=1;VC=DIV;R5;ASP
1	10055	rs768019142	T	TA	.	.	RS=768019142;RSPOS=10055;dbSNPBuildID=144;SSR=0;SAO=0;VP=0x050000020005000002000200;GENEINFO=DDX11L1:100287102;WGT=1;VC=DIV;R5;ASP
```

### Do it in Bash

In bash, this was accomplished by gunzipping the module and doing 

1. `gunzip` it or `zcat` into the next command
2. `grep -v "##" All_20161122.vcf | cut -d $'\t' -f 3 > rsids.txt`
3. `cat` it into `pybel_tools` namespace builder

## Output

In [4]:
%%time
with open(os.path.expanduser('~/Downloads/rsids.txt')) as dbsnp_snps, open(dbsnp_out_path, 'w') as o:
    _ = next(dbsnp_snps)  # Throw out header
    
    write_namespace(
        "dbSNP Common SNPs", 
        "dbSNP", 
        "Gene and Gene Products", 
        'Charles Tapley Hoyt', 
        dbsnp_url, 
        dbsnp_snps,
        namespace_description="SNP List acquired from dbSNP",
        namespace_species='9606',
        author_copyright='WTF License',
        functions="G",
        author_contact="charles.hoyt@scai.fraunhofer.de",
        output=f
    )

# Data from ENSEMBL

In [2]:
ensembl_url = "http://www.ensembl.org/biomart/martview/8335b18c43ef6a646d303cdba04299ce?VIRTUALSCHEMANAME=default&ATTRIBUTES=hsapiens_snp.default.snp.refsnp_id|hsapiens_snp.default.snp.refsnp_source|hsapiens_snp.default.snp.chr_name|hsapiens_snp.default.snp.chrom_start|hsapiens_snp.default.snp.chrom_end&FILTERS=&VISIBLEPANEL=resultspanel"

In [5]:
ensembl_out_path = os.path.expanduser('~/Downloads/ensembl_snps.belns')

It's also possible to download the list of SNPs in ENSEMBL. Their data download service is also slow, and often crashes when trying to dump this database.

In [4]:
%%time
res = requests.get(url)
res.status_code

CPU times: user 15.7 ms, sys: 5.63 ms, total: 21.3 ms
Wall time: 210 ms


In [5]:
ensembl_snps = res.content.decode('utf-8').split('\n')

## Output

The `buildns` command in `pybel_tools` takes a list of items on stdin and the appropriate annotations as arguments to write a `*.belns` file conforming to the [specification](openbel-framework.readthedocs.io/en/latest/tutorials/building_custom_namespaces.html) from the OpenBEL Framework

In [6]:
%%time
with open(ensembl_out_path, 'w') as f:
    write_namespace(
        "Ensembl SNPS", 
        "EnsblSNP", 
        "Gene and Gene Products", 
        'Charles Tapley Hoyt', 
        ensembl_url, 
        ensembl_snps,
        namespace_description="SNP List acquired from Ensembl",
        namespace_species='9606',
        author_copyright='WTF License',
        functions="G",
        author_contact="charles.hoyt@scai.fraunhofer.de",
        file=f
    )

CPU times: user 1.45 ms, sys: 1.34 ms, total: 2.79 ms
Wall time: 2.97 ms


# Conclusions

There are multiple databases containing SNPs. Some are curated and others are not, and they all contain different annotations for different purposes. There are many other sources that also contain these data, such as the Affymetrix and Illumina sequence probe manifests.