# Building a SNP Namespace

**Author:** [Charles Tapley Hoyt](https://github.com/cthoyt/)

This notebook outlines the process to programatically download a curated SNP listing and build a namespace with the PyBEL namespace builder. 

In [1]:
import requests
import os

from pybel_tools.namespace_utils import build_namespace

# dbSNP

## Download

A standard resource for SNPs is dbSNP. It lists variations for which .

This notebook was written during the period for which build 147 is the most current.

ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606/VCF/All_20161122.vcf.gz is the physical link, but there's a symlink for the most recent at ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606/VCF/00-All.vcf.gz.

The VCF format has comments starting with a `##` for the first 50 or so lines, then rows with the following format

```
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO
1	10019	rs775809821	TA	T	.	.	RS=775809821;RSPOS=10020;dbSNPBuildID=144;SSR=0;SAO=0;VP=0x050000020005000002000200;GENEINFO=DDX11L1:100287102;WGT=1;VC=DIV;R5;ASP
1	10055	rs768019142	T	TA	.	.	RS=768019142;RSPOS=10055;dbSNPBuildID=144;SSR=0;SAO=0;VP=0x050000020005000002000200;GENEINFO=DDX11L1:100287102;WGT=1;VC=DIV;R5;ASP
```

### Do it in Bash

In bash, this was accomplished by gunzipping the module and doing 

1. `gunzip` it or `zcat` into the next command
2. `grep -v "##" All_20161122.vcf | cut -d $'\t' -f 3 > rsids.txt`
3. `cat` it into `pybel_tools` namespace builder

In [2]:
out_path = os.path.expanduser('~/Downloads/snps.belns')

In [3]:
url = 'ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606/VCF/00-All.vcf.gz'

In [4]:
%%time
with open(os.path.expanduser('~/Downloads/rsids.txt')) as i, open(out_path, 'w') as o:
    header = next(i)
    
    build_namespace(
        "dbSNP Common SNPs", 
        "dbSNP", 
        "Gene and Gene Products", 
        'Charles Tapley Hoyt', 
        url, 
        i,
        namespace_description="SNP List acquired from dbSNP",
        namespace_species='9606',
        author_copyright='WTF License',
        functions="G",
        author_contact="charles.hoyt@scai.fraunhofer.de",
        output=o
    )
    
    

# ENSEMBL

It's also possible to download the list of SNPs in ENSEMBL. Their data download service is also slow, and often crashes when trying to dump this database.

In [2]:
url = "http://www.ensembl.org/biomart/martview/8335b18c43ef6a646d303cdba04299ce?VIRTUALSCHEMANAME=default&ATTRIBUTES=hsapiens_snp.default.snp.refsnp_id|hsapiens_snp.default.snp.refsnp_source|hsapiens_snp.default.snp.chr_name|hsapiens_snp.default.snp.chrom_start|hsapiens_snp.default.snp.chrom_end&FILTERS=&VISIBLEPANEL=resultspanel"

In [4]:
%%time
res = requests.get(url)
res.status_code

CPU times: user 15.7 ms, sys: 5.63 ms, total: 21.3 ms
Wall time: 210 ms


In [5]:
snps = res.content.decode('utf-8').split('\n')

## Output

The `buildns` command in `pybel_tools` takes a list of items on stdin and the appropriate annotations as arguments to write a `*.belns` file conforming to the [specification](openbel-framework.readthedocs.io/en/latest/tutorials/building_custom_namespaces.html) from the OpenBEL Framework

In [6]:
%%time
with open(out_path, 'w') as f:
    build_namespace(
        "Ensembl SNPS", 
        "EnsblSNP", 
        "Gene and Gene Products", 
        'Charles Tapley Hoyt', 
        url, 
        snps,
        namespace_description="SNP List acquired from Ensembl",
        namespace_species='9606',
        author_copyright='WTF License',
        functions="G",
        author_contact="charles.hoyt@scai.fraunhofer.de",
        output=f
    )

CPU times: user 1.45 ms, sys: 1.34 ms, total: 2.79 ms
Wall time: 2.97 ms
