# Download PanCan12 data

In this notebook, we download various data files required for the analyses performed in the other notebooks.

> **Attention**
>
> Downloading the TCGA data requires the [firehose_get](https://confluence.broadinstitute.org/display/GDAC/Download) utility, which we assume has been installed and added to the executable `PATH`.

In [1]:
import sys
sys.path.append("../lib")

In [2]:
import nbsupport.tcga
import nbsupport.util

In [3]:
firehoseDate = "2014_07_15"

In [4]:
DATA_DIR = "../data"

In [5]:
!mkdir {DATA_DIR}

In [6]:
%cd {DATA_DIR}

/data/s.canisius/notebooks/discover/data


In [7]:
!mkdir downloads
!mkdir entrez
!mkdir tcga

In [8]:
%cd downloads

/data/s.canisius/notebooks/discover/data/downloads


## PanCan12 alteration data

In [9]:
!firehose_get -b -only copynumber_gistic MutSigNozzleReport2CV analyses {firehoseDate} {" ".join(nbsupport.tcga.PANCAN12_STUDIES)}

Validating run selection against Broad Institute website ...
You've asked to download archives for the following tasks

     copynumber_gistic MutSigNozzleReport2CV

run against the disease cohorts

      COAD LUSC READ GBM LAML HNSC BLCA UCEC LUAD OV BRCA KIRC

from the analyses__2014_07_15 Firehose run. 
Attempting to retrieve data for Broad GDAC run analyses__2014_07_15 ...
--2016-02-24 11:50:39--  http://gdac.broadinstitute.org/runs/analyses__2014_07_15/data/COAD/
     0K                                                      100%  161M=0s
2016-02-24 11:50:39 ERROR 404: Not Found.
--2016-02-24 11:50:39--  http://gdac.broadinstitute.org/runs/analyses__2014_07_15/data/COAD/?C=N;O=D
     0K                                                      100%  111M=0s
--2016-02-24 11:50:39--  http://gdac.broadinstitute.org/runs/analyses__2014_07_15/data/COAD/?C=M;O=A
     0K                                                      100%  164M=0s
--2016-02-24 11:50:39--  http://gdac.broadinstitute.org/ru

In [10]:
import shutil
import tarfile
import tempfile

from contextlib import closing

In [11]:
for study in nbsupport.tcga.PANCAN12_STUDIES:
    filePrefix = "analyses__{date}/{study}/{date2}/gdac.broadinstitute.org_{study}-{tumourType}".format(
        date=firehoseDate, date2=firehoseDate.replace("_", ""), study=study, tumourType="TB" if study == "LAML" else "TP")
    
    gisticResults = "{prefix}.CopyNumber_Gistic2.Level_4.{date2}00.0.0.tar.gz".format(
        prefix=filePrefix, date2=firehoseDate.replace("_", ""))
    with tarfile.open(gisticResults, "r|gz") as archive:
        for entry in archive:
            if entry.name.endswith("all_thresholded.by_genes.txt"):
                with closing(archive.extractfile(entry)) as stream, tempfile.TemporaryFile() as out:
                    shutil.copyfileobj(stream, out)
                    out.seek(0)
                    cn = nbsupport.tcga.read_copynumber_data(out)
                    cn.to_hdf("../tcga/tcga-pancan12.h5", "/data/{}/cn".format(study), complevel=9, complib="bzip2")
    
    mutsigResults = "{prefix}.MutSigNozzleReport2CV.Level_4.{date2}00.1.0.tar.gz".format(
        prefix=filePrefix, date2=firehoseDate.replace("_", ""))
    with tarfile.open(mutsigResults, "r|gz") as archive:
        for entry in archive:
            if entry.name.endswith(".final_analysis_set.maf"):
                with closing(archive.extractfile(entry)) as stream, tempfile.TemporaryFile() as out:
                    shutil.copyfileobj(stream, out)
                    out.seek(0)
                    mut = nbsupport.tcga.read_mutation_data(out)
                    mut.to_hdf("../tcga/tcga-pancan12.h5", "/data/{}/mut".format(study), complevel=9, complib="bzip2")

In [12]:
filePrefix = "analyses__{date}/BRCA/{date2}/gdac.broadinstitute.org_BRCA-TP".format(
        date=firehoseDate, date2=firehoseDate.replace("_", ""))

gisticResults = "{prefix}.CopyNumber_Gistic2.Level_4.{date2}00.0.0.tar.gz".format(
        prefix=filePrefix, date2=firehoseDate.replace("_", ""))

with tarfile.open(gisticResults, "r|gz") as archive:
    for entry in archive:
        if entry.name.endswith("amp_genes.conf_99.txt"):
            with closing(archive.extractfile(entry)) as stream, open("../tcga/amp_genes.conf_99.BRCA.txt", "w") as out:
                shutil.copyfileobj(stream, out)

## High-confidence mutational drivers list

A list of high-confidence mutational drivers has been published in the following paper.

> Tamborero, D. *et al*. Comprehensive identification of mutational cancer
driver genes across 12 tumor types. *Sci Rep* **3**, 2650 (2013), [doi:10.1038/srep02650](http://doi.org/10.1038/srep02650).

In [13]:
import urllib
import zipfile

In [14]:
url = "http://www.nature.com/article-assets/npg/srep/2013/131002/srep02650/extref/srep02650-s2.zip"

In [15]:
filename, response = urllib.urlretrieve(url, "srep02650-s2.zip")

In [16]:
nbsupport.util.check_digest(filename, "5a56134a26ff9e83b40e8cd1e7043e5e")

In [17]:
with zipfile.ZipFile(filename) as zip:
    with zip.open("srep02650-s3.csv") as stream, open("../tcga/mutational-drivers.csv", "w") as out:
        shutil.copyfileobj(stream, out)

## Cancer gene list

The [Bushman lab](http://www.bushmanlab.org) maintains a list of genes implicated in cancer.

In [18]:
url = "http://www.bushmanlab.org/assets/doc/allonco_20130923.tsv"

In [19]:
filename, respose = urllib.urlretrieve(url, "../tcga/cancer-genes.tsv")

In [20]:
nbsupport.util.check_digest(filename, "72832c847db96c7fb019fa97050fb9a4")

## Entrez gene annotation

In [21]:
url = "ftp://ftp.ncbi.nlm.nih.gov/genomes/H_sapiens/ARCHIVE/ANNOTATION_RELEASE.105/mapview/seq_gene.md.gz"

In [22]:
filename, respose = urllib.urlretrieve(url, "../entrez/seq_gene.md.gz")

In [23]:
nbsupport.util.check_digest(filename, "6fa4691b852dd9bb603c2a2f64718317")

## Manual downloads

The following downloads are protected by passwords, and so they cannot be downloaded automatically.

### Recurrent copy number alterations

Download the following two files from synapse, and add them to the `data/tcga` directory.

* [amp_genes.conf_95.pancan12.txt](https://www.synapse.org/#!Synapse:syn2204280)
* [del_genes.conf_95.pancan12.txt](https://www.synapse.org/#!Synapse:syn2204295)

In [24]:
nbsupport.util.check_digest("../tcga/amp_genes.conf_95.pancan12.txt", "0f23349056bc5d451c2bf8358e99dab8")

In [25]:
nbsupport.util.check_digest("../tcga/del_genes.conf_95.pancan12.txt", "80b3e9027ba6ca0e6b56eedde6bb019f")

### MSigDb pathways

Download the following file from MSigDb and add it to the `data/msigdb` directory.

* [c2.cp.v5.0.symbols.gmt](http://software.broadinstitute.org/gsea/msigdb/download_file.jsp?filePath=/resources/msigdb/5.1/c2.cp.v5.1.symbols.gmt)

In [26]:
nbsupport.util.check_digest("../msigdb/c2.cp.v5.0.symbols.gmt", "9790879a470eda3c6c431a91f3b652f8")