# How do I find<sup>1</sup> & download TCGA data hosted on AWS?

## Goal
The **goal** of this tutorial is to empower the public to access [TCGA](https://wiki.nci.nih.gov/display/TCGA/TCGA+Home) data via [Amazon Web Services](https://aws.amazon.com/). 

## TCGA Data Types
There are two [types](https://wiki.nci.nih.gov/display/TCGA/Open+Access+and+Controlled+Access+Data) of TCGA data:

 * Open Access Data
 * Controlled Access Data 

Investigators can obtain access to _Controlled_ data through the [NIH](https://www.nih.gov/) via the [dBGaP](http://www.ncbi.nlm.nih.gov/gap) site. Seven Bridges has obtained _Trusted Partner_ status with the NIH and authenticates user through the Federated Identity service.

## Steps to data access
Finding and downloading TCGA data are accomplished with three different steps. Data is _found_ by constructing and executing a [SPARQL](https://www.w3.org/TR/rdf-sparql-query/) query (Step 1). Data can be _downloaded_ by an API call (Step 2) if the user has appropriate access to the data type (Step 2). 

 1. [OpenSPARQL](https://opensparql.sbgenomics.com/#/) to query a [RDF](https://www.w3.org/RDF/) database of hosted TCGA data
 2. Register on the [Cancer Genomics Cloud](https://cgc.sbgenomics.com/login/) (CGC)
 3. The CGC [API](http://docs.cancergenomicscloud.org/docs/the-cgc-api) to download files
 
We will follow these steps to find and then download _Open Access Data_ which will be accessible to **all users**. However, we _encourage users_ to modify the example query to stratify particularly interesting cohorts or obtain _Controlled Access Data_ (if they have the appropriate permission).

## Notes
This tutorial is written in **Python 2.7**, for compatibility with Python 3 please swap out the _urllib_ library. 

<sup>1</sup> This is the **OpenSPARQL** flavor of this tutorial. There is also a _Datasets API_ version [here](access_TCGA_on_AWS_via_DatasetsAPI.ipynb)

## _Step 1_: Find data with OpenSPARQL
### Imports
We import [urllib](https://docs.python.org/2/library/urllib.html) and [SPARQLWrapper](https://rdflib.github.io/sparqlwrapper/) for checking the OpenSPARQL endpoint and constructing the SPARQL object. Both json and requests will be needed later for the API calls.

In [1]:
# Needed for Step 1
import urllib
import SPARQLWrapper as spark


# Needed for Step 3
import json
from requests import request

### Check OpenSPARQL endpoint
This tutorial relies on a public endpoint that can be temporarily down. Here we make sure it is operational. Then we initialize the SPARQL object.

In [2]:
# Check SPARQL endpoint
try:
    rc = urllib.urlopen("https://opensparql.sbgenomics.com").getcode()
except Exception:
    rc = 0
if rc != 200:
    print(
        """script relies on sparql endpoint 
        (https://opensparql.sbgenomics.com/) 
        which is currently not responding. 
        Can not continue, exiting.""")
    raise KeyboardInterrupt
else:
    print("Endpoint is operational, we are good to go!")
    
    
# Initialize SPARQL object
sparql_endpoint = "https://opensparql.sbgenomics.com/bigdata/namespace/tcga_metadata_kb/sparql"
sparql = spark.SPARQLWrapper(sparql_endpoint)   

Endpoint is operational, we are good to go!


### Define a query
We will construct a query to find a set of TCGA files. First, please review the TCGA **ontology** that we have defined for this dataset [here](http://docs.cancergenomicscloud.org/docs/tcga-metadata-on-the-cgc), it includes **140 properties** and takes some time to navigate.  

This tutorial is searching for **female**, **Breast Cancer**, patients (_Cases_), who are **alive** and the associated _Files_ which are **open-access**, provide **Gene expression** and came from the **experiemental strategy** of RNA-seq. One way to think about this is nesting, e.g.
 
 * Case
   * hasGender
   * hasDiseaseType 
   * hasVitalStatus
 * File
   * hasAccessLevel
   * hasDataType 
   * hasExperimentalStrategy

We will assign an _exact value_ to the above properties. However, the query also needs a few _non-specific_ parameters such as **File : hasStoragePath**. We include this in the query without an exact value such that this information is _returned_ by the query. We will need it for downloading the file in the next step. Alternatively, you can return properties and operate on them directly in Python, we provide an example here with **Case : hasDaysToLastFollowup**.

Below we set the query and execute it. The query results are stored in an object named **results**.

In [3]:
# Create the query above as a block-string
query = """
    prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
    prefix tcga: <https://www.sbgenomics.com/ontologies/2014/11/tcga#>

    select distinct ?case_id ?file_name ?file ?path ?vital_status ?days_to_follow
    where
    {
      ?case a tcga:Case .
      ?case rdfs:label ?case_id .
      
      ?case tcga:hasDiseaseType ?dt .
      ?dt rdfs:label 'Breast Invasive Carcinoma' .

      ?case tcga:hasGender ?gender. 
      ?gender rdfs:label 'FEMALE' .
  
      ?case tcga:hasVitalStatus ?vs .
      ?vs rdfs:label 'Alive' .
      
      ?case tcga:hasDaysToLastFollowUp ?days_to_follow .

      ?case tcga:hasFile ?file .
    
      ?file rdfs:label ?file_name .
      ?file tcga:hasStoragePath ?path .
      
      ?file tcga:hasAccessLevel ?ac .
      ?ac rdfs:label 'Open' .
      
      ?file tcga:hasExperimentalStrategy ?es .
      ?es rdfs:label 'RNA-Seq'.
      
      ?file tcga:hasDataType ?dat.
      ?dat rdfs:label 'Gene expression'
    }
"""

sparql.setQuery(query)              # Define query on the wrapper
sparql.setReturnFormat(spark.JSON)  # We want server to return JSON to use
results = sparql.query().convert()  # Convert results to Python object

### Extracting useful results
Here we extract useful information from the _results_ object. First we extract two _examples_ of properties which maybe actionable in iPython:

 1. UUID
 2. Days to last followup
 
Next we pull out two properties which will be necessary for downloading the data:

 1. Path
 2. (optional) File name
 
We print out some summary stats about the query and list the first 10 results.

#### Note
There will very likely be repetitive _UUIDs_ and _days to followup_ in the print-out. This is **expected behavior** because we have not restricted the query to be exclusive, e.g. by specifying **Sample : hasSampleType = 'Primary Tumor'**. Instead we are seeing multiple files per sample, likely due to multiple samples. 

In [4]:
# Information (potentially actionable) about the query results
uuid_list = [result['case_id']['value'] for result in results['results']['bindings']]
day_to_follow_list = \
[result['days_to_follow']['value'] for result in results['results']['bindings']]

# Information for downloading files within the query
file_paths = [result['path']['value'] for result in results['results']['bindings']]
file_names = [result['file_name']['value'] for result in results['results']['bindings']]
file_ids = [result['file']['value'].split('/')[-1] for result in results['results']['bindings']]

# Print some information about the query results
print("Query returned %i results, printing the first 10:" % (len(uuid_list)))
for ii in range(0,min(10, len(uuid_list))):
    print("Case UUID %s had %s days to last followup \n" \
         % (uuid_list[ii], day_to_follow_list[ii]))

Query returned 4349 results, printing the first 10:
Case UUID 084C6AEC-94F7-4090-8F3F-59FA9E89721A had 786 days to last followup 

Case UUID 084C6AEC-94F7-4090-8F3F-59FA9E89721A had 786 days to last followup 

Case UUID 084C6AEC-94F7-4090-8F3F-59FA9E89721A had 786 days to last followup 

Case UUID 084C6AEC-94F7-4090-8F3F-59FA9E89721A had 786 days to last followup 

Case UUID 084C6AEC-94F7-4090-8F3F-59FA9E89721A had 786 days to last followup 

Case UUID 08306EDE-E74B-4BE9-B768-7FD98647B6AC had 344 days to last followup 

Case UUID 08306EDE-E74B-4BE9-B768-7FD98647B6AC had 344 days to last followup 

Case UUID 08306EDE-E74B-4BE9-B768-7FD98647B6AC had 344 days to last followup 

Case UUID 08306EDE-E74B-4BE9-B768-7FD98647B6AC had 344 days to last followup 

Case UUID 08306EDE-E74B-4BE9-B768-7FD98647B6AC had 344 days to last followup 



## _Step 2_: Register for a Cancer Genomics Cloud Account
### Create a new account
**NOTE**: Users with an existing CGC account can skip this step.

There is an excellent guide to registering for the CGC [here](http://docs.cancergenomicscloud.org/docs/sign-up-for-the-cgc) -- this whould process should take less than 5 mins and is free.  

### Get your Developer Token
Once you have registered and verified your account, log in to the CGC. In the top right corner, click on your user name (here "jack"). A small window will drop down, including your account balance, data access approval, Account Settings, and Payments.  

<img src="images/TCGA_AWS_0.png" height="218" width="780"> 

Click on Account Settings to open a new window with your information.

<img src="images/TCGA_AWS_1.png" height="496" width="537"> 

In the left panel, click on Developer to get to the Developer Dashboard.

<img src="images/TCGA_AWS_2.png" height="338" width="780"> 

If this is the first time logging on, you won't have any Token. Click the green Generate Token button to get one. 

<img src="images/TCGA_AWS_3.png" height="274" width="579"> 

Copy the Authentication Token, we will use this later for data access. 

**NOTE**: your token is **confidential** information which grants the same access as your username and password. Please keep it safe. In case it is compromised (for example in a publically shared iPython notebook), make sure to click the _Regenerate Token_ button to disable the old one and generate a new one.

## _Step 3_: Download data with the Cancer Genomics Cloud API
We will use the Python wrapper (sevenbridges-python) to download the data. If you haven't already installed this, please do

### Install _sevenbridges-python_ library
You need to install _sevenbridges-python_ library. Library details are available [here](http://sevenbridges-python.readthedocs.io/en/latest/sevenbridges/). The easiest way to install sevenbridges-python is using pip:
```bash
pip install sevenbridges-python
```

### Save your Authentication Token to a local configuration file
There are multiple ways to pass your authentication token to the sevenbridges-python library. Here we focus on using a configuration file in your HOME directory. The format of the .sbgrc is standard ini file format, as shown below:

   ```bash
    [cgc]
    auth-token = 910975f5b24a470bb0b028fe813b8100
    api-url = https://cgc-api.sbgenomics.com/v2
    
    [sbpla]
    auth-token = 700992f7b24a470bb0b028fe813b8100
    api-url = https://api.sbgenomics.com/v2  
   ```
To **create** this file<sup>1</sup>, use the following steps in your _Terminal_:

    1.
```bash
cd ~
touch .sbgrc
vi .sbgrc
```
    2. Press "i" then enter to go into **insert mode**
    3. Write the text above for each environment. It should look like: 
 
<img src="images/example_sbgrc_file.png" height="505" width="409"> 
    4. Press "ESC" then type ":wq" to save the file and exit vi
  
<sup>1</sup> If the file already exists, omit the _touch_ command

### Imports
We import the _Api_ class from the official sevenbridges-python bindings below.

In [5]:
import sevenbridges as sbg

## Initialize the object
The _Api_ object needs to know your **auth\_token** and the correct path. Here we assume you are using the .sbgrc file in your home directory. For other options see <a href="https://github.com/sbg/okAPI/blob/master/Recipes/CGC/Setup_API_environment.ipynb">Setup_API_environment.ipynb</a>

In [6]:
# [USER INPUT] specify platform {cgc, sbg}
prof = 'cgc'


config_file = sbg.Config(profile=prof)
api = sbg.Api(config=config_file)

### List of links to download files
Here we loop through the _first ten_ files using the **file\_ids** list from above. We do two things with these ids:

 1. Create a list of files on the platform. From this point, it would be possible to **take action** on the CGC, i.e. combining these files with an _App_ and the starting a _Task_. 
 2. (optional) Generate a list of download links
 3. Download each of the ten files in this list. They will be saved to a _downloads_ folder in your local directory.

In [8]:
# 1) Generate a list a file objects from the file_ids list
file_list = []
for f_id in file_ids[0:10]:
    file_list.append(api.files.get(id = f_id))
    print(file_list[-1].name)    
    
# (BRANCH-POINT) Do something awesome with these files on the CGC


# 2) (optional) Generate a list of download links
dl_list = []
for f in file_list:
    dl_list.append(f.download_info())

    
# 3) Download each of the files in the list to a _downloads_ folder in your local directory.
import os

dl_dir = 'downloads'
try:
    os.stat(dl_dir)
except:
    os.mkdir(dl_dir)

for f in file_list:
    f.download(path = ("%s/%s" % (dl_dir, f.name)))

UNCID_654572.TCGA-A2-A0D1-01A-11R-A034-07.110309_UNC12-SN629_0068_BB05LPABXX.3.trimmed.annotated.translated_to_genomic.spljxn.quantification.txt
UNCID_654594.TCGA-A2-A0D1-01A-11R-A034-07.110309_UNC12-SN629_0068_BB05LPABXX.3.trimmed.annotated.gene.quantification.txt
unc.edu.01370d42-f75c-4532-9b9c-24ff7302b033.1770587.bt.exon_quantification.txt
unc.edu.01370d42-f75c-4532-9b9c-24ff7302b033.1151943.junction_quantification.txt
UNCID_654642.TCGA-A2-A0D1-01A-11R-A034-07.110309_UNC12-SN629_0068_BB05LPABXX.3.trimmed.annotated.translated_to_genomic.exon.quantification.txt
unc.edu.e3639f71-cfe7-46c7-8390-73f8043ac6b0.1147657.junction_quantification.txt
UNCID_477289.TCGA-A2-A0T6-01A-11R-A084-07.110406_UNC13-SN749_0049_AC008AABXX.7.trimmed.annotated.translated_to_genomic.exon.quantification.txt
UNCID_477240.TCGA-A2-A0T6-01A-11R-A084-07.110406_UNC13-SN749_0049_AC008AABXX.7.trimmed.annotated.translated_to_genomic.spljxn.quantification.txt
unc.edu.e3639f71-cfe7-46c7-8390-73f8043ac6b0.1769820.bt.exon_