# How do I find<sup>2</sup> & download TCGA data hosted on AWS?

## Goal
The **goal** of this tutorial is to empower the public to access [TCGA](https://wiki.nci.nih.gov/display/TCGA/TCGA+Home) data via [Amazon Web Services](https://aws.amazon.com/). 

## TCGA Data Types
There are two [types](https://wiki.nci.nih.gov/display/TCGA/Open+Access+and+Controlled+Access+Data) of TCGA data:

 * Open Access Data
 * Controlled Access Data 

Investigators can obtain access to _Controlled_ data through the [NIH](https://www.nih.gov/) via the [dBGaP](http://www.ncbi.nlm.nih.gov/gap) site. Seven Bridges has obtained _Trusted Partner_ status with the NIH and authenticates user through the Federated Identity service.

## Steps to data access
Finding and downloading TCGA data are accomplished with three different steps. Data is _found_ by constructing and executing a [SPARQL](https://www.w3.org/TR/rdf-sparql-query/) query within the _Datasets API_ (Step 2). Data can be _downloaded_ by an API call (Step 3) if the user has appropriate access to the data type (Step 1). 

 1. Register on the [Cancer Genomics Cloud](https://cgc.sbgenomics.com/login/) (CGC)
 2. [Dataset API](http://docs.cancergenomicscloud.org/docs/datasets-api-overview) to query a [RDF](https://www.w3.org/RDF/) database of hosted TCGA data. 
 3. The CGC [API](http://docs.cancergenomicscloud.org/docs/the-cgc-api) to download files
 
We will follow these steps to find and then download _Open Access Data_ which will be accessible to **all users**. However, we _encourage users_ to modify the example query to stratify particularly interesting cohorts or obtain _Controlled Access Data_ (if they have the appropriate permission).

## Notes
 * This tutorial is written in **Python 2.7**, for compatibility with Python 3 please swap out the _urllib_ library. 
 * Datasets API is an **Advanced Access** feature, which means, while it is fully operational, it is subject to change.

<sup>2</sup> This is the **Datasets API** flavor of this tutorial. There is also an _OpenSPARQL_ version [here](access_TCGA_on_AWS.ipynb)

## _Step 1_: Register for a Cancer Genomics Cloud Account
### Create a new account
**NOTE**: Users with an existing CGC account can skip this step.

There is an excellent guide to registering for the CGC [here](http://docs.cancergenomicscloud.org/docs/sign-up-for-the-cgc) -- this whould process should take less than 5 mins and is free.  

### Get your Developer Token
Once you have registered and verified your account, log in to the CGC. In the top right corner, click on your user name (here "jack"). A small window will drop down, including your account balance, data access approval, Account Settings, and Payments.  

<img src="images/TCGA_AWS_0.png" height="218" width="780"> 

Click on Account Settings to open a new window with your information.

<img src="images/TCGA_AWS_1.png" height="496" width="537"> 

In the left panel, click on Developer to get to the Developer Dashboard.

<img src="images/TCGA_AWS_2.png" height="338" width="780"> 

If this is the first time logging on, you won't have any Token. Click the green Generate Token button to get one. 

<img src="images/TCGA_AWS_3.png" height="274" width="579"> 

Copy the Authentication Token, we will use this later for data access. 

**NOTE**: your token is **confidential** information which grants the same access as your username and password. Please keep it safe. In case it is compromised (for example in a publically shared iPython notebook), make sure to click the _Regenerate Token_ button to disable the old one and generate a new one.

## _Step 2_: Find data with Datasets API
### Imports
We import json and requests to write a wrapper around the API call

In [1]:
import json
from requests import request

### Define an API call wrapper
We define a simple function here to send and recieve JSONs from the API using correctly formatted HTTP calls. The necessary imports have already been handled above.

Note, we are making the base URL optional as the Datasets API and Cancer Genomics Cloud API use different endpoints

In [2]:
def api_call(path, method='GET', query=None, data=None, token=None):
    
    base_url = 'https://cgc-datasets-api.sbgenomics.com/datasets/tcga/v0/'
    
    data = json.dumps(data) if isinstance(data, dict) \
    or isinstance(data,list) else None
              
    headers = {
        'X-SBG-Auth-Token': token,
        'Accept': 'application/json',
        'Content-type': 'application/json',
    }
    
    response = request(method, base_url + path, params=query, \
                       data=data, headers=headers) 
    response_dict = json.loads(response.content) if \
    response.content else {}

    if response.status_code / 100 != 2:
        print(response_dict['message'])
        print('Error Code: %i.' % (response_dict['code']))
        print(response_dict['more_info'])
        raise Exception('Server responded with status code %s.' \
                        % response.status_code)
    return response_dict

### Provide the Authentication Token
This is the worst approach, I'm just seeing if you are paying attention. Examples of proper coding of your auth\_token are available for [sevenbridges-python bindings](https://github.com/sbg/okAPI/blob/master/Recipes/CGC/Setup_API_environment.ipynb)

In [3]:
auth_token = '15baf12492f6432086b6b44b3d6389d0'

### Define a query
We will construct a query to find a set of TCGA files. First, please review the TCGA **ontology** that we have defined for this dataset [here](http://docs.cancergenomicscloud.org/docs/tcga-metadata-on-the-cgc), it includes **140 properties** and takes some time to navigate.  

This tutorial is searching for **female**, **Breast Cancer**, patients (_Cases_), who are **alive** and the associated _Files_ which are **open-access**, provide **Gene expression** and came from the **experiemental strategy** of RNA-seq. One way to think about this is nesting, e.g.
 
 * File
   * hasAccessLevel
   * hasDataType 
   * hasExperimentalStrategy
 * Case
   * hasGender
   * hasDiseaseType 
   * hasVitalStatus

We will assign an _exact value_ to the above properties. However, the query also needs a few _non-specific_ parameters such as **File : hasStoragePath**. We include this in the query without an exact value such that this information is _returned_ by the query. We will need it for downloading the file in the next step. Alternatively, you can return properties and operate on them directly in Python, we provide an example here with **Case : hasDaysToLastFollowup**.

Below we set the query and execute it. The query results are stored in an object named **results**.

#### PROTIP:
 * Extensive details of the Datasets API calls are available [here](http://docs.cancergenomicscloud.org/docs/query-tcga-via-the-datasets-api)

In [4]:
query_body = {
    "entity": "files",
    "hasAccessLevel" : "Open",
    "hasDataType" : "Gene expression",
    "hasExperimentalStrategy": "RNA-Seq",
    "hasCase": {
        "hasDiseaseType" : "Breast Invasive Carcinoma",
        "hasGender" : "FEMALE",
        "hasVitalStatus" : "Alive"
    }
}

### Check the total number of records in the query
The first API call will return the total number of records

In [5]:
total = api_call(method='POST', path ='query/total', \
                 token=auth_token, data=query_body)

### Create a list of all records
We _page_ through the records 100 at a time to build a list of all files in the query.

#### PROTIP
Working with pagination in the API is beautifully described <a href="http://docs.cancergenomicscloud.org/docs/the-cgc-api#section-response-pagination">here</a>

In [6]:
files_in_query = []

from __future__ import division
from math import ceil

loops = int(ceil(total['total']/100))

for ii in range(0,loops):
    files_in_query.append(api_call(method='POST', \
                                   path =("query?offset=%i" % (100*ii)), \
                                   token=auth_token, data=query_body))
    print("%3.1f percent of files added" % (100*(ii+1)/loops))
    
# NOTE: each item in file_list is a list of 100 files from the query. Example below:
print('\n \n')
print(files_in_query[0]['_embedded']['files'][0])
print(files_in_query[1]['_embedded']['files'][0])

2.3 percent of files added
4.5 percent of files added
6.8 percent of files added
9.1 percent of files added
11.4 percent of files added
13.6 percent of files added
15.9 percent of files added
18.2 percent of files added
20.5 percent of files added
22.7 percent of files added
25.0 percent of files added
27.3 percent of files added
29.5 percent of files added
31.8 percent of files added
34.1 percent of files added
36.4 percent of files added
38.6 percent of files added
40.9 percent of files added
43.2 percent of files added
45.5 percent of files added
47.7 percent of files added
50.0 percent of files added
52.3 percent of files added
54.5 percent of files added
56.8 percent of files added
59.1 percent of files added
61.4 percent of files added
63.6 percent of files added
65.9 percent of files added
68.2 percent of files added
70.5 percent of files added
72.7 percent of files added
75.0 percent of files added
77.3 percent of files added
79.5 percent of files added
81.8 percent of files ad

## _Step 3_: Download data with the Cancer Genomics Cloud API
We will use the Python wrapper (sevenbridges-python) to download the data. If you haven't already installed this, please do

### Install _sevenbridges-python_ library
You need to install _sevenbridges-python_ library. Library details are available [here](http://sevenbridges-python.readthedocs.io/en/latest/sevenbridges/). The easiest way to install sevenbridges-python is using pip:
```bash
pip install sevenbridges-python
```

### Save your Authentication Token to a local configuration file
There are multiple ways to pass your authentication token to the sevenbridges-python library. Here we focus on using a configuration file in your HOME directory. The format of the .sbgrc is standard ini file format, as shown below:

   ```bash
    [cgc]
    auth-token = 910975f5b24a470bb0b028fe813b8100
    api-url = https://cgc-api.sbgenomics.com/v2
    
    [sbpla]
    auth-token = 700992f7b24a470bb0b028fe813b8100
    api-url = https://api.sbgenomics.com/v2  
   ```
To **create** this file<sup>1</sup>, use the following steps in your _Terminal_:

    1.
```bash
cd ~
touch .sbgrc
vi .sbgrc
```
    2. Press "i" then enter to go into **insert mode**
    3. Write the text above for each environment. It should look like: 
 
<img src="images/example_sbgrc_file.png" height="505" width="409"> 
    4. Press "ESC" then type ":wq" to save the file and exit vi
  
<sup>1</sup> If the file already exists, omit the _touch_ command

### Imports
We import the _Api_ class from the official sevenbridges-python bindings below.

In [7]:
import sevenbridges as sbg

## Initialize the object
The _Api_ object needs to know your **auth\_token** and the correct path. Here we assume you are using the .sbgrc file in your home directory. For other options see <a href="https://github.com/sbg/okAPI/blob/master/Recipes/CGC/Setup_API_environment.ipynb">Setup_API_environment.ipynb</a>

In [8]:
# [USER INPUT] specify platform {cgc, sbg}
prof = 'cgc'


config_file = sbg.Config(profile=prof)
api = sbg.Api(config=config_file)

### List of links to download files
Here we loop through the _first ten_ files in the first item of the **files_in_query** list from above using the **'ids'** key. Specifically, the id for the first file is in:

```python
files_in_query[0]['_embedded']['files'][0]['id']
```
We do two things with these ids:

 1. Create a list of files on the platform. From this point, it would be possible to **take action** on the CGC, i.e. combining these files with an _App_ and the starting a _Task_. 
 2. (optional) Generate a list of download links
 3. Download each of the ten files in this list. They will be saved to a _downloads_ folder in your local directory.

In [9]:
# 1) Generate a list a file objects from the file_ids list
file_list = []
for f in files_in_query[0]['_embedded']['files'][0:10]:
    file_list.append(api.files.get(id = f['id']))
    print(file_list[-1].name)    
    
# (BRANCH-POINT) Do something awesome with these files on the CGC


# 2) (optional) Generate a list of download links
dl_list = []
for f in file_list:
    dl_list.append(f.download_info())

    
# 3) Download each of the files in the list to a _downloads_ folder in your local directory.
import os

dl_dir = 'downloads'
try:
    os.stat(dl_dir)
except:
    os.mkdir(dl_dir)

for f in file_list:
    f.download(path = ("%s/%s" % (dl_dir, f.name)))

unc.edu.96c1cfef-0507-46f5-a007-d0f7369cced5.2090373.bt.exon_quantification.txt
unc.edu.96c1cfef-0507-46f5-a007-d0f7369cced5.2089520.junction_quantification.txt
unc.edu.7dbd8795-b264-4f68-bdea-4cfe39884864.1155813.junction_quantification.txt
UNCID_608183.TCGA-A8-A06O-01A-11R-A00Z-07.110228_UNC10-SN254_0198_BB041RABXX.4.trimmed.annotated.gene.quantification.txt
unc.edu.7dbd8795-b264-4f68-bdea-4cfe39884864.1771027.bt.exon_quantification.txt
UNCID_608349.TCGA-A8-A06O-01A-11R-A00Z-07.110228_UNC10-SN254_0198_BB041RABXX.4.trimmed.annotated.translated_to_genomic.exon.quantification.txt
UNCID_608179.TCGA-A8-A06O-01A-11R-A00Z-07.110228_UNC10-SN254_0198_BB041RABXX.4.trimmed.annotated.translated_to_genomic.spljxn.quantification.txt
unc.edu.4808bc63-000a-4a49-a25b-4b817ca5ea54.1768727.bt.exon_quantification.txt
unc.edu.4808bc63-000a-4a49-a25b-4b817ca5ea54.1145980.junction_quantification.txt
UNCID_470209.TCGA-AO-A0J4-01A-11R-A034-07.110503_UNC12-SN629_0079_AB0151ABXX.5.trimmed.annotated.gene.quanti