# Programmatically Access TCGA Data using the Seven Bridges Cancer Genomics Cloud via the Datasets API

---
**NOTE**

This tutorial is not actively maintained!

--- 

TCGA is one of the world’s largest cancer genomics data collections, including more than eleven thousand patients, representing 33 cancers, and over half a million total files. Seven Bridges has created a unified metadata ontology from the diverse cancer studies, made this data available, and provided compute infrastructure to facilitate customized analyses on the Cancer Genomics Cloud (the CGC). The CGC provides powerful methods to query and reproducibly analyze TCGA data - alone or in conjunction with your own data.
We continue to develop new methods of interacting with data on the CGC, however, we also appreciate that sometimes it is useful to be able to analyze data locally, or in an AWS environment that you have configured yourself. While the CGC has undergone thorough testing and is certified as a FISMA-moderate system, if you wish to analyze data in alternative locations, you must take the appropriate steps to ensure your computing environment is secure and compliant with current best practices. If you plan to download large numbers of files for local analysis, we recommend using the download utilities available from the Genomic Data Commons which have been specifically optimized for this purpose.
Below, we provide a tutorial showing how to find and access TCGA data using the Datasets API. Alternatively, you can try to query TCGA data using a SPARQL query.


## Goal of this Tutorial
During this tutorial, you will learn how to use the Datasets API to get the gene expression files for tumor-normal tissue matched Breast Cancer datasets.  In order to do this, we need to first identify the primary tumor and normal tissue samples from BRCA for which RNA-seq experiments have been performed. We then identify the tumor-tissue normal matched RNA-seq datasets by identifying which cases or patients had both experiments performed on them. After identifying these patients, we can then get the gene expression files for these tumor-normal matched datasets.

## Prerequisites
Before you begin this tutorial, you should:
 1. ** Set up your CGC account. ** If you haven't already done so, navigate to https://cgc.sbgenomics.com/ and follow these directions to register for the CGC. This tutorial uses Open Data, which is available to all CGC users. The same approach can be used by approved researchers to access Controlled Data. Learn more about TCGA data access here.3 
 2. ** Install the Seven Bridges' API Python library. ** This tutorial uses the library sevenbridges-python. Learn how to install it before continuing.
 3. ** Obtain your authentication token. ** You'll use your authentication token to encode your user credentials when interacting with the CGC programmatically. Learn how to access your authentication token. It is important to store your authentication token in a safe place as it can be used to access your account. The time and location your token was last used is shown on the developer dashboard. If for any reason you believe your token has been compromised, you can regenerate it at any time.
 
## Query using the Datasets API
The Datasets API is an API designed around the TCGA data structure and focused on search functionality. You can use the Datasets API to browse TCGA using API requests written in JSON. Queries made using the Datasets API return entities and are particularly suitable for browsing TCGA data.
We'll write a Python script to issue our query into TCGA using the Datasets API. Since the Datasets API is not included in our Python library, sevenbridges-python, we will use two Python modules, json and requests, to interact with it instead. We'll use these modules to write a wrapper around the API request.

In [None]:
import json
from requests import request

Below, we define a simple function to send and receive JSONs from the API using the correctly formatted HTTP calls. The necessary imports are handled above.

In [None]:
def api_call(path, method='GET', query=None, data=None, token=None):
     
    base_url = 'https://cgc-datasets-api.sbgenomics.com/datasets/tcga/v0/'
     
    data = json.dumps(data) if isinstance(data, dict) \
    or isinstance(data,list) else None
               
    headers = {
        'X-SBG-Auth-Token': token,
        'Accept': 'application/json',
        'Content-type': 'application/json',
    }
     
    response = request(method, base_url + path, params=query, \
                       data=data, headers=headers)
    response_dict = response.json() if \
    response.json() else {}
 
    if response.status_code / 100 != 2:
        print(response_dict['message'])
        print('Error Code: %i.' % (response_dict['code']))
        print(response_dict['more_info'])
        raise Exception('Server responded with status code %s.' \
                        % response.status_code)
    return response_dict

Then, provide your authentication token, as shown below. Examples of proper coding of your auth_token are available for sevenbridges-python bindings

In [None]:
auth_token = 'Enter your Authentication token here'

Now, we can define a query in JSON for finding all primary tumor samples that are Breast Invasive Carcinoma and those that have RNA-seq experiments performed. 


In [None]:
tumor_samples_query = {
    "entity": "samples",
    "hasSampleType": "Primary Tumor",
    "hasCase": {
        "hasDiseaseType" : "Breast Invasive Carcinoma",
        "hasGender" : "FEMALE",
        "hasVitalStatus" : "Alive"
        },
    "hasFile": {
        "hasExperimentalStrategy": "RNA-Seq",
         "hasDataType" : "Gene expression"
    }
}

In [None]:
total = api_call(method='POST', path ='query/total', \
                 token=auth_token, data=tumor_samples_query)
print("There are {} samples matching the query".format(total['total']))

Below, we define a simple function to get all matches to the query in the API using the correctly formatted HTTP calls. 

In [None]:
import math

def getAllMatches(auth_token, query_body):
    numberFiles = api_call(method="POST", path="query/total", \
                                       token=auth_token, data=query_body)["total"]
    numCalls = int(math.ceil(numberFiles/100.0))
    matches = []
    entity = query_body["entity"]
    for i in range(0, numCalls):
        query_body["offset"] = str(i * 100)
        currSet = api_call(method="POST", path="query" \
            , token=auth_token, data=query_body)["_embedded"][entity]
        for currMatch in currSet:
            matches.append(currMatch)

    return matches

Using these functions, we can now get all the samples that match the required queries.

In [None]:
tumor_samples = getAllMatches(auth_token, tumor_samples_query)
tumor_sample_ids = [curr_sample["id"] for curr_sample in tumor_samples]

Now, we can define a query in JSON for getting all normal tissue samples that are Breast Invasive Carcinoma and those that have RNA-seq experiments performed. 

In [None]:
tissue_normal_samples_query = {
    "entity": "samples",
    "hasSampleType": "Solid Tissue Normal",
    "hasCase": {
        "hasDiseaseType" : "Breast Invasive Carcinoma",
        "hasGender" : "FEMALE",
        "hasVitalStatus" : "Alive"
        },
    "hasFile": {
        "hasExperimentalStrategy": "RNA-Seq",
        "hasDataType" : "Gene expression"
    }
}

In [None]:
tissue_normal_samples = getAllMatches(auth_token, tissue_normal_samples_query)
tissue_normal_sample_ids = [curr_sample["id"] for curr_sample in tissue_normal_samples]

In [None]:
total = api_call(method='POST', path ='query/total', \
                 token=auth_token, data=tissue_normal_samples_query)
print("There are {} samples matching the query".format(total['total']))

Now, we are ready to identify the corresponding cases (patients) that have both tumor/normal matched RNA-seq experiments

In [None]:
tumor_cases_query = {
    "entity": "cases",
    "hasSample": tumor_sample_ids
}
tumor_cases = getAllMatches(auth_token, tumor_cases_query)
tumor_case_ids = [curr_case["id"] for curr_case in tumor_cases]

In [None]:
tissue_normal_cases_query = {
    "entity": "cases",
    "hasSample": tissue_normal_sample_ids
}
tissue_normal_cases = getAllMatches(auth_token, tissue_normal_cases_query)
tissue_normal_case_ids = [curr_case["id"] for curr_case in tissue_normal_cases]

In [None]:
tumor_match_case_ids = list(set(tumor_case_ids) & set(tissue_normal_case_ids))
print("There are {} cases that have both primary tumor and solid tissue normal samples with RNA-seq experiments".format(len(tumor_match_case_ids)))

Now that we know the case IDs, we can use them to get the appropriate files

In [None]:
tumor_match_files_query = {
    "entity": "files",
    "hasExperimentalStrategy": "RNA-Seq",
    "hasDataType" : "Gene expression",
    "hasSample": {
        "hasSampleType" : "Primary Tumor"
    },
    "hasCase": tumor_match_case_ids
}
tumor_match_files = getAllMatches(auth_token, tumor_match_files_query)

In [None]:
tissue_normal_match_files_query = {
    "entity": "files",
    "hasExperimentalStrategy": "RNA-Seq",
    "hasDataType" : "Gene expression",
    "hasSample": {
        "hasSampleType" : "Solid Tissue Normal"
    },
    "hasCase": tumor_match_case_ids
}
tissue_normal_match_files = getAllMatches(auth_token, tissue_normal_match_files_query)

In [None]:
print("There are {} files corresponding to Gene Expression for Tumor samples in tumor-normal matched cases for BRCA".format(len(tumor_match_files)))
print("There are {} files corresponding to Gene Expression for Solid tissue normal samples in tumor-normal matched cases for BRCA".format(len(tissue_normal_match_files)))

## Initialize the sevenbridges-python library
We've now installed sevenbridges-python and stored our credentials in a config file. Let's import the api class from the official sevenbridges-python bindings.

In [None]:
import sevenbridges as sbg

Let's initialize the api object so the API knows our credentials.

In [None]:
# [USER INPUT] specify platform {cgc, sbg}
prof = 'sbpla'


config_file = sbg.Config(profile=prof)
api = sbg.Api(config=config_file)

Create a new project

In [None]:
# [USER INPUT] Set project name here:
new_project_name = 'Matched Tumor-Control Samples'                          
      
    
# What are my funding sources?
billing_groups = api.billing_groups.query()  

# Pick the first group (arbitrary)
print((billing_groups[0].name + \
       ' will be charged for computation and storage (if applicable) for your new project'))

# Set up the information for your new project
new_project = {
        'billing_group': billing_groups[0].id,
        'description': """A project created by the API recipe (projects_makeNew.ipynb).
                      This also supports **markdown**
                      _Pretty cool_, right?
                   """,
        'name': new_project_name
}

# check if this project already exists. LIST all projects and check for name match
my_project = [p for p in api.projects.query(limit=100).all() \
              if p.name == new_project_name]      
              
if my_project:    # exploit fact that empty list is False, {list, tuple, etc} is True
    print('A project with the name (%s) already exists, please choose a unique name' \
          % new_project_name)
    raise KeyboardInterrupt
else:
    # CREATE the new project
    my_project = api.projects.create(name = new_project['name'], \
                                     billing_group = new_project['billing_group'], \
                                     description = new_project['description'])
    
    # (re)list all projects, and get your new project
    my_project = [p for p in api.projects.query(limit=100).all() \
              if p.name == new_project_name][0]

Copy the files based on file IDs that are transferrable across the Datasets and platform APIs

In [None]:
def copyToProject(api, my_project, finalFiles):
    my_files = api.files.query(limit = 100, project = my_project.id).all()
    # pop out the file names
    my_file_names = [f.name for f in my_files]
    newFiles = []
    for currFile in finalFiles:
        if currFile["label"] in my_file_names:
            print('file already exists in second project, please try another file')
        else:
            fileObject = api.files.get(id = currFile['id'])
            #print fileObject.name, fileObject.id
            
            my_new_file = fileObject.copy(project = my_project.id, name = fileObject.name)
            newFiles.append(my_new_file)
    print "Files Imported!"

copyToProject(api, my_project, tumor_match_files)
copyToProject(api, my_project, tissue_normal_match_files)