The Datasets API enables researchers to search for files in the TCGA (legacy hg19) data set based on metadata. Similar endpoints are available for querying other public datasets such as CCLE. All researchers can perform queries to return Case, Sample, and File IDs and import level 3 files (i.e., files containing non-personally identifiable information), such as gene quantification or somatic VCF files. Files containing personally identifiable data, such as germline VCF files and raw sequencing data, can only be accessed with appropriate dbGaP permissions on the CGC. In this section, we will show how to write a script using the Python bindings for the Datasets API to search for and import into projects RNA-sequencing data from matched tumor-normal samples from patients diagnosed with BRCA.  

Necessary Requirements 
- A computer with internet access and an up-to-date Internet browser (e.g. Firefox, Chrome, Safari). 
- An account on the Seven Bridges’ Cancer Genomics Cloud (CGC) (https://cgc.sbgenomics.com). To access Controlled Data from TCGA, you need to register with your eRA Commons or NIH-CIT credentials and have access permissions through the Database of Genotypes and Phenotypes (dbGaP) (Link: https://dbgap.ncbi.nlm.nih.gov/aa/wga.cgi?page=login).
- Install conda and the Python bindings for the Seven Bridges API using:
        > pip install sevenbridges-python


1) The Datasets API employs API requests written in JSON format to return entities such as Case, Sample, or File IDs. We will use two Python modules, json and requests, to write a wrapper around the API request, so the first step is to import these modules. 


In [None]:
import json
from requests import request

2) We define a simple function to send and receive JSONs from the API using correctly formatted HTTP calls. The token is your Authentication token - we’ll tell you how to get that in the next step.

In [None]:

def api_call(path, method='GET', query=None, data=None, token=None):
    # Base URL for datasets API     
    base_url = 'https://cgc-datasets-api.sbgenomics.com/datasets/tcga/v0/'
    
    # input for API call converted to json format
    data = json.dumps(data) if isinstance(data, dict) \
        or isinstance(data,list) else None
              
    # header for API call
    headers = {
        'X-SBG-Auth-Token': token,
        'Accept': 'application/json',
        'Content-type': 'application/json',
    }
     
    # API call
    response = request(method, base_url + path, params=query, \
                       data=data, headers=headers)
    #Converting response from JSON to dictionary
    response_dict = response.json() if \
        response.json() else {}
 
    if response.status_code / 100 != 2:
        print(response_dict['message'])
        print('Error Code: %i.' % (response_dict['code']))
        print(response_dict['more_info'])
        raise Exception('Server responded with status code %s.' \
                        % response.status_code)
    return response_dict


3) Next obtain your CGC Authentication token for the API from the Developer Dashboard within your account. To access the Developer Dashboard, click on your username in the upper right corner and then click Developer. Click on Auth token to access the Authentication token (Figure 10). Note: Your Auth token allows access to the CGC and your projects; treat it as you would your password.

In [None]:
auth_token = "Your Authentication Token Here"

4) Now we build a query in JSON format to find primary tumor samples that (i) are from Cases (i.e., patients) diagnosed with Breast Invasive Carcinoma and (ii)  have associated RNA-Seq reads. You can query each entity’s schema using an API call such as http://docs.cancergenomicscloud.org/docs/query-via-the-datasets-api#section-step-1-get-an-entity-s-metadata-schema.


In [None]:
tumor_samples_query = {
    "entity": "samples",
    "hasSampleType": "Primary Tumor",
    "hasCase": {
        "hasDiseaseType" : "Breast Invasive Carcinoma",
        },
    "hasFile": {
        "hasExperimentalStrategy": "RNA-Seq",
         "hasDataFormat" : "TARGZ"
    }
}


5) Next we perform an API query using the function from step 2 and the query from step 4 to return the number of BRCA primary tumor samples with RNA-seq reads files. Adding “query/total” to the base TCGA metadata path ensures that you get the number of samples that match the above metadata query.

In [None]:
total = api_call(method='POST', path ='query/total', \
                 token=auth_token, data=tumor_samples_query)
print("There are {} samples matching the query".format(total['total']))

6) Next, we define a simple function to get all matches to the query using the correctly formatted HTTP calls. The previous API call gave the total number of samples related to the query (primary tumor samples from breast invasive carcinoma cases that have raw RNA-seq reads). If you want a list of IDs and all the metadata associated with the samples that match this query, we need to add just “query” to the path and get the results for all samples matching the metadata query. In addition, the results are returned in the form of pages with a maximum of 100 entities per page. Hence, we have to loop through the results to get the final list of sample IDs and associated metadata.

In [None]:
import math

def getAllMatches(auth_token, query_body):
    numberFiles = api_call(method="POST", path="query/total", \
                                       token=auth_token, data=query_body)["total"]
    numCalls = int(math.ceil(numberFiles/100.0))
    matches = []
    entity = query_body["entity"]
    for i in range(0, numCalls):
        query_body["offset"] = str(i * 100)
        currSet = api_call(method="POST", path="query" \
            , token=auth_token, data=query_body)["_embedded"][entity]
        for currMatch in currSet:
            matches.append(currMatch)

    return matches

7) We can call this new function to return all the sample IDs that match the query.

In [None]:
tumor_samples = getAllMatches(auth_token, tumor_samples_query)
tumor_sample_ids = [curr_sample["id"] for curr_sample in tumor_samples]


8) We can similarly query for normal tissue samples that are BRCA solid tissue normal samples with RNA-seq reads files

In [None]:
tissue_normal_samples_query = {
    "entity": "samples",
    "hasSampleType": "Solid Tissue Normal",
    "hasCase": {
        "hasDiseaseType" : "Breast Invasive Carcinoma"
        },
    "hasFile": {
        "hasExperimentalStrategy": "RNA-Seq",
        "hasDataFormat" : "TARGZ"
    }
}


tissue_normal_samples = getAllMatches(auth_token, tissue_normal_samples_query)
tissue_normal_sample_ids = [curr_sample["id"] for curr_sample in tissue_normal_samples]
total = api_call(method='POST', path ='query/total', \
                 token=auth_token, data=tissue_normal_samples_query)
print("There are {} samples matching the query".format(total['total']))

9) To identify the corresponding Cases (patients) that have either tumor samples or tissue normal samples with RNA-seq experiments.

In [None]:
tumor_cases_query = {
    "entity": "cases",
    "hasSample": tumor_sample_ids
}
tumor_cases = getAllMatches(auth_token, tumor_cases_query)
tumor_case_ids = [curr_case["id"] for curr_case in tumor_cases]

In [None]:
tissue_normal_cases_query = {
    "entity": "cases",
    "hasSample": tissue_normal_sample_ids
}
tissue_normal_cases = getAllMatches(auth_token, tissue_normal_cases_query)
tissue_normal_case_ids = [curr_case["id"] for curr_case in tissue_normal_cases]

10) To identify the subset of patients with RNA-seq experiments for both primary tumor and tissue normal, we take the intersection of the two lists.

In [None]:
tumor_match_case_ids = list(set(tumor_case_ids) & set(tissue_normal_case_ids))
print("There are {} cases that have both primary tumor and solid tissue normal samples with RNA-seq experiments".format(len(tumor_match_case_ids)))

11) Now we obtain the paired tumor-normal samples by querying using the Case IDs and appropriate metadata for the primary tumor and solid tissue normal samples, respectively. We return a list of json objects corresponding to the metadata for each file.


In [None]:
tumor_match_files_query = {
    "entity": "files",
    "hasExperimentalStrategy": "RNA-Seq",
    "hasDataFormat" : "TARGZ",
    "hasSample": {
        "hasSampleType" : "Primary Tumor"
    },
    "hasCase": tumor_match_case_ids
}
tumor_match_files = getAllMatches(auth_token, tumor_match_files_query)

In [None]:
tissue_normal_match_files_query = {
    "entity": "files",
    "hasExperimentalStrategy": "RNA-Seq",
    "hasDataFormat" : "TARGZ",
    "hasSample": {
        "hasSampleType" : "Solid Tissue Normal"
    },
    "hasCase": tumor_match_case_ids
}
tissue_normal_match_files = getAllMatches(auth_token, tissue_normal_match_files_query)

In [None]:

print("There are {} files corresponding to raw reads for Tumor samples in tumor-normal matched cases for BRCA".format(len(tumor_match_files)))
print("There are {} files corresponding to raw reads for Solid tissue normal samples in tumor-normal matched cases for BRCA".format(len(tissue_normal_match_files)))

12) The File objects associate information with the File ID (a unique identifier for a file on the CGC), and this ID can be used to copy the file your project. In the case of Controlled Access data, you must have appropriate dbGaP permissions to copy these files or view their contents. Now you can initialize the Seven Bridges python bindings.

In [None]:
import sevenbridges as sbg
api = sbg.Api(url='https://cgc-api.sbgenomics.com/v2', token=auth_token)

13) If you haven’t already made a project for these files using the GUI, you can create one using the API. To create a new project, you need to provide a project name and a billing group. This code block selects the first billing group available, which will return a user’s personal billing group in most cases. If you are involved in collaborations and are in multiple billing groups, you will want to select the appropriate one for each project.

In [None]:
new_project_name = 'Protocol 1'                          
billing_groups = api.billing_groups.query()  
print((billing_groups[0].name + \
       ' will be charged for computation and storage (if applicable) for your new project'))

new_project = {
        'billing_group': billing_groups[0].id,
        'name': new_project_name, 
        'tags': ['tcga']
}

my_project = api.projects.create(name = new_project['name'], \
                                 billing_group = new_project['billing_group'], \
                                 tags = new_project['tags'])
my_project = [p for p in api.projects.query(limit=100).all() \
              if p.name == new_project_name][0] 

14) Now you can copy the files to the project. The ID of each file returned from the Datasets API can be used to copy each file onto a project on the user’s workspace. To do this, we create a function that loops through the list of files we want to copy and copies each file to the user’s project. 

In [None]:
def copy_to_project(api, my_project, final_files):
    new_files = []
    for curr_file in final_files:
        file_object = api.files.get(id = curr_file['id'])
        my_new_file = file_object.copy(project = my_project.id, name = file_object.name)
        new_files.append(my_new_file)
    print("Files Imported!")


copy_to_project(api, my_project, tumor_match_files)