# Programmatically Access CCLE Data using the Seven Bridges Cancer Genomics Cloud via the Datasets API

The CCLE is made possible through a collaboration between the Broad Institute, the Novartis Institutes for Biomedical Research, and the Genomics Institute of the Novartis Research Foundation to perform detailed genetic and pharmacologic characterization of a large number of human cancer models.

The CCLE public project contains Open Access sequencing data (in the form of reads aligned to the hg19 broad variant reference genome) for nearly 1000 cancer cell line samples.


## Goal of this Tutorial
During this tutorial, you will learn how to use the Datasets API to get the bam files obtained from whole genome sequencing of Breast Cancer cell-lines.  In order to do this, we need to first identify the files that correspond to these metadata requirements within the CCLE data. After identifying these files, we can then get the files on to a project on the CGC using the CGC API.

## Prerequisites
Before you begin this tutorial, you should:
 1. ** Set up your CGC account. ** If you haven't already done so, navigate to https://cgc.sbgenomics.com/ and follow these directions to register for the CGC. This tutorial uses Open Data, which is available to all CGC users. The same approach can be used by approved researchers to access Controlled Data. Learn more about TCGA data access here.3 
 2. ** Install the Seven Bridges' API Python library. ** This tutorial uses the library sevenbridges-python. Learn how to install it before continuing.
 3. ** Obtain your authentication token. ** You'll use your authentication token to encode your user credentials when interacting with the CGC programmatically. Learn how to access your authentication token. It is important to store your authentication token in a safe place as it can be used to access your account. The time and location your token was last used is shown on the developer dashboard. If for any reason you believe your token has been compromised, you can regenerate it at any time.
 
## Query using the Datasets API
The Datasets API is an API designed around the CCLE/TCGA data structure and focused on search functionality. You can use the Datasets API to browse CCLE using API requests written in JSON. Queries made using the Datasets API return entities and are particularly suitable for browsing CCLE data.
We'll write a Python script to issue our query into CCLE using the Datasets API. Since the Datasets API is not included in our Python library, sevenbridges-python, we will use two Python modules, json and requests, to interact with it instead. We'll use these modules to write a wrapper around the API request.

In [1]:
import json
from requests import request

Below, we define a simple function to send and receive JSONs from the API using the correctly formatted HTTP calls. The necessary imports are handled above.

In [None]:
def api_call(path, method='GET', query=None, data=None, token=None):
     
    base_url = 'https://cgc-datasets-api.sbgenomics.com/datasets/ccle/v0/'
     
    data = json.dumps(data) if isinstance(data, dict) \
    or isinstance(data,list) else None
               
    headers = {
        'X-SBG-Auth-Token': token,
        'Accept': 'application/json',
        'Content-type': 'application/json',
    }
     
    response = request(method, base_url + path, params=query, \
                       data=data, headers=headers)
    response_dict = response.json() if \
    response.json() else {}
 
    if response.status_code / 100 != 2:
        print(response_dict)
        raise Exception('Server responded with status code %s.' % response.status_code)
    return response_dict

Then, provide your authentication token, as shown below. Examples of proper coding of your auth_token are available for sevenbridges-python bindings

In [None]:
auth_token = 'Insert Your Authentication Token Here'

Now, we can define a query in JSON for finding all primary tumor samples that are Breast Invasive Carcinoma and those that have RNA-seq experiments performed. 


In [None]:
files_query = {
    "entity": "files",
    "hasReferenceGenome": "HG19_Broad_variant",
    "hasDataFormat": "BAM",
    "hasExperimentalStrategy": "WGS",
    "hasCase": {
        "hasDiseaseType" : "Breast invasive carcinoma",
        "hasSampleType" : "EBV Immortalized Normal"
        }
} #Change for your purposes

In [None]:
total = api_call(method='POST', path ='query/total', \
                 token=auth_token, data=files_query)
print("There are {} samples matching the query".format(total['total']))

Below, we define a simple function to get all matches to the query in the API using the correctly formatted HTTP calls. 

In [None]:
import math

def getAllMatches(auth_token, query_body):
    numberFiles = api_call(method="POST", path="query/total", \
                                       token=auth_token, data=query_body)["total"]
    numCalls = int(math.ceil(numberFiles/100.0))
    matches = []
    entity = query_body["entity"]
    for i in range(0, numCalls):
        query_body["offset"] = str(i * 100)
        currSet = api_call(method="POST", path="query" \
            , token=auth_token, data=query_body)["_embedded"][entity]
        for currMatch in currSet:
            matches.append(currMatch)

    return matches

Using these functions, we can now get all the samples that match the required queries.

In [None]:
files = getAllMatches(auth_token, files_query)
file_ids = [curr_file["id"] for curr_file in files]

## Initialize the sevenbridges-python library
We've now installed sevenbridges-python and stored our credentials in a config file. Let's import the api class from the official sevenbridges-python bindings.

In [None]:
import sevenbridges as sbg

Let's initialize the api object so the API knows our credentials.

In [None]:
# [USER INPUT] specify platform {cgc, sbg}
prof = 'cgc'


config_file = sbg.Config(profile=prof)
api = sbg.Api(config=config_file)

Create a new project

In [None]:
# [USER INPUT] Set project name here:
new_project_name = 'CCLE Files'                          
      
    
# What are my funding sources?
billing_groups = api.billing_groups.query()  

# Pick the first group (arbitrary)
print((billing_groups[0].name + \
       ' will be charged for computation and storage (if applicable) for your new project'))

# Set up the information for your new project
new_project = {
        'billing_group': billing_groups[0].id,
        'description': """A project created by the API recipe (projects_makeNew.ipynb).
                      This also supports **markdown**
                      _Pretty cool_, right?
                   """,
        'name': new_project_name
}

# check if this project already exists. LIST all projects and check for name match
my_project = [p for p in api.projects.query(limit=100).all() \
              if p.name == new_project_name]      
              
if my_project:    # exploit fact that empty list is False, {list, tuple, etc} is True
    print('A project with the name (%s) already exists, please choose a unique name' \
          % new_project_name)
    raise KeyboardInterrupt
else:
    # CREATE the new project
    my_project = api.projects.create(name = new_project['name'], \
                                     billing_group = new_project['billing_group'], \
                                     description = new_project['description'])
    
    # (re)list all projects, and get your new project
    my_project = [p for p in api.projects.query(limit=100).all() \
              if p.name == new_project_name][0]

Copy the files based on file IDs that are transferrable across the Datasets and platform APIs

In [None]:
def copyToProject(api, my_project, finalFiles):
    my_files = api.files.query(limit = 100, project = my_project.id).all()
    # pop out the file names
    my_file_names = [f.name for f in my_files]
    newFiles = []
    for currFile in finalFiles:
        if currFile["label"] in my_file_names:
            print('file already exists in second project, please try another file')
        else:
            fileObject = api.files.get(id = currFile['id'])
            #print fileObject.name, fileObject.id
            
            my_new_file = fileObject.copy(project = my_project.id, name = fileObject.name)
            newFiles.append(my_new_file)
    print "Files Imported!"

copyToProject(api, my_project, files)