# Use TCGA metadata to access files

This document details the process of finding and downloading TCGA files, filtered using a SPARQL query into the file metadata. 

### Prerequisites:

1. A [CGC account](http://docs.cancergenomicscloud.org/docs/sign-up-for-the-cgc).
2. [Controlled Data access](http://docs.cancergenomicscloud.org/docs/tcga-data-access) so that you can use all TCGA data on the CGC.
3. A Python library that enables you to access SPARQL endpoint programatically. We recommend https://github.com/RDFLib/rdflib, which you can install using `pip install rdflib`, but any other SPARQL library should do.
4. Familiarity with the CGC API. For this, please take a look at [the documentation](http://docs.cancergenomicscloud.org/docs/the-cgc-api). 
  
### Notes

This example uses Python 2.7 but can be trivially adapted to use Python 3.x


## Issuing SPARQL queries programatically

The following query is designed to get a list of files we need to process. In order to do this, we need the following:

1. Create our SPARQL query 
2. Define an endpoint
3. Send the query to server 
4. Grab the results


The query we will use will give us files that are from cases of the disease 'Lung Adenocarcinoma', in which the patients are alive, had the last medical follow-up 550 days ago, and have received chemotherapy. We shall further specify that the sample is from taken from the primary tumor, using the experimental strategy WXS (Whole Exome Sequencing).

In [None]:
from SPARQLWrapper import SPARQLWrapper, JSON
import json

# Use the public endpoint

sparql_endpoint = "https://opensparql.sbgenomics.com/blazegraph/namespace/tcga_metadata_kb/sparql"

# Initialize the SPARQL wrapper with the endpoint
sparql = SPARQLWrapper(sparql_endpoint)

query = """
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
prefix tcga: <https://www.sbgenomics.com/ontologies/2014/11/tcga#>

select distinct ?case ?sample ?file_name ?path ?xs_label ?subtype_label
where
{
 ?case a tcga:Case .
 ?case tcga:hasDiseaseType ?disease_type .
 ?disease_type rdfs:label "Lung Adenocarcinoma" .
 
 ?case tcga:hasHistologicalDiagnosis ?hd .
 ?hd rdfs:label "Lung Adenocarcinoma Mixed Subtype" .
 

 

 
 ?case tcga:hasFollowUp ?follow_up .
 ?follow_up tcga:hasDaysToLastFollowUp ?days_to_last_follow_up .
 filter(?days_to_last_follow_up>550) 
  
 ?follow_up tcga:hasVitalStatus ?vital_status .
 ?vital_status rdfs:label ?vital_status_label .
 filter(?vital_status_label="Alive")
 
 ?case tcga:hasDrugTherapy ?drug_therapy .
 ?drug_therapy tcga:hasPharmaceuticalTherapyType ?pt_type .
 ?pt_type rdfs:label ?pt_type_label .
 filter(?pt_type_label="Chemotherapy")
  
 ?case tcga:hasSample ?sample .
 ?sample tcga:hasSampleType ?st .
 ?st rdfs:label ?st_label
 filter(?st_label="Primary Tumor")
     
 ?sample tcga:hasFile ?file .
 ?file rdfs:label ?file_name .
 
 ?file tcga:hasStoragePath ?path.
  
 ?file tcga:hasExperimentalStrategy ?xs.
 ?xs rdfs:label ?xs_label .
 filter(?xs_label="WXS")
  
 ?file tcga:hasDataSubtype ?subtype .
 ?subtype rdfs:label ?subtype_label

}





"""


sparql.setQuery(query)

sparql.setReturnFormat(JSON)
results = sparql.query().convert()



# From results, we grab a list of files. TCGA metadata database returns a list of filepaths. 
filelist = [result['path']['value'] for result in results['results']['bindings']]





In [None]:
# The list of file paths is now in the filelist array, as shown below
print 'Your query returned %s files with paths:' % len(filelist)

for file in filelist:
    print file 


## Use the CGC API to download a file

Prerequisites:

1. An account on CGC with access to [TCGA Controlled Data](http://docs.cancergenomicscloud.org/docs/tcga-data-access).
2. Your authentication token used to access the CGC API. This is available at https://cgc.sbgenomics.com/account#developer.

In order to download a file, we need do do the following:

1. Map file paths we get from TCGA metadata database to the file IDs used on the CGC.
2. Get each file's  download URL from the API.
3. Use a download program such as `wget` or `aria2c` to download the files.

In [None]:
# The following script uses the Python requests library to make a small wrapper around the CGC API
import uuid
import json
import pprint
import requests

def api(api_url, path, auth_token,method='GET', query=None, data=None): 
  data = json.dumps(data) if isinstance(data, dict) or isinstance(data,list) else None 
  base_url = api_url
 
  headers = { 
    'X-SBG-Auth-Token': auth_token, 
    'Accept': 'application/json', 
    'Content-type': 'application/json', 
  } 
 
  response = requests.request(method, base_url + path, params=query, data=data, headers=headers) 
  print "URL: ",  response.url
  print "RESPONSE CODE: ", response.status_code
  print ('--------------------------------------------------------------------------------------------------------------------')
  response_dict = json.loads(response.content) if response.content else {} 
  response_headers = dict(response.headers)

  pprint.pprint(response_headers)
  print('--------------------------------------------------------------------------------------------------------------------')
  pprint.pprint(response_dict)
  return response_dict

### Prepare the API endpoint and authentication token

The base URL for the CGC API is https://cgc-api.sbgenomics.com/v2/

Your CGC API authentication token can be retrieved from https://cgc.sbgenomics.com/account#developer. Enter your token in the code below, to replace <YOUR TOKEN HERE>.

In [None]:
# API base URL
base = 'https://cgc-api.sbgenomics.com/v2/' 

auth_token = '<YOUR TOKEN HERE>'



In [None]:
# Get download data for each of the files 

# Note that here we use a special purpose API call on CGC as described on 
# http://docs.cancergenomicscloud.org/v1.0/docs/get-a-files-download-url

download_urls = api(api_url=base,auth_token=auth_token,path='action/files/get_download_url',method='POST',query=None,data=filelist)



### Download files

We'll write the download URLs to a file, `download.txt`, which can be then used to download them via a download tool such as [wget](https://www.gnu.org/software/wget/) or [aria2](http://aria2.sourceforge.net/).



In [None]:
outfile = open('download.txt','wb')
for url in download_urls:
    outfile.write(url)
    outfile.write('\n')

outfile.close()


Now we can use the list of download links to obtain the files:

1. Using `wget`: `wget --content-disposition -i download.txt`
2. Using `aria2`: `aria2c -i download.txt --file-allocation=none`