In [1]:
import requests
import loompy

### Set up server URL
Change the URL here to point the client to other implementations of the api.  The examples here are based on the server running at:
http://felcat.caltech.edu/rnaget

In [2]:
server_params = {'caltech-edu': {'rnaget_url': "http://felcat.caltech.edu/rnaget",
                                 'project_tag': "PCAWG",
                                 'headers': {'User-Agent': 'python-requests/2.21.0',
                                             'Accept-Encoding': 'gzip, deflate',
                                             'Accept': '*/*',
                                             'Connection': 'keep-alive'}
                                },
                 'crg-cat': {'rnaget_url': "https://genome.crg.cat/rnaget",
                             'project_tag': "cancer",
                             'headers': {'User-Agent': 'python-requests/2.21.0',
                                         'Accept-Encoding': 'gzip, deflate',
                                         'Accept': '*/*',
                                         'Connection': 'keep-alive',
                                         'Authorization': 'Bearer abcdefuvwxyz'}
                            }
                }

# use this to select the target server.  This should be the only knob to turn - everything
# from here on should use the API
rnaget_params = server_params['caltech-edu']

### Project
The demonstration dataset is the PCAWG dataset.  For security, return of all available results is not enabled.  To get the project we have to go through the search endpoint with a reasonable guess at the tag.

In [3]:
payload = {"tags": rnaget_params['project_tag']}
r = requests.get("{}/projects".format(rnaget_params['rnaget_url']), params=payload,
                 headers=rnaget_params['headers'])

The response is a json object

In [4]:
r.json()

[{'tags': ['PCAWG', 'cancer'],
  'version': '1.0',
  'name': 'PCAWG',
  'id': '43378a5d48364f9d8cf3c3d5104df560',
  'description': 'Pan Cancer Analysis of Whole Genomes test data from Expression Atlas E-MTAB-5423'}]

Here we extract the project ID.  With this we can drill down through the hierarchy and get related entries.

In [5]:
pcawg_project_id = r.json()[0]['id']
pcawg_project_id

'43378a5d48364f9d8cf3c3d5104df560'

### Study
With the PCAWG project ID, we can get the related study.  We use the same technique to extract the study ID.

In [6]:
payload = {"projectID": pcawg_project_id}
r = requests.get("{}/studies".format(rnaget_params['rnaget_url']), params=payload,
                 headers=rnaget_params['headers'])
pcawg_study_id = r.json()[0]['id']
pcawg_study_id

'6cccbbd76b9c4837bd7342dd616d0fec'

### Expression
With the study ID we can get the related expression objects.  This can now be used to get the URL to download the matrix.

In [7]:
payload = {"studyID": pcawg_study_id, "format": "loom"}
r = requests.get("{}/expressions/tickets".format(rnaget_params['rnaget_url']), params=payload,
                 headers=rnaget_params['headers'])
r.json()

{'tags': [],
 'version': '1.0',
 'fileType': 'loom',
 'studyID': '6cccbbd76b9c4837bd7342dd616d0fec',
 'units': 'TPM',
 'url': 'http://woldlab.caltech.edu/~sau/rnaget/E-MTAB-5423-query-results.tpms.loom',
 'id': '2a7ab5533ef941eaa59edbfe887b58c4'}

In [9]:
matrix_url = r.json()['url']
print(matrix_url)

http://woldlab.caltech.edu/~sau/rnaget/E-MTAB-5423-query-results.tpms.loom


## Working with the files
With the download URL we can stream the expression file to a local file and run whatever downstream amalysis we wish.  The following are some simple navigation examples.

The felcat.caltech.edu server is serving files in loom format.  To explore these proceed to **The loom file** below.

The genome.crg.cat server is serving files in tsv format.  To explore these proceed to **The tsv file** below

### The loom file
Now that we have the URL we can stream it to a local file.  The felcat server is providing a loom file so the loompy package can be used to explore it.

(This step might take a few moments (approx 10-15 sec for my connection) as we are downloading the entire PCAWG expression matrix)

In [10]:
# static name for download of PCAWG example dataset
pcawg_loom_file = "full-loom-file.loom"

In [11]:
r = requests.get(matrix_url, stream=True, headers=rnaget_params['headers'])
with open(pcawg_loom_file, 'wb') as fd:
    for chunk in r.iter_content(chunk_size=128):
        fd.write(chunk)

A quick look at the loom file:

In [12]:
ds = loompy.connect(pcawg_loom_file)

In [13]:
ds.shape

(56717, 1350)

In [14]:
ds.ca.keys()

['Condition', 'Sample', 'Tissue']

In [15]:
ds.ca['Sample']

array(['DO221123 - primary tumour', 'DO221124 - primary tumour',
       'DO221127 - primary tumour', ..., 'DO27588 - primary tumour',
       'DO27636 - primary tumour', 'DO27747 - primary tumour'],
      dtype=object)

In [16]:
ds.ra.keys()

['GeneID', 'GeneName']

In [17]:
ds.ra['GeneName']

array(['TSPAN6', 'TNMD', 'DPM1', ..., 'AC008264.2', 'AP000229.1',
       'AC098479.1'], dtype=object)

### Slicing the loom file
A useful operation is the slice the loom file.  This can be done by passing the ID of the study along with the row and/or columns to return.

In [18]:
payload = {"studyID": pcawg_study_id,
           "format": "loom",
           "featureNameList": "TSPAN6"}
r = requests.get("{}/expressions/tickets".format(rnaget_params['rnaget_url']), params=payload,
                 headers=rnaget_params['headers'])
tspan_matrix_url = r.json()['url']
tspan_matrix_url

'https://woldlab.caltech.edu/~sau/rnaget/98500efecd8c465fb542e908eb5ec0b0_filtered_loom.loom'

Download it and verify that the matrix was sliced.

In [19]:
feature_loom_name = 'tspan-loom-file.loom'

In [20]:
r = requests.get(tspan_matrix_url, stream=True, headers=rnaget_params['headers'])
with open(feature_loom_name, 'wb') as fd:
    for chunk in r.iter_content(chunk_size=128):
        fd.write(chunk)

In [21]:
ds2 = loompy.connect(feature_loom_name)

In [22]:
ds2.shape

(1, 1350)

In [23]:
ds2.ra['GeneName']

array(['TSPAN6'], dtype=object)

We can also select multiple features

In [24]:
payload = {"studyID": pcawg_study_id,
           "format": "loom",
           "featureNameList": "TSPAN6,DPM1"}
r = requests.get("{}/expressions/tickets".format(rnaget_params['rnaget_url']), params=payload,
                 headers=rnaget_params['headers'])
multi_slice_matrix_url = r.json()['url']
multi_slice_matrix_url

'https://woldlab.caltech.edu/~sau/rnaget/425f98d741ff44bdb5d2d100a926d159_filtered_loom.loom'

In [25]:
multi_feature_loom_name = "sliced-loom.loom"

In [26]:
r = requests.get(multi_slice_matrix_url, stream=True)
with open(multi_feature_loom_name, 'wb') as fd:
    for chunk in r.iter_content(chunk_size=128):
        fd.write(chunk)

In [27]:
ds3 = loompy.connect(multi_feature_loom_name)
ds3.ra['GeneName']

array(['TSPAN6', 'DPM1'], dtype=object)

the same thing on the sample axis

In [28]:
payload = {"studyID": pcawg_study_id,
           "format": "loom",
           "sampleIDList": "DO221124 - primary tumour"}
r = requests.get("{}/expressions/tickets".format(rnaget_params['rnaget_url']), params=payload,
                 headers=rnaget_params['headers'])
DO221124_matrix_url = r.json()['url']
DO221124_matrix_url

'https://woldlab.caltech.edu/~sau/rnaget/91dc3ebc2bf2428fb69b0809f7b66a92_filtered_loom.loom'

In [29]:
sample_loom_name = "221124-loom-file.loom"

In [30]:
r = requests.get(DO221124_matrix_url, stream=True)
with open(sample_loom_name, 'wb') as fd:
    for chunk in r.iter_content(chunk_size=128):
        fd.write(chunk)

In [31]:
ds4 = loompy.connect(sample_loom_name)

In [32]:
ds4.shape

(56717, 1)

In [33]:
ds4.ca['Sample']

array(['DO221124 - primary tumour'], dtype=object)

In [34]:
ds4.ra['GeneName'][:3]

array(['TSPAN6', 'TNMD', 'DPM1'], dtype=object)

### Continuous signal data
The `continuous` RNA data type is designed to return a value for every base in a range.  This is used for browser tracks and types of epigenomic data.  This is a parallel endpoint to `expression`.

In [35]:
payload = {"studyID": pcawg_study_id,
           "format": "loom"}
r = requests.get("{}/continuous/tickets".format(rnaget_params['rnaget_url']), params=payload,
                 headers=rnaget_params['headers'])
r.json()

{'tags': ['cancer'],
 'version': '1.0',
 'fileType': 'loom',
 'studyID': '6cccbbd76b9c4837bd7342dd616d0fec',
 'units': 'counts',
 'url': '/woldlab/castor/home/sau/public_html/rnaget/signal-query-results.loom',
 'id': 'fa057c6d18c44960a1b8b49d065b3889'}