In [1]:
import requests
import loompy

### Set up server URL
Change the URL here to point the client to other implementations of the api.  The examples here are based on the server running at:
http://felcat.caltech.edu/rnaget

In [3]:
# uncomment the following based on the target implementation

# felcat.caltech.edu

#rnaget_url = "http://felcat.caltech.edu/rnaget"
#project_tag = "PCAWG"
#headers = {'User-Agent': 'python-requests/2.21.0',
#           'Accept-Encoding': 'gzip, deflate',
#           'Accept': '*/*',
#           'Connection': 'keep-alive'}


# server 2
rnaget_url = "https://genome.crg.cat/rnaget"
project_tag = "cancer"
headers = {'User-Agent': 'python-requests/2.21.0',
           'Accept-Encoding': 'gzip, deflate',
           'Accept': '*/*',
           'Connection': 'keep-alive',
           'Authorization': 'Bearer abcdefuvwxyz'}

### Project
The demonstration dataset is the PCAWG dataset.  For security, return of all available results is not enabled.  To get the project we have to go through the search endpoint with a reasonable guess at the tag.

In [4]:
payload = {"tags": project_tag}
r = requests.get("{}/projects/search".format(rnaget_url), params=payload, headers=headers)

The response is a json object

In [5]:
r.json()

[{'description': 'The Pancancer Analysis of Whole Genomes (PCAWG) study is an international collaboration to identify common patterns of mutation in more than 2,800 cancer whole genomes from the International Cancer Genome Consortium.',
  'id': 'E-MTAB-5200',
  'name': 'Pancancer Analysis of Whole Genomes',
  'tags': ['bulk', 'RNA-seq', 'human', 'cancer'],
  'version': '1.0'},
 {'description': 'Single-Cell Analysis of Human Pancreas Reveals Transcriptional Signatures  of Aging and Somatic Mutation Patterns',
  'id': 'E-GEOD-81547',
  'name': 'Single cell transcriptome analysis of human pancreas',
  'tags': ['single-cell', 'RNA-seq', 'human', 'cancer'],
  'version': '1.0'}]

Here we extract the project ID.  With this we can drill down through the hierarchy and get related entries.

In [6]:
pcawg_project_id = r.json()[0]['id']
pcawg_project_id

'E-MTAB-5200'

### Study
With the PCAWG project ID, we can get the related study.  We use the same technique to extract the study ID.

In [9]:
payload = {"projectID": pcawg_project_id}
r = requests.get("{}/studies/search".format(rnaget_url), params=payload, headers=headers)
pcawg_study_id = r.json()[0]['id']
pcawg_study_id

'E-MTAB-5200-ST0'

### Expression
With the study ID we can get the related expression objects.  This can now be used to get the URL to download the matrix.

In [10]:
payload = {"studyID": pcawg_study_id}
r = requests.get("{}/expressions/search".format(rnaget_url), params=payload, headers=headers)
r.json()

[{'URL': 'https://genome.crg.cat/rnaget/data/E-MTAB-5200.tpms.tsv',
  'fileType': 'tsv',
  'id': '8beada7b93d5e55aa557138b39c6f930',
  'studyID': 'E-MTAB-5200-ST0',
  'tags': ['bulk', 'cancer', 'human', 'RNA-seq']}]

In [11]:
matrix_url = r.json()[0]['URL']
print(matrix_url)

https://genome.crg.cat/rnaget/data/E-MTAB-5200.tpms.tsv


## Working with the files
With the download URL we can stream the expression file to a local file and run whatever downstream amalysis we wish.  The following are some simple navigation examples.

The felcat.caltech.edu server is serving files in loom format.  To explore these proceed to **The loom file** below.

The genome.crg.cat server is serving files in tsv format.  To explore these proceed to **The tsv file** below

### The loom file
Now that we have the URL we can stream it to a local file.  The felcat server is providing a loom file so the loompy package can be used to explore it.

(This step might take a few moments (approx 10-15 sec for my connection) as we are downloading the entire PCAWG expression matrix)

In [9]:
# static name for download of PCAWG example dataset
pcawg_loom_file = "full-loom-file.loom"

In [10]:
r = requests.get(matrix_url, stream=True)
with open(pcawg_loom_file, 'wb') as fd:
    for chunk in r.iter_content(chunk_size=128):
        fd.write(chunk)

A quick look at the loom file:

In [11]:
ds = loompy.connect(pcawg_loom_file)

In [12]:
ds.shape

(56717, 1350)

In [13]:
ds.ca.keys()

['Condition', 'Sample', 'Tissue']

In [14]:
ds.ca['Sample']

array(['DO221123 - primary tumour', 'DO221124 - primary tumour',
       'DO221127 - primary tumour', ..., 'DO27588 - primary tumour',
       'DO27636 - primary tumour', 'DO27747 - primary tumour'],
      dtype=object)

In [15]:
ds.ra.keys()

['GeneID', 'GeneName']

In [16]:
ds.ra['GeneName']

array(['TSPAN6', 'TNMD', 'DPM1', ..., 'AC008264.2', 'AP000229.1',
       'AC098479.1'], dtype=object)

### Slicing the loom file
A useful operation is the slice the loom file.  This can be done by passing the ID of the study along with the row and/or columns to return.

In [17]:
payload = {"studyID": pcawg_study_id,
           "featureNameList": "TSPAN6"}
r = requests.get("{}/expressions/search".format(rnaget_url), params=payload, headers=headers)
tspan_matrix_url = r.json()[0]['URL']
tspan_matrix_url

'http://woldlab.caltech.edu/~sau/rnaget/e361b762b6544ba7a7b15f2ce7c3172f_filtered_loom.loom'

Download it and verify that the matrix was sliced.

In [18]:
feature_loom_name = 'tspan-loom-file.loom'

In [19]:
r = requests.get(tspan_matrix_url, stream=True, headers=headers)
with open(feature_loom_name, 'wb') as fd:
    for chunk in r.iter_content(chunk_size=128):
        fd.write(chunk)

In [20]:
ds2 = loompy.connect(feature_loom_name)

In [21]:
ds2.shape

(1, 1350)

In [22]:
ds2.ra['GeneName']

array(['TSPAN6'], dtype=object)

We can also select multiple features

In [23]:
payload = {"studyID": pcawg_study_id,
           "featureNameList": "TSPAN6,DPM1"}
r = requests.get("{}/expressions/search".format(rnaget_url), params=payload, headers=headers)
multi_slice_matrix_url = r.json()[0]['URL']
multi_slice_matrix_url

'http://woldlab.caltech.edu/~sau/rnaget/14329d0cb4a14a5e84309f8965b0c463_filtered_loom.loom'

In [24]:
multi_feature_loom_name = "sliced-loom.loom"

In [25]:
r = requests.get(multi_slice_matrix_url, stream=True)
with open(multi_feature_loom_name, 'wb') as fd:
    for chunk in r.iter_content(chunk_size=128):
        fd.write(chunk)

In [26]:
ds3 = loompy.connect(multi_feature_loom_name)
ds3.ra['GeneName']

array(['TSPAN6', 'DPM1'], dtype=object)

the same thing on the sample axis

In [27]:
payload = {"studyID": pcawg_study_id,
           "sampleID": "DO221124 - primary tumour"}
r = requests.get("{}/expressions/search".format(rnaget_url), params=payload, headers=headers)
DO221124_matrix_url = r.json()[0]['URL']
DO221124_matrix_url

'http://woldlab.caltech.edu/~sau/rnaget/E-MTAB-5423-query-results.tpms.loom'

In [28]:
sample_loom_name = "221124-loom-file.loom"

In [29]:
r = requests.get(DO221124_matrix_url, stream=True)
with open(sample_loom_name, 'wb') as fd:
    for chunk in r.iter_content(chunk_size=128):
        fd.write(chunk)

In [30]:
ds4 = loompy.connect(sample_loom_name)

In [31]:
ds4.shape

(56717, 1350)

In [32]:
ds4.ca['Sample']

array(['DO221123 - primary tumour', 'DO221124 - primary tumour',
       'DO221127 - primary tumour', ..., 'DO27588 - primary tumour',
       'DO27636 - primary tumour', 'DO27747 - primary tumour'],
      dtype=object)

In [33]:
ds4.ra['GeneName'][:3]

array(['TSPAN6', 'TNMD', 'DPM1'], dtype=object)

### The tsv file
Now that we have the URL we can stream it to a local file.

(This step might take a few moments (approx 2-3 minutes for my connection) as we are downloading the entire PCAWG expression matrix)

In [12]:
# static name for download of PCAWG example dataset
pcawg_tsv_file = "full-tsv-file.tsv"

In [13]:
r = requests.get(matrix_url, stream=True, headers=headers)
with open(pcawg_tsv_file, 'w') as fd:
    for chunk in r.iter_content(chunk_size=128):
        fd.write(chunk)

Let's take a look at the first few entries.  The tsv file has 4 header rows followed by the sample headings.

In [18]:
with open(pcawg_tsv_file, 'r') as fd:
    for i in range(4):
        fd.readline()
    samples = fd.readline()
    print(samples.split('\t')[:4])
    for i in range(3):
        print(fd.readline().split('\t')[:4])
    

['Gene ID', 'Gene Name', 'DO221123 - primary tumour, B-cell non-Hodgkin lymphoma, blood', 'DO221124 - primary tumour, B-cell non-Hodgkin lymphoma, blood']
['ENSG00000000003', 'TSPAN6', '4.0', '5.0']
['ENSG00000000005', 'TNMD', '', '0.4']
['ENSG00000000419', 'DPM1', '154.0', '90.0']
