Import libraries;
> * **pandas** for loading data into spreadsheet-like objects called dataframes
> * **requests** for *HTTP requests*, i.e, interacting with web APIs, accessing databases, etc
> * **json** used to work with *JavaScript Object Notation*, commonly used as a format for web responses
> * **io** to treat the HTTP response as a stream of bytes to read into a file
> * **tarfile** to work with the data sent
> * **re** (regular expression) to search for, within and retrieve parts of strings

In [None]:
import pandas as pd
import requests
import json
import io
import tarfile
import re

* Set up URL and end-point,
* And preview the data/check mappable fields.

In [None]:
defaultUrl = "https://api.gdc.cancer.gov/"

endpoint = input("Enter the endpoint to be accessed:")

url = defaultUrl + endpoint.lower()

response = requests.get(url)

data = response.json()

print(json.dumps(data, indent=2))

Enter the endpoint to be accessed:files/_mapping
{
  "_mapping": {
    "files.access": {
      "description": "",
      "doc_type": "files",
      "field": "access",
      "full": "files.access",
      "type": "keyword"
    },
    "files.acl": {
      "description": "",
      "doc_type": "files",
      "field": "acl",
      "full": "files.acl",
      "type": "keyword"
    },
    "files.analysis.analysis_id": {
      "description": "",
      "doc_type": "files",
      "field": "analysis.analysis_id",
      "full": "files.analysis.analysis_id",
      "type": "keyword"
    },
    "files.analysis.analysis_type": {
      "description": "",
      "doc_type": "files",
      "field": "analysis.analysis_type",
      "full": "files.analysis.analysis_type",
      "type": "keyword"
    },
    "files.analysis.created_datetime": {
      "description": "",
      "doc_type": "files",
      "field": "analysis.created_datetime",
      "full": "files.analysis.created_datetime",
      "type": "keyword"
  

Construct the paramters for the request when `endpoint = files` as a dictionary object (representing a JSON input), with the following keys (all pairs must be parsed as strings):

* `"filters"`, with it's value as a nested dictionary containing;
> an **operator** field valued to either of the following:
>> 1. *logical* operators, such as `"=", "!=", "<", "<=", ">", "=>"`
>>
>> including `"and" & "or"` for a logical relation between two or more dictionaries nested in the `"content"` key
>>
>> 2. `"is"` and `"not"` operators to value a field as missing or not
>>
>> 3. `"in"` and `"exclude"` operators to return responses with values included or not included, respectively, in lists nested in the value for the `"value"` key
>
> a **content** field (`"content"`) that contains a *nested dictionary* further containing;
>> 1. a `"field"` key that's paired with the property you want to specify of whatever end-point you're querying (projects, cases, files, data (for downloading), etc)
>>
>> 2. and a `"value"` key that's paired with the value(s, as a list) you want that property to hold
> **or**
> 3. a *list of more nested* filter dictionaries

* `"format"`, to choose from;
> 1. **JSON**, passed as a dictionary in python,
> 2. **TSV**, or Tab Separated Values,
> 3.  **XML**, or Extensible Mark-up Language,

* `"fields"`, to specify which of the properties to return when querying metadata/information,

* `"size"` to set a maximum number of results to include in the response.

*(the above steps can allow us to create a request, that when parsed to the  `json=` parameter in a `.post()` method, will return the UUIDs of files matching the specified filters)*


In [None]:
# Set up the URL to query the /files endpoint for metadata about the files

fileIdObtain_URL = "https://api.gdc.cancer.gov/files"

# Construct the filters we wanna pass the files through

filters = {
  "op": "and",
  "content": [
    {
      "op": "=",
      "content": {
        "field": "cases.project.project_id",
        "value": ["TCGA-BRCA"]
      }
    },
    {
      "op": "=",
      "content": {
        "field": "files.access",
        "value": ["open"]
      }
    },
    {
      "op": "=",
      "content": {
        "field": "files.data_format",
        "value": ["TSV"]
      }
    },
    {
      "op": "=",
      "content": {
        "field": "files.data_category",
        "value": ["Transcriptome Profiling"]
      }
    },
    {
      "op": "=",
      "content": {
        "field": "files.data_type",
        "value": ["Gene Expression Quantification"]
      }
    },
    {
      "op": "=",
      "content": {
        "field": "cases.samples.sample_type",
        "value": ["Solid Tissue Normal"]
      }
    },
    {
        "op": "=",
        "content": {
            "field": "cases.diagnoses.synchronous_malignancy",
            "value": ["No"]
        }
    },
    {
        "op": "=",
        "content": {
            "field": "cases.diagnoses.prior_malignancy",
            "value": ["No"]
        }
    },
    {
        "op": "=",
        "content": {
            "field": "cases.diagnoses.prior_treatment",
            "value": ["No"]
        }
    }
  ]
}

# Specify the other parameters

params = {
    "filters": filters,
    "format": "json",
    "fields": "file_id",
    "size": "1000000"
}

# Initialise a list to hold the UUIDs of the files we want to obtain

file_uuid_list = []

# Query the URL we set up with the params dict passed to the json= parameter of the .post function
# (used in place of the .get because the request we're sending contains a complex json payload)

response = requests.post(fileIdObtain_URL, json=params)

# For every dictionary nested in the list at the "hits" key
# of the dictionary nested in the value for the "data" key
# in the JSON response formatted as a dictionary,
# extract the value at the "file_id" key
# and add it to file_uuid_list

for file_entry in response.json()["data"]["hits"]:
    file_uuid_list.append(file_entry["file_id"])

# Print the list of UUIDs

print(file_uuid_list)

# For comparison, print the JSON response formatted as a python dictionary

print(json.dumps(response.json(), indent=2))


['456bc30b-59f8-4427-b798-5b113ca635a0', 'c6a4afd8-8044-475f-b4fd-a1b4cb922976', '18b0bd2d-505f-4ecb-9bea-b52e6a74cebd', '75182885-7501-49b1-bb0d-8a88da1080a3', '52151eca-7819-496e-bf31-4875b68d429d', 'd7a48283-c113-4745-be6b-553966e6b457', '2bf56d2d-8c5e-4579-847b-03fd0ba46143', 'eddb2dc6-2b72-43a8-a7fc-3dd09dde68af', '699e47b8-5396-43eb-927d-8d05b0e79644', 'e84b9f1a-0def-425b-86c1-c143b27509b5', 'c440cfb1-33ec-4be4-a4a4-4aea9a66d021', 'bd239a8e-56e2-45ca-bc44-bff98b72c1d6', 'b9eb33d4-1017-42cf-b72c-f917de5425e7', 'ddb8fb65-cf53-41a0-8acd-a538a8754fa5', '8b000fae-e6de-4038-8486-45316cae622d', '8ebe0bf6-11fa-418d-918c-5c73f0e7e9ac', 'dafb2454-1ea7-4cb6-8b3b-2b5b6f19aa89', '248ac510-a5af-4608-9cdc-5c8673633b82', '63adf5e2-d5cb-4937-a0ee-f2914e130b25', 'a930c017-565f-49ad-893a-37cc394c269d', '165fe176-2dea-489f-92d6-34b0c6848312', '1479c033-ebe7-423d-8460-bbe84fd5ffb6', '958813f4-8036-42f7-856d-7a69c4175adc', 'f11c5295-697a-46f7-91fe-b760dc9f1029', '38854c85-fc09-4a51-93a7-257762517583',

By *default*, the GDC API returns a **compressed tar.gz** archive when a download of more than one file is requested.
> However, if we append `"?tarfile"` to the end of the URL with the data endpoint, it'll return an uncompressed, bundled tarball, and we *won't have to unzip it*.

In [None]:
# Set up the URL with the /data?tarfile endpoint to download the files

uncompressed_bundled_dataEndPt = "https://api.gdc.cancer.gov/data?tarfile"

# Pass the UUID list we constructed to the "ids" key in the params dictionary

params = {
    "ids": file_uuid_list
}

# Query the URL we set up, passing the new params dict with the file UUIDs to the json= parameter

response = requests.post(uncompressed_bundled_dataEndPt, json=params)

# Create a local, in-memory file (not loaded to disk/server),
# and the write the content of the response in binary to it

downloaded_data = io.BytesIO(response.content)

The following is the data pipeline we'll use to process the data from the files, to allow us to compare the variance in TPM and FPKM value across samples;

1. **Open** the tarball bundle of files,


2. Starting at the first, and **looping through each** file, carry out the following steps;
> 1. Set the *first* column, containing gene IDs, as the **index**
>
> 2. For the first gene in this file, extract the value at;
>> * first the `"tpm_unstranded"` column, and add it **to the dataframe for TPM**, such that each *row represents a gene*, and each column represents it's TPM from a different sample
>>
>> * followed by the `"fpkm_unstranded"` column adding that to the FPKM dataframe in the **same pattern**.



In [None]:
tpm_columns_list = []
fpkm_columns_list = []

downloaded_data.seek(0)
with tarfile.open(fileobj=downloaded_data, mode="r:*") as tar:
      for i, member in enumerate(tar, start=1):
        if member.name.endswith(".tsv"):
          print(f"{i}. Extracting: {member.name}, size: {round(((member.size) / (1024 * 1024)), 2)} MB")

          file = tar.extractfile(member)

          file_name = member.name.split("/")[-1]
          file_name_splitting = re.match((r"^([^\.]+)"), file_name)
          file_uuid = file_name_splitting.group(1)


          series_tpm = pd.read_csv(
              file,
              sep="\t",
              header=0,
              usecols=[0,6],
              names=["gene_id","TPM",],
              comment="#",
              dtype={"gene_id":str},
              index_col="gene_id").squeeze()

          series_tpm.name = file_uuid

          file.seek(0)

          series_fpkm = pd.read_csv(
              file,
              sep="\t",
              header=0,
              usecols=[0,7],
              names=["gene_id","FPKM",],
              comment="#",
              dtype={"gene_id":str},
              index_col="gene_id").squeeze()

          series_fpkm.name = file_uuid

          tpm_columns_list.append(series_tpm)
          fpkm_columns_list.append(series_fpkm)

      print(f'Total files: {len(tar.getmembers())}')


tpm_df = pd.concat(tpm_columns_list, axis=1).dropna()
fpkm_df = pd.concat(fpkm_columns_list, axis=1).dropna()

2. Extracting: 040d00da-2bc9-49d4-bf6c-b4515b6a2bbf/d4f91697-1c39-4398-bf2f-85217a22ddff.rna_seq.augmented_star_gene_counts.tsv, size: 4.05 MB
3. Extracting: 046fe29e-7c99-4093-8565-f2f205a00796/61f811cf-9dc1-4f48-ad62-c3ebfd1f0847.rna_seq.augmented_star_gene_counts.tsv, size: 4.05 MB
4. Extracting: 0972f396-9045-4faa-98f8-e8c3e02f9901/37157138-01ef-49a5-bc74-c45feaf411e2.rna_seq.augmented_star_gene_counts.tsv, size: 4.06 MB
5. Extracting: 0d11c50a-8648-48ea-a107-e138a1d3e086/bf3ea4a0-bcd6-4e9d-acbb-3416f6ce53b7.rna_seq.augmented_star_gene_counts.tsv, size: 4.04 MB
6. Extracting: 0e89f9f0-419f-46e7-981d-781a1302e4be/98f19a00-fd01-46b1-a75d-f66522ebde2d.rna_seq.augmented_star_gene_counts.tsv, size: 4.04 MB
7. Extracting: 1320db11-22a5-417f-8ec7-65c0bf4681a2/9926a02f-9fa7-42c9-bfa1-e6ec45018fed.rna_seq.augmented_star_gene_counts.tsv, size: 4.05 MB
8. Extracting: 1479c033-ebe7-423d-8460-bbe84fd5ffb6/967fde0b-9c38-4c69-97f2-9640ebe1dc9a.rna_seq.augmented_star_gene_counts.tsv, size: 4.06 MB

Check the dataframes to ensure they've been preprocessed correctly before we carry out our analysis steps, with;
> * `.shape`, to display the **number of rows and columns**
>
> * `.iloc[:x, :y]`, to display the **first x and y number of rows and columns**, respectively

In [None]:
print("TPM matrix:",  tpm_df.shape)
print("FPKM matrix:", fpkm_df.shape)
print("TPM:")
display(tpm_df.iloc[:25, :25])

print("FPKM:")
display(fpkm_df.iloc[:25, :25])

TPM matrix: (60660, 111)
FPKM matrix: (60660, 111)
TPM:


Unnamed: 0_level_0,d4f91697-1c39-4398-bf2f-85217a22ddff,61f811cf-9dc1-4f48-ad62-c3ebfd1f0847,37157138-01ef-49a5-bc74-c45feaf411e2,bf3ea4a0-bcd6-4e9d-acbb-3416f6ce53b7,98f19a00-fd01-46b1-a75d-f66522ebde2d,9926a02f-9fa7-42c9-bfa1-e6ec45018fed,967fde0b-9c38-4c69-97f2-9640ebe1dc9a,467bd108-f391-4dd1-92de-43879822400d,16e1fcb7-863d-40c1-b2d6-365d519148f6,678aa892-0631-47cc-b19e-cdcef42cecb3,...,d23ff3bc-5524-4dca-b0f4-a1561a30566c,5dfb4024-ca51-4296-9fc7-0037b7b1e27b,69303772-638d-4171-84d2-cfa535256976,fb3713b9-fad5-4d66-b419-8f53530b14cd,bd1f12ab-ee49-4e7e-aad4-14924c49d306,1cbbe329-a471-425e-b14d-e9043fce6926,da549530-7ba5-4302-acd2-40a766860216,0a950719-ca7a-4dbf-8755-262cabf3d4b7,4da20b00-8ed6-41ed-88d9-9969d15761c7,c058aa43-349b-4f45-8217-6b193bf7c610
gene_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ENSG00000000003.15,45.9824,77.1714,52.4591,80.2665,64.4346,68.3142,131.3761,75.3017,70.0615,90.2373,...,66.9001,77.5386,43.0014,58.4397,44.5582,55.0109,70.0346,103.0264,67.3136,86.9116
ENSG00000000005.6,37.0512,8.3699,13.7006,8.5532,41.7691,30.9113,7.5982,8.362,16.2751,6.7653,...,72.9793,110.806,31.6191,2.9499,18.5232,4.7416,5.916,7.9248,15.6231,46.6355
ENSG00000000419.13,78.0841,81.3918,73.7749,117.2099,88.8489,98.6978,98.8816,91.0519,70.7344,97.9863,...,111.9876,76.4689,65.9192,102.1078,79.5989,123.2829,117.4734,109.0865,94.4963,107.2129
ENSG00000000457.14,9.2776,9.8335,8.3744,11.6211,11.1609,11.1053,19.3951,18.8496,9.7836,18.5271,...,10.9101,7.348,5.9159,16.5362,7.7207,26.6012,20.2739,19.9034,16.399,16.8192
ENSG00000000460.17,1.4015,1.8052,2.5084,3.4706,3.1388,3.3004,4.6964,4.0655,3.4433,4.0972,...,3.0322,1.9413,1.2737,4.0842,2.0354,6.3034,4.0675,5.8266,3.4367,4.8315
ENSG00000000938.13,9.1242,6.5091,11.9813,9.8024,33.9159,7.5613,10.9468,5.9892,11.1783,10.9976,...,24.8802,5.1402,2.9,6.846,15.234,6.9882,9.7128,10.3577,6.224,14.6143
ENSG00000000971.16,20.5844,8.6628,32.3818,60.1291,39.8891,33.1138,39.3028,38.6294,75.3753,14.2467,...,71.9791,110.4566,40.4136,26.3179,57.6448,14.2753,12.2971,23.6751,36.0174,46.2473
ENSG00000001036.14,44.3797,29.7359,31.9208,55.9374,38.2603,52.4396,36.738,31.3686,42.1712,29.444,...,68.552,39.2939,23.2154,43.5818,42.1548,31.8399,33.5947,38.1506,44.1885,42.9804
ENSG00000001084.13,8.0623,13.5573,16.6655,16.7514,18.6092,20.8848,21.9084,14.1776,24.4927,17.2846,...,15.2806,35.0076,26.8915,17.8017,15.397,19.5966,18.4889,20.558,18.2775,24.2
ENSG00000001167.14,18.6356,33.7614,29.2547,32.6968,36.5263,29.3157,60.2435,46.2701,37.138,49.5914,...,34.0588,19.0612,23.5234,45.3365,23.5096,51.279,58.2414,49.6867,51.8045,48.1951


FPKM:


Unnamed: 0_level_0,d4f91697-1c39-4398-bf2f-85217a22ddff,61f811cf-9dc1-4f48-ad62-c3ebfd1f0847,37157138-01ef-49a5-bc74-c45feaf411e2,bf3ea4a0-bcd6-4e9d-acbb-3416f6ce53b7,98f19a00-fd01-46b1-a75d-f66522ebde2d,9926a02f-9fa7-42c9-bfa1-e6ec45018fed,967fde0b-9c38-4c69-97f2-9640ebe1dc9a,467bd108-f391-4dd1-92de-43879822400d,16e1fcb7-863d-40c1-b2d6-365d519148f6,678aa892-0631-47cc-b19e-cdcef42cecb3,...,d23ff3bc-5524-4dca-b0f4-a1561a30566c,5dfb4024-ca51-4296-9fc7-0037b7b1e27b,69303772-638d-4171-84d2-cfa535256976,fb3713b9-fad5-4d66-b419-8f53530b14cd,bd1f12ab-ee49-4e7e-aad4-14924c49d306,1cbbe329-a471-425e-b14d-e9043fce6926,da549530-7ba5-4302-acd2-40a766860216,0a950719-ca7a-4dbf-8755-262cabf3d4b7,4da20b00-8ed6-41ed-88d9-9969d15761c7,c058aa43-349b-4f45-8217-6b193bf7c610
gene_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ENSG00000000003.15,15.6546,23.728,17.1887,23.824,19.3257,20.9001,34.421,20.1276,21.2464,25.2229,...,20.3263,25.2538,16.496,16.4625,14.4503,15.1857,18.5367,27.6631,17.2596,23.1603
ENSG00000000005.6,12.614,2.5735,4.4891,2.5387,12.5277,9.457,1.9908,2.2351,4.9355,1.891,...,22.1733,36.0887,12.1296,0.831,6.0071,1.3089,1.5658,2.1278,4.0059,12.4275
ENSG00000000419.13,26.5835,25.0257,24.173,34.7892,26.6482,30.1957,25.9073,24.3375,21.4505,27.3889,...,34.0252,24.9054,25.2877,28.7639,25.8141,34.0321,31.0928,29.2903,24.2294,28.5701
ENSG00000000457.14,3.1585,3.0235,2.7439,3.4493,3.3475,3.3976,5.0816,5.0384,2.9669,5.1787,...,3.3148,2.3932,2.2694,4.6583,2.5038,7.3432,5.3661,5.3442,4.2048,4.482
ENSG00000000460.17,0.4771,0.555,0.8219,1.0301,0.9414,1.0097,1.2305,1.0867,1.0442,1.1452,...,0.9213,0.6323,0.4886,1.1505,0.6601,1.74,1.0766,1.5645,0.8812,1.2875
ENSG00000000938.13,3.1063,2.0014,3.9258,2.9094,10.1723,2.3133,2.8681,1.6009,3.3899,3.074,...,7.5594,1.6741,1.1125,1.9285,4.9404,1.9291,2.5708,2.7811,1.5959,3.8944
ENSG00000000971.16,7.0079,2.6636,10.6102,17.847,11.9638,10.1309,10.2975,10.3253,22.8578,3.9822,...,21.8694,35.9749,15.5033,7.4138,18.6944,3.9407,3.2548,6.3569,9.2351,12.324
ENSG00000001036.14,15.1089,9.1429,10.4591,16.6028,11.4753,16.0434,9.6255,8.3846,12.7886,8.2301,...,20.8282,12.7977,8.9058,12.2771,13.6709,8.7894,8.8918,10.2436,11.3302,11.4534
ENSG00000001084.13,2.7448,4.1685,5.4606,4.972,5.5814,6.3895,5.7401,3.7896,7.4275,4.8314,...,4.6427,11.4017,10.316,5.0148,4.9933,5.4096,4.8936,5.5199,4.6865,6.4488
ENSG00000001167.14,6.3444,10.3807,9.5856,9.7048,10.9553,8.9689,15.784,12.3677,11.2622,13.8617,...,10.3481,6.2081,9.024,12.7713,7.6242,14.1555,15.4153,13.3411,13.283,12.843
