# 🧬 VPODC Cohort File Selection & Download Guide
### Using Gen3 Client for Automated File Retrieval
---

**Last Updated**: May 2025  
**Tools Used**: Python, Gen3 Client, Jupyter Notebook  

This notebook provides a step-by-step walkthrough for:
1. Selecting a VPODC cohort from `Gen3`.
2. Listing associated VCF object IDs.
3. Downloading files using the `gen3-client`.


---

## ⚙️ Install the Gen3

Before proceeding, we need to install the [Gen3](https://gen3.org).  
This SDK enables programmatic access to Gen3 commons, including authentication and data submission APIs.

The command below installs the SDK silently (hides verbose output) and prints a short confirmation when done.


In [1]:
# Install the Gen3 Python SDK via pip
print("Installing Gen3...")
!pip install --user --force --upgrade gen3 --ignore-installed certifi > /dev/null 2>&1
print("✅ Gen3 installation complete.")

Installing Gen3...
✅ Gen3 installation complete.


In [2]:
# Import some Python packages
import requests, json, fnmatch, os, os.path, sys, subprocess, glob, ntpath, copy, re
import pandas as pd
import gen3
from gen3.auth import Gen3Auth
from gen3.submission import Gen3Submission

## 📦 Make sure you have a valid profile and credentials configured in your Gen3 client.

In [3]:
# Set `creds` to the location of your credentials.json downloaded from the "Profile" page.
# This should be the only line you may need to edit for this notebook to work.
profile = 'vpodc'
api = 'https://vpodc.data-commons.org'
creds = '/home/jovyan/pd/vpodc-credentials.json'
client = '/home/jovyan/pd/.gen3/gen3-client'

auth = Gen3Auth(api, refresh_file=creds)
sub = Gen3Submission(api, auth)

In [4]:
# Download the gen3-client (for downloading files) and configure a profile 
!curl https://api.github.com/repos/uc-cdis/cdis-data-client/releases/latest | grep browser_download_url.*linux |  cut -d '"' -f 4 | wget -qi -
!unzip dataclient_linux.zip
!mkdir -p /home/jovyan/pd/.gen3
!mv gen3-client /home/jovyan/pd/.gen3
!rm dataclient_linux.zip

# Configure a profile
cmd = client +' configure --profile='+profile+' --apiendpoint='+api+' --cred='+creds
try:
    output = subprocess.check_output(cmd, stderr=subprocess.STDOUT, shell=True).decode('UTF-8')
except Exception as e:
    output = e.output.decode('UTF-8')
    print("ERROR:" + output)

# Check authorization privileges
cmd = client +' auth --profile='+profile
try:
    output = subprocess.check_output(cmd, stderr=subprocess.STDOUT, shell=True).decode('UTF-8')
    #print("\n"+output)
except Exception as e:
    output = e.output.decode('UTF-8')
    #print("ERROR:" + output)

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   276  100   276    0     0   9399      0 --:--:-- --:--:-- --:--:--  9517
unzip:  cannot find or open dataclient_linux.zip, dataclient_linux.zip.zip or dataclient_linux.zip.ZIP.
mv: cannot stat 'gen3-client': No such file or directory
rm: cannot remove 'dataclient_linux.zip': No such file or directory


In [5]:
# Download some extra functions for interacting with APIs
!rm -f -- expansion.py
!wget https://raw.githubusercontent.com/cgmeyer/gen3sdk-python/master/expansion/expansion.py
%run ./expansion.py
exp = Gen3Expansion(api, auth, sub)

--2025-05-20 18:51:05--  https://raw.githubusercontent.com/cgmeyer/gen3sdk-python/master/expansion/expansion.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.110.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 235484 (230K) [text/plain]
Saving to: ‘expansion.py’


2025-05-20 18:51:05 (132 MB/s) - ‘expansion.py’ saved [235484/235484]



## 🔍 Query Primary Site Values from Gen3

We query all available values of `"PrimarysiteX"` from the `oncology_primary` node  
within the specified project (`VA-PODR-COHORT-A`) using the `paginate_query` function.

This function helps avoid timeouts by breaking the request into chunks.  
We specifically request the column:  
- `PrimarysiteX`  


In [6]:
props = ['PrimarysiteX']
data = exp.paginate_query(node='oncology_primary',project_id='VA-PODR-COHORT-A',props=props,format='tsv', chunk_size=10000)


	Found 202 records in 'oncology_primary' node of project 'VA-PODR-COHORT-A'. 
	Records retrieved: 202 of 202 (100%), offset: 10000, chunk_size: 10000.                                                                                                                                

### 🫁 Filter for Lung-Related Sites

From the full list of `PrimarysiteX` values, we filter those containing the string `"LUNG"`.  
This gives us a focused list of lung-related primary tumor sites to use in downstream filtering.


In [7]:
sites = [x for x in list(set(data.PrimarysiteX)) if x is not None]
lung_sites = [i for i in sites if 'LUNG' in i] 
lung_sites

['LUNG, LOWER LOBE',
 'LUNG, UPPER LOBE',
 'LUNG, MIDDLE LOBE',
 'LUNG, MAIN BRONCHUS',
 'LUNG NOS']

### 🧬 Retrieve Case Submitter IDs for Lung Cases

Using the filtered `lung_sites`, we call `paginate_query` again to retrieve `submitter_id`s  
associated with lung cases from the `oncology_primary` node.

Each `lung_site` value is used in a query loop to match corresponding entries in Gen3.


In [8]:
props = ['submitter_id']
lung_site = lung_sites[0]
ids = []
for lung_site in lung_sites:
    args = 'with_path_to:{type:"oncology_primary", PrimarysiteX: "%s"}' % lung_site
    data = exp.paginate_query(node='case', project_id=None, props=props, args=args, format='tsv', chunk_size=10000)
    ids1 = list(set(data.submitter_id))
    ids += ids1
case_submitter_ids = list(set(ids))


	Found 1880 records in 'case' node of project 'None'. 
	Records retrieved: 1743 of 1880 (92%), offset: 10000, chunk_size: 10000.                                                                                                                               
	Found 4115 records in 'case' node of project 'None'. 
	Records retrieved: 3748 of 4115 (91%), offset: 10000, chunk_size: 10000.                                                                                                                               
	Found 318 records in 'case' node of project 'None'. 
	Records retrieved: 300 of 318 (94%), offset: 10000, chunk_size: 10000.                                                                                                                                 
	Found 231 records in 'case' node of project 'None'. 
	Records retrieved: 225 of 231 (97%), offset: 10000, chunk_size: 10000.                                                                                                           

In [9]:
len(case_submitter_ids)

6587

## 📄 Extracting Object IDs from Structured Data

We can download all the **structured data** from the data file nodes and use **Pandas** to filter the relevant `object_id`s for download.

#### 🔍 Steps Overview:
- 🛠️ Use the function `get_node_tsvs` to fetch structured data from a node.
- 📋 This function returns a **DataFrame** containing fields like:
  - `case_submitter_id`
  - `cases.submitter_id#1` (usually the same as `case_submitter_id`)
- 🎯 Filter the DataFrame to include only rows where `case_submitter_id` matches one in our predefined list.
- 📦 Extract the corresponding `object_id`s from those rows.
- ⬇️ Use the extracted `object_id`s to download the data files with `gen3-client`.

> ✅ This approach ensures you're only downloading the exact files tied to your case list.


In [10]:
node = 'variant_call_file'
projects = ['VA-PODR-COHORT-A']
df = exp.get_node_tsvs(node,projects,overwrite=True)

df.head()


Output written to file: node_tsvs/variant_call_file_tsvs/VA-PODR-COHORT-A_variant_call_file.tsv
node_tsvs/variant_call_file_tsvs/VA-PODR-COHORT-A_variant_call_file.tsv has 152 records.
length of all dfs: 152
Master node TSV with 152 total records written to master_variant_call_file.tsv.


Unnamed: 0,type,id,project_id,submitter_id,data_category,data_format,data_type,file_name,file_size,md5sum,...,state_comment,variant_calling_workflow,aligned_reads_files.id,aligned_reads_files.submitter_id,cases.id,cases.submitter_id,core_metadata_collections.id,core_metadata_collections.submitter_id,unaligned_reads_files.id,unaligned_reads_files.submitter_id
0,variant_call_file,0034ce46-f41d-42ef-a03c-0333b8081a76,VA-PODR-COHORT-A,VA-PODR-COHORT-A_C179813604_2_ef1d,Simple Nucleotide Variation,VCF,Annotated Somatic Mutation,C179813604_2.vcf,15737,10931a53640eb6b4d10e5b3262678e61,...,,,,,6c845a82-c75d-4070-8368-d1d9f78f4624,C179813604,4254caef-09ea-4399-b4ee-bdc271f99037,VCF_collection-01,,
1,variant_call_file,0367675c-96e2-4626-83c9-da46f0eb9ad7,VA-PODR-COHORT-A,VA-PODR-COHORT-A_C2040737988_1_a1b4,Simple Nucleotide Variation,VCF,Annotated Somatic Mutation,C2040737988_1.vcf,1895,8ead48f026250bd47c1e2d4f984695f1,...,,,,,9355ddf9-be48-48dd-8b6a-9a9af64a2767,C2040737988,4254caef-09ea-4399-b4ee-bdc271f99037,VCF_collection-01,,
2,variant_call_file,03cbabb7-4680-4a25-80a8-fcf08b4d1898,VA-PODR-COHORT-A,VA-PODR-COHORT-A_C1361659370_1_8682,Simple Nucleotide Variation,VCF,Annotated Somatic Mutation,C1361659370_1.vcf,12818,dfd44a7ac69522f5e98949c07c83ce5c,...,,,,,b44eca18-0f91-4eef-a60d-0055fcd3b0ae,C1361659370,4254caef-09ea-4399-b4ee-bdc271f99037,VCF_collection-01,,
3,variant_call_file,066d3640-ed34-4d42-96cd-6370397fa588,VA-PODR-COHORT-A,VA-PODR-COHORT-A_C240403488_1_2240,Simple Nucleotide Variation,VCF,Annotated Somatic Mutation,C240403488_1.vcf,13778,811bd9abfbff9d69e0afa249f551ba88,...,,,,,9a439d4f-a852-44ed-afda-dcec5446b016,C240403488,4254caef-09ea-4399-b4ee-bdc271f99037,VCF_collection-01,,
4,variant_call_file,0d9f6a8f-0e97-466c-8ea4-33332464b81e,VA-PODR-COHORT-A,VA-PODR-COHORT-A_C179813604_3_b334,Simple Nucleotide Variation,VCF,Annotated Somatic Mutation,C179813604_3.vcf,16190,9d008680794aec1857340d0c441f7a30,...,,,,,6c845a82-c75d-4070-8368-d1d9f78f4624,C179813604,4254caef-09ea-4399-b4ee-bdc271f99037,VCF_collection-01,,


### 🧾 Extract Object IDs for Lung Cases

From the main structured DataFrame `df`, we filter for rows where the `case_submitter_id` is found in our list of lung cases.  
Then, we extract the corresponding `object_id` values, which will be used to download files tied to these records.


In [11]:
vcf_object_ids = list(df.loc[df['case_submitter_id'].isin(case_submitter_ids)]['object_id'])
len(vcf_object_ids)

65

### ⬇️ Download Files via Gen3 Client

We loop through the list of `object_id`s and use the `gen3-client` command-line tool to download each file.

- The download path is set to a specific directory.
- The `--no-prompt` flag is used to skip overwrite confirmation.
- This process downloads one file per iteration based on its GUID.

> ⚠️ If files already exist in the destination folder, they will be silently overwritten unless `--rename` is set.


In [12]:
!mkdir -p /home/jovyan/pd/Downloaded_Data/variant_call_files
dl_dir = '/home/jovyan/pd/Downloaded_Data/variant_call_files'

for object_id in vcf_object_ids:
    !/home/jovyan/pd/.gen3/gen3-client download-single --guid=$object_id --profile=vpodc --download-path=$dl_dir --no-prompt

2025/05/20 18:51:12 Error occurred when checking for latest version: GET https://api.github.com/repos/uc-cdis/cdis-data-client/tags: 403 API rate limit exceeded for 3.86.93.34. (But here's the good news: Authenticated requests get a higher rate limit. Check out the documentation for more details.) [rate reset in 18m15s]
2025/05/20 18:51:12 Total number of objects in manifest: 1
2025/05/20 18:51:12 Preparing file info for each file, please wait...
2025/05/20 18:51:12 File info prepared successfully
2025/05/20 18:51:12 1 files downloaded.
2025/05/20 18:51:13 Error occurred when checking for latest version: GET https://api.github.com/repos/uc-cdis/cdis-data-client/tags: 403 API rate limit exceeded for 3.86.93.34. (But here's the good news: Authenticated requests get a higher rate limit. Check out the documentation for more details.) [rate reset in 18m14s]
2025/05/20 18:51:13 Total number of objects in manifest: 1
2025/05/20 18:51:13 Preparing file info for each file, please wait...
2025/0

### 🧮 Count Downloaded VCF Files

We perform a final check to verify how many `.vcf` files were successfully downloaded to the target directory.  
This uses the `find` command to search for files with the `.vcf` extension and counts them with `wc -l`.

> ✅ This helps confirm that the number of downloaded VCFs matches the number of expected `object_id`s.


In [13]:
!find $dl_dir -name '*.vcf' | wc -l

61


### 🧫 Extract Structured Data for Pathology Slide Files

We now repeat the process for the `pathology_slide` node using the `get_node_tsvs()` function.

This function retrieves all available metadata for pathology slide files within the `VA-PODR-COHORT-A` project.  
The result is stored as a structured DataFrame (`df`) and saved as a `.tsv` file for reference.

💡 Key columns include:
- `submitter_id`: uniquely identifies the slide
- `data_format`: expected to be `JPEG`
- `file_name`, `file_size`: for download and quality assessment


In [14]:
node = 'pathology_slide'
df = exp.get_node_tsvs(node,projects='VA-PODR-COHORT-A',overwrite=True)
df.head()


Output written to file: node_tsvs/pathology_slide_tsvs/VA-PODR-COHORT-A_pathology_slide.tsv
node_tsvs/pathology_slide_tsvs/VA-PODR-COHORT-A_pathology_slide.tsv has 170 records.
length of all dfs: 170
Master node TSV with 170 total records written to master_pathology_slide.tsv.


Unnamed: 0,type,id,project_id,submitter_id,data_category,data_format,data_type,file_name,file_size,md5sum,...,study_date,study_description,study_id,study_instance_uid,cases.id,cases.submitter_id,core_metadata_collections.id,core_metadata_collections.submitter_id,samples.id,samples.submitter_id
0,pathology_slide,00dc89a5-33e9-4a84-98cb-b09bd0b21c4e,VA-PODR-COHORT-A,VA-PODR-COHORT-A_C434194520_1_56f7,,JPEG,,C434194520_1.jpg,4147004,2d6e7f28825a31e9b9a4908ef62d0539,...,,,,,,,92fb3e1a-68c4-40a0-8a28-b4294e5ba990,cohortA_batch10_pathology_slides,,
1,pathology_slide,0230ae05-e066-448e-a42b-32747357681e,VA-PODR-COHORT-A,VA-PODR-COHORT-A_C477240694_1_3976,,JPEG,,C477240694_1.jpg,448896,ff6b6394b752367b3cfab0a89522f676,...,,,,,4c7fd0df-bf01-439e-8a23-a4cf92b5ab87,C477240694,e70b4083-a7aa-49a8-b87e-0b65af215cea,Pathology_Slides_HDD5,,
2,pathology_slide,024efa6a-bc4d-45e2-836a-50ac53ed0c3f,VA-PODR-COHORT-A,VA-PODR-COHORT-A_C1797125648_2_70f3,,JPEG,,C1797125648_2.jpg,368135,4142f35cca4bc29f86715510fe7600a2,...,,,,,83cf9cc0-794d-4830-a374-e2e88081e04e,C1797125648,e70b4083-a7aa-49a8-b87e-0b65af215cea,Pathology_Slides_HDD5,,
3,pathology_slide,02655df4-778f-4562-b964-3c91940f2913,VA-PODR-COHORT-A,VA-PODR-COHORT-A_C2126691600_1_ab20,,JPEG,,C2126691600_1.jpg,3287790,3b052d6ca9adcbf94fdd34ce520b92a9,...,,,,,6651a3e6-97ce-48f1-ba5e-40c6aeb58cce,C2126691600,e70b4083-a7aa-49a8-b87e-0b65af215cea,Pathology_Slides_HDD5,,
4,pathology_slide,0485adc9-31cb-497d-93e3-32b761b5d5bf,VA-PODR-COHORT-A,VA-PODR-COHORT-A_C914031896_1_7eba,,JPEG,,C914031896_1.jpg,2543461,65379910e28fe7a661aee8a6099d1c09,...,,,,,8b3a65db-5e8e-4fe7-b7e5-9278660753e4,C914031896,e70b4083-a7aa-49a8-b87e-0b65af215cea,Pathology_Slides_HDD5,,


### 🔗 Match Pathology Slide Files to Lung Cases

We filter the `pathology_slide` DataFrame to include only those records whose `case_submitter_id` appears in our list of lung cases.  
This gives us a refined list of `object_id`s corresponding to pathology slide images linked to LUNG patients.


In [15]:
slide_object_ids = list(df.loc[df['case_submitter_id'].isin(case_submitter_ids)]['object_id'])
len(slide_object_ids)

56

### ⬇️ Download Pathology Slide Files Using Gen3 Client

We loop through each `object_id` and download the associated pathology slide image using the `gen3-client`.

- Downloads are saved to a designated directory (`Downloaded_Data/pathology_slides`)
- The `--no-prompt` flag ensures the script proceeds without manual confirmation
- If the file already exists, it will be overwritten unless the `--rename` flag is set


In [16]:
!mkdir -p /home/jovyan/pd/Downloaded_Data/pathology_slides
dl_dir = '/home/jovyan/pd/Downloaded_Data/pathology_slides'

for object_id in slide_object_ids:
    !/home/jovyan/pd/.gen3/gen3-client download-single --guid=$object_id --profile=vpodc --download-path=$dl_dir --no-prompt

2025/05/20 18:52:10 Error occurred when checking for latest version: GET https://api.github.com/repos/uc-cdis/cdis-data-client/tags: 403 API rate limit exceeded for 3.86.93.34. (But here's the good news: Authenticated requests get a higher rate limit. Check out the documentation for more details.) [rate reset in 14m16s]
2025/05/20 18:52:10 Total number of objects in manifest: 1
2025/05/20 18:52:10 Preparing file info for each file, please wait...
2025/05/20 18:52:10 File info prepared successfully
2025/05/20 18:52:11 1 files downloaded.
2025/05/20 18:52:11 Error occurred when checking for latest version: GET https://api.github.com/repos/uc-cdis/cdis-data-client/tags: 403 API rate limit exceeded for 3.86.93.34. (But here's the good news: Authenticated requests get a higher rate limit. Check out the documentation for more details.) [rate reset in 17m15s]
2025/05/20 18:52:11 Total number of objects in manifest: 1
2025/05/20 18:52:11 Preparing file info for each file, please wait...
2025/0

### 🧮 Verify Downloaded Files

After completing the download loop, we verify the total number of pathology slide images that were successfully saved.  
This command searches the download directory (`$dl_dir`) for all `.jpg` files and counts them using `wc -l`.

> This provides a quick sanity check to ensure the number of downloaded files matches the number of requested `object_id`s.


In [17]:
!find $dl_dir -name '*.jpg' | wc -l

56


---
## 📚 References



1. **Do N, Grossman RL, Feldman T, et al. (2019)**. “The Veterans Precision Oncology Data Commons: Transforming VA data into a national resource for research in precision oncology.” *Seminars in Oncology* **46**(4–5):314–320. DOI: [10.1053/j.seminoncol.2019.09.002](https://doi.org/10.1053/j.seminoncol.2019.09.002) 

2. **Elbers DC, Fillmore NR, Sung F‑C, et al. (2020)**. “The Veterans Affairs Precision Oncology Data Repository, a Clinical, Genomic, and Imaging Research Database.” *Patterns* **1**:100083. DOI: [10.1016/j.patter.2020.100083](https://doi.org/10.1016/j.patter.2020.100083) 

3. **Cheng D, Ramos‑Cejudo J, Tuck D, et al. (2019)**. “External validation of a prognostic model for mortality among patients with non‑small‑cell lung cancer using the Veterans Precision Oncology Data Commons.” *Seminars in Oncology* **46**(4–5):327–333. DOI: [10.1053/j.seminoncol.2019.09.006](https://doi.org/10.1053/j.seminoncol.2019.09.006) 



---

### 🔗 Useful Resources

- [Gen3 Client Documentation](https://gen3.org/resources/user/gen3-client/)  
- [Gen3 Commons Portal](https://gen3.datacommons.io/)  
- [VPODC Project Overview](https://vpodc.data-commons.org/)

---

> ⚠️ If you encounter download errors (e.g., HTTP 403), ensure your Gen3 credentials are correctly configured and your profile has appropriate access.  
> For additional support, contact your system administrator or the Gen3 support team.
