<a href="https://colab.research.google.com/github/cgmeyer/AI-Deep-Learning-Lab-2024/blob/midrc-cohort%2Fupdate_readme_nb/sessions/midrc-cohort/MIDRC_Cohort_Building_DLL_RSNA_2024.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Cohort Building Using the MIDRC Data Commons and Biomedical Imaging Hub
---
This notebook briefly demonstrates how to use the MIDRC open APIs to build a cohort of MIDRC imaging studies using patient clinical data and AI-research-based annotations in the MIDRC data commons and then access and view the X-ray image files associated with those imaging studies.

It also demonstrates how to use the MIDRC Biomedical Imaging Hub (BIH) open metadata APIs to discover images across the Biomedical Data Fabric (BDF), including those in data resources other than the MIDRC data commons.

All cohort selection possible in the [MIDRC data explorer UI](https://data.midrc.org/explorer) and the [MIDRC BIH Explorer](https://imaging-hub.data-commons.org/Explorer) can also be achieved programmatically using API requests. In this notebook, we'll select the same cohort as in the data explorer demo detailed in [these slides](https://docs.google.com/presentation/d/1cMKyl-QWa2oM9HFnr0F7D83JaPx74GyFErz7O-gJjas/edit?usp=sharing).

by Chris Meyer, PhD

Manager of Data and User Services at the Center for Translational Data Science at University of Chicago

Presented at the MIDRC RSNA 2024 Deep Learning Lab on December 2, 2024

## 1) Set up Python environment
---


### Download an API key file containing your credentials
---

1) Navigate to the MIDRC data portal in your browser: https://data.midrc.org.

2) Read and accept the DUA (if you haven't already).

3) Navigate to the user profile page: https://data.midrc.org/identity

4) Click on the button "Create API Key" and save the `credentials.json` file somewhere safe as "midrc-credentials.json".


### Set local variables
---
Change the following `cred` variable path to point to your credentials file downloaded from the MIDRC data portal following the instructions above.

In [None]:
cred = "/content/midrc-credentials.json" # location of your MIDRC credentials, downloaded from https://data.midrc.org/identity by clicking "Create API key" button and saving the credentials.json locally
api = "https://data.midrc.org" # The base URL of the data commons being queried. This shouldn't change for MIDRC.


### Install / Import Python Packages and Scripts

In [None]:
## The packages below may be necessary for users to install according to the imports necessary in the subsequent cells.

import sys
#!{sys.executable} -m pip install
#!{sys.executable} -m pip install --upgrade pandas
#!{sys.executable} -m pip install --upgrade --ignore-installed PyYAML
#!{sys.executable} -m pip install --upgrade pip
!{sys.executable} -m pip install --upgrade gen3
!{sys.executable} -m pip install pydicom
#!{sys.executable} -m pip install --upgrade Pillow
#!{sys.executable} -m pip install psmpy
#!{sys.executable} -m pip install python-gdcm --upgrade
#!{sys.executable} -m pip install pylibjpeg --upgrade

In [None]:
## Import Python Packages and scripts

import os, subprocess
import pandas as pd
import numpy as np
import pydicom
from PIL import Image
import glob
#import gdcm
#import pylibjpeg

# import some Gen3 packages
import gen3
from gen3.auth import Gen3Auth
from gen3.query import Gen3Query



### Initiate instances of the Gen3 SDK Classes using credentials file for authentication
---
Again, make sure the "cred" directory path variable reflects the location of your credentials file (path variables set above).

In [None]:
auth = Gen3Auth(api, refresh_file=cred) # authentication class
query = Gen3Query(auth) # query class


## 2) Build Cohorts by Sending Queries to the MIDRC APIs
#### General notes on sending queries:
* There are many ways to query and access metadata for cohort building in MIDRC, but this notebook will focus on using the [Gen3](https://gen3.org) graphQL query service ["guppy"](https://github.com/uc-cdis/guppy/#readme). This is the backend query service that [MIDRC's data explorer GUI](https://data.midrc.org/explorer) uses. So, anything you can do in the explorer GUI, you can do with guppy queries, and more!
* The guppy graphQL service has more functionality than is demonstrated in this simple example. You can find extensive documentation in GitHub [here](https://github.com/uc-cdis/guppy/blob/master/doc/queries.md) in case you'd like to build your own queries from scratch.
* The Gen3 SDK (intialized as `query` above in this notebook) has Python wrapper scripts to make sending queries to the guppy graphQL API simpler. The guppy SDK package can be viewed in GitHub [here](https://github.com/uc-cdis/gen3sdk-python/blob/master/gen3/query.py).
* Guppy queries focus on a particular type of data (cases, imaging studies, files, etc.), which corresponds to the major tabs in [MIDRC's data explorer GUI](https://data.midrc.org/explorer).
* Queries include arguments that are akin to selecting filter values in [MIDRC's data explorer GUI](https://data.midrc.org/explorer).
* To see more documentation about how to use and combine filters with various operator logic (like AND/OR/IN, etc.) see [this page](https://github.com/uc-cdis/guppy/blob/master/doc/queries.md#filter).

---


#### Set query parameters
---
* Here, we'll send a query to the `imaging_study` guppy index, which corresponds to the "Imaging Studies" tab of [MIDRC's data explorer GUI](https://data.midrc.org/explorer).
* The filters defined below can be modified to return different subsets of imaging studies. Here, we'll use rather restrictive parameters so the number of studies returned is small for demonstration purposes.
* If our query request is successful, the API response should be in JSON format, and it should contain a list of imaging study UIDs along with any other study-related data we ask for.


In [None]:
### Set some "imaging_study" query parameters

## mRALE filter: we'll select all imaging studies annotated with an mRALE score greater than or equal to this threshold number
mRALE_threshold = 20

## days from study to positive COVID-19 test filter: we want imaging studies performed within two days after a positive test
min_days_from_study_to_test = -2
max_days_from_study_to_test = 0

## Imaging study modality filter: we select imaging studies with a modality of either DX or CR
study_modalities = ["DX", "CR"]

## Imaging study body part filter: here we select "chest" as the "LOINC system" filter, which is the body part examined
body_part_examined = "Chest"

## Case filters: we will select Hispanic males 70 years of age and older
ethnicity = "Hispanic or Latino"
sex = "Male"
age_threshold = 70

In [None]:
## Note: the "fields" option defines what fields we want the query to return. If set to "None", returns all available fields.

imaging_studies = query.raw_data_download(
                    data_type="imaging_study",
                    fields=None,
                    filter_object={
                        "AND": [
                            {"=": {"loinc_system": body_part_examined}},
                            {"=": {"sex": sex}},
                            {"=": {"ethnicity": ethnicity}},
                            {">=": {"age_at_index": age_threshold}},
                            {"IN": {"study_modality": study_modalities}},
                            {"nested": {"path": "imaging_study_annotations", ">=": {"midrc_mRALE_score": mRALE_threshold}}},
                            {"AND": [
                                {">=": {"days_from_study_to_pos_covid_test": min_days_from_study_to_test}},
                                {"<=": {"days_from_study_to_pos_covid_test": max_days_from_study_to_test}}
                            ]}
                        ]
                    },
                    sort_fields=[{"submitter_id": "asc"}]
                )

if len(imaging_studies) > 0:
    imaging_studies_ids = [i['submitter_id'] for i in imaging_studies if 'submitter_id' in i] ## make a list of the imaging study IDs returned
    print("Query returned {} study IDs from {} cases.".format(len(imaging_studies),len(set([i['case_ids'][0] for i in imaging_studies if 'case_ids' in i]))))
    print("Data is a list with rows like this:\n\t {}".format(imaging_studies[0:1]))
else:
    print("Your query returned no data! Please, check that query parameters are valid.")

In [None]:
imaging_studies_df = pd.DataFrame(imaging_studies)
display(imaging_studies_df)


## 3) Send another query to get data file details for our cohort / case ID
---
The `object_id` field in each imaging study record above contains the file identifiers for all files associated with each imaging study, which could include files like third-party annotations. If we simply want to access all files associated with our list of cases, we can use those object_ids.

However, in this example, we'll ask for specific types of files and get more detailed information about each of the files. This is achieved by querying the `data_file` guppy index, which corresponds to the "Data Files" tab of the MIDRC data explorer GUID.

All MIDRC data files, including both images and annotations, are listed in the guppy index "data_file", which is queried in a similar manner to our query of the `imaging_study` index above. The query parameter `data_type` below determines which guppy (Elasticsearch) index we're querying.

To get only `data_file` records that correspond to our imaging study cohort built previously, we'll use the list of study UIDs as a query filter.


### Set 'data_file' query parameters
---
Here, we'll utilize the property `source_node` to filter the list of files for our cohort to only those matching the type of files we're interested in. In this example, we ask only for CR and DX (x-ray) images, which will exclude any other types of files like annotations.

We're also using the property `study_uid` as a filter to restrict the `data_file` records returned down to those associated with the imaging studies in our cohort built above.


In [None]:
# Build a list of study UIDs to use as a filter in our data_file query
study_uids = [i['study_uid'] for i in imaging_studies]
study_uids

In [None]:
# Choose the types of data we want using "source_node" as a filter
source_nodes = ["cr_series_file","dx_series_file"]


In [None]:
## Search for specific files associated with our cohort by adding "study_uid" as a filter
# * Note: "fields" is set to "None" in this query, which by default returns all the properties available
data_files = query.raw_data_download(
                    data_type="data_file",
                    fields=None,
                    filter_object={
                        "AND": [
                            {"IN": {"study_uid": study_uids}},
                            {"IN": {"source_node": source_nodes}},
                        ]
                    },
                    sort_fields=[{"submitter_id": "asc"}]
                )

if len(data_files) > 0:
    object_ids = [i['object_id'] for i in data_files if 'object_id' in i] ## make a list of the file object_ids returned by our query
    cases = list(set([i['case_ids'][0] for i in data_files if 'case_ids' in i]))
    studies = list(set([i['study_uid'][0] for i in data_files if 'study_uid' in i]))
    print("Query returned {} data files with {} object_ids from {} studies of {} cases.".format(len(data_files),len(object_ids),len(studies),len(cases)))
    print("Data is a list with rows like this:\n\t {}".format(data_files[0:1]))
else:
    print("Your query returned no data! Please, check that query parameters are valid.")

In [None]:
# object_id (AKA "data GUID") is a globally unique file identifier that points to an actual file object in cloud storage. We'll use the object_ids along with the gen3 command-line tool to download the files these object_ids point to.
object_ids


## 4) Access data files using their object_id / data GUID (globally unique identifiers)
---
In order to download files stored in MIDRC, users need to reference the file's object_id (AKA data GUID or Globally Unique IDentifier).

Once we have a list of GUIDs we want to download, we can use either the gen3-client or the gen3 SDK to download the files. You can also access individual files in your browser after logging-in and entering the GUID after the `files/` endpoint, as in this URL: https://data.midrc.org/files/GUID

where GUID is the actual GUID, e.g.: https://data.midrc.org/files/dg.MD1R/b87d0db3-d95a-43c7-ace1-ab2c130e04ec

For instructions on how to install and use the gen3-client, please see [the MIDRC quick-start guide](https://data.midrc.org/dashboard/Public/documentation/Gen3_MIDRC_GetStarted.pdf), which can be found linked here and in the MIDRC data portal header as "Get Started".

Below we use the gen3 SDK command `gen3 drs-pull object` which is [documented in detail here](https://github.com/uc-cdis/gen3sdk-python/blob/master/docs/howto/drsDownloading.md).

### Use the Gen3 SDK command `gen3 drs-pull object` to download an individual file

In [None]:
## Make a new directory in Colab /content dir for downloaded files
## Note: if this command is run in Google Colab, this will not alter any local directories
os.system("rm -r downloads")
os.system("mkdir -p downloads")


In [None]:
## We can use a simple loop to download all files and keep track of successes and failures
## Here we will only download one image to save time for demo purposes
oid_num = 1
success,failure,other=[],[],[]
count,total = 0,len(object_ids[0:oid_num])
for object_id in object_ids[0:oid_num]:
    count+=1
    cmd = "gen3 --auth {} --endpoint data.midrc.org drs-pull object {} --output-dir downloads".format(cred,object_id)
    stout = subprocess.run(cmd, shell=True, capture_output=True)
    if not stout.stdout:
        raise Exception(f"gen3 sdk failure: {stout.stderr}")
    #print("Progress ({}/{}): {}".format(count,total,stout.stdout))
    print("Progress ({}/{}): {}".format(count,total,stout.stdout.decode("utf-8")))
    if "failed" in str(stout.stdout):
        failure.append(object_id)
    elif "successfully" in str(stout.stdout):
        success.append(object_id)
    else:
        other.append(object_id)


In [None]:
# Get a list of all downloaded .dcm files
image_files = glob.glob(pathname='**/*.dcm',recursive=True,)
image_files

### View the DICOM Images
---
Here we'll use the [Python package `pydicom`](https://pydicom.github.io/pydicom/stable/) to view the downloaded DICOM images.

Note that some of the files may contain compressed pixel data that require other packages to view; so, for this demo we'll simply skip over those using the following loop.

In [None]:
for image_file in image_files:
    print(image_file)
    ds = pydicom.dcmread(image_file)
    try:
        new_image = ds.pixel_array.astype(float)
        scaled_image = (np.maximum(new_image, 0) / new_image.max()) * 255.0
        scaled_image = np.uint8(scaled_image)
        final_image = Image.fromarray(scaled_image)
        print(type(final_image))
        display(final_image)
    except Exception as e:
        print("Couldn't view {}: {}.".format(image_file,e))

#### View the DICOM Headers
---
DICOM files have metadata elements embedded in the images. These can also be read and viewed using the `pydicom` package.

In [None]:
ds = pydicom.dcmread(image_files[0],force=True)
display(ds)

In [None]:
# Access individual elements
display(ds.file_meta)
display(ds.ImageType)
display(ds[0x0008, 0x0016])


In [None]:
# View the dicom metadata for all files as a DataFrame
dfs = []
for image_file in image_files:
    ds = pydicom.dcmread(image_file)
    df = pd.DataFrame(ds.values())
    df[0] = df[0].apply(lambda x: pydicom.dataelem.convert_raw_data_element(x) if isinstance(x, pydicom.dataelem.RawDataElement) else x)
    df['name'] = df[0].apply(lambda x: x.name)
    df['value'] = df[0].apply(lambda x: x.value)
    df = df[['name', 'value']]
    df = df.set_index('name').T.reset_index(drop=True)
    df['filename'] = image_file
    df.drop(columns=['Pixel Data'],inplace=True) # drop the pixel data as it's too large and nonsensical to store in a DataFrame
    dfs.append(df)

In [None]:
# Make a master dataframe for all images using only headers in all dataframes
headers = list(set.intersection(*map(set,dfs)))
df = pd.concat([df[headers] for df in dfs])
df.set_index('filename',inplace=True)


In [None]:
display(df)

In [None]:
## Export the file metadata as a TSV file
filename = "MIDRC_DICOM_metadata.tsv"
df.to_csv(filename, sep='\t')


## 5) Set up Python environment for MIDRC BIH
---


### Download an API key file containing your credentials
---
1) Navigate to the MIDRC BIH login page in your browser: https://imaging-hub.data-commons.org/portal/login.
2) Navigate to the user profile page: https://imaging-hub.data-commons.org/portal/identity.
3) Click on the button "Create API Key" and save the `credentials.json` file somewhere safe as `bih-credentails.json`.


### Set local variables
---
Change the following `cred` variable path to point to your credentials file downloaded from the MIDRC data portal following the instructions above.

In [None]:
bcred = "/content/bih-credentials.json" # location of your MIDRC credentials, downloaded from https://data.midrc.org/identity by clicking "Create API key" button and saving the credentials.json locally
bapi = "https://imaging-hub.data-commons.org/" # The base URL of the data commons being queried. This shouldn't change for MIDRC.


### Initiate instances of the Gen3 SDK Classes using credentials file for authentication
---
Again, make sure the "cred" directory path variable reflects the location of your credentials file (path variables set above).

In [None]:
bauth = Gen3Auth(bapi, refresh_file=bcred) # authentication class
bquery = Gen3Query(bauth) # query class


## 6) Build Cohorts by Sending Queries to the MIDRC BIH metadata API
---
#### Set query parameters
---
* Here, we'll send a query to the `imaging_series` guppy index, which the [MIDRC BIH data explorer GUI](https://data.midrc.org/explorer) runs off.
* The filters defined below can be modified to return different subsets of imaging series. Here, we'll use a rather restrictive combination of Modality, Body Part Examined, and Study Descrition filters to narrow our selected imaging series to a small number for demonstration purposes.
* If our query request is successful, the API response should be in JSON format, and it should contain a list of imaging series along with any other data we ask for, including data GUIDs we will use to access image files.
* Reminder that the guppy graphQL service has extensive documentation in GitHub [here](https://github.com/uc-cdis/guppy/blob/master/doc/queries.md).

In [None]:
### Set some "imaging_series" query parameters to select Lung CT imaging series for female COVID-19 cases across the Biomedical Imaging Data Fabric

## Here we select imaging series with a BodyPartExamined of "Chest"
BodyPartExamined = "LUNG"

## Here we select imaging series with a Modality of "CT"
Modality = "CT"

## Here we select imaging series with a PatientSex of "Female"
PatientSex = "Female"

## Here we select imaging series with a disease_type of "COVID-19"
disease_type = "COVID-19"



In [None]:
## Note: the "fields" option defines what fields we want the query to return. If set to "None", returns all available fields.

series = bquery.raw_data_download(
                    data_type="imaging_series",
                    fields=None,
                    filter_object={
                        "AND": [
                            {"=": {"BodyPartExamined": BodyPartExamined}},
                            {"=": {"Modality": Modality}},
                            {"=": {"PatientSex": PatientSex}},
                            {"=": {"disease_type": disease_type}},
                        ]
                    },
                    sort_fields=[{"submitter_id": "asc"}]
                )

if len(series) > 0:
    series_ids = list(set([i['submitter_id'] for i in series if 'submitter_id' in i])) ## make a list of the imaging series IDs returned
    object_ids = list(set([rec['object_ids'][0] for rec in series if 'object_ids' in rec])) ## make a list of the imaging series IDs returned
    subject_ids = list(set([rec['subject_id'][0] for rec in series if 'subject_id' in rec])) ## make a list of the imaging series IDs returned
    print("Query returned {} imaging series for {} subjects with {} object_ids.".format(len(series),len(subject_ids),len(object_ids)))
    print("Data is a list with rows like this:")
    for k,v in series[0:1][0].items():
      print("\t\'{}' : '{}'".format(k,v))
else:
    print("Your query returned no data! Please, check that query parameters are valid.")

In [None]:
series_df = pd.DataFrame(series)
display(series_df)


In [None]:
## Export the file metadata as a TSV file
filename = "MIDRC_BIH_imaging_series_metadata.tsv"
series_df.to_csv(filename, sep='\t')


## 7) Access data files using their object_id / data GUID (globally unique identifiers)
---
In order to programmatically access files for imaging series indexed in MIDRC BIH, users can reference the file's object_id (AKA data GUID or Globally Unique IDentifier, which is an example of a GA4GH DRS URI).

If an imaging series does not have an object_id associated with it, users will need to follow the platform links in the data table to the host platform where the data can be accessed or requested.

As above for the MIDRC data commons, once we have a list of object_ids / image GUIDs we want to download, we can use either the gen3-client or the gen3 SDK to download the files.

For instructions on how to install and use the gen3-client, please see [the MIDRC quick-start guide](https://data.midrc.org/dashboard/Public/documentation/Gen3_MIDRC_GetStarted.pdf), which can be found linked here and in the MIDRC data portal header as "Get Started".

Below we use the gen3 SDK command `gen3 drs-pull object` which is [documented in detail here](https://github.com/uc-cdis/gen3sdk-python/blob/master/docs/howto/drsDownloading.md).

### View the DICOM Images
---
The MIDRC BIH aggregates dicom viewer URLs from across connected nodes in the Biomedical Imaging Data Fabric. If a connected data resources runs a dicom viewer and provides URLs for imaging series, the URL should be available in the BIH metadata, demonstrated below.

In [None]:
for rec in series:
  if 'dicom_viewer_url' in rec and rec['dicom_viewer_url'] != np.nan:
    print("{}".format(rec['dicom_viewer_url']))

## The End
---
If you have any questions related to this notebook don't hesitate to reach out to the MIDRC Helpdesk at midrc-support@gen3.org or the author directly at cgmeyer@uchicago.edu

Happy data wrangling!