<a href="https://colab.research.google.com/github/fedorov/IDC-Examples/blob/master/notebooks/cookbook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# IDC Google Colab cookbook notebook

The goal of this notebook is to serve as the source of various small bits that should be helpful in developing analysis notebooks by the IDC users.

Please email Andrey Fedorov andrey dot fedorov at gmail dot com if you have any questions or suggestions!

* Prepared: Spring 2022
* Updated: June 29, 2022

# Prerequisites

* To use Colab, and to access data in IDC, you will need a [Google Account](https://support.google.com/accounts/answer/27441?hl=en)
* Make sure your Colab instance has a GPU! For this check "Runtime > Change runtime type" and make sure to choose the GPU runtime.
* To perform queries against IDC BigQuery tables you will need a cloud project. You can get started with Google Cloud free project with the following steps (they are also illustrated in [this short video](https://youtu.be/i08S0KJLnyw)):
  1. Go to https://console.cloud.google.com/, and accept Terms and conditions.
  2. Click "Select a project" button in the upper left corner of the screen, and then click "New project".
  3. Open the console menu by clicking the ☰ menu icon in the upper left corner, and select "Dashboard". You will see information about your project, including your Project ID. Insert that project ID in the cell below in place of `REPLACE_ME_WITH_YOUR_PROJECT_ID`.

In [1]:
# initialize this variable with your Google Cloud Project ID!
my_ProjectID = "REPLACE_ME_WITH_YOUR_PROJECT_ID"

import os
os.environ["GCP_PROJECT_ID"] = my_ProjectID

# Authentication

In [2]:
# you will need to authenticate with your Google ID to do anything meaningful with IDC
from google.colab import auth
auth.authenticate_user()

# Query

BigQuery SQL is an extremely powerful instrument for searching DICOM metadata available in IDC! The examples below are intended to give you a basic idea about some of the capabilities. If you want to know more, please refer to the [BigQuery query syntax documentation](https://cloud.google.com/bigquery/docs/reference/standard-sql/query-syntax). You can also easily google various tutorials, such as [this one](https://blog.coupler.io/bigquery-tutorial/) to help you get started. While learning all the tricks of SQL will take a lot of effort, you should be able to very quickly master the skills that can go long way exploring IDC data!

When experimenting with queries, [BigQuery SQL console](https://console.cloud.google.com/bigquery) is very very handy!

To run queries, first, instantiate the query client, which can next be configured to run the query.

In [3]:
# python API is the most flexible way to query IDC BigQuery metadata tables
from google.cloud import bigquery
bq_client = bigquery.Client(my_ProjectID)

## Get list of the available collections



In [None]:
selection_query = f"""
  SELECT  
    DISTINCT(collection_id) 
  FROM 
    `bigquery-public-data.idc_current.dicom_all` 
"""

selection_result = bq_client.query(selection_query)
selection_df = selection_result.result().to_dataframe()

selection_df

## Get some summary information about collections

In [None]:
selection_query = f"""
  SELECT  
    collection_id,
    STRING_AGG(DISTINCT(Modality)) as collection_modalities, # DICOM modalities encountered
    ROUND(SUM(instance_size)/POW(1024,3),3) as collection_size_GB # total size on disk
  FROM 
    `bigquery-public-data.idc_current.dicom_all` 
  GROUP BY
    collection_id
  ORDER BY
    collection_size_GB DESC
"""

selection_result = bq_client.query(selection_query)
selection_df = selection_result.result().to_dataframe()

selection_df

## Select items by specific UID/ID

In [18]:
# select rows corresponding to the specific DICOM instance, as defined by SOPInstanceUID value
# similarly, you can select by specifying StudyInstanceUID, SeriesInstanceUID or SOPInstanceUID,
# instead of the PatientID line below with the following (as examples) by deleting the # character in front
# of the corresponding line
selection_query = f"""
  SELECT  
    StudyInstanceUID, 
    SeriesInstanceUID, 
    SOPInstanceUID, 
    instance_size, 
    gcs_url 
  FROM 
    `bigquery-public-data.idc_current.dicom_all` 
  WHERE 
#    PatientID = \"R01-001\"
#   SOPInstanceUID = \"1.3.6.1.4.1.14519.5.2.1.6450.2626.226637977389233552278537838820\" 
#   SeriesInstanceUID = \"1.3.6.1.4.1.14519.5.2.1.4334.1501.312037286778380630549945195741\" 
#   StudyInstanceUID = \"	1.3.6.1.4.1.14519.5.2.1.4334.1501.116796918629271881210561198785\" 
   collection_id = "lidc_idri"
"""

selection_result = bq_client.query(selection_query)
selection_df = selection_result.result().to_dataframe()

In [None]:
size_gb = round(selection_df["instance_size"].sum()/(1024**3),4)
print(f"Cohort size on disk: {size_gb} Gb")

## Select by availability of segmentations

What segmentations do we have anyway? Let's look at the distinct combinations of segmentation property category, type and anatomic location, which are the metadata attributes that describe segmentations.

In this instance, we run the query using the `%%bigquery` magic. This requires less code, but cannot be parameterized as easily as when using python BQ interface.

In [None]:
%%bigquery --project=$my_ProjectID

SELECT
  DISTINCT(SegmentedPropertyCategory.CodeMeaning) as SegmentedPropertyCategory_CodeMeaning,
  SegmentedPropertyType.CodeMeaning as SegmentedPropertyType_CodeMeaning,
  AnatomicRegion.CodeMeaning as AnatomicRegion_CodeMeaning
FROM
  `bigquery-public-data.idc_current.segmentations`

Select all rows that correspond to the instances of segmentations of anything in the prostate.

In [24]:
# select rows corresponding to cases that have segmentation of anything in the prostate
selection_query = f"""
  SELECT  
    dicom_all.StudyInstanceUID, 
    dicom_all.SeriesInstanceUID, 
    dicom_all.SOPInstanceUID, 
    gcs_url 
  FROM 
    `bigquery-public-data.idc_current.dicom_all` as dicom_all 
  JOIN 
    `bigquery-public-data.idc_current.segmentations` as segmentations 
  ON 
    dicom_all.SOPInstanceUID = segmentations.SOPInstanceUID 
  WHERE 
    segmentations.SegmentedPropertyType.CodeMeaning LIKE \"%prostate%\" OR 
    segmentations.AnatomicRegion.CodeMeaning LIKE \"%prostate%\"
"""

selection_result = bq_client.query(selection_query)
selection_df = selection_result.result().to_dataframe()

In [None]:
selection_df

Unnamed: 0,StudyInstanceUID,SeriesInstanceUID,SOPInstanceUID,gcs_url
0,1.3.6.1.4.1.14519.5.2.1.3671.4754.266963586071...,1.2.276.0.7230010.3.1.3.1426846371.7356.151320...,1.2.276.0.7230010.3.1.4.1426846371.7356.151320...,gs://public-datasets-idc/2688dccd-cc69-4f4a-ae...
1,1.3.6.1.4.1.14519.5.2.1.3671.4754.266963586071...,1.2.276.0.7230010.3.1.3.1426846371.7356.151320...,1.2.276.0.7230010.3.1.4.1426846371.7356.151320...,gs://public-datasets-idc/2688dccd-cc69-4f4a-ae...
2,1.3.6.1.4.1.14519.5.2.1.3671.4754.266963586071...,1.2.276.0.7230010.3.1.3.1426846371.7356.151320...,1.2.276.0.7230010.3.1.4.1426846371.7356.151320...,gs://public-datasets-idc/2688dccd-cc69-4f4a-ae...
3,1.3.6.1.4.1.14519.5.2.1.7310.5101.130276529947...,1.2.276.0.7230010.3.1.3.1070885483.11412.15991...,1.2.276.0.7230010.3.1.4.1070885483.11412.15991...,gs://public-datasets-idc/59a0d450-f21a-433d-8a...
4,1.3.6.1.4.1.14519.5.2.1.7310.5101.130276529947...,1.2.276.0.7230010.3.1.3.1070885483.11412.15991...,1.2.276.0.7230010.3.1.4.1070885483.11412.15991...,gs://public-datasets-idc/59a0d450-f21a-433d-8a...
...,...,...,...,...
525,1.3.6.1.4.1.14519.5.2.1.7311.5101.726872428105...,1.2.276.0.7230010.3.1.3.1070885483.16388.15991...,1.2.276.0.7230010.3.1.4.1070885483.16388.15991...,gs://public-datasets-idc/864543fe-9efe-4515-85...
526,1.3.6.1.4.1.14519.5.2.1.7311.5101.726872428105...,1.2.276.0.7230010.3.1.3.1070885483.16388.15991...,1.2.276.0.7230010.3.1.4.1070885483.16388.15991...,gs://public-datasets-idc/864543fe-9efe-4515-85...
527,1.3.6.1.4.1.14519.5.2.1.7311.5101.236131511359...,1.2.276.0.7230010.3.1.3.1070885483.17072.15991...,1.2.276.0.7230010.3.1.4.1070885483.17072.15991...,gs://public-datasets-idc/3fa71302-051c-4900-97...
528,1.3.6.1.4.1.14519.5.2.1.7311.5101.236131511359...,1.2.276.0.7230010.3.1.3.1070885483.17072.15991...,1.2.276.0.7230010.3.1.4.1070885483.17072.15991...,gs://public-datasets-idc/3fa71302-051c-4900-97...


# Visualization

In [None]:
# helper function to view a study or a specific series hosted by IDC
def get_idc_viewer_url(studyUID, seriesUID=None):
  url = "https://viewer.imaging.datacommons.cancer.gov/viewer/"+studyUID
  if seriesUID is not None:
    url = url+"?seriesInstanceUID="+seriesUID
  return url

my_StudyInstanceUID = selection_df["StudyInstanceUID"][0]
my_SeriesInstanceUID = selection_df[selection_df["StudyInstanceUID"] == selection_df["StudyInstanceUID"][0]]["SeriesInstanceUID"][0]

print("URL to view the entire study:")
print(get_idc_viewer_url(my_StudyInstanceUID))
print()
print("URL to view the specific series:")
print(get_idc_viewer_url(my_StudyInstanceUID, my_SeriesInstanceUID))

# Downloading

Refer to the documentation page on the topic for most up-to-date information:

https://learn.canceridc.dev/data/downloading-data

In [26]:
import os
os.environ["DOWNLOAD_DEST"] = "/content/IDC_downloads"
os.environ["MANIFEST"] = "/content/idc_manifest.txt"

In [27]:
!mkdir -p ${DOWNLOAD_DEST}
!echo "gsutil cp \$* $DOWNLOAD_DEST" > gsutil_download.sh
!chmod +x gsutil_download.sh

In [28]:
# creating a manifest file for the subsequent download of files
selection_df["gcs_url"].to_csv(os.environ["MANIFEST"], header=False, index=False)

In [29]:
# download is this simple (but not very fast!)
%%capture

!cat ${MANIFEST} | gsutil -m cp -I ${DOWNLOAD_DEST}

If you want to download a non-trivial amount of data, you will want to parallelize downloads, as illustrated below.

In [None]:
!cat ${MANIFEST} | xargs -n 25 -P 10 ./gsutil_download.sh

WIP: it is **much** faster to download using [s5cmd](https://github.com/peak/s5cmd) - see our documentation here for details: https://learn.canceridc.dev/data/downloading-data.

# Sorting

In [32]:
%%capture
!pip install pydicom
!git clone https://github.com/pieper/dicomsort
!sudo apt-get install dcmtk

In [None]:
import os
os.environ["SORTED_DEST"] = "/content/IDC_sorted"

!mkdir -p $SORTED_DEST
!rm -rf $SORTED_DEST/*
!python dicomsort/dicomsort.py -k -u $DOWNLOAD_DEST ${SORTED_DEST}/%StudyInstanceUID/%SeriesInstanceUID/%SOPInstanceUID.dcm