Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add DICOM Stack data set #5

Closed

Conversation

adam-grant-hendry
Copy link
Contributor

@adam-grant-hendry adam-grant-hendry commented Nov 8, 2021

This dataset will be used to test improved volume rendering and DICOM stack reading per pyvista/pyvista-support issue #500.

![image](https://user-images.githubusercontent.com/59346180/140785116-3dc68d20-bca6-4d6f-85b6-6b832a576138.png)

Skin Cancer

The Cancer Imaging Archive (TCIA) is a service which de-identifies and hosts a large archive of medical images of cancer accessible for public download. The data are organized as “collections”; typically patients’ imaging related by a common disease (e.g. lung cancer), image modality or type (MRI, CT, digital histopathology, etc) or research focus. DICOM is the primary file format used by TCIA for radiology imaging. Supporting data related to the images such as patient outcomes, treatment details, genomics and expert analyses are also provided when available.

This dataset is a member of the National Cancer Institute's Clinical Proteomic Tumor Analysis Consortium Sarcomas (CPTAC-SAR) cohort. CPTAC is a national effort to accelerate the understanding of the molecular basis of cancer through the application of large-scale proteome and genome analysis, or proteogenomics. Radiology and pathology images from CPTAC patients are being collected and made publicly available by The Cancer Imaging Archive to enable researchers to investigate cancer phenotypes which may correlate to corresponding proteomic, genomic and clinical data.

This data has been published under the Creative Commons Attribution 3.0 Unported License and must adhere to the CPTAC Data Use Agreement. Per the TCIA Data Usage Policy (see License file), all oral or written presentations, disclosures, or publications must acknowledge the specific dataset(s) or applicable accession number(s) and the NIH-designated data repositories through which the investigator accessed any data. The appropriate citations are included in the Citations file. The metadata for this dataset is included in metadata.csv. Questions may be directed to help@cancerimagingarchive.net.

Title: Forearm Sarcoma
DataDescription URI: https://doi.org/10.7937/TCIA.2019.9bt23r95
Number of Images: 3
Total Size: 1.51 MB
File Format: DICOM

Files:
DICOM_Stack.zip
LICENSE.txt
CITATION.txt
metadata.csv

This dataset is a member of the Pancreatic-CT-CBCT-SEG collection and is distributed under the Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0/). Per the TCIA Data Usage Policy (see `License` file), all oral or written presentations, disclosures, or publications must acknowledge the specific dataset(s) or applicable accession number( [metadata.csv](https://github.com/pyvista/vtk-data/files/8467442/metadata.csv) s) and the NIH-designated data repositories through which the investigator accessed any data. The appropriate citations are included in the `Citations` file.

Specifically, the metadata for this dataset is as follows:

Series UID: 1.3.6.1.4.1.14519.5.2.1.302382790855582445722435410442490497846
Collection: Pancreatic-CT-CBCT-SEG
3rd Party Analysis: NO
DataDescription URI: NA
Subject ID: Pancreas-CT-CB_001
Study UID: 1.3.6.1.4.1.14519.5.2.1.21087345762211724523378497892240459677
Study Description: PANCREAS

MatthewFlamm and others added 6 commits August 20, 2021 16:53
This dataset will be used to test improved volume rendering and DICOM
stack reading per `pyvista/pyvista-support` issue #500.

The Cancer Imaging Archive (TCIA) is a service which de-identifies and
hosts a large archive of medical images of cancer accessible for public
download. The data are organized as “collections”; typically patients’
imaging related by a common disease (e.g. lung cancer), image modality
or type (MRI, CT, digital histopathology, etc) or research focus. DICOM
is the primary file format used by TCIA for radiology imaging.
Supporting data related to the images such as patient outcomes,
treatment details, genomics and expert analyses are also provided when
available.

This dataset is a member of the Pancreatic-CT-CBCT-SEG collection and
is distributed under the Creative Commons Attribution 4.0 International
License (https://creativecommons.org/licenses/by/4.0/). Per the TCIA
Data Usage Policy (see `License` file), all oral or written
presentations, disclosures, or publicatons must acknowledge the specific
dataset(s) or applicable accession number(s) and the NIH-designated data
repositories through which the investigator accessed any data. The
appropriate citations are included in the `Citations` file.

Specifically, the metadata for this dataset is as follows:

Series UID: 1.3.6.1.4.1.14519.5.2.1.302382790855582445722435410442490497846
Collection: Pancreatic-CT-CBCT-SEG
3rd Party Analysis: NO
DataDescription URI: NA
Subject ID: Pancreas-CT-CB_001
Study UID: 1.3.6.1.4.1.14519.5.2.1.21087345762211724523378497892240459677
Study Description: PANCREAS
Study Date: 7/6/2012
Series Description: PANCREAS DI iDose 3
Manufacturer: Philips
Modality: CT
SOP Class Name: CT Image Storage
SOP Class UID: 1.2.840.10008.5.1.4.1.1.2
Number of Images: 134
File Size: 70.58 MB
File Location: .\Pancreatic-CT-CBCT-SEG\Pancreas-CT-CB_001\07-06-2012-NA-PANCREAS-59677\201.000000-PANCREAS DI iDose 3-97846
Download Timestamp: 2021-11-07T16:51:32.384
@adam-grant-hendry
Copy link
Contributor Author

adam-grant-hendry commented Nov 9, 2021

Additionally, it should be noted, TCIA has a REST API interface, so if we want to use more data from this site in the future, we could possibly lazy load and convert data instead of directly adding to this repo (making sure to include proper citations, of course).

@MatthewFlamm
Copy link
Contributor

The license looks very permissive, the citation requirement will have to be kept in mind if used for documentation. This is a somewhat large dataset for testing (~67 Mb in total), it might be prohibitive to be constantly downloading these files? But it certainly is an interesting dataset, which is also nice for the documentation.

I'm not familiar with DICOM datasets, but is it possible to reduce the number of files for testing purposes? This may be a middle ground here. If so, this looks good to me.

@adam-grant-hendry
Copy link
Contributor Author

adam-grant-hendry commented Nov 11, 2021

@MatthewFlamm

the citation requirement will have to be kept in mind if used for documentation

Yes, agreed. The docstrings in the corresponding updated code and sphinx docs will need to contain the citations.

This is a somewhat large dataset for testing (~67 Mb in total), it might be prohibitive to be constantly downloading these files?

Yes, agreed; I'm concerned about that as well. Instead, I can write a private module that implements the TCIA REST API so we can download directly from their site rather than store data on GitHub. It behaves similarly to the URL requests functions currently implemented in pyvista/examples/downloads.py:

  1. You access a resource by sending an HTTP request to the TCIA API server. The server replies with a response that either contains the data you requested, or a status indicator.
  2. You can access the metadata of an API by appending /metadata to the end of the query. The metadata is in JSON format and conforms to this schema.
  3. Most APIs can return results as CSV/JSON/XML/HTML. You can specify the return format by including the query parameter format.
  4. An API request takes the following structure:
<BaseURL><Resource><QueryEndpoint>?<QueryParameters><Format> 

They also provide code examples and an SDK for python hosted here on GitHub.

Would you and the team be alright if I made such a python module to implement the API?

I'm not familiar with DICOM datasets, but is it possible to reduce the number of files for testing purposes?

DICOM is simply an image file format standardized by the medical community to secure data transfer, particularly for patient files. A single standalone DICOM image is typically something like a chest x-ray, mammogram, bone fracture image, etc. Alternatively, an MRI or CT (computed tomography) machine can reconstruct a 3D volume by taking images of a body at multiple slices (hence the "tomography" in "computed tomography") along an axis.

Thus, multiple images will always exist for a raw DICOM stack that represents a volume, unfortunately. Oftentimes, there are many files. A single file volume image, like a ".ply" or an ".stl", can be created from a stack, but raw data is important in research. There is a lot of clean up that occurs in generated the 3D model and we typically want to experiment with and list the filters used to create the 3D model in our research so that we can guarantee the results can be repeated by others.

I could try to find a smaller data set, which may very well exist, but I think the best solution would be to use the TCIA API and download from them.

What do you think?

@MatthewFlamm
Copy link
Contributor

Having the data here is probably the best IMO if it is used for testing. Otherwise, the testing could be broken due to the other endpoint being down. I'm realizing that if we use all the DICOM layers for the documentation building, we already need to download the whole dataset for full testing anyway. So my question above about utilizing a partial dataset is probably not important.

Maybe the 50+ Mb file size is no issue? Let's wait to get another opinion from @akaszynski .

@adam-grant-hendry
Copy link
Contributor Author

adam-grant-hendry commented Nov 14, 2021

@akaszynski Do you have any opinions?

@adam-grant-hendry
Copy link
Contributor Author

@adeak @banesullivan Any thoughts on this as well?

@banesullivan
Copy link
Member

Skimmed over this...

Any data that are used for testing/examples need to be hosted here as GitHub is generally reliable. If that external service goes down or changes their URL, it will break our CI and create a significant burden to us much like pyvista/pyvista#1226 did

@banesullivan
Copy link
Member

Ah, just saw the concerns about file size

Perhaps we should just make a seperate repo for lfs files?

@MatthewFlamm
Copy link
Contributor

There is also this dataset from vtk data testing that is much smaller. I know nothing about it.

https://data.kitware.com/#collection/55f17f758d777f6ddc7895b7/folder/5afd93708d777f15ebe1b516

@adam-grant-hendry
Copy link
Contributor Author

adam-grant-hendry commented Nov 29, 2021

There is also this dataset from vtk data testing that is much smaller. I know nothing about it.

@MatthewFlamm The data appears corrupted (I cannot open it in ParaView as a volume). This appears to be several slices of a prostate CT scan exam. A prostate scan would definitely be smaller than a torso. I found a working one from TCIA that is 2.1MB. Would that be okay?

About the smallest useful set I can find is 2MB. Is that acceptable?

@akaszynski
Copy link
Member

About the smallest useful set I can find is 2MB. Is that acceptable?

2MB is quite acceptable.

@MatthewFlamm
Copy link
Contributor

I found the inconsistency in the data set I linked. It is for a different reader: https://vtk.org/doc/nightly/html/classvtkDICOMImageReader.html

It is used in this example
https://kitware.github.io/vtk-examples/site/Cxx/IO/ReadDICOM/

This reader isn't available in my version of ParaView either.

I would agree that a smaller dataset would make a whole lot more sense.

Per contributor feedback, a smaller dataset (<= 2MB) is ideal.

This dataset will be used to test improved volume rendering and DICOM
stack reading per pyvista/pyvista-support issue #500.

The Cancer Imaging Archive (TCIA) is a service which de-identifies and
hosts a large archive of medical images of cancer accessible for public
download. The data are organized as “collections”; typically patients’
imaging related by a common disease (e.g. lung cancer), image modality
or type (MRI, CT, digital histopathology, etc) or research focus. DICOM
is the primary file format used by TCIA for radiology imaging.
Supporting data related to the images such as patient outcomes,
treatment details, genomics and expert analyses are also provided when
available.

This dataset is a member of the National Cancer Institute's Clinical
Proteomic Tumor Analysis Consortium Sarcomas (CPTAC-SAR) cohort. CPTAC
is a national effort to accelerate the understanding of the molecular
basis of cancer through the application of large-scale proteome and
genome analysis, or proteogenomics. Radiology and pathology images from
CPTAC patients are being collected and made publicly available by The
Cancer Imaging Archive to enable researchers to investigate cancer
phenotypes which may correlate to corresponding proteomic, genomic and
clinical data.

This data has been published under the `Creative Commons Attribution 3.0
Unported License` and must adhere to the CPTAC Data Use Agreement. Per
the TCIA Data Usage Policy (see `License` file), all oral or written
presentations, disclosures, or publications must acknowledge the
specific dataset(s) or applicable accession number(s) and the
NIH-designated data repositories through which the investigator
accessed any data. The appropriate citations are included in the
`CITATIONS` file. The metadata for this dataset is included in
`metadata.csv`. Questions may be directed to
<help@cancerimagingarchive.net>.

Title: Forearm Sarcoma
DataDescription URI: https://doi.org/10.7937/TCIA.2019.9bt23r95
Number of Images: 3
Total Size: 1.51 MB
File Format: DICOM
@adam-grant-hendry
Copy link
Contributor Author

@akaszynski @MatthewFlamm @banesullivan I've replaced the dataset with a 1.5 MB dataset. Please kindly review at your earliest convenience and let me know if this will work. Thank you!

@MatthewFlamm
Copy link
Contributor

This seems much more reasonable. It looks like a rebase or merge got messed up, and will have to be fixed before merging. GitHub is saying there are 102 files changed.

@adam-grant-hendry
Copy link
Contributor Author

adam-grant-hendry commented Apr 12, 2022

@MatthewFlamm That should be the 102 files I deleted so that there are only 3 DICOM files. I forgot to add the upstream. I'll fix this.

@adam-grant-hendry adam-grant-hendry deleted the feat/dicomstack branch April 12, 2022 01:19
@adam-grant-hendry
Copy link
Contributor Author

adam-grant-hendry commented Apr 12, 2022

@MatthewFlamm I accidentally delete this PR. Please see #9 to continue. Sorry for the confusion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants