-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add DICOM Stack
data set
#5
Add DICOM Stack
data set
#5
Conversation
3dgallery data
Add pvd files from paraview data
This dataset will be used to test improved volume rendering and DICOM stack reading per `pyvista/pyvista-support` issue #500. The Cancer Imaging Archive (TCIA) is a service which de-identifies and hosts a large archive of medical images of cancer accessible for public download. The data are organized as “collections”; typically patients’ imaging related by a common disease (e.g. lung cancer), image modality or type (MRI, CT, digital histopathology, etc) or research focus. DICOM is the primary file format used by TCIA for radiology imaging. Supporting data related to the images such as patient outcomes, treatment details, genomics and expert analyses are also provided when available. This dataset is a member of the Pancreatic-CT-CBCT-SEG collection and is distributed under the Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0/). Per the TCIA Data Usage Policy (see `License` file), all oral or written presentations, disclosures, or publicatons must acknowledge the specific dataset(s) or applicable accession number(s) and the NIH-designated data repositories through which the investigator accessed any data. The appropriate citations are included in the `Citations` file. Specifically, the metadata for this dataset is as follows: Series UID: 1.3.6.1.4.1.14519.5.2.1.302382790855582445722435410442490497846 Collection: Pancreatic-CT-CBCT-SEG 3rd Party Analysis: NO DataDescription URI: NA Subject ID: Pancreas-CT-CB_001 Study UID: 1.3.6.1.4.1.14519.5.2.1.21087345762211724523378497892240459677 Study Description: PANCREAS Study Date: 7/6/2012 Series Description: PANCREAS DI iDose 3 Manufacturer: Philips Modality: CT SOP Class Name: CT Image Storage SOP Class UID: 1.2.840.10008.5.1.4.1.1.2 Number of Images: 134 File Size: 70.58 MB File Location: .\Pancreatic-CT-CBCT-SEG\Pancreas-CT-CB_001\07-06-2012-NA-PANCREAS-59677\201.000000-PANCREAS DI iDose 3-97846 Download Timestamp: 2021-11-07T16:51:32.384
Additionally, it should be noted, TCIA has a REST API interface, so if we want to use more data from this site in the future, we could possibly lazy load and convert data instead of directly adding to this repo (making sure to include proper citations, of course). |
The license looks very permissive, the citation requirement will have to be kept in mind if used for documentation. This is a somewhat large dataset for testing (~67 Mb in total), it might be prohibitive to be constantly downloading these files? But it certainly is an interesting dataset, which is also nice for the documentation. I'm not familiar with DICOM datasets, but is it possible to reduce the number of files for testing purposes? This may be a middle ground here. If so, this looks good to me. |
Yes, agreed. The docstrings in the corresponding updated code and sphinx docs will need to contain the citations.
Yes, agreed; I'm concerned about that as well. Instead, I can write a private module that implements the TCIA REST API so we can download directly from their site rather than store data on GitHub. It behaves similarly to the URL requests functions currently implemented in
They also provide code examples and an SDK for python hosted here on GitHub. Would you and the team be alright if I made such a python module to implement the API?
DICOM is simply an image file format standardized by the medical community to secure data transfer, particularly for patient files. A single standalone DICOM image is typically something like a chest x-ray, mammogram, bone fracture image, etc. Alternatively, an MRI or CT (computed tomography) machine can reconstruct a 3D volume by taking images of a body at multiple slices (hence the "tomography" in "computed tomography") along an axis. Thus, multiple images will always exist for a raw DICOM stack that represents a volume, unfortunately. Oftentimes, there are many files. A single file volume image, like a ".ply" or an ".stl", can be created from a stack, but raw data is important in research. There is a lot of clean up that occurs in generated the 3D model and we typically want to experiment with and list the filters used to create the 3D model in our research so that we can guarantee the results can be repeated by others. I could try to find a smaller data set, which may very well exist, but I think the best solution would be to use the TCIA API and download from them. What do you think? |
Having the data here is probably the best IMO if it is used for testing. Otherwise, the testing could be broken due to the other endpoint being down. I'm realizing that if we use all the DICOM layers for the documentation building, we already need to download the whole dataset for full testing anyway. So my question above about utilizing a partial dataset is probably not important. Maybe the 50+ Mb file size is no issue? Let's wait to get another opinion from @akaszynski . |
@akaszynski Do you have any opinions? |
@adeak @banesullivan Any thoughts on this as well? |
Skimmed over this... Any data that are used for testing/examples need to be hosted here as GitHub is generally reliable. If that external service goes down or changes their URL, it will break our CI and create a significant burden to us much like pyvista/pyvista#1226 did |
Ah, just saw the concerns about file size Perhaps we should just make a seperate repo for lfs files? |
There is also this dataset from vtk data testing that is much smaller. I know nothing about it. https://data.kitware.com/#collection/55f17f758d777f6ddc7895b7/folder/5afd93708d777f15ebe1b516 |
@MatthewFlamm The data appears corrupted (I cannot open it in ParaView as a volume). This appears to be several slices of a prostate CT scan exam. A prostate scan would definitely be smaller than a torso. I found a working one from TCIA that is 2.1MB. Would that be okay? About the smallest useful set I can find is 2MB. Is that acceptable? |
2MB is quite acceptable. |
I found the inconsistency in the data set I linked. It is for a different reader: https://vtk.org/doc/nightly/html/classvtkDICOMImageReader.html It is used in this example This reader isn't available in my version of ParaView either. I would agree that a smaller dataset would make a whole lot more sense. |
Per contributor feedback, a smaller dataset (<= 2MB) is ideal. This dataset will be used to test improved volume rendering and DICOM stack reading per pyvista/pyvista-support issue #500. The Cancer Imaging Archive (TCIA) is a service which de-identifies and hosts a large archive of medical images of cancer accessible for public download. The data are organized as “collections”; typically patients’ imaging related by a common disease (e.g. lung cancer), image modality or type (MRI, CT, digital histopathology, etc) or research focus. DICOM is the primary file format used by TCIA for radiology imaging. Supporting data related to the images such as patient outcomes, treatment details, genomics and expert analyses are also provided when available. This dataset is a member of the National Cancer Institute's Clinical Proteomic Tumor Analysis Consortium Sarcomas (CPTAC-SAR) cohort. CPTAC is a national effort to accelerate the understanding of the molecular basis of cancer through the application of large-scale proteome and genome analysis, or proteogenomics. Radiology and pathology images from CPTAC patients are being collected and made publicly available by The Cancer Imaging Archive to enable researchers to investigate cancer phenotypes which may correlate to corresponding proteomic, genomic and clinical data. This data has been published under the `Creative Commons Attribution 3.0 Unported License` and must adhere to the CPTAC Data Use Agreement. Per the TCIA Data Usage Policy (see `License` file), all oral or written presentations, disclosures, or publications must acknowledge the specific dataset(s) or applicable accession number(s) and the NIH-designated data repositories through which the investigator accessed any data. The appropriate citations are included in the `CITATIONS` file. The metadata for this dataset is included in `metadata.csv`. Questions may be directed to <help@cancerimagingarchive.net>. Title: Forearm Sarcoma DataDescription URI: https://doi.org/10.7937/TCIA.2019.9bt23r95 Number of Images: 3 Total Size: 1.51 MB File Format: DICOM
@akaszynski @MatthewFlamm @banesullivan I've replaced the dataset with a 1.5 MB dataset. Please kindly review at your earliest convenience and let me know if this will work. Thank you! |
This seems much more reasonable. It looks like a rebase or merge got messed up, and will have to be fixed before merging. GitHub is saying there are 102 files changed. |
@MatthewFlamm That should be the 102 files I deleted so that there are only 3 DICOM files. I forgot to add the upstream. I'll fix this. |
@MatthewFlamm I accidentally delete this PR. Please see #9 to continue. Sorry for the confusion. |
This dataset will be used to test improved volume rendering and DICOM stack reading per
pyvista/pyvista-support
issue #500.![image](https://user-images.githubusercontent.com/59346180/140785116-3dc68d20-bca6-4d6f-85b6-6b832a576138.png)The Cancer Imaging Archive (TCIA) is a service which de-identifies and hosts a large archive of medical images of cancer accessible for public download. The data are organized as “collections”; typically patients’ imaging related by a common disease (e.g. lung cancer), image modality or type (MRI, CT, digital histopathology, etc) or research focus. DICOM is the primary file format used by TCIA for radiology imaging. Supporting data related to the images such as patient outcomes, treatment details, genomics and expert analyses are also provided when available.
This dataset is a member of the National Cancer Institute's Clinical Proteomic Tumor Analysis Consortium Sarcomas (CPTAC-SAR) cohort. CPTAC is a national effort to accelerate the understanding of the molecular basis of cancer through the application of large-scale proteome and genome analysis, or proteogenomics. Radiology and pathology images from CPTAC patients are being collected and made publicly available by The Cancer Imaging Archive to enable researchers to investigate cancer phenotypes which may correlate to corresponding proteomic, genomic and clinical data.
This data has been published under the Creative Commons Attribution 3.0 Unported License and must adhere to the CPTAC Data Use Agreement. Per the TCIA Data Usage Policy (see
License
file), all oral or written presentations, disclosures, or publications must acknowledge the specific dataset(s) or applicable accession number(s) and the NIH-designated data repositories through which the investigator accessed any data. The appropriate citations are included in theCitations
file. The metadata for this dataset is included inmetadata.csv
. Questions may be directed to help@cancerimagingarchive.net.Title: Forearm Sarcoma
DataDescription URI: https://doi.org/10.7937/TCIA.2019.9bt23r95
Number of Images: 3
Total Size: 1.51 MB
File Format: DICOM
Files:
DICOM_Stack.zip
LICENSE.txt
CITATION.txt
metadata.csv
This dataset is a member of the Pancreatic-CT-CBCT-SEG collection and is distributed under the Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0/). Per the TCIA Data Usage Policy (see `License` file), all oral or written presentations, disclosures, or publications must acknowledge the specific dataset(s) or applicable accession number( [metadata.csv](https://github.com/pyvista/vtk-data/files/8467442/metadata.csv) s) and the NIH-designated data repositories through which the investigator accessed any data. The appropriate citations are included in the `Citations` file.Specifically, the metadata for this dataset is as follows:
Series UID: 1.3.6.1.4.1.14519.5.2.1.302382790855582445722435410442490497846
Collection: Pancreatic-CT-CBCT-SEG
3rd Party Analysis: NO
DataDescription URI: NA
Subject ID: Pancreas-CT-CB_001
Study UID: 1.3.6.1.4.1.14519.5.2.1.21087345762211724523378497892240459677
Study Description: PANCREAS