# DICOM Crawling Code
Given a list of directories, this script will resave all data from DICOMs stored in those folders. The pixel data will be stored in ``pixel_data.h5`` and the metadata from the DICOM headers will be stored in ``metadata.csv``. The HDF5 and CSV files support efficient data reading and evaluation. Note that running this notebook is equivalent to calling ``run_crawler.py`` from the command line.


### Start by importing required packages and functions

In [1]:
from crawler_utils import dicom_crawl
import glob
import os

### Set the parameters

In [3]:
# Define a list of directories, each of which should contain DICOMs. All DICOMs directly inside these directories will be processed.
# Note that we won't parse through these directories to find subdirectories, we'll ignore anything that isn't a directory in this list, and we'll ignore any files that aren't DICOM files found in these directories.
dicom_folders = ['/data_storage/train_images','/data_storage/test_images']

# Define a directory to store outputs
storage_folder = os.getcwd()

# Define a unique identifier for output filenames; can be None
# ex: output_id = 'study1' results in outputs 'pixel_data_study1.h5' and 'metadata_study1.csv'
# ex: output_id = None results in outputs 'pixel_data.h5' and 'metadata.csv'
output_id = 'CT'

# Define number of processors; note that if saving 2d scans, parallelization will be ignored
n_procs = 1

# Turn on/off functionality to create h5 of all pixel data; writing pixel data increases run time significantly
# Set to True to write all pixel data, False to write no pixel data
write_pixeldata = True 

# Choose whether to evaluate 3d series or 2d images
# If eval_3d_scans = True, dicom_crawl() will find all scans with the same series instance UID and stack them in order into a 3d image, then save the 3d stack in the h5 and save one line of metadata/scan
# If eval_3d_scans = False, dicom_crawl() will save each individual DICOM file's pixel data as a 2d image in the h5 file and each DICOM file's metadata will be written to a CSV
eval_3d_scans = True

# Choose whether to parallelize over the folders by setting par_over_folder = True, or over the scans within a folder by setting par_over_folder = 0
# Note this parallelization is only used when evaluating 3d data.
# If you have many folders in dicom_folders, each with O(1) scan, set the parameter below to True
# If you have many scans per folder, set the parameter below to False to have the code parallelize over scans within a folder
par_over_folder = True

### Run DICOM crawling code

In [4]:
# Crawl dicoms
dicom_crawl(dicom_folders, storage_folder, output_id, n_procs, write_pixeldata, eval_3d_scans, par_over_folder)
print("Finished running DICOM crawling code.")


Number of previously stored scans:  0
Starting DICOM crawling...



  0%|          | 0/35 [00:00<?, ?it/s]

  new_metadata = pd.read_csv(metadata_storage_fn).append(new_metadata, ignore_index=True)
35it [01:06,  1.90s/it][01:06<00:00,  3.85s/it]
100%|██████████| 35/35 [01:06<00:00,  1.90s/it]

Finished running DICOM crawling code.



