Skip to content

Performant methods for preprocessing DICOM images with Rust, and associated Python bindings

License

Notifications You must be signed in to change notification settings

medcognetics/dicom-preprocessing

Repository files navigation

DICOM Preprocessing

Implements a tool that preprocesses DICOM files into TIFF images. The primary motivation is to prepare DICOM images for use in computer vision tasks, with a focus on efficient storage and minimization of decode processing time.

Transformation Sequence

Depending on the options used, the following transformations are applied to the DICOM image:

  1. Cropping - if the --crop option is used, the image or volume is cropped such that all-zero rows and columns are removed from the edges of the image.
  2. Resizing - the image is resized to the target size, preserving the aspect ratio.
  3. Padding - if the aspect ratio does not match the target size, the image is padded in the direction specified by the --padding option.

To enable mapping coordinates from the original image to the output image, the following TIFF tags will be set:

  • DefaultCropOrigin - the origin of the initial cropping step as (x, y)
  • DefaultCropSize - the size of the initial cropping step as (width, height)
  • DefaultScale - the floating point scale of the resizing step as (x, y)
  • ActiveArea - coordinates of the non-padded area of the image as (left, top, right, bottom)
  • PageNumber - tuple of (0, total) indicating the total number of frames in the file

Command Line Interface

Preprocess DICOM files into (multi-frame) TIFFs

Usage: dicom-preprocess [OPTIONS] <SOURCE> <OUTPUT>

Arguments:
  <SOURCE>  Source path. Can be a DICOM file, directory, or a text file with DICOM file paths
  <OUTPUT>  Output path. Can be a directory (for multiple files) or a file (for a single file)

Options:
  -c, --crop                         Crop the image. Pixels with value equal to zero are cropped away.
  -m, --crop-max                     Also include pixels with value equal to the data type's maximum value in the crop calculation
  -s, --size <SIZE>                  Target size (width,height)
  -f, --filter <FILTER>              Filter type [default: triangle] [possible values: triangle, nearest, catmull-rom, gaussian, lanczos3]
  -p, --padding <PADDING_DIRECTION>  Padding direction [default: zero] [possible values: zero, top-left, bottom-right, center]
  -z, --compressor <COMPRESSOR>      Compression type [default: packbits] [possible values: packbits, lzw, uncompressed]
      --strict                       Fail on input paths that are not DICOM files
  -h, --help                         Print help
  -V, --version                      Print version

Example Images

Below are example images demonstrating the effects of different cropping options (resized to 512x384):

Original Image Cropped (Zero Pixels) Cropped (Zero + Maximum Pixels)
Original Image Cropped (Zero) Cropped (Zero + Max)

The maximum pixel cropping option (-m, --crop-max) prevents certain image watermarks from impacting the cropping calculation. Effective cropping can maximize the information extracted from the image at a given resolution budget.

Below are example images demonstrating various volume handling options:

Central Slice Maximum Intensity
Central Slice Maximum Intensity

Optimization Notes

Compression and ZFS

Below is a comparison of file sizes for 26,474 digital breast tomosynthesis (DBT) volumes after preprocessing to TIFF when stored in ZFS. Example decode times from a local NVMe SSD are also provided for each configuration. Note that the Rust PackBits decoder seems suboptimal, as PackBits decoding is generally faster than LZW.

TIFF Compression Total Size Total Size (LZ4 Compressed) Peak Decode Time (ms)
Uncompressed 12TB 6.5TB 3.204
Packbits 8.3TB 6.5TB 67.288
LZW 5.6TB 5.6TB 44.080

PackBits compression does not yield a substantial reduction in stored file size on ZFS. The primary determinant of compression algorithm then becomes the network bandwidth between the storage and compute nodes. Uncompressed files will require higher bandwidth to transfer, but do not require decompression on the compute node. Furthermore, the elimination of decompression frees the CPU to do other train-time tasks like augmentation. Note that TrueNAS will store compressed blocks in adaptive replacement cache (ARC), thus uncompressed and PackBits compressed files will have similar memory footprint.

In summary, if you have sufficient storage capacity, network bandwidth, and are using an access pattern that saturates the network link, uncompressed TIFFs are a good choice. Local flash storage will be bottlenecked by the decompression step, so uncompressed TIFFs are an ideal choice for maximum throughput.

Access Patterns

When loading preprocessed TIFFs from HDDs over a local network, access patterns become a significant determinant of throughput. Spinning rust HDDs suffer from high latencies, and thus random access patterns are suboptimal. Below is a comparison of sequential and random access patterns for a the dataset described above. This is not an exact comparison, as the order of file reads differs between the two and thus the slice chosen from each DBT volume is different. However, the substantial difference in throughput demonstrates the impact of access patterns.

Access Pattern Throughput (files/s)
Sequential 641.4
Random 9.805

ARC

TrueNAS will store retrieved data in ARC. For sufficiently small datasets, it is possible that the entire dataset can be stored in ARC, thus eliminating bottlenecks associated with disk I/O and random access patterns. Below is a comparison of two benchmark runs, both using random access patterns with a consistent seed between runs. The second run benefits from ARC, as the dataset is smaller than the available ARC capacity.

Dataset Size Throughput (files/s)
Run 1 9.836
Run 2 820.5

Given sufficient network bandwidth and ARC capacity, operations on datasets that have been cached will likely be bottlenecked by decode time.

Manifest Creation

When dealing with large datasets stored on slow drives, it is useful to create a manifest of the dataset. This manifest should track the preprocessed file paths that comprise the dataset, as well as the inode of the preprocessed file (for optimizing sequential read performance). A binary, dicom-manifest, is provided to create a manifest from a directory of preprocessed TIFFs. It is assumed that the preprocessed TIFFs are named in the format of {study_instance_uid}/{sop_instance_uid}.tiff. Likewise, it is assumed that dicom-manifest will be given a source path at the root of the preprocessed dataset.

The manifest will contain the following columns, sorted by (study_instance_uid, sop_instance_uid):

  • study_instance_uid - the study instance UID of the preprocessed file
  • sop_instance_uid - the SOP instance UID of the preprocessed file
  • path - the path of the preprocessed file relative to the source path
  • inode - the inode number of the preprocessed file
  • width - the width of the preprocessed file
  • height - the height of the preprocessed file
  • channels - the number of channels in the preprocessed file
  • num_frames - the number of frames in the preprocessed file

Example usage:

dicom-manifest /path/to/preprocessed/dataset /path/to/manifest.csv

About

Performant methods for preprocessing DICOM images with Rust, and associated Python bindings

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published