# Trust but Verify - Inspection of Large Image Collections

This notebook and accompanying [Python script characterize_data.py](characterize_data.py) illustrate the use of SimpleITK as a tool for efficient data inspection on large image collections, as part of familiarizing oneself with the data and performing cleanup prior to its use in deep learning or any other supervised machine learning approach.

The reasons for inspecting your data before using it include:
1. Identification of corrupt images.
2. Identification of erroneous images (label noise).
3. Assessment of data quality and variability in terms of intensity range, image resolution, and pixel types.
4. Reduction of workload, identifying redundant information content (e.g. a greyscale/single channel image masquerading as a color/three channel image - think x-ray in jpg file).


In [1]:
import matplotlib.pyplot as plt
import ipywidgets as widgets
import ipympl
from DicomHelper import *

# utility method that either downloads data from the Girder repository or
# if already downloaded returns the file name for reading from disk (cached data)
OUTPUT_DIR = "OUTPUT-ALL-UNLABELED"



In [2]:
%env SITK_SHOW_COMMAND /Applications/Slicer.app/Contents/MacOS/Slicer

env: SITK_SHOW_COMMAND=/Applications/Slicer.app/Contents/MacOS/Slicer


In [3]:
%matplotlib widget

In [4]:
data_root_dir = "/Users/seanreed/Documents/ALL-UNLABELED-DATA"

In [5]:
!ls $data_root_dir

[34m1.2.276.0.2783747.3.1.2.1150183542.91496.1584557269.16654[m[m
[34m1.2.276.0.2783747.3.1.2.1150183542.91496.1584560923.16663[m[m
[34m1.2.276.0.7230010.3.1.2.1498839191.5144.1567777520.295[m[m
[34m1.2.276.0.7230010.3.1.2.2733322860.4416.1572435955.336[m[m
[34m1.2.276.0.7230010.3.1.2.738517100.4712.1625692712.6881[m[m
[34m1.2.276.0.7230010.3.1.2.738517100.4712.1625694761.7821[m[m
[34m1.2.276.0.7230010.3.1.2.738517100.4712.1625695650.8766[m[m
[34m1.2.276.0.7230010.3.1.2.738517100.5904.1625697387.543[m[m
[34m1.2.276.0.7230010.3.1.2.738517100.7560.1611692908.810[m[m
[34m1.2.276.0.7230010.3.1.2.738517100.7560.1611699852.1756[m[m
[34m1.2.392.200036.9116.2.2.2.1762671078.1544058796.778235[m[m
[34m1.2.392.200036.9116.2.2.2.1762671078.1544062790.854133[m[m
[34m1.2.392.200036.9116.2.2.2.1762671078.1544064007.853456[m[m
[34m1.2.392.200036.9116.2.2.2.1762671078.1544065347.70543[m[m
[34m1.2.392.200036.9116.2.2.2.1762671078.1544071914.125853

## Characterizing  image set

To characterize the image set we have written a [Python script](characterize_data.py) that you should run from the command line. This script is very flexible and allows you to robustly characterize your image set. Try the various options and learn more about your data. You'd be surprised how many times the data isn't what you thought it is when only relying on visual inspection. The script allows you to inspect your data both on a file by file basis and as DICOM series where an image (volume) is stored in multiple files.

DICOM series:
```
python characterize_data.py data output/DICOM_image_data_report.csv per_series \
--metadata_keys "0008|0060" "0018|5101" --metadata_keys_headings "modality" "radiographic view"  
```


After characterizing the image set we turn to visual inspection. 

In [None]:
!python characterize_data.py --dir $data_root_dir --output "./OUTPUT-ALL-UNLABELED/output_all_unlabeled.csv" --metadata_keys "0008|0060" "0018|5101" --metadata_keys_headings "modality" "radiographic view" 

args are 
 Namespace(analyze='per_series', dir='/Users/seanreed/Documents/ALL-UNLABELED-DATA', external_applications=[], external_applications_headings=[], imageIO='', metadata_keys=['0008|0060', '0018|5101'], metadata_keys_headings=['modality', 'radiographic view'], output='./OUTPUT-ALL-UNLABELED/output_all_unlabeled.csv')
ImageSeriesReader (0x7fc597184dc0): Non uniform sampling or missing slices detected,  maximum nonuniformity:44.7

ImageSeriesReader (0x7fc597172d30): Non uniform sampling or missing slices detected,  maximum nonuniformity:28.4194

ImageSeriesReader (0x7fc5967ede40): Non uniform sampling or missing slices detected,  maximum nonuniformity:47.5

ImageSeriesReader (0x7fc5971a5ee0): Non uniform sampling or missing slices detected,  maximum nonuniformity:16.9483

ImageSeriesReader (0x7fc597b56df0): Non uniform sampling or missing slices detected,  maximum nonuniformity:19.3538

ImageSeriesReader (0x7fc5971a5ff0): Non uniform sampling or missing slices detected,  maximum n

ImageSeriesReader (0x7fc5967ede40): Non uniform sampling or missing slices detected,  maximum nonuniformity:10.4528

ImageSeriesReader (0x7fc5971a4cf0): Non uniform sampling or missing slices detected,  maximum nonuniformity:3.22388

ImageSeriesReader (0x7fc597b20980): Non uniform sampling or missing slices detected,  maximum nonuniformity:8.2

ImageSeriesReader (0x7fc5971a56c0): Non uniform sampling or missing slices detected,  maximum nonuniformity:13.4328

ImageSeriesReader (0x7fc597b617c0): Non uniform sampling or missing slices detected,  maximum nonuniformity:18

ImageSeriesReader (0x7fc5967ede40): Non uniform sampling or missing slices detected,  maximum nonuniformity:12.9167

ImageSeriesReader (0x7fc597172d30): Non uniform sampling or missing slices detected,  maximum nonuniformity:9.63636

ImageSeriesReader (0x7fc5967ede40): Non uniform sampling or missing slices detected,  maximum nonuniformity:12.8571

ImageSeriesReader (0x7fc59aa05e60): Non uniform sampling or missing slice

ImageSeriesReader (0x7fc597172d30): Non uniform sampling or missing slices detected,  maximum nonuniformity:41.7692

ImageSeriesReader (0x7fc5967ede40): Non uniform sampling or missing slices detected,  maximum nonuniformity:52.3333

ImageSeriesReader (0x7fc59aa050e0): Non uniform sampling or missing slices detected,  maximum nonuniformity:1091.3

ImageSeriesReader (0x7fc59aa050e0): Non uniform sampling or missing slices detected,  maximum nonuniformity:1106.18

ImageSeriesReader (0x7fc59aa04460): Non uniform sampling or missing slices detected,  maximum nonuniformity:129.075

ImageSeriesReader (0x7fc597172d30): Non uniform sampling or missing slices detected,  maximum nonuniformity:11.8491

ImageSeriesReader (0x7fc597172d30): Non uniform sampling or missing slices detected,  maximum nonuniformity:11.8318

ImageSeriesReader (0x7fc5967ede40): Non uniform sampling or missing slices detected,  maximum nonuniformity:28.8689

ImageSeriesReader (0x7fc593044930): Non uniform sampling or missi

ImageSeriesReader (0x7fc597172d30): Non uniform sampling or missing slices detected,  maximum nonuniformity:10.2778

ImageSeriesReader (0x7fc597b3a580): Non uniform sampling or missing slices detected,  maximum nonuniformity:8.33333

ImageSeriesReader (0x7fc5971583c0): Non uniform sampling or missing slices detected,  maximum nonuniformity:4

ImageSeriesReader (0x7fc591ea59d0): Non uniform sampling or missing slices detected,  maximum nonuniformity:59.8889

ImageSeriesReader (0x7fc59aa04730): Non uniform sampling or missing slices detected,  maximum nonuniformity:20.4348

ImageSeriesReader (0x7fc59aa04080): Non uniform sampling or missing slices detected,  maximum nonuniformity:18.8667

ImageSeriesReader (0x7fc597b50ec0): Non uniform sampling or missing slices detected,  maximum nonuniformity:69

ImageSeriesReader (0x7fc593047d30): Non uniform sampling or missing slices detected,  maximum nonuniformity:21.25

ImageSeriesReader (0x7fc5967ede40): Non uniform sampling or missing slices de

ImageSeriesReader (0x7fc5967ede40): Non uniform sampling or missing slices detected,  maximum nonuniformity:88.0909

ImageSeriesReader (0x7fc597b5bb00): Non uniform sampling or missing slices detected,  maximum nonuniformity:13.1613

ImageSeriesReader (0x7fc591ea5180): Non uniform sampling or missing slices detected,  maximum nonuniformity:6.25

ImageSeriesReader (0x7fc5967ede40): Non uniform sampling or missing slices detected,  maximum nonuniformity:4.80851

ImageSeriesReader (0x7fc5967ede40): Non uniform sampling or missing slices detected,  maximum nonuniformity:13.6712

ImageSeriesReader (0x7fc597172d30): Non uniform sampling or missing slices detected,  maximum nonuniformity:11.5238

ImageSeriesReader (0x7fc597b5bb00): Non uniform sampling or missing slices detected,  maximum nonuniformity:10.9841

ImageSeriesReader (0x7fc593044ca0): Non uniform sampling or missing slices detected,  maximum nonuniformity:12.36

ImageSeriesReader (0x7fc597b20980): Non uniform sampling or missing s

ImageSeriesReader (0x7fc5967ede40): Non uniform sampling or missing slices detected,  maximum nonuniformity:31.9277

ImageSeriesReader (0x7fc597b56df0): Non uniform sampling or missing slices detected,  maximum nonuniformity:6.72727

ImageSeriesReader (0x7fc597b59510): Non uniform sampling or missing slices detected,  maximum nonuniformity:8.12308

ImageSeriesReader (0x7fc597172d30): Non uniform sampling or missing slices detected,  maximum nonuniformity:8.42254

ImageSeriesReader (0x7fc5967ede40): Non uniform sampling or missing slices detected,  maximum nonuniformity:15.6136

ImageSeriesReader (0x7fc5967ede40): Non uniform sampling or missing slices detected,  maximum nonuniformity:30.0926

ImageSeriesReader (0x7fc597172d30): Non uniform sampling or missing slices detected,  maximum nonuniformity:10.6613

ImageSeriesReader (0x7fc591ea8ad0): Non uniform sampling or missing slices detected,  maximum nonuniformity:16.3077

ImageSeriesReader (0x7fc597b69150): Non uniform sampling or miss

ImageSeriesReader (0x7fc597b3a580): Non uniform sampling or missing slices detected,  maximum nonuniformity:25.9444

ImageSeriesReader (0x7fc5967ede40): Non uniform sampling or missing slices detected,  maximum nonuniformity:13.4146

ImageSeriesReader (0x7fc394004470): Non uniform sampling or missing slices detected,  maximum nonuniformity:10.2927



## Now we look at our data using the DICOM series based approach.

After selecting our images of interest we print the associated files. Notice that for the series based approach for some images there is a single file association and for some multiple files.

In [None]:
### unlabeled data
data_root_dir = "/Users/seanreed/Documents/ALL-UNLABELED-DATA"

In [None]:
faux_series_volume_file_name = os.path.join(OUTPUT_DIR, "faux_series_volume.pkl")
faux_series_file_list_name = os.path.join(OUTPUT_DIR, "faux_series_file_list.pkl")
faux_volume_image_files, image_file_list = visualize_series(
    data_root_dir, projection_axis=2, thumbnail_size=[64, 64], tile_size=[5, 5]
)
with open(faux_series_volume_file_name, "wb") as fp:
    pickle.dump(faux_volume_image_files, fp)
with open(faux_series_file_list_name, "wb") as fp:
    pickle.dump(image_file_list, fp)

In [None]:
print(image_file_list[:2])
print(len(image_file_list))
for i, sub_list in enumerate(image_file_list):
    print(f"sub_list{i} has {len(sub_list)} dicom files")

In [None]:
with open(faux_series_volume_file_name, "rb") as fp:
    faux_volume_image_files = pickle.load(fp)
with open(faux_series_file_list_name, "rb") as fp:
    image_file_list = pickle.load(fp)

try:
    image_selection_gui2 = ImageSelection(
        faux_volume_image_files,
        image_file_list,
        figure_size=(10, 5),
        selection_func=show_image,
    )
except Exception as e:
    print(e)

In [None]:
selected_files = image_selection_gui2.get_selected_images()
print(selected_files)