# DICOM metadata EDA

Work in progress to analyze the metadata attributes of CT scans in DICOM files. Comments welcome.

The data was extracted in another notebook by loading every single CT scan and dumping each metadata attribute. I saved it to a CSV file to speed up (extracting takes 5-10 mins). [This kernel](https://www.kaggle.com/anarthal/dicom-metadata-extracting-attributes-to-dataframe) shows how I did it.

# 1. Imports

GCDM is needed to load certain DICOM files. Not strictly needed here, but just in case.

In [None]:
!conda install -c conda-forge gdcm -y

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
from matplotlib import pyplot as plt
import matplotlib
import pydicom
import os
from os import path
import tqdm
import IPython
import re
import PIL
import scipy.misc
from pathlib import Path

sns.set()

# 2. Loading the data

Load the data. Some fields are arrays but are stored as strings containing Python literal arrays.

In [None]:
dftrain = pd.read_csv('/kaggle/input/osic-pulmonary-fibrosis-progression/train.csv')
dftest = pd.read_csv('/kaggle/input/osic-pulmonary-fibrosis-progression/test.csv')
dfmeta = pd.read_csv('/kaggle/input/osic-pulmonary-fibrosis-metadata/meta.csv', converters={
    'ImageType': eval,
    'ImagePositionPatient': lambda x: tuple(float(elm) for elm in eval(x)) if x != '' else np.nan
})
dfmeta['ImageType'] = dfmeta.ImageType.map(lambda x: tuple(x) if type(x) is list else tuple())
dfmeta['ImagePositionPatientX'] = dfmeta['ImagePositionPatient'].map(lambda x: x[0] if type(x) is tuple else np.nan)
dfmeta['ImagePositionPatientY'] = dfmeta['ImagePositionPatient'].map(lambda x: x[1] if type(x) is tuple else np.nan)
dfmeta['ImagePositionPatientZ'] = dfmeta['ImagePositionPatient'].map(lambda x: x[2] if type(x) is tuple else np.nan)

dcms = []
for root, dirs, fnames in os.walk('/kaggle/input/osic-pulmonary-fibrosis-progression/train'):
    dcms += list(os.path.join(root, f) for f in fnames if f.endswith('.dcm'))

TBC: explain this

In [None]:
dfpat = dfmeta.groupby('PatientID').first()
dfpat['SliceCount'] = dfmeta.groupby('PatientID').size()
dfpat['Span'] = dfmeta.groupby('PatientID')['ImagePositionPatientZ'].apply(lambda x: x.max() - x.min())

In [None]:
dfpat.loc['ID00165637202237320314458']

# 3. DICOM attributes

List of analyzed DICOM metadata (see cells below for graphs and tables supporting my arguments):

- BitsAllocated: https://dicom.innolitics.com/ciods/ct-image/image-pixel/00280100 Number of bits per pixel allocated in the image. Always 16 (DROP).
- BitsStored: https://dicom.innolitics.com/ciods/ct-image/image-pixel/00280101 Number of bits per pixel stored in the image. Ranging from 12 to 16 - see what does this mean.
- BodyPartExamined: https://dicom.innolitics.com/ciods/raw-data/general-series/00180015 Obvious meaning, always 'Chest' (DROP)
- [Rows](https://dicom.innolitics.com/ciods/mr-image/image-pixel/00280010), [Columns](https://dicom.innolitics.com/ciods/mr-image/image-pixel/00280011): exactly what you would expect, equivalent to obj.pixel_array.shape.
- [ConvolutionKernel](https://dicom.innolitics.com/ciods/ct-image/ct-image/00181210): ??? widely range of values available
- [DeidentificationMethod](https://dicom.innolitics.com/ciods/ct-image/patient/00120063): how the CT scan was made anonymous. All values are identical (DROP)
- [GantryDetectorTilt](http://dicomlookup.com/lookup.asp?sw=Tnumber&q=(0018,1120)): the angle of the gantry when the CT scan was made. It's informational and its value is constant (DROP)
- [HighBit](https://dicom.innolitics.com/ciods/us-image/image-pixel/00280102): the specification says it should be 1 less than BitsStored. Train test agrees with that, so no extra info here (DROP).
- [InstanceNumber](https://dicom.innolitics.com/ciods/ct-image/general-image/00200013): an identification for the current file. In our case, this matches with the file name (e.g. for a file called 42.dcm, InstanceNumber == 42). Not useful (DROP).
- [KVP](https://dicom.innolitics.com/ciods/digital-x-ray-image/x-ray-generation/00180060): [Peak kilo-voltage](https://en.wikipedia.org/wiki/Peak_kilovoltage) of the X-ray generator. A measure of the power of the generated X-rays. More power tends to generate better images (TBC: study [this article](https://radiopaedia.org/articles/kilovoltage-peak)). Ranging from 100 to 140, mode 120.
- [Manufacturer](https://dicom.innolitics.com/ciods/ct-image/general-equipment/00080070): the company that produced the equipment (e.g. SIEMENS, Philips...). There are several different values, although I don't see how these could be important for our purpose (DROP).
- [ManufacturerModelName](https://dicom.innolitics.com/ciods/ct-image/general-equipment/00081090): the equipment model's name. Same consideration as above (DROP).
- [Modality](https://www.dicomlibrary.com/dicom/modality/): describes what does the DICOM file contain. In our case, it's always CT, for Computed Tomography scan (DROP).
- [PatientID](https://dicom.innolitics.com/ciods/ct-image/patient/00100020): self explanatory. In the training set, checked that it's always the same as the directory name the image is in (DROP).
- [PatientPosition](https://dicom.innolitics.com/ciods/ct-image/general-series/00185100): position of the patient when the CT scan was taken. Seems important as the CT scan seems oriented differently for images with different values (see GIFs in cells below to see what I mean). Standard says it's just informative (TBC: check Patient Orientation Code Sequence)
    - HFS (Head-First Supine): head goes first, and the patient is in supine potition (meaning that her head is looking upwards, to the ceiling).
    - FFS (Feet-First Supine): feet go first, and the patient is in supine potition (meaning that her head is looking upwards, to the ceiling).
    - HFP (Head-First Prone): head goes first, and the patient is in prone potition (meaning that her head is looking downwards, to the floor).
    - FFP (Feet-First Prone): feet go first, and the patient is in prone potition (meaning that her head is looking downwards, to the floor).
- PatientSex: seems redundant, we already have sex in the tabular metadata (DROP).
- [PhotometricInterpretation](https://dicom.innolitics.com/ciods/ct-image/image-pixel/00280004): tells us how to interpret the pixel values in the image. For all CT scans here we have PhotometricInterpretation == 'MONOCHROME2', which means that it's a grayscale image with lower pixel corresponding to darker colors than higher ones (DROP).
- [PixelRepresentation](https://dicom.innolitics.com/ciods/ct-image/image-pixel/00280103): whether the image pixel values are encoded as signed or unsigned integers. I assume this is already taken into account by pydicom when reading the image (grepping the source codes yields a zound of matches). Our dataset has both signed and unsigned represented images. Too low level (DROP).
- [StudyInstanceID](https://dicom.innolitics.com/ciods/ct-image/general-study/0020000d): identifier for the study. According to [this stack overflow answer](https://stackoverflow.com/questions/1434918/dicom-whats-the-point-of-sopinstanceuid-tag), DICOM defines a hierarchy patient -> study -> series -> instance. In our case, there is a 1-to-1 correspondence between Patient IDs and StudyInstanceID, so this gives us no extra info (DROP).
- [SOPInstanceUID](https://dicom.innolitics.com/ciods/ct-image/sop-common/00080018): a unique identifier for each CT scan. (In DICOM, SOP means Service Object Pair. A CT scan is one of the possible type of SOPs the standard defines). Not useful (DROP).
- [ImageType](https://dicom.innolitics.com/ciods/ct-image/general-image/00080008): information about what does the image contain. It is actually a list of 2 or more values. Patient ID00421637202311550012437 appears to be corrupted, as it has the string literal '1'. The values in here mean the following:
    - ImageType[0] can be ORIGINAL or DERIVED. Derived images are generated from other ones, while originals are not (from what I've read, derived images entail some post-processing while primary images do not). Most of the images are ORIGINAL, but we also have a few DERIVED ones.
    - ImageType[1] can be PRIMARY or SECONDARY. Primary images are created directly by the examination of the patient, while secondary ones are created afterwards (from what I've read, a 3D reconstruction would be secondary while a plain scan would be primary). Most of the images are PRIMARY, while a few of them are SECONDARY. It seems like the separation is a little bit blurry, so I'm not sure how much info does this give us.
    - [ImageType[2]](https://dicom.innolitics.com/ciods/ct-image/ct-image/00080008) is CT scan-specific and may be AXIAL or LOCALIZER. I can only see AXIAL images here. However, there are other two values in our dataset: REFORMATTED and OTHER (no idea what is this about).
    - Further values are implementation specific, very little info on them.
    
    
The following is a list of attributes I've seen in several notebooks preprocessing images, like ([this](https://www.kaggle.com/anarthal/osic-autoencoder-training/edit) and (this)[https://www.kaggle.com/gzuidhof/full-preprocessing-tutorial]):

- [ImagePositionPatient](https://dicom.innolitics.com/ciods/ct-image/image-plane/00200032): an array of 3 elements indicating the coordinates (x, y, z) of the upper-left corner of the image. Most of the images have this attribute, except for 3 patients.
    - Z axis is increasing towards the patients head (slices with greater ImagePositionPatient[2] correspond to the upper part of the lungs). This value can be used to order the slices to reconstruct the 3D volume the CT scan is representing.
    - X and Y axis origins differ from patient to patient. In general, the X and Y origins are the same for every slice of a scan, except for two patients. TBC: study if these attributes are somehow useful.
- [SliceLocation](https://dicom.innolitics.com/ciods/ct-image/image-plane/00201041): a single number specifying the location of the image plane with respect to an implementation-defined plane. Contains similar information to ImagePositionPatient[2], but it's not always consistent (sometimes ImagePositionPatient[2] indicates that the increasing z-axis direction is one, while SliceLocation indicates the opposite). All train images that have a value for ImagePositionPatient[2] have also a value for SliceLocation. TBC: investigate further and see if there is any relation with ImageOrientationPatient.
- [RescaleType](https://dicom.innolitics.com/ciods/ct-image/ct-image/00281054): DICOM images are stored scaled. This attribute is supposed to indicate how to convert between the values in the pixel array (the "stored values") and actual meaningful values. In most cases, this attribute is either NaN or HU (meaning Hounsfield Units). Some others are US or UNSPECIFIED (synonyms). It's reasonable to assume everything is in HU.
- [RescaleIntercept](https://dicom.innolitics.com/ciods/ct-image/ct-image/00281052) and [RescaleSlope](https://dicom.innolitics.com/ciods/ct-image/ct-image/00281053): to convert from the pixel array to HU, we have to apply a linear transformation like:  `HU = RescaleIntercept + RescaleSlope * sv`, where `sv` is the stored value. Note that these values are constant among the different slices within a single CT scan. (Actually, in the training set the only meaningful term is the intercept - the slope is always 1).
- [SliceThickness](https://dicom.innolitics.com/ciods/ct-image/image-plane/00180050): how thick is the slice along the Z axis that this image represents? 


TBC:

- ImageOrientationPatient
- DistanceSourceToDetector
- DistanceSourceToPatient
- FocalSpots
- FrameOfReferenceUID
- GeneratorPower
- LargestImagePixelValue
- Modality
- PatientOrientation
- PixelPaddingValue
- PixelRepresentation
- PixelSpacing
- PositionReferenceIndicator
- RevolutionTime
- RotationDirection
- SamplesPerPixel
- SeriesInstanceUID
- SingleCollimationWidth
- SliceThickness
- SmallestImagePixelValue
- SpacingBetweenSlices
- SpatialResolution
- SpecificCharacterSet
- SpiralPitchFactor
- TableFeedPerRotation
- TableHeight
- TableSpeed
- TotalCollimationWidth
- WindowCenter
- WindowCenterWidthExplanation
- WindowWidth
- XRayTubeCurrent

In [None]:
dfmeta.SliceCount * dfmeta.

## 3.1. NAN count

In [None]:
plt.figure(figsize=(10, 20))
nans = dfmeta.isna().sum().sort_index()
sns.barplot(y=nans.index, x=nans, orient='h')

## 3.2. BitsStored

In [None]:
sns.countplot(dfmeta.BitsStored)

## 3.3. Image sizes

In [None]:
sizes = dfmeta.apply(lambda x: f'{x.Rows}x{x.Columns}', axis=1)
plt.figure(figsize=(15, 8))
sns.countplot(sizes)

## 3.4. KVP

In [None]:
plt.figure(figsize=(8, 8))
sns.countplot(dfmeta.KVP)

## 3.5. Manufacturer

In [None]:
plt.figure(figsize=(15, 8))
sns.countplot(dfmeta.Manufacturer)

In [None]:
dfmeta.ManufacturerModelName.value_counts()

## 3.6. PatientPosition

As explained above, this tells us how the patient was positioned in the scan. Let's visualize two positions to understand it. TBC: may be worth considering rotating the images?

In [None]:
tmp = dfmeta.groupby('PatientPosition')['PatientID'].nunique()
sns.barplot(x=tmp.index, y=tmp.values)

Let's use GIFs to see it. Note: if a patient has too many slices, the GIF may freeze your browser! Note how the CT scan changes depending on PatientPosition. The GIF creation is inspired in [this notebook](https://www.kaggle.com/andradaolteanu/pulmonary-fibrosis-competition-eda-dicom-prep).

In [None]:
def display_gif(img, duration=10000, low=-2048., high=2048.):
    def to_pil(slc):
        lower = low
        upper = high
        pixels = (255. * (slc - lower) / (upper - lower)).astype('uint8')
        pixels = np.clip(pixels, 0, 255)
        return PIL.Image.fromarray(pixels, 'L')

    pil_imgs = [to_pil(slc) for slc in img]
    print(f'{len(pil_imgs)} frames for this patient')
    pil_imgs[0].save('tmp.gif', format='GIF',
                   append_images=pil_imgs[1:],
                   save_all=True,
                   duration=duration//len(pil_imgs), loop=0)
    IPython.display.display(display(IPython.display.Image('tmp.gif')))
    
def read_patient_img(patient_id):
    patient_dir = Path('/kaggle/input/osic-pulmonary-fibrosis-progression/train') / patient_id
    slices = [pydicom.read_file(p) for p in patient_dir.glob('*.dcm')]
    try:
        slices.sort(key=lambda x: float(x.ImagePositionPatient[2]))
    except AttributeError:
        pass
    image = np.stack([s.pixel_array.astype(float) * s.RescaleSlope + s.RescaleIntercept  for s in slices])
    return image

In [None]:
selected_patients = dfmeta.groupby('PatientPosition').apply(lambda gp: gp.PatientID.value_counts().index[-1])
descr = {
    'FFP': 'Feet-first prone (head down)',
    'FFS': 'Feet-first supine (head up)',
    'HFP': 'Head-first prone (head down)',
    'HFS': 'Head-first supine (head up)',
}
for pos in ['FFP', 'FFS']:
    print(f'{descr[pos]}')
    display_gif(read_patient_img(selected_patients[pos]))

## 3.7. Pixel representation

In [None]:
dfmeta.PixelRepresentation.value_counts()

## 3.8. ImageType

In [None]:
dfmeta['ImageType0'] = dfmeta.ImageType.map(lambda x: x[0] if len(x) >= 1 else np.nan)
dfmeta['ImageType1'] = dfmeta.ImageType.map(lambda x: x[1] if len(x) >= 2 else np.nan)
dfmeta['ImageType2'] = dfmeta.ImageType.map(lambda x: x[2:])

In [None]:
sns.countplot(dfmeta.ImageType0)

In [None]:
sns.countplot(dfmeta.ImageType1)

In [None]:
dfmeta.groupby('PatientID').apply(lambda gp: gp.ImageType0.iloc[0]).value_counts()

In [None]:
dfmeta.groupby('PatientID').apply(lambda gp: gp.ImageType1.iloc[0]).value_counts()

In [None]:
dfmeta.ImageType.map(lambda x: x[:2]).value_counts()

In [None]:
dfmeta.ImageType2.value_counts()

## 3.9. ImagePositionPatient

In [None]:
dfmeta['ImagePositionPatientX'] = dfmeta.ImagePositionPatient.map(lambda x: x[0] if type(x) is tuple else np.nan)
dfmeta['ImagePositionPatientY'] = dfmeta.ImagePositionPatient.map(lambda x: x[1] if type(x) is tuple else np.nan)
dfmeta['ImagePositionPatientZ'] = dfmeta.ImagePositionPatient.map(lambda x: x[2] if type(x) is tuple else np.nan)

In [None]:
dfmeta[~dfmeta.ImagePositionPatient.isna()].groupby('PatientID').ImagePositionPatientY.nunique().value_counts()

## 3.10. Rescaling (converting to Hounsfield Units)

In [None]:
dfmeta['RescaleType'].value_counts()

In [None]:
dfmeta.groupby('PatientID')['RescaleIntercept'].nunique().value_counts()

In [None]:
dfmeta.groupby('PatientID')['RescaleSlope'].nunique().value_counts()

In [None]:
dfmeta['RescaleIntercept'].value_counts()

In [None]:
dfmeta['RescaleSlope'].value_counts()

## 3.11. Voxel size

In [None]:
sns.jointplot(data=dfpat, x='SliceThickness', y='SliceCount')

In [None]:
dfpat[dfpat['SliceThickness'] > 7]

In [None]:
patid = 'ID00229637202260254240583'
print(f'Patient {patid}, with SliceCount={dfpat.loc[patid, "SliceCount"]}, SliceThickness={dfpat.loc[patid, "SliceThickness"]}')

In [None]:
dfmeta[dfmeta.PatientID.eq(patid)].ImagePositionPatient

In [None]:
display_gif(read_patient_img('ID00229637202260254240583'))