# DICOM metadata

This notebook extracts all possible metadata from DICOM files and saves it into a DataFrame (I will also create a .csv file for portability to other notebooks).
The process takes a few minutes and is very memory expensive, so better done just once.

In [this other notebook](https://www.kaggle.com/anarthal/dicom-metadata-eda) I'm using this data to perform an initial research on what does each attribute mean and an EDA. This notebook just dumps the data into CSV.

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import pydicom
from tqdm import tqdm
import os

# 1. File listing

Find the names for all the files in the training set with .dcm extension.

In [None]:
dcms = []
for root, dirs, fnames in os.walk('/kaggle/input/osic-pulmonary-fibrosis-progression/train'):
    dcms += list(os.path.join(root, f) for f in fnames if f.endswith('.dcm'))
print(f'There are {len(dcms)} CT scans')

# 2. Attribute name listing

Let's get all the attributes present in **any** of the DICOM files. The `.dir()` method comes in handy for this. Note that some files have some attributes and some others do not, so inspecting a single file is not enough. Running this takes some minutes.

In [None]:
attrs = set()
for fname in tqdm(dcms):
    with pydicom.dcmread(fname) as obj:
        attrs.update(obj.dir())

This is a complete list of the DICOM attributes. Drop `PixelData` so we do not run out of memory (this one contains the actual image).

In [None]:
dcm_keys = list(attrs)
dcm_keys.remove('PixelData') # The actual array of pixels, this is not metadata
dcm_keys.remove('PatientName') # Anonymous data!
dcm_keys

# 3. Load the actual values from the files

If an attribute is not present, we stick an `np.nan`. We also perform some casting to standard Python types to make things easier.

In [None]:
meta = []
typemap = {
    pydicom.uid.UID: str,
    pydicom.multival.MultiValue: list
}
def cast(x):
    return typemap.get(type(x), lambda x: x)(x)

for i, fname in enumerate(tqdm(dcms)):
    with pydicom.dcmread(fname) as obj:
        meta.append([cast(obj.get(key, np.nan)) for key in dcm_keys])

dfmeta = pd.DataFrame(meta, columns=dcm_keys)
dfmeta

# 4. Done!

In [None]:
dfmeta.to_csv('meta.csv', index=False)