This notebook is the first in a series of notebooks which use the [fast.ai](https://fast.ai) Medical Imaging API built on top of Pytorch. It is modeled after [Jeremy Howard's notebooks](https://www.kaggle.com/c/rsna-intracranial-hemorrhage-detection/discussion/114214) from the 2019 RSNA Intracranial Hemorrhage Detection Kaggle Challenge.

First, we need to upgrade the `fastai` library to `version 2.0.x` and import the relevant libraries.

In [None]:
!pip install fastai --upgrade >/dev/null

In [None]:
# order of importing the fastai libraries matters here, possibly due to a namespace conflict
from fastai.medical.imaging import *
from fastai.basics import *

import glob

# Exploring the file tree
We know from the data overview that the data are split into `train` and _public_ `test` groups of 7,279 and 650 studies, respectively. Each study is organized in standard DICOM format with the top-level directory labeled with the `StudyInstanceUID` and the images organized under a `SeriesInstanceUID` sub-directory with each individual file labeled by the `SOPInstanceUID` in the following path format `<StudyInstanceUID>/<SeriesInstanceUID>/<SOPInstanceUID>.dcm`.

In [None]:
path = Path('../input/rsna-str-pulmonary-embolism-detection')

In [None]:
!ls {path}

In [None]:
path_trn = path/'train'
dirs_trn = path_trn.ls()
dirs_trn[:5].attrgot('name')

In [None]:
path_tst = path/'test'
dirs_tst = path_tst.ls()
print(f'Number of training studies: {len(dirs_trn)}')
print(f'Number of test studies: {len(dirs_tst)}')

# Creating a DataFrame of DICOM metadata
Now we'll proceed to extract the metadata from the DICOM files and put it into a `pandas.DataFrame`, which we'll save in `feather` format for later use.

In [None]:
fns_trn = L(glob.glob(f'{path_trn}/**/*.dcm', recursive=True))
fns_trn = fns_trn.map(Path)
print(len(fns_trn))
fns_trn[:5]

Since there are ~1.8 million images in the training dataset, it's impractical to extract metadata for every image...

So, we'll select one image from each study for inclusion in our DICOM metadata Data Frame.

In [None]:
import gc, os
del(fns_trn)
gc.collect();

In [None]:
fns_trn = L()
for r, d, f in os.walk(path_trn):
    if f:
        fn = Path(f'{r}/{f[0]}')
        fns_trn.append(fn)
print(len(fns_trn))
fns_trn[:5]

In [None]:
fn = fns_trn[0]
dcm = fn.dcmread()
dcm

In [None]:
df_trn = pd.DataFrame.from_dicoms(fns_trn, px_summ=False)
df_trn.to_feather('df_trn.fth')
df_trn.head()

We'll clean up here before proceeding.

In [None]:
del(df_trn, fns_trn)
gc.collect();

# Creating a DataFrame of labels
Here we'll extract the labels from `train.csv` and save them in `feather` format for future use.

In [None]:
path_lbls = path/'train.csv'
lbls = pd.read_csv(path_lbls)
print(lbls.shape)
lbls.drop_duplicates(['StudyInstanceUID', 'SOPInstanceUID'], inplace=True)
print(lbls.shape)
lbls.head()

Looks like the labels are in a nice, readable format, so we'll save them in `feather` format.

In [None]:
lbls.to_feather('lbls.fth')

The [second notebook in the series](https://www.kaggle.com/wfwiggins203/exploring-the-dicom-metadata-images-with-fast-ai) explores the DICOM metadata a little further and looks at a sampling of the images. I try inject some extra domain knowledge from my day job as a radiologist.