In [None]:
import pandas as pd
import pydicom
import glob
import ast
import numpy as np
import matplotlib.pyplot as plt

# Stratification

Cross Validation: Splits the data into k "random" folds.  
Stratified Cross Valiadtion: Splits the data into k folds, making sure each fold is an appropriate representative of the original data. (class distribution, mean, variance, etc).  

So it is straight forward to do a statified K-split on a dataset using a single feature (class label) using e.g. sklearn.model_selection.StratifiedKFold(). But what if we want to stratify on multiple features in addition to the label itself?  

The SIIM-FISABIO-RSNA COVID-19 Detection dataset has metadata from multiple sources: The .csv-files and the metadata found in the dicom format. In this notebook we will collect data from all the sources and create stratified splits based on multiple features.



# Inspect dicom metadata
Not all the fields are very interesting, so we only extract the ones we think could be useful in the stratification process. In this case we want the following properties:  
  * Patient's Sex
  * Patient ID
  * Photometric Interpretation. The images are either 'MONOCHROME1' or 'MONOCHROME2' format. The difference between them is explained [here](http://dicom.nema.org/medical/dicom/current/output/chtml/part03/sect_C.7.6.3.html#sect_C.7.6.3.1.2). 
  * Imager Pixel Spacing. This is a numerical value, good for practicing binning.
  
Let's take a look at what information is found in the DICOM files:

In [None]:
pydicom.read_file('../input/siim-covid19-detection/train/00e936c58da6/fb532194f195/b81969467c6b.dcm')

# Collect all data into DataFrame
Below data is collected from the .csv-files as well as the DICOM files. Some of the studies have multiple images of the same patient - clearly we do not want to have images of the same patient across several folds (data leakage), so we keep the data on study level. At the same time we want to keep track of how many bounding boxes there are within each study as a potential stratification feature.

In [None]:
dfs = pd.read_csv('../input/siim-covid19-detection/train_study_level.csv')
data, fids = [], []

for i in range(len(dfs)):
    study = dfs.id.iloc[i].split('_')[0]
    files = glob.glob('../input/siim-covid19-detection/train/'+study+'/**/*.dcm')
    dcm_data = pydicom.read_file(files[0]) # fetch dicom data from first file in study
    data.append([study, 
                dfs['Negative for Pneumonia'].iloc[i], 
                dfs['Typical Appearance'].iloc[i],
                dfs['Indeterminate Appearance'].iloc[i],
                dfs['Atypical Appearance'].iloc[i],
                dcm_data.get('PatientSex'),
                dcm_data.get('PatientID'),
                dcm_data.get('PhotometricInterpretation'),
                dcm_data.get('ImagerPixelSpacing')[0],
                len(files)
                ])
    # create list of file IDs
    flist = []
    for f in files:
        flist.append(f.split('/')[-1].split('.')[0])
    fids.append(flist)
    
# assemble DataFrame
df = pd.DataFrame(data, columns = ['Study', 'Negative for Pneumonia', 'Typical Appearance', 
                                   'Indeterminate Appearance', 'Atypical Appearance',
                                   'PatientSex', 'PatientID', 'ImageType', 'PixelSpacing', 'StudyFileCount'])
# count bounding boxes per image
bcnt, fcnt = np.zeros(len(df), dtype=int), np.zeros(len(df), dtype=int)

dfi = pd.read_csv('../input/siim-covid19-detection/train_image_level.csv')
for i in range(len(dfi)):
    try:
        cnt = len(ast.literal_eval(dfi.boxes[i]))
    except:
        cnt = 0
    # accumulated # of bboxes per study
    bcnt[df[df['Study']==dfi.StudyInstanceUID.iloc[i]].index.values[0]] += cnt
    # number of images with bboxes per study
    if cnt > 0:
        fcnt[df[df['Study']==dfi.StudyInstanceUID.iloc[i]].index.values[0]] += 1

# add last columns to DataFrame
df['HasBBoxes'] = (bcnt > 0)
df['FilesWithBBoxes'] = fcnt
df['TotalBBoxCount'] = bcnt
df['Group'] = np.zeros(len(df), dtype=int) # This will be our stratification column later on
df['FileIDs'] = fids
df.sample(15)

# Binning of continous data
In the dataframe above 'PixelSpacing' is a column that needs binning. Let's look at the histogram:

In [None]:
df.hist('PixelSpacing', bins=12)
plt.title('PixelSpacing (unique values={})'.format(len(df.PixelSpacing.unique())));

Pandas has two functions to bin data. `cut` and `qcut`. `cut` is used when we want to specifically define the bin edges (typically uniform bin sizes), while `qcut` is used when we want about the same number of elements in each bin. Here we go for the latter, and bin the 'PixelSpacing' column into three bins called 'High', 'Med' and 'Low' representing image resolution in a new column called 'Resolution'.

In [None]:
df['Resolution'] = pd.qcut(df['PixelSpacing'], q=3, labels=['High', 'Med', 'Low'])
df.sample(5)

Check the distribution of image resolutions - looks fine:

In [None]:
df.Resolution.value_counts()

# Using pivot tables to assess stratification features
Now that we have all the features represented in a DataFrame, and we can use the pandas Pivot Table function to explore stratification options. Which features to stratify on depends on e.g. what type of model to be trained etc. Generally we try to have at least as many examples (studies in this case) per stratification group as there a folds, say 5 folds. The pivot table is used to group and count the number of studies according to the features we select. Then a group number is assigned to each row in the pivot table, and this group number will be our stratification column in the end.  

Let's start off with just the labels:

In [None]:
idx = ["Negative for Pneumonia", "Typical Appearance", "Indeterminate Appearance", "Atypical Appearance"]
table = pd.pivot_table(df,
                   index=idx,
                   values=[],
                   aggfunc=[len], margins=True, margins_name="Total")
table['Group'] = np.arange(0, len(table), dtype=int)
table

We observe that the dataset has only 4 classes (one-hot encoded as it is), and a total of 6054 studies. Now, let's add the patient's sex:

In [None]:
idx = ["Negative for Pneumonia", "Typical Appearance", "Indeterminate Appearance", "Atypical Appearance", "PatientSex"]
table = pd.pivot_table(df,
                       index=idx,
                       values=[],
                       aggfunc=[len], margins=True, margins_name="Total")
table['Group'] = np.arange(0, len(table), dtype=int)
table

Still plenty of examples in each bin. Add image type to the features:

In [None]:
idx = ["Negative for Pneumonia", "Typical Appearance", "Indeterminate Appearance", "Atypical Appearance", "PatientSex", "ImageType"]
table = pd.pivot_table(df,
                       index=idx,
                       values=[],
                       aggfunc=[len], margins=True, margins_name="Total")
table['Group'] = np.arange(0, len(table), dtype=int)
table

Finally add resolution to the feature list:

In [None]:
idx = index=["Negative for Pneumonia", "Typical Appearance", "Indeterminate Appearance", "Atypical Appearance", "PatientSex", "ImageType", "Resolution"]
table = pd.pivot_table(df,
                       index=idx,
                       values=[],
                       aggfunc=[len], margins=True, margins_name="Total")
table['Group'] = np.arange(0, len(table), dtype=int)
table

The smallest group has 8 examples (studies). Let's stop there, with 48 unique feature combinations. The last step now is to add the group number from the pivot table to our DataFrame. There is probably a more elegant way of doing this, but below we simply loop through each row in the pivot table while updating the Group column in the dataset DataFrame.

In [None]:
def set_group(org_df, pivtable, indexes):
    # loop through each group in the pivot table
    for g in range(len(pivtable)-1):
        # create query string for stratify group
        qstr = ''
        for i in range(len(indexes)):
            qstr += '`'+indexes[i]+'` == '
            if str(pivtable.index.get_level_values(i)[g]).isnumeric():
                qstr += str(pivtable.index.get_level_values(i)[g])
            else:
                qstr += '\"'+str(pivtable.index.get_level_values(i)[g])+'\"'
            if i < len(idx)-1:
                qstr += ' and '
        # assign group name
        df.at[df.query(qstr).index, 'Group'] = g
    return df

In [None]:
# Assign the group names
df = set_group(df, table, idx)
df.to_pickle('dataset.pkl')

# Perform stratified K-Folds Split
Once we have grouped the data using pivot tables, splitting the data into K-Folds is straight forward. Here we use the StratifiedKFold function from scikit-learn:

In [None]:
from sklearn.model_selection import StratifiedKFold

# create columns names
cols = []
for i in range(len(index)):
    for j in range(len(df[index[i]].value_counts())):
        cols.append(df[index[i]].value_counts().name+'='+str(df[index[i]].value_counts().index[j]))
dfs = pd.DataFrame(columns = cols)
# split and count examples per fold
skf = StratifiedKFold(n_splits=5)
for train_index, test_index in skf.split(df, df.Group):
    vals = []
    for i in range(len(index)):
        for j in range(len(df.iloc[train_index][index[i]].value_counts())):
            vals.append(df.iloc[train_index][index[i]].value_counts()[j])
    dfn = pd.DataFrame([vals], columns=cols)
    
    dfs=dfs.append(dfn, ignore_index = True)
dfs

Notice how the number of samples per feature is evenly distributed between the 5 folds.

# Summary
In this notebook we have seen how Pandas pivot tables can be a powerful tool for creating stratified K-Folds splits on a dataset when multiple features are considered.