<img src="https://storage.googleapis.com/kaggle-competitions/kaggle/26680/logos/header.png?t=2021-04-23-22-04-05)" />

# Project Description

## Understanding the challenge

This challenge, as well as the dataset itself, is composed of two levels. The first is the image level which contains the chest radiographs, and above it we have the study level, which contains the general conclusion from all the patient radiographs.  
On the study level, each study is classified by specialists as `Negative for Pneumonia`, or as `Typical Appearance`, `Indeterminate  Appearance`, or `Atypical Appearance` to Covid-19.
The grading system is based on [this paper](https://journals.lww.com/thoracicimaging/Fulltext/2020/11000/Review_of_Chest_Radiograph_Findings_of_COVID_19.4.aspx) which proposes a new reporting language for chest radiographs (CXR) findings related to COVID-19, as described in the following table (Table 1 in the paper):


> |Radiographic Classification | CXR Findings | Suggested Reporting Language|
|:--------------------------|:-------------|:----------------------------|
|Typical appearance|Multifocal bilateral, peripheral opacities Opacities with rounded morphology Lower lung–predominant distribution|“Findings typical of COVID-19 pneumonia are present. However, these can overlap with other infections, drug reactions, and other causes of acute lung injury”|
| Indeterminate appearance | Absence of typical findings AND Unilateral, central or upper lung predominant distribution | “Findings indeterminate for COVID-19 pneumonia and which can occur with a variety of infections and noninfectious conditions” |
| Atypical appearance | Pneumothorax or pleural effusion Pulmonary edema Lobar consolidation Solitary lung nodule or mass Diffuse tiny nodules Cavity	|“Findings atypical or uncommonly reported for COVID-19 pneumonia. Consider alternative diagnoses” |
|Negative for pneumonia | No lung opacities | “No findings of pneumonia. However, chest radiographic findings can be absent early in the course of COVID-19 pneumonia”|
      
Although these findings refer to the CXR themselves, on this challenge we were provided with these labels only at the study level, while each study can have many images.
On image level, each image has a list of bounding boxes of findings. The bounding boxes can contain findings from different types, as described by the competition hosts:
> Bounding boxes were placed on lung opacities, whether typical or indeterminate. Bounding boxes were also placed on some atypical findings including solitary lobar consolidation, nodules/masses, and cavities. Bounding boxes were not placed on pleural effusions, or pneumothoraces. No bounding boxes were placed for the negative for pneumonia category.

The dataset doesn't distinguish between the findings type. The findings were given the label opacity, and the prediction in the submission for the findings class should be always  `opacity`.  
The details of the study grading method according to the findings in the images described in the table above. Even though the exact meaning of the terminology is definitely beyond my understanding, one thing we can learn from this table is that the classifying is based on the nature of findings, as well as on their region in the lungs. This is crucial for a better understanding of what our model is supposed to learn.

## Evalutaion

The evaluation at the study level is quite simple: we will check our prediction accuracy. But at the image level, our predictions will be bounding boxes. Probably the bounding boxes will not match exactly to the labeled ones, and we do not care if there are minor differences. So how will we decide whether our predictions are consistent with the labels or not?
Come to think of it, the most important thing here is how much our predictions area intersect with the ground truth labels. For ideal prediction, the predicted area will match the ground truth exactly. Meaning, the intersection and the area of each prediction and the ground truth are equals. In a more realistic case, our prediction isa bit smaller or larger than the ground truth, or span out in one direction and too short in another. In all these cases, the more the intersection area is large with respect to both the predicted area and the ground truth label area, the more we can regard the prediction as correct. This is the rationale behind the PASCAL VOC2010 IoU (Intersection over Union) evaluation method, which is used in this competition: a bounding box prediction is considered correct if the rate of the intersection over union of  the prediction and the ground truth is greater than $0.5$. i.e, we demand
$$IoU = \frac{A_y \cap A_\hat{y}}{A_y \cup A_\hat{y}} > 0.5$$  
Where $A_y$ is the ground truth bounding box area and $A_\hat{y}$ is the area of the predicted box. 

# Notebook initialize

## Basic Imports

In [None]:
!pip install numpy --upgrade
!pip install python-gdcm

In [None]:
from pathlib import Path
import sys
from ast import literal_eval

import numpy as np
import pandas as pd

import matplotlib as mlp
import matplotlib.pyplot as plt
import matplotlib.patches as patches

import seaborn as sns

print(sys.version)

mlp.rcParams['figure.figsize'] = (15, 7)

## Enviroment Settings

In [None]:
is_colab = 'google.colab' in sys.modules
if is_colab:
    from google.colab import drive
    drive.mount('/content/drive')
    path = Path('/content/drive/MyDrive/covid19-detection/data') 
else:
    path = Path('/kaggle/input/siim-covid19-detection') 

# EDA

The dataset is composed of three parts. The CRX files in DICOM format, and two metadata tables: one for the image level and another for the study level. Let's explore first the image-level metadata of the training set.

In [None]:
image_df = pd.read_csv(path/'train_image_level.csv', index_col='id')

In [None]:
image_df.head()

Before doing anything else, we'd like to change this terrible column name `StudyInstanceUID` to a more reasonable one.

In [None]:
image_df = image_df.rename(columns={'StudyInstanceUID':'study_id'})

Now it's much better.

For each image we are provided with image id, study id, the findings bounding boxes, and labels for each bounding box. Let's examine the label column first.
The content of the label column corresponds to the submission's desired format. It contains a description of an unlimited number of finding, separated by whitespace. Each of the descriptions contains 6 fields, also separated by whitespace, as follows:
  
`finding_label confidence xmin ymin xmax ymax  `
The content of this row is as this pattern, repeated as the number of this image's findings. So if we have $k$ finding for specific image, the label row will be:
`finding_label_1 confidence_1 xmin_1 ymin_1 xmax_1 ymax_1 finding_label_2 ...  finding_label_k confidence_k xmin_k ymin_k xmax_k ymax_k `

Let's extract these values.

In [None]:
def extract_label(row):
    values = row.label.split()
    if len(values) % 6 != 0:
        # corrupted row
        print(f'row #{row.index}: wrong number of paramerers in label field')
    return [dict(zip('id finding_id label confident xmin ymin xmax ymax'.split(), [row.name, i] + values[6*i:6*(i+1)])) for i in range(len(values) // 6)]

In [None]:
findings = pd.DataFrame.from_dict(image_df.apply(extract_label, axis=1).sum()).set_index('id')
findings.head()

Now we can see the domains of these values

In [None]:
print(f'The unique findings label values are {findings.label.unique()}')

In [None]:
print(f'The unique findings confdence values are {findings.confident.unique()}')

The labels are only `none` and `opacity`, and the confidence in the training set is always 1 (since this is a labeled dataset). All bounding box data is provided in the `boxes` field, this field is `Nan` when there are no findings (as we can see in the second row in the head of the Dataframe printed above). So in fact, all the data we need exists in the `boxes` field.  
Thus, we can extract the findings data directly from the boxes fields and examine some of their properties.

In [None]:
findings = image_df.apply(lambda x: [{'id':x.name, 'finding_id': i, **box} for i, box in enumerate(literal_eval(x.boxes))] if type(x.boxes) == str else [{'id': x.name}], axis=1)
findings = pd.DataFrame.from_dict(findings.sum()).set_index('id')

In [None]:
print(f'Total number of findings: {findings.shape[0]}')
findings.head()

Before further exploration of findings properties, let's explore the study-level metadata:

In [None]:
study_df = pd.read_csv(path/'train_study_level.csv', index_col='id')

In [None]:
study_df.head()

In this dataframe each study is classified into one of 4 classes: `Negative for Pneumonia`, `Typical Appearance`, `Indeterminate Appearance`, and `Atypical Appearance`. It is important to know how these classes are distributed over the dataset.  
In the evaluation section in the competition details in Kaggle it's said that
> Studies in the test set may contain more than one label. They are as follows: `negative`, `typical`, `indeterminate`, `atypical`

Accordingly,  this is a multilabel classification task.    
In contrast, in [a post](https://www.kaggle.com/c/siim-covid19-detection/discussion/240250) in the competition discussion section, the hosts indicated that
> Per the grading schema, chest radiographs are classified into one of four categories, which are **mutually exclusive**

Since the two descriptions contradict, it is worth inspecting the training set to see which labels can be assigned to an image together.

In [None]:
(study_df.apply(sum, axis=1) == 1).all()

So in the training set, each of the studies has a single label attached to it, and this classification is in fact one-hot encoded classification for each study to one of those 4 classes.  

Since the results of our check on the training set supports the second post, and since inherently, by their meanings, the labels seem to be mutually exclusive , we will leave it as a single-label classification task. For convenience , we will store the labels in one row rather than in one-hot encoding format and join it with the images dataframe.

In [None]:
image_df.study_id += '_study' 
study_labels = study_df.idxmax(axis=1).rename('study_label')
# study_labels.index = study_labels.index.str.extract('(^[^_]*)').apply(lambda x:x[0], axis=1)
image_df = image_df.merge(study_labels, left_on='study_id', right_index=True)

Next we will check how these classes are distributed over the dataset.

In [None]:
label_count = study_df.sum()
plt.figure(figsize=(8, 8))
plt.pie(label_count, labels=label_count.index, wedgeprops={'edgecolor': 'black'}, autopct='%1.f%%', textprops={'fontsize': 16}, explode=[.01]*4, shadow=True)
plt.title('Label Distribution', fontdict={'fontsize':22});

We have 47% certain Covid cases (typical appearance), 36% non-covid (28% negative for pneumonia and 8% atypical to covid), and 17% obscure cases. From a covid vs non-covid point of view, the dataset is quite balanced. But from the classification point of view,  almost 50% of the cases are from one class and only 8% of the cases are from another class.  
To better understand the distribution, let’s see that classification distribution in absolute numbers:

In [None]:
g = sns.barplot(x=label_count.index, y=label_count)
for label, count in zip(range(label_count.index.shape[0]), label_count):
    g.text(label, count, count, color='black', ha="center", fontdict=dict(fontsize=14, weight='bold'))

Next, let's look at some properties of the findings. The number of findings varies for each image. Each of them is an opacity (or another type of the above-mentioned findings) in the CXR, and for each of them we provided the bounding box of the opacity area. Let’s look at the number of findings per image and the main statistical properties of their areas: sum, mean, max, etc.

In [None]:
findings['area'] = findings.width * findings.height
findings_props = findings.groupby('id').area.agg(['count','sum', 'min', 'max', 'mean', 'std'])

In [None]:
# findings_props.index = findings_props.index.str.extract('(^[^_]*)').apply(lambda x:x[0], axis=1)
image_props = image_df.join(findings_props)
image_props.head()

The findings count for each class label is:

In [None]:
plt.figure(figsize=(10,6))
plt.gca().yaxis.grid(linestyle='--')
sns.violinplot(data=image_props, x='study_label', y='count')
plt.title('Findings Count', fontdict={'fontsize':16})
plt.show()

It is clear now that all the negative results have no findings at all, as stated in the grading method table. On the other hand, for each of the other four types it seems that there are instances with no opacity findings, contrary to these grades descriptions in the above table. But is this really the case? We saw earlier that we have more images than studied. That is, some studies have more than one image. So it seems that in some cases the prognosis is based on findings that are determined only in one of the scans. Let's verify this conclusion.

In [None]:
study_findings_props = image_props.groupby(['study_id'])
study_findings_count = study_findings_props['count'].agg('sum').to_frame().join(study_labels)
plt.figure(figsize=(10,6))
plt.gca().yaxis.grid(linestyle='--')
sns.violinplot(data=study_findings_count, x='study_label', y='count')
plt.title('Findings Count', fontdict={'fontsize':16})
plt.show()

It becomes clear that the above conclusion is correct. Almost all of the clear covid-19 cases have 2 findings, and a couple of them with 3 findings. The indeterminate cases also have at least one finding each, and only the non-covid cases sometimes have no findings, even when positive to pneumonia. But according to our table, `No Findings` means `Positive to Pneumonia`, so we'll put these instances aside for now.

In [None]:
studies_to_remove = study_df[study_df.index.isin(study_findings_props['count'].sum()[study_findings_props['count'].sum() == 0].index) & 
         (study_df['Negative for Pneumonia'] == 0)]
study_df['removed'] = False
study_df.loc[studies_to_remove.index, 'removed'] = True
print(f'Total number of removed rows: {study_df.loc[studies_to_remove.index].shape[0]}')
print(f'\n\nRemoved rows by label:\n')
print(study_df.loc[studies_to_remove.index].iloc[:, :-1].sum().to_string())

In [None]:
image_props = image_props.drop(image_props.loc[image_props.study_id.isin(studies_to_remove.index)].index)

study_findings_props = image_props.groupby(['study_id'])
study_findings_count = study_findings_props['count'].agg('sum').to_frame().join(study_labels)
plt.figure(figsize=(10,6))
# plt.gca().yaxis.grid(linestyle='--')
sns.violinplot(data=study_findings_count, x='study_label', y='count')
plt.title('Findings Count', fontdict={'fontsize':16})
plt.show()

Now all the positive cases have findings. Let's inspect other findings properties:

In [None]:
plt.figure(figsize=(20,10))

for i, prop in enumerate(['sum', 'min', 'max', 'mean', 'std'], start=1):
    plt.subplot(2, 3, i)
#     plt.gca().yaxis.grid(linestyle='--')
    sns.violinplot(data=image_props, x='study_label', y=prop, order=study_labels.unique())
    plt.xticks(rotation=10)
    title = f'Findings {prop.title()}'
    if prop.title() != 'Sum': title += ' Area'
    plt.title(title, fontdict={'fontsize':16})
plt.tight_layout()
plt.show()

One can see that clear covid cases strongly tend to have  larger findings area. The indeterminate cases also tend to have larger findings areas than the atypical ones, but this difference is much less significant. Let's inspect these features again, but now at the study level.

In [None]:
study_props = (image_props[['study_id']]
               .join(findings)
               .groupby('study_id').area.agg(['count','sum', 'min', 'max', 'mean', 'std'])
               .merge(study_labels, left_on='study_id', right_index=True))

plt.figure(figsize=(20,10))

for i, prop in enumerate(['sum', 'min', 'max', 'mean', 'std'], start=1):
    plt.subplot(2, 3, i)
#     plt.gca().yaxis.grid(linestyle='--')
    sns.violinplot(data=study_props, x='study_label', y=prop, order=study_labels.unique(), pallete=['blue', 'green', 'green', 'red'])
    plt.xticks(rotation=10)
    title = f'Findings {prop.title()}'
    if prop.title() != 'Sum': title += ' Area'
    plt.title(title, fontdict={'fontsize':16})
plt.tight_layout()
plt.show()

It seems that there is no significant difference between the image level and study level. But, this leads us to two important questions: How many images are related to one study on average? In case that a study had more than one image, how many images the prognosis is based on?

In [None]:
image_df.groupby('study_id').label.count().unique()

In [None]:
print(image_df.groupby('study_id').label.agg(images_count='count').value_counts().to_string())
# g = sns.countplot(data=image_df.groupby('study_id').label.agg(images_count='count'), x='images_count')

In most cases there’s one image per study. But in the cases with multiple images, what is the difference between the images? Is the prognosis based on all of the images?  
It is straightforward to get an answer to the second password from the data. We simply count the number of images labeled with findings.

In [None]:
multiple_images = image_df.groupby('study_id').filter((lambda x: x.label.count() > 1))
multiple_images_with_findings = multiple_images[multiple_images.boxes.notna()]
print('Images with finding in Study Counts')
print(multiple_images_with_findings.groupby('study_id').boxes.agg(count='count').value_counts().to_string())

So it is established that , there is never more than one image labeled with findings per study. Now it will interesting to see the difference between the images in one study. To do that, we have to pay attention to the third and most important part of our dataset - the DICOM files.

#### DICOM files
The data is provided in [DICOM fromat](https://en.wikipedia.org/wiki/DICOM), which is the standard in medical imaging information and related data. This format packs each medical image with related data, such as Patient Id, Name, Sex, etc. In our case, the data de-identified for privacy reasons, but we still may have important data in the metadata provided in the DICOM file.
Let's pick a file and see what it looks like.

In [None]:
from pydicom import dcmread

def extract_id(full_id): return full_id[:full_id.index('_')]
def get_file_path(study_id, image_id, dataset='train'):
    study_id, image_id = extract_id(study_id), extract_id(image_id)
    return [*(path/dataset/study_id).glob(f'**/{image_id}.dcm')][0] 

#     return [*(path/dataset/row.study_id.str.extract("(^[^_]*)").values[0,0]).glob(f'**/{row.index.str.extract("(^[^_]*)").values[0,0]}.dcm')][0] 
sample = image_df.sample(random_state=14)
fpath = get_file_path(sample.study_id.values[0], sample.index.values[0])
ds = dcmread(fpath)

print(ds)
plt.imshow(ds.pixel_array, cmap=plt.cm.gray)
plt.colorbar()
plt.show()

Let's take a taste from our data:

In [None]:
# import gdcm
def show_dicoms(df, ncols=5, size=5, annotate=False):
    n = df.shape[0]
    nrows = int(np.ceil(n/ncols))

    fig = plt.figure(figsize=(size*ncols, size*nrows))
    for i, row in enumerate(df.itertuples()):
        fpath = get_file_path(row.study_id, row.Index)
        ds = dcmread(fpath)
        plt.subplot(nrows, ncols,i+1)
        plt.imshow(ds.pixel_array, cmap=plt.cm.gray)
        if annotate:
            if isinstance(row.boxes, str):
                for box in literal_eval(row.boxes):
                    rect = patches.Rectangle((box['x'], box['y']), 
                                             box['width'], box['height'],
                                            color='r', fill=False)
                    plt.gca().add_patch(rect)
        plt.xticks([]), plt.yticks([])
    plt.subplots_adjust()
    return fig
#     plt.show()

In [None]:
show_dicoms(image_df.sample(25, random_state=25))
plt.show()

Many of the images are cropped, rotated, and have different lummination level. Lungs are contained in all of the images, but the location of the lungs in the image is not constant, The images margin size are varying, and the images may contain other body parts - neck, stomach, hands, etc. To get a better understanding of the matter in hand, it will be helpful to see CXR from the different labels with the annotated bounding boxes drawn on the image.

In [None]:
gb = image_df.groupby('study_label')
for name, group in gb:
    if not name.startswith('Neg'): group = group.loc[group.boxes.notna()]
    fig = show_dicoms(group.sample(9, random_state=4), 3, 8, annotate=True)
    fig.suptitle(name, fontsize=18)
    plt.show()

The results are not clear for the non-expert eye. Although sometimes there is a kind of opacity in the boxes, in other cases there is no clear difference between the area inside and the area outside the box. This will affect the algorithm development and verification processes since I will not be able to rely on my own knowledge and intuition.

Now let's take a look at the metadata provided by the DICOMs. The attributes that may interest us are the body part examined, sex, image size, pixel spacing (represent the physical size of the image), modality (the scanning method), and image type. Let's extract these features to a pandas dataframe for later use.

In [None]:
from tqdm.notebook import tqdm
data = {}
def append_dcm_properties(row, props):
#     print(row)
    try:
        fpath = get_file_path(row.study_id, row.Index)
#         fpath = [*(path/'train'/row.study_id[:row.study_id.index("_")]).glob(f'**/{row.Index[:row.Index.index("_")]}.dcm')][0] 
        ds = dcmread(fpath)
        data[row.Index] = {prop: getattr(ds, prop.replace(' ', '')) for prop in props}
    except Exception as e:
        print(f'**/{row.name[:row.name.index("_")]}.dcm')
        raise
        
props = ['Image Type', 'Modality','Body Part Examined', 'Photometric Interpretation',
                                       'Patient Sex', 'Imager Pixel Spacing', 'Rows', 'Columns']
# image_df.apply(lambda x: append_dcm_properties(x, props) , axis=1)
for row in tqdm(image_df.itertuples(), total=image_df.shape[0]):
    append_dcm_properties(row, props)

In [None]:
dm = pd.DataFrame.from_dict(data, orient='index')
dm.to_csv('DICOM_metadata.csv')
dm.head()

Let's inspect the value ranges of our new data:

In [None]:
for column in dm[:]:
    print(f'{column}:' )
    print(dm[column].unique())
    print('-'*100)

The first thing to inspect is the sex field.
How is our data splitted between the sexes? How the sex is related with the COVID19 prognoses?

In [None]:
sns.countplot(data=dm, x='Patient Sex');

In [None]:
plt.figure(figsize=(15,7))
sns.countplot(data=dm.join(image_df), x='study_label', hue='Patient Sex')

We can see here the well-known fact that statistically women suffer less from Covid-19. Although our data is quite balanced with respect to sex, women suffer much less from pneumonia of any kind. In the typical covid cases (clear/severe covid cases) there are about two-thirds cases of women than men.

In `Body Part Examined` column, we have the unique values
>`'CHEST' 'PORT CHEST' 'TORAX' nan 'T?RAX' 'Pecho' 'THORAX' 'ABDOMEN'
 'SKULL' '2- TORAX' 'TÒRAX' 'PECHO'`
 
'CHEST', 'THORAX' (which exists in many versions), and 'PECHO' are all the same, whether you prefer English or Spanish. Let's try to see what else we can extract from the metadata. (Why do we have SKULLs here???)

In [None]:
body_parts = dm.loc[~dm['Body Part Examined'].isin(['CHEST', 'TORAX','THORAX', 'T?RAX',
                                                     'TÒRAX','PECHO', 'Pecho'])]
gb = body_parts.join(image_df).groupby('Body Part Examined')
for name, group in gb:
    fig = show_dicoms(group.sample(6, random_state=4), 3, 8)
    fig.suptitle(name, fontsize=18)
    plt.show()

The `ABDOMEN` images seem to contain the body's lower part too. Besides that, there does not seem to be a significant difference between the images group (in particular, we have no `SKULL`s here).
