**Analysis of the training labels/boxes for the [SIIM COVID-19 Detection](https://www.kaggle.com/c/siim-covid19-detection/overview) competition**.

**Conclusions**:
- There is no duplicates.
- 32% of images don't have a box.
- There can be multiple images per study!
- There are two recurring areas: one for each lung.

# CONFS

In [None]:
IMAGES = '/kaggle/input/siim-covid19-detection/train_image_level.csv'
STUDIES = '/kaggle/input/siim-covid19-detection/train_study_level.csv'

# IMPORTS

In [None]:
import pandas as pd
import plotly.express as px

# DATASETS

In [None]:
studies = pd.read_csv(STUDIES, index_col='id')
# assert studies['id'].nunique() == len(studies)
studies = studies.idxmax(axis=1).to_frame(name='case')
studies.index = studies.index.str.replace('_study', '')
studies.head()

In [None]:
labels = pd.read_csv(IMAGES, index_col='id')
# assert labels['id'].nunique() == len(labels)
# assert set(labels['StudyInstanceUID']) ^ set(studies.index) == set()
labels = labels.merge(studies, left_on='StudyInstanceUID', right_index=True)
labels.head()

In [None]:
boxes = []
for i, row in labels.iterrows():
    value = row['boxes']
    if isinstance(value, float): # nan
        print('x', end='')
        continue
    values = eval(value)
    print('.', end='')
    for val in values:
        box = {
            'image': i,
            'study': row['StudyInstanceUID'],
            'case': row['case'],
            'xmin': val['x'],
            'ymin': val['y'],
            'xmax': val['x'] + val['width'],
            'ymax': val['y'] + val['height'],
            'width': val['width'],
            'height': val['height'],
        }
        boxes.append(box)
boxes = pd.DataFrame(boxes)
boxes.head()

# ANALYSIS

## Labels

In [None]:
labels.info()

- Most studies contain a single image.
- The maximum number of images per study is 9.

In [None]:
images_per_study = labels['StudyInstanceUID'].value_counts()
images_per_study

In [None]:
px.histogram(images_per_study)

- 32% of images don't have boxes.

In [None]:
images_no_boxes = labels[labels['label']=='none 1 0 0 1 1']
len(images_no_boxes) / len(labels) * 100

## Boxes

In [None]:
boxes.info()

- We see two modes, one for each lungs
- The distribution of cases if about the same.
- There seems to be more boxes on the left lung.

In [None]:
px.histogram(boxes, x='xmin', color='case')

In [None]:
px.histogram(boxes, x='ymin', color='case')

In [None]:
px.histogram(boxes, x='xmax', color='case')

In [None]:
px.histogram(boxes, x='ymax', color='case')

In [None]:
px.histogram(boxes, x='width', color='case')

In [None]:
px.histogram(boxes, x='height', color='case')

- There is a linear relation between the width and the height.
- The heigth is often greater than the width (explained by the size of human lungs).

In [None]:
px.scatter(boxes, x='width', y='height', color='case')