## Dataset information

In this competition, we are **identifying and localizing** COVID-19 abnormalities on chest radiographs. 
This is an object detection and classification problem. For each test image, you will be predicting 
a bounding box and class for all findings. The train dataset comprises 6,334 chest scans. 
All images were labeled by a panel of experienced radiologists for the presence of opacities as well as overall appearance.

Note that all images are stored in paths with the form **study/series/image**. 
The study ID here relates directly to the study-level predictions, and the image ID is the ID used for image-level predictions.

The hidden test dataset is of roughly the same scale as the training dataset.
For each test study, you should make a determination within the following labels:

* Negative for Pneumonia 
* Typical Appearance' 
* Indeterminate Appearance 
* Atypical Appearance

Prediction examples: 

* "none 1 0 0 1 1" - prediction for no findings ("none" is the class ID for no finding, and this provides a one-pixel bounding box with a confidence of 1.0)
* "atypical 1 0 0 1 1" - prediction for Atypical Appearance

**Note:** The images are in [DICOM](https://en.wikipedia.org/wiki/DICOM) format, which means they contain additional data that might be useful for visualizing and classifying.

In [None]:
# installing additional libs
! pip install pandas-profiling[notebook]

In [None]:
import pandas as pd
import pydicom
import matplotlib.pyplot as plt
from pathlib import Path
from pandas_profiling import ProfileReport

%matplotlib inline

root = Path("../input/siim-covid19-detection")

## train_image_level.csv 

the train image-level metadata, with one row for each image, including both correct labels and any bounding boxes in a dictionary format. Some images in both test and train have multiple bounding boxes.

Columns:

* **id** - unique image identifier
* **boxes** - bounding boxes in easily-readable dictionary format
* **label** - the correct prediction label for the provided bounding boxes

In [None]:
train_image_level = pd.read_csv(root / "train_image_level.csv")
train_image_level_profile = ProfileReport(train_image_level, title="train_image_level_report")
train_image_level_profile

## train_study_level.csv

the train study-level metadata, with one row for each study, including correct labels.

Columns:

* **id** - unique study identifier
* **Negative for Pneumonia** - 1 if the study is negative for pneumonia, 0 otherwise
* **Typical Appearance** - 1 if the study has this appearance, 0 otherwise
* **Indeterminate Appearance**  - 1 if the study has this appearance, 0 otherwise
* **Atypical Appearance**  - 1 if the study has this appearance, 0 otherwise

In [None]:
train_study_level = pd.read_csv(root / "train_study_level.csv")
train_study_level_profile = ProfileReport(train_study_level, title="train_study_level_report")
train_study_level_profile

## dicom metadata

In [None]:
_id, boxes, label, study = train_image_level.iloc[0]
study_images_filenames = list((root / "train" / study).rglob("*.dcm"))
dicom = pydicom.read_file(study_images_filenames[0])
dicom

In [None]:
image_np = dicom.pixel_array
image_norm = (image_np - image_np.min()) / image_np.max()
image_norm = (image_norm * 255).astype(np.uint8)

plt.imshow(image_norm, cmap="gray")
plt.axis("off")
plt.show()