![Image_compition](https://storage.googleapis.com/kaggle-competitions/kaggle/24800/logos/header.png?t=2020-12-17-19-26-15)

# VinBigData Chest X-ray Abnormalities Detection
##### **Automatically localize and classify thoracic abnormalities from chest radiographs**

In this competition, the goal is to automatically localize and classify 14 types of thoracic abnormalities from chest radiographs. If successful, we will help build what could be a valuable second opinion for radiologists. An automated system that could accurately identify and localize findings on chest radiographs would relieve the stress of busy doctors while also providing patients with a more accurate diagnosis. 

Before doing that, let's fisrt learn more about the competiton, the data type, the evaluation and so on.

 ----- WORK IN PROGRESS -----

**My goal is to publish several notebooks, each one of them will represent a step in this competition. Please let me know if you have any feedback.** 

 

# Table of Contents
1. [Introduction üìö](#introduction)
    1. [What is the competition about and how did we get the data ?](#how)
    2. [Chest X-rays](#chest)
2. [The Data üíæ](#data)
    1. [Short presentation of the data](#presentation)
    2. [Reading the data](#reading)
    3. [The findings](#findings)


# Introduction üìö <a class="anchor" id="introduction"></a>

## What is the competition about and how did we get the data ? <a class="anchor" id="how"></a>

In an effort to provide a large dataset of chest x-ray (CXR) images with high-quality labels for the research community, VinBigData have built the VinDr-CXR dataset from more than 100,000 raw images in DICOM format that were retrospectively collected from the Hospital 108 and the Hanoi Medical University Hospital, two of the largest hospitals in Vietnam.

Out of this raw data, they release 18 000  postero-anterior (PA) view CXR scans that come with both the localization of critical findings and the classification of common thoracic diseases. These images were annotated by a group of 17 radiologists with at least 8 years of experience for the presence of 22 critical findings (local labels) and 6 diagnoses (global labels); each finding is localized with a bounding box while the global labels reflect the diagnostic impression of the radiologist.

The released dataset is divided into a training set of 15,000 and a test set of 3,000. Each scan in the training set was independently labeled by 3 radiologists, while each scan in the test set was labeled by the consensus of 5 radiologis.

Why that is super cool ? because as the Vingroup said in their recent paper [‚ÄúVinDr-CXR: An open dataset of chest X-rays with radiologist's annotations‚Äù](https://arxiv.org/pdf/2012.15029.pdf), most of the existing chest X-ray datasets include labels from a list of findings without specifying their locations on the radiographs. This limits the development of machine learning algorithms for the detection and localization of chest abnormalities. 

For example, the MIMIC-CXR dataset release in 2019 contains more than 377 000 CXR images but these are image-level labeled (no bounding-box, just an image label) and like most of the other existing CXR datasets, its depend on automated rule-based labelers that either use keyword matching or an NLP model to extract disease labels from free-text radiology reports. These tools can produce labels on a large scale but, at the same time, introduce a high rate of inconsistency, uncertainty, and errors as mentionned the paper.

So with this competition, we benefit from the knowledge of several experienced radiologists to hopefully help building an automated system that could accurately identify and localize findings on chest radiographs.

So it is pretty clair by now that we are going to work with X-ray images and to be more precise, chest X-ray. Let's learn more about that part.

## Chest X-rays <a class="anchor" id="chest"></a>

### **A short reminder of the definition of X-rays**

X-rays are a type of radiation called electromagnetic waves. X-ray imaging creates pictures of the inside of your body. The images show the parts of your body in different shades of black and white. This is because different tissues absorb different amounts of radiation. Calcium in bones absorbs x-rays the most, so bones look white. Fat and other soft tissues absorb less and look gray. Air absorbs the least, so for example lungs which are full of air look black.

The most familiar use of x-rays is checking for fractures (broken bones), but x-rays are also used in other ways. For example, chest x-rays can spot pneumonia and mammograms use x-rays to look for breast cancer.

A chest x-ray is an x-ray of the chest, lungs, heart, large arteries, ribs, and diaphragm. There are several CXR views. In this competition we will only have to deal with  postero-anterior (PA) views. That mean that the patient stands in front of a radiographic plate, hands on hips, with the X-ray source 2‚ÄØm behind.

### **A normal CXR**

It is important to know the normal chest radiograph and common landmarks so that we can recognize what is abnormal.  

![Chest Radiograph - Normal](https://www.ebmconsult.com/content/images/Xrays/ChestXrayAPNmlLabeled.png)

Note that the left side of the subject is on the right side of the image. In most CXR a little R or L will indicate the side.

Here are some common landmarks for the normal CXR:

- The trachea should sit midline and be in between the right and left clavicular heads.
- The carina is the point or level at which the trachea divides into the right and left main bronchi.  This is usually midline with the spinous process being behind it. 
- The aortic knob should be visualized in the normal chest radiograph around the level of T3 to T4 or just lateral to the carina. 
- Usually the heart is positioned with one-third of its diameter to the right, and two-thirds to the left of the thoracic vertebrae spinous processes.
- The right hemidiaphragm normally sits slightly higher than the left due to the presence of the liver under the diaphragm.


Note that the T4 (for example) vertebra is the fourth thoracic vertebra that makes up the middle segment of spinal column of the human body. The thoracic spinal vertebrae consist of 12 total vertebrae and are located between the cervical vertebrae (which begin at the base of the skull) and the lumbar spinal vertebrae.


After this breif introduction about the construction of the dataset, the reminder about X-ray and the presentation of a normal CXR, we can move on to see some of the abnormal CXR and by extention talk about our data.

# The Data üíæ <a class="anchor" id="data"></a>

## Short presentation of the data <a class="anchor" id="presentation"></a>

As we said before, the dataset comprises 18,000 postero-anterior (PA) CXR scans in DICOM format. 15,000 for training and 3000 for test.

In addition, we have 2 files :


    - train.csv - the train set metadata, with one row for each object, including a class and a bounding box. Some images in both test and train have multiple objects.
    - sample_submission.csv - a sample submission file in the correct format


for the train.csv file, we will find these data :

    - image_id - unique image identifier
    - class_name - the name of the class of detected object (or "No finding")
    - class_id - the ID of the class of detected object
    - rad_id - the ID of the radiologist that made the observation
    - x_min - minimum X coordinate of the object's bounding box
    - y_min - minimum Y coordinate of the object's bounding box
    - x_max - maximum X coordinate of the object's bounding box
    - y_max - maximum Y coordinate of the object's bounding box


For each test image, we will be predicting a bounding box and class for all findings. If we predict that there are no findings, we should create a prediction of "14 1 0 0 1 1".

- the first number is the class ID, can go from 0 to 14.
- the second number is the confidence of our detection. This value is between 0 and 1, the closer to 1 the more confident we are.
- the rest is x_min, y_min and x_max,y_max. The bounding box coordinate.


Here is the distribution of findings and pathologies on the training set of the VinDr-CXR originel Dataset.

![Distribution of findings and pathologies on the training set of the VinDr-CXR Dataset.](https://vindr.ai/wp-content/uploads/2020/12/pasted-image-0-2-e1607659858378.png)

As you can see there are 28 classes in the original dataset. For the competition we have to localize and classify the presence of only 14 critical radiographic findings as listed below:

0. Aortic enlargement
1. Atelectasis
2. Calcification
3. Cardiomegaly
4. Consolidation
5. ILD
6. Infiltration
7. Lung Opacity
8. Nodule/Mass
9. Other lesion
10. Pleural effusion
11. Pleural thickening
12. Pneumothorax
13. Pulmonary fibrosis
14. No findings

We will go through every one of them in one section below.


The images are in DICOM format, which means they contain additional data that might be useful for visualizing and classifying. Let's begin dealing with the DICOM files.

## Reading the data <a class="anchor" id="reading"></a>

First let's import all what we need and define our paths. Then we can start to take a look at our data.

In [None]:
# All the import

import os
import re
import pydicom
import matplotlib
import numpy as np
import pandas as pd
from tqdm import tqdm
from math import ceil
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
from pydicom.pixel_data_handlers.util import apply_voi_lut


In [None]:
train_csv = pd.read_csv("../input/vinbigdata-chest-xray-abnormalities-detection/train.csv")
train_path = "../input/vinbigdata-chest-xray-abnormalities-detection/train"
train_dicom_path = [f"{train_path}/{i}" for i in os.listdir(train_path)] 

In [None]:
train_csv.head()

In [None]:
print(f"The train shape is {train_csv.shape}")
print(f"There are {len(train_csv['image_id'].unique())} unique images in this CSV and {len(train_dicom_path)} DICOM images")

Let see for one CXR what information we have:

In [None]:
train_csv[train_csv['image_id'] == '9a5094b2563a1ef3ff50dc5c7ff71345']

So one radologist can give several findings, if two or more radiologist have the same findings, the position of the bounding box may be differents (example with class_id = 3).

In [None]:
# Let's read the dicom of the same image above
image_id = '9a5094b2563a1ef3ff50dc5c7ff71345'
data_file = pydicom.dcmread(f"{train_path}/{image_id}.dicom")
print(data_file)

For now let's visualize some of the CXR we have.

In [None]:
# from here https://www.kaggle.com/raddar/convert-dicom-to-np-array-the-correct-way

def read_xray(path, voi_lut = True, fix_monochrome = True):
    dicom = pydicom.read_file(path)
    
    # VOI LUT (if available by DICOM device) is used to transform raw DICOM data to "human-friendly" view
    if voi_lut:
        data = apply_voi_lut(dicom.pixel_array, dicom)
    else:
        data = dicom.pixel_array
               
    # depending on this value, X-ray may look inverted - fix that:
    if fix_monochrome and dicom.PhotometricInterpretation == "MONOCHROME1":
        data = np.amax(data) - data
        
    data = data - np.min(data)
    data = data / np.max(data)
    data = (data * 255).astype(np.uint8)
        
    return data

def example(classe, number_ex, path=train_path):
    image_list = train_csv[train_csv['class_id']==classe][0:number_ex]['image_id'].values
    image_index = train_csv[train_csv['class_id']==classe][0:number_ex]['image_id'].index.values
    rows = int(ceil(number_ex / 4))
    gs = gridspec.GridSpec(rows, 4)
    fig = plt.figure(figsize=(30, 20))
    fig.suptitle(f"The class {train_csv.loc[train_csv['class_id']==classe, 'class_name'].values[0]}", fontsize=30)
    for i, (name, image_index) in enumerate(zip(image_list, image_index)):
        ax = fig.add_subplot(gs[i])
        ax.set_title(train_csv.loc[i, 'image_id'])
        ax.axis("off")
        data = read_xray(f"{path}/{name}.dicom")
        ax.imshow(data, cmap='gray')
        if train_csv.loc[image_index, 'class_name'] != 'No finding':
            bbox = [train_csv.loc[image_index, 'x_min'],
                    train_csv.loc[image_index, 'y_min'],
                    train_csv.loc[image_index, 'x_max'],
                    train_csv.loc[image_index, 'y_max']]
            p = matplotlib.patches.Rectangle((bbox[0], bbox[1]),
                                             bbox[2]-bbox[0],
                                             bbox[3]-bbox[1],
                                             ec='r', fc='none', lw=2.)
            ax.add_patch(p)
    fig.tight_layout()
    plt.subplots_adjust(hspace=0.1, wspace=0.1, top=0.95)


In [None]:
example(1,8)

## The findings <a class="anchor" id="findings"></a>

OK that cool, we now can visualize our data and the abnormalities in each CXR. It's now time to understand each of the abnormalities.

### Aortic enlargement


This is know as a sign of an aortic aneurysm. It's litterally an enlargement of the aorta (the body's largest artery), which can occur in the chest (thoracic aneurysm) or abdomen. 

Here is one figure that can help understand it.

![types of aneurysm](https://www.uofmhealth.org/sites/default/files/styles/large/public/Types%20of%20Aneurysms.jpg?itok=2POtyibt)

The above illustration shows three types of aneurysm and the two aneurysm shapes. The three types of aneurysm are Ascending Thoracic aortic Aneurysm (TAA), Descending Thoracic Aortic Aneurysm (TAA), and Abdominal Aortic Aneurysm (AAA). The Fusiform Aneurysm and Saccular Aneurysm show the two types of aneurysm shapes.


Of course, we are working on chest data so we won't see any Abdominal Aortic Aneurysm but it's still important to know that kind of stuff.

Risk factors for it may include:

    Age
    Gender
    Smoking
    High blood pressure
    ...

Note that we have the age and the gender information in our metadata! that can help us later.


Here are some real examples.

In [None]:
example(0, 8)

### Atelectasis

Atelectasis is a medical term used to describe the complete or partial collapse of a lung. It is sometimes referred to as a ‚Äúcollapsed lung". It occurs when tiny air sacs in the lungs known as alveoli deflate. 

![](http://www.msrblog.com/wp-content1/uploads/2018/10/Causes-of-Atelectasis.jpg)

As we can see, this abnormality is recognizable as there is no air in part or all of the lungs. The air is represented by black and therefore I think that we can see Atelectasis whiter in the CXR:

In [None]:
example(1, 8)

### Calcification 


The calcification is a deposits of calcium in the tissues causing the tissue to harden. We can detect this as the calcium absorbs x-rays the most, so calcification look white in the X-ray. As it's a deposits, the bounding box can be "smaller" sometimes so it may be more diffucult for us - non radiologist - to detect this abnormality. 

In [None]:
example(2, 8)

### Cardiomegaly

The term "cardiomegaly" refers to an enlarged heart that we can see in the CXR. TAn enlarged heart may be the result of a short-term stress on the body, such as pregnancy, or a medical condition, such as the weakening of the heart muscle, coronary artery disease, heart valve problems or abnormal heart rhythms. 

One way to notice it if cardiothoracic ratio is too large.

In [None]:
example(3, 8)

### Consolidation

This abnormality occurs when the air that usually fills the small airways in the lungs is replaced with something else. Depending on the cause, the air may be replaced with:

    a fluid, such as pus, blood, or water
    a solid, such as stomach contents or cells

On X-ray, we will see that as a whiter zone instead of the normal black zone of the lungs.

In [None]:
example(4, 8)

### ILD


Interstitial lung disease includes more than 200 different conditions that cause inflammation and scarring around the balloon-like air sacs in the lungs, called the alveoli. Oxygen travels through the alveoli into your bloodstream. When they‚Äôre scarred, these sacs can‚Äôt expand as much. As a result, less oxygen enters the blood.

Because it's always easier to understand something with a figure :

![Interstitial Lung Disease - What is it ?](https://projects.iq.harvard.edu/files/styles/os_files_xlarge/public/acil/files/8353.jpg?m=1434040440&itok=oRoBd9z0)

Other parts of your lungs can be affected too, such as the airways, lung lining, and blood vessels. 


The four patterns of interstitial lung disease (ILD): linear, reticular, reticulonodular, and nodular opacities.

In [None]:
example(5, 8)

#### Infiltration


This when an abnormal substance that accumulates gradually within cells or body tissues or any substance or type of cell that occurs within or spreads as through the interstices (interstitium and/or alveoli, we saw them above - reffer to the figure) of the lung, that is foreign to the lung, or that accumulates in greater than normal quantity within it.



Just like the consolidation, we can see it on a X-ray because of a whiter zone instead of the normal black zone of the lungs. The problem is it's so similar to the consolidation that it became hard to distinguish them apart.

In [None]:
example(6, 8)

### Lung Opacity


From [this](https://www.kaggle.com/zahaviguy/what-are-lung-opacities) notebook : it's any area in the chest radiograph that is more white than it should be. the notebook explained very well the whole thing.

In [None]:
example(7, 8)

### Nodule/Mass

Pulmonary nodules are small, rounded opacities within the pulmonary interstitium. The differential diagnosis for a nodule can be refined by its size, location, and density.

Size : 

    miliary nodules: <2 mm
    pulmonary micronodule: 2-7 mm
    pulmonary nodule: 7-30 mm
    pulmonary mass: >30 mm

A pulmonary mass is any area of pulmonary opacification that measures more than 30 mm. The commonest cause of a pulmonary mass is lung cancer. [radiopaedia](https://radiopaedia.org/articles/pulmonary-mass?lang=us).


In [None]:
example(8, 8)

### Other lesion


As we saw, the original dataset has more than 14 classes, I think that in this class we have the original "other lesion" class (with about 363 records  according to the paper above) + all the classes that are missing  here (ex. Rib fracture, clavicle fracture, edma, etc.) 

Note : I have to check that :) 

In [None]:
example(9, 8)

### Pleural effusion

Pleural effusion, also called water on the lung, is an excessive buildup of fluid in the space between your lungs and chest cavity.

Thin membranes, called pleura, cover the outside of the lungs and the inside of the chest cavity. There‚Äôs always a small amount of liquid within this lining to help lubricate the lungs as they expand within the chest during breathing.

The pleura creates too much fluid when it‚Äôs irritated, inflamed, or infected. This fluid accumulates in the chest cavity outside the lung, causing what‚Äôs known as a pleural effusion.

Certain types of cancer can cause pleural effusions, lung cancer in men and breast cancer in women being the most common. other causes can be pulmonary embolism and pneumonia for example.

![RCNi - pleural effusion](https://dm3omg1n1n7zx.cloudfront.net/rcni/static/journals/ns/28/41/ns.28.41.51.e8849/graphic/ns_v28_n41_8849_0001.jpg)


Imaging criteria are (found [here](http://www.stritch.luc.edu/lumen/MedEd/Radio/curriculum/Medicine/Pleural_effusion1.htm), pretty good details !):

    Homogenous density
    Density in dependent portion
        Upright: Costophrenic angle in PA view
        Lateral view: Anterior and posterior portions of gutter
        Lateral decubitus position: Along sides
        Supine position: Along posteriorly, giving diffuse haziness on the side of effusion
    Silhouette of upper limit of density
        Upper margin high in axilla in PA view (yellow arrows)
        Upper margin high anteriorly and posteriorly in lateral view
        This is just an illusion
    Loss of silhouette. In the images below note lack of identifiable left diaphragm before and visible diaphragm after clearance of fluid (Silhouette sign principle)
    Mediastinal shift



In [None]:
example(10, 8)

### Pleural thickening

Pleural thickening occurs when scar tissue develops on the lining of the lungs, or the pleura. Pleural thickening can develop following asbestos exposure or other conditions, such as infection.

Depending on the cause, pleural thickening may form in different parts of the pleura.


    Visceral pleura: The membrane directly covering the lung tissue
    Parietal pleura: The outer membrane of the lung attached to the chest wall
    Pleural space: The space between the visceral and parietal pleura

![Mesothelioma - Pleural thickening](https://www.mesothelioma.com/wp-content/uploads//Meso_Pleural-Thickening-53-2.svg)


In [None]:
example(11, 8)

### Pneumothorax

Pneumothorax is the medical term for a collapsed lung. Pneumothorax occurs when air enters the space around your lungs (the pleural space). Air can find its way into the pleural space when there‚Äôs an open injury in your chest wall or a tear or rupture in your lung tissue, disrupting the pressure that keeps your lungs inflated.

A pneumothorax typically demonstrate:

    - visible visceral pleural edge is seen as a very thin, sharp white line
    - no lung markings are seen peripheral to this line
    - peripheral space is radiolucent compared to the adjacent lung

In [None]:
example(12, 8)

### Pulmonary fibrosis

Pulmonary fibrosis is a lung disease that occurs when lung tissue becomes damaged and scarred. This thickened, stiff tissue makes it more difficult for your lungs to work properly.

![](https://medicaldialogues.in/h-upload/2020/05/18/128958-idiopathic-pulmonary-fibrosis.jpg)

The most common radiographic changes to pulmonary fibrosis are an interstitial shadowing of small (1 to 2 mm), irregular opacities, which are seen in about 75% of patients. Less common are small, round opacities, which are seen in 20% of patients. This finding is generally known as reticulonodular opacities. [emedicine.medscape](https://emedicine.medscape.com)

In [None]:
example(13, 8)

The last class is "No finding", so a normal CXR. Even if we already saw one example, here are some more.

In [None]:
example(14,5)

Sources :

* [VinDR - CXR](https://vindr.ai/datasets/cxr)
* [VinDr-CXR: An open dataset of chest X-rays withradiologist‚Äôs annotations](https://arxiv.org/pdf/2012.15029.pdf)
* [EBM Consult](https://www.ebmconsult.com/articles/radiology-chest-xray-normal)
* [Healthline](https://www.healthline.com/human-body-maps/t4-fourth-thoracic-vertebrae#1)
* [UMCVC](https://www.umcvc.org/conditions-treatments/aortic-aneurysm)
* [MSRBlog](http://www.msrblog.com/science/medical/atelectasis.html)
* [Projects Harvard](https://projects.iq.harvard.edu)
* [NursingStandard](https://dm3omg1n1n7zx.cloudfront.net/rcni/static/journals/ns/28/41/ns.28.41.51.e8849/graphic/ns_v28_n41_8849_0001.jpg)
* [Medicaldialogues](https://medicaldialogues.in/pulmonology/news/pirfenidone-found-safe-efficacious-in-idiopathic-pulmonary-fibrosis-65865)