<h1 style="text-align: center; font-family: Verdana; font-size: 32px; font-style: normal; font-weight: bold; text-decoration: none; text-transform: none; font-variant: small-caps; letter-spacing: 3px; color: #7b4f44; background-color: #ffffff;">VinBigData Chest X-ray Abnormalities Detection</h1>

### About Competation

When we have a broken arm, radiologists help save the day—and the bone. These doctors diagnose and treat medical conditions using imaging techniques like CT and PET scans, MRIs, and, of course, X-rays. Yet, as it happens when working with such a wide variety of medical tools, radiologists face many daily challenges, perhaps the most difficult being the chest radiograph. The interpretation of chest X-rays can lead to medical misdiagnosis, even for the best practicing doctor. Computer-aided detection and diagnosis systems (CADe/CADx) would help reduce the pressure on doctors at metropolitan hospitals and improve diagnostic quality in rural areas.

Existing methods of interpreting chest X-ray images classify them into a list of findings. There is currently no specification of their locations on the image which sometimes leads to inexplicable results. A solution for localizing findings on chest X-ray images is needed for providing doctors with more meaningful diagnostic assistance.

In this competition, we’ll automatically localize and classify 14 types of thoracic abnormalities from chest radiographs. We'll work with a dataset consisting of 18,000 scans that have been annotated by experienced radiologists. train model with 15,000 independently-labeled images and will be evaluated on a test set of 3,000 images. 

# 1: IMPORTING LIBRARIES AND DATA

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os

import matplotlib.pyplot as plt
import matplotlib.patches as patches
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from pydicom.pixel_data_handlers.util import apply_voi_lut
import cv2
import pydicom
from pydicom import dcmread
from pathlib import Path

In [None]:
df = pd.read_csv("../input/vinbigdata-chest-xray-abnormalities-detection/train.csv")
df.head(10)

In [None]:
df.info()

## 1.1: DATA DESCRIPTION

**Columns**
- `image_id` - unique image identifier
- `class_name` - the name of the class of detected object
- `class_id` - the ID of the class of detected object
- `rad_id` - the ID of the radiologist that made the observation
- `x_min` - minimum X coordinate of the object's bounding box
- `y_min` - minimum Y coordinate of the object's bounding box
- `x_max` - maximum X coordinate of the object's bounding box
- `y_max` - maximum Y coordinate of the object's bounding box


**LABELS**
- `0` - Aortic enlargement
- `1` - Atelectasis
- `2` - Calcification
- `3` - Cardiomegaly
- `4` - Consolidation
- `5` - ILD
- `6` - Infiltration
- `7` - Lung Opacity
- `8` - Nodule/Mass
- `9` - Other lesion
- `10` - Pleural effusion
- `11` - Pleural thickening
- `12` - Pneumothorax
- `13` - Pulmonary fibrosis
- `14` - No Findings

<br><b style="text-decoration: underline; font-family: Verdana; text-transform: uppercase;">Aortic enlargement</b>
* Aortic enlargement is known as a sign of an aortic aneurysm. This condition often occurs in the ascending aorta. 
* In general, the term aneurysm is used when the axial diameter is >5.0 cm for the ascending aorta and >4.0 cm for the descending aorta.

<br><b style="text-decoration: underline; font-family: Verdana; text-transform: uppercase;">Atelectasis</b>
* Atelectasis is a condition where there is no air in part or all of the lungs and they have collapsed. 
* A common cause of atelectasis is obstruction of the bronchi.
* In atelectasis, there is an increase in density on chest x-ray (usually whiter; black on black-and-white inversion images).

<br><b style="text-decoration: underline; font-family: Verdana; text-transform: uppercase;">Calcification</b>
* Calcium (calcification) may be deposited in areas where previous inflammation of the lungs or pleura has healed. 
* Many diseases or conditions can cause calcification on chest x-ray. 
* Calcification may occur in the Aorta (as with atherosclerosis) or it may occur in mediastinal lymph nodes (as with previous infection, tuberculosis, or histoplasmosis).

<br><b style="text-decoration: underline; font-family: Verdana; text-transform: uppercase;">Cardiomegaly</b>
* Cardiomegaly is usually diagnosed when the ratio of the heart's width to the width of the chest is more than 50%. This diagnostic criterion may be an essential basis for this competition.
* Cardiomegaly can be caused by many conditions, including hypertension, coronary artery disease, infections, inherited disorders, and cardiomyopathies.
* The heart-to-lung ratio criterion for the diagnosis of cardiomegaly is a ratio of greater than 0.5. However, this is only valid if the XRay is performed while the patient is standing. If the patient is sitting or in bed, this criterion cannot be used. To determine whether a patient is sitting or standing (and consequently whether this criteron is valid), we will detect the presence of air in the stomach (if there is no air in it, the patient is not standing and the criterion cannot be used)

<br><b style="text-decoration: underline; font-family: Verdana; text-transform: uppercase;">Consolidation</b>
* Consolidation is a decrease in lung permeability due to infiltration of fluid, cells, or tissue replacing the air-containing spaces in the alveoli.
* Consolidation is officially referred to as air space consolidation. 
* On X-rays displaying air space consolidation, the lung field's density is increased, and pulmonary blood vessels are not seen, but black bronchi can be seen in the white background, which is called <i>"air bronchogram"</i>. Since air remains in the bronchial tubes, they do not absorb X-rays and appear black, and the black and white are reversed from normal lung fields.

<br><b style="text-decoration: underline; font-family: Verdana; text-transform: uppercase;">ILD</b>
* ILD stands for <i>"Interstitial Lung Disease"</i>.
* Interstitial Lung Disease is a general term for many conditions in which the interstitial space is injured. 
* The interstitial space refers to the walls of the alveoli (air sacs in the lungs) and the space around the blood vessels and small airways.
* Chest radiographic findings include ground-glass opacities (i.e., an area of hazy opacification), linear reticular shadows, and granular shadows.

<br><b style="text-decoration: underline; font-family: Verdana; text-transform: uppercase;">Infiltration</b>
* The infiltration of some fluid component into the alveoli causes an infiltrative shadow (Infiltration).
* It is difficult to distinguish from consolidation and, in some cases, impossible to distinguish. Please see [this link](https://allnurses.com/consolidation-vs-infiltrate-vs-opacity-t483538/) for more information.

<br><b style="text-decoration: underline; font-family: Verdana; text-transform: uppercase;">Lung Opacity</b>
* Lung opacity is a loose term with many potential interpretations/meanings. Please see this [kaggle discussion](https://www.kaggle.com/zahaviguy/what-are-lung-opacities) for more information.
* Lung opacity can often be identified as any area in the chest radiograph that is <b>more white than it should be.</b>

<br><b style="text-decoration: underline; font-family: Verdana; text-transform: uppercase;">Nodule/Mass</b>
* Nodules and masses are seen primarily in lung cancer, and metastasis from other parts of the body such as colon cancer and kidney cancer, tuberculosis, pulmonary mycosis, non-tuberculous mycobacterium, obsolete pneumonia, and benign tumors.
* A nodule/mass is a round shade (typically less than 3 cm in diameter – resulting in much smaller than average bounding boxes) that appears on a chest X-ray image. 

<br><b style="text-decoration: underline; font-family: Verdana; text-transform: uppercase;">Other lesion</b>
* Others include all abnormalities that do not fall into any other category. This includes bone penetrating images, fractures, subcutaneous emphysema, etc.

<br><b style="text-decoration: underline; font-family: Verdana; text-transform: uppercase;">Pleural effusion</b>
* Pleural effusion is the accumulation of water outside the lungs in the chest cavity. 
* The outside of the lungs is covered by a thin membrane consisting of two layers known as the pleura. Fluid accumulation between these two layers (chest-wall/parietal-pleura and the lung-tissue/visceral-pleura) is called pleural effusion.
* The findings of pleural effusion vary widely and vary depending on whether the radiograph is taken in the upright or supine position.
* The most common presentation of pleural effusion is <b>elevation of the diaphragm on one side, flattening the diaphragm, or blunting the angle between rib and diaphragm (typically more than 30 degrees)</b>

<br><b style="text-decoration: underline; font-family: Verdana; text-transform: uppercase;">Pleural thickening</b>
* The pleura is the membrane that covers the lungs, and the change in the thickness of the pleura is called pleural thickening. 
* It is often seen in the uppermost part of the lung field (the apex of the lung).

<br><b style="text-decoration: underline; font-family: Verdana; text-transform: uppercase;">Pneumothorax</b>
* A pneumothorax is a condition in which air leaks from the lungs and accumulates in the chest cavity. 
* When air leaks and accumulates in the chest, it cannot expand outward like a balloon due to the ribs' presence. Instead, the lungs are pushed by the air and become smaller. In other words, a pneumothorax is a situation where air leaks from the lungs and the lungs become smaller (collapsed).
* In a chest radiograph of a pneumothorax, the collapsed lung is whiter than normal, and the area where the lung is gone is uniformly black. Besides, the edges of the lung may appear linear.

<br><b style="text-decoration: underline; font-family: Verdana; text-transform: uppercase;">Pulmonary fibrosis</b>
* Pulmonary Fibrosis is inflammation of the lung interstitium due to various causes, resulting in thickening and hardening of the walls, fibrosis, and scarring.
* The fibrotic areas lose their air content, which often results in dense cord shadows or granular shadows.

<br><b style="text-decoration: underline; font-family: Verdana; text-transform: uppercase;">No finding</b>
* There are no findings on x-ray images. This is the normal image and is the baseline image needed to differentiate from the abnormal image.

# 2: DATA VIZUALIZATION & ANALYSIS

In [None]:
df['rad_id'].unique()

In [None]:
df['class_name'].value_counts()

In [None]:
fig = px.histogram(df, x="class_name", color="class_name")

fig.update_layout(
    yaxis=dict(title_text='Count', titlefont=dict(size=20)),
    xaxis=dict(title_text='Abnormality Label Name', titlefont=dict(size=20)),
    title_text='Abnormalities Count Plot'
)
fig.show()

In [None]:
def percent_distribution(data):
    # Get the count for each label
    label_counts = data.class_name.value_counts()

    # Get total number of samples
    total_samples = len(data)

    # Count the number of items in each class
    for i in range(len(label_counts)):
        label = label_counts.index[i]
        count = label_counts.values[i]
        percent = int((count / total_samples) * 10000) / 100
        print("{:<30s}:   {} or {}%".format(label, count, percent))

percent_distribution(df)

In [None]:
sns.pairplot(df, hue='class_id');

In [None]:
def plot_width_of__bounding_boxes(data):
    
    fig, axes = plt.subplots(7, 2, figsize=(16,20), sharex=True)
    fig.suptitle("width of bounding box for different categories", fontsize=16)
    
    classes = data.class_name.unique()
    for j, i in enumerate(classes[~np.isin(classes, 'No finding')]):
        data_ = data[data['class_name']==i]
        sns.distplot(data_['x_max'] - data_['x_min'], ax=axes[j%7, j//7]);
        axes[j%7, j//7].title.set_text(i);
    plt.show()

plot_width_of__bounding_boxes(df)

In [None]:
def plot_height_of_bounding_boxes(data):
    
    fig, axes = plt.subplots(7, 2, figsize=(16,20), sharex=True)
    fig.suptitle("height of bounding box for different categories", fontsize=16)
    
    classes = data.class_name.unique()
    for j, i in enumerate(classes[~np.isin(classes, 'No finding')]):
        data_ = data[data['class_name']==i]
        sns.distplot(data_['y_max'] - data_['y_min'], ax=axes[j%7, j//7]);
        axes[j%7, j//7].title.set_text(i);
    plt.show()

plot_height_of_bounding_boxes(df)

In [None]:
def plot_area_of_bounding_boxes(data):
    
    fig, axes = plt.subplots(7, 2, figsize=(16,20), sharex=True)
    fig.suptitle("area of bounding box for different categories", fontsize=16)
    
    classes = data.class_name.unique()
    for j, i in enumerate(classes[~np.isin(classes, 'No finding')]):
        data_ = data[data['class_name']==i]
        sns.distplot((data_['y_max'] - data_['y_min'])*(data_['x_max'] - data_['x_min']), 
                     ax=axes[j%7, j//7]);
        axes[j%7, j//7].title.set_text(i);
    plt.show()

plot_area_of_bounding_boxes(df)

# 3: IMAGE DATA VIZUALIZTION

In [None]:
def read_xray(path, voi_lut = True, fix_monochrome = True):
    """ Convert dicom file to numpy array 
    
    Args:
        path (str): Path to the dicom file to be converted
        voi_lut (bool): Whether or not VOI LUT is available
        fix_monochrome (bool): Whether or not to apply monochrome fix
        
    Returns:
        Numpy array of the respective dicom file 
        
    """
    dicom = pydicom.read_file(path)      # Use the pydicom library to read the dicom file
    
    # VOI LUT (if available by DICOM device) is used to transform raw DICOM data to "human-friendly" view
    if voi_lut:
        data = apply_voi_lut(dicom.pixel_array, dicom)
    else:
        data = dicom.pixel_array
               
    # depending on this value, X-ray may look inverted - fix that:
    if fix_monochrome and dicom.PhotometricInterpretation == "MONOCHROME1":
        data = np.amax(data) - data
        
    data = data - np.min(data)
    data = data / np.max(data)
    data = (data * 255).astype(np.uint8)
        
    return data

In [None]:
root = '/kaggle/input/vinbigdata-chest-xray-abnormalities-detection/train/'
ext = '.dicom'

data = read_xray(root + df.image_id[0] + ext)
plt.imshow(data, 'gray');

In [None]:
def show_xray(data, root=root, ext=ext):
    fig, axs = plt.subplots(3,3, figsize=(16,18))
    
    for i in range(9):
        k = np.random.randint(0,len(data))  #selecting random integer for plotting XRAY
        img_id = data.image_id[k]
        class_name = data.class_name[k]
        
        img_path = os.path.join(root, img_id+ext)
        dicom_pixel = read_xray(img_path)
        
        axs[i//3,i%3].imshow(dicom_pixel, cmap='gray')
        axs[i//3,i%3].title.set_text(class_name)
    plt.show()

show_xray(df)

## 3.1 Vizualising Anchor Boxes

In [None]:
def show_xray_withAnchorBoxs(data, root=root, ext=ext, n=3):
    fig, axs = plt.subplots(n,n, figsize=(40,40))

    un_xrays = data.image_id.unique()
    for i in range(n*n):
        k = np.random.randint(0,len(un_xrays))  #selecting random integer for plotting XRAY
        xray_id = un_xrays[k] #taking xray name (image_id)
        
        abnorm_df = data[data.image_id==xray_id] #this df contains all anchor boxes loaction for a particula image_id
        abnorm_df.reset_index(drop=True, inplace=True) #reseting index
   
        xray_path = os.path.join(root, xray_id+ext) #path of that image
        img = read_xray(xray_path) #reading dicom file
        
        axs[i//n,i%n].imshow(img, cmap='gray')#plotting the image
        
        for j in range(len(abnorm_df)):

            axs[i//n,i%n].add_patch(patches.Rectangle(
                (abnorm_df['x_min'][j], abnorm_df['y_min'][j]), 
                abnorm_df['x_max'][j] - abnorm_df['x_min'][j], 
                abnorm_df['y_max'][j] - abnorm_df['y_min'][j], 
                edgecolor='y',
                linewidth=1,
                fill=False)
            )
            axs[i//n,i%n].text(
                abnorm_df['x_min'][j], abnorm_df['y_min'][j], #coordinate
                abnorm_df.class_name[j],  #sting 
                bbox=dict(fill=True, edgecolor='yellow', linewidth=2)
            )
    plt.show()

show_xray_withAnchorBoxs(df[df.class_name!='No finding']) #filtering no finding images

In [None]:
show_xray_withAnchorBoxs(df[df.class_name!='No finding'], n=4)

## 3.2 Generating Heatmaps

In [None]:
# courtsy of https://www.kaggle.com/craigmthomas/localization-of-findings

def add_image_dimensions_gender(df):
    path_spec = "../input/vinbigdata-chest-xray-abnormalities-detection/train/{}.dicom"
    height = []
    width = []
    gender = []
    age = []
    for _, row in df.iterrows():
        dcm = dcmread(Path(path_spec.format(row["image_id"])), stop_before_pixels=True)
        height.append(dcm.Rows)
        width.append(dcm.Columns)
        gender.append(dcm[0x10, 0x40].value)
        patient_age = dcm[0x10, 0x1010].value if [0x10, 0x1010] in dcm else ""
        age.append(patient_age)
    df["image_height"] = height
    df["image_width"] = width
    df["gender"] = gender
    df["age"] = age
    
    return df

def scale_bounding_boxes(df):
    df["x_min_norm"] = df["x_min"] / df["image_width"]
    df["y_min_norm"] = df["y_min"] / df["image_height"]
    df["x_max_norm"] = df["x_max"] / df["image_width"]
    df["y_max_norm"] = df["y_max"] / df["image_height"]
    
    return df

def draw_box_on_array(row, np_array):
    xy = [
        int(row["x_min_norm"] * 400), 
        int(row["y_min_norm"] * 500),
        int(row["x_max_norm"] * 400),
        int(row["y_max_norm"] * 500),
    ]
    np_array[xy[1]:xy[3], xy[0]:xy[2]] += 1
    
def get_bbox(df, class_id):
    np_array = np.zeros(shape=(500, 400))
    if class_id == 14:
        return np_array
    for _, row in df[df["class_id"] == class_id].iterrows():
        draw_box_on_array(row, np_array)
    return np_array

def plot_heatmap(data=df):
    data = add_image_dimensions_gender(data)
    data = scale_bounding_boxes(data)
    
    classes = [
        "0 - Aortic enlargement",
        "1 - Atelectasis",
        "2 - Calcification",
        "3 - Cardiomegaly", 
        "4 - Consolidation",
        "5 - ILD",
        "6 - Infiltration",
        "7 - Lung Opacity",
        "8 - Nodule/Mass",
        "9 - Other lesion",
        "10 - Pleural effusion",
        "11 - Pleural thickening",
        "12 - Pneumothorax",
        "13 - Pulmonary fibrosis",
        "14 - No finding",
    ]

    fig, axs = plt.subplots(nrows=5, ncols=3, figsize=(20, 30))
    for i, ax in enumerate(axs.flatten()):
        ax.imshow(get_bbox(df, i), cmap='hot', interpolation='nearest')
        _ = ax.set_title(classes[i], fontweight="bold", size=15)
        
plot_heatmap()

<h3 style="text-align: font-family: Verdana; font-size: 20px; font-style: normal; font-weight: normal; text-decoration: none; text-transform: none; letter-spacing: 2px; color: green; background-color: #ffffff;"> Work In Progress</h3>

<h3 style="text-align: font-family: Verdana; font-size: 20px; font-style: normal; font-weight: normal; text-decoration: none; text-transform: none; letter-spacing: 2px; color: green; background-color: #ffffff;"> Do upvote if it helped and comment</h3>