<h1 style="text-align: center; font-family: Gabriola; font-size: 32px; font-style: normal; font-weight: bold; text-decoration: none; text-transform: none; font-variant: small-caps; letter-spacing: 3px; color: #ffffff; background-color: #502b6b;">VinBigData Chest X-ray Abnormalities Detection</h1>
<h2 style="text-align: center; font-family: Futara; font-size: 20px; font-style: normal; font-weight: bold; text-decoration: none; text-transform: none; letter-spacing: 1px; color: #ffffff; background-color: #502b6b;">Automatically localize and classify thoracic abnormalities from chest radiographs</h2>
<h3 style="text-align: center; font-family: Futara; font-size: 18px; font-style: normal; font-weight: bold; text-decoration: none; text-transform: none; letter-spacing: 1px; color: #ffffff; background-color: #502b6b;">Detailed Data Analysis</h3>

## Table of contents

1. [Introduction](#Introduction)

2. [The Dataset](#The-Dataset)
   - [DATASET INFORMATION](#DATASET-INFORMATION)
   - [DATA FILES](#DATA-FILES)

3. [Getting DICOM Metadata](#Getting-DICOM-Metadata)
4. [EDA](#EDA)
5. [Analysis For Data Cleaning](#Analysis-For-Data-Cleaning)
   - [Redundant Samples](#Redundant-Samples)
   - [Consider Columns That Have Very Few Values](##Consider-Columns-That-Have-Very-Few-Values)


## More Coming soon

In [None]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib
import seaborn as sns
import pydicom as dicom
import cv2

import warnings
warnings.filterwarnings("ignore")

In [None]:
path = '/kaggle/input/vinbigdata-chest-xray-abnormalities-detection/'
train_data = pd.read_csv(path+'train.csv')

In [None]:
def plot_example(idx_list):
    fig, axs = plt.subplots(1, 3, figsize=(15, 10))
    fig.subplots_adjust(hspace = .1, wspace=.1)
    axs = axs.ravel()
    for i in range(3):
        image_id = train_data.loc[idx_list[i], 'image_id']
        data_file = dicom.dcmread(path+'train/'+image_id+'.dicom')
        img = data_file.pixel_array
        axs[i].imshow(img, cmap='gray')
        axs[i].set_title(train_data.loc[idx_list[i], 'class_name'])
        axs[i].set_xticklabels([])
        axs[i].set_yticklabels([])
        if train_data.loc[idx_list[i], 'class_name'] != 'No finding':
            bbox = [train_data.loc[idx_list[i], 'x_min'],
                    train_data.loc[idx_list[i], 'y_min'],
                    train_data.loc[idx_list[i], 'x_max'],
                    train_data.loc[idx_list[i], 'y_max']]
            p = matplotlib.patches.Rectangle((bbox[0], bbox[1]),
                                             bbox[2]-bbox[0],
                                             bbox[3]-bbox[1],
                                             ec='r', fc='none', lw=2.)
            axs[i].add_patch(p)
            


---

## Introduction

[[ go back to the top ]](#Table-of-contents)

In this kaggle competition, we are going to automatically localize and classify 14 types of thoracic abnormalities from chest radiographs.we will work with a dataset consisting of 18,000 scans that have been annotated by experienced radiologists. we can train your model with 15,000 independently-labeled images and will be evaluated on a test set of 3,000 images.




---


## The Dataset

[[ go back to the top ]](#Table-of-contents)

In this competition, we are classifying common thoracic lung diseases and localizing critical findings. This is an object detection and classification problem.

For each test image, you will be predicting a bounding box and class for all findings. If you predict that there are no findings, you should create a prediction of **`14 1 0 0 1 1`** *(14 is the class ID for no finding, and this provides a one-pixel bounding box with a confidence of 1.0)*

Note that the images are in **DICOM** format, which means they contain additional data that might be useful for visualizing and classifying.



![Example Radiographs](https://i.imgur.com/QWmbhXx.png)

<br>

#### DATASET INFORMATION

The dataset comprises **`18,000`** postero-anterior (PA) CXR scans in DICOM format, which were de-identified to protect patient privacy. 

All images were labeled by a panel of experienced radiologists for the presence of **14** critical radiographic findings as listed below:

- [0. Aortic enlargement](#0.-Aortic-enlargement)
- [1. Atelectasis](#1.-Atelectasis)
- [2. Calcification](#2.-Calcification)
- [3. Cardiomegaly](#3.-Cardiomegaly)
- [4. Consolidation](#4.-Consolidation)
- [5. ILD](#5.-ILD)
- [6. Infiltration](#6.-Infiltration)
- [7. Lung Opacity](#7.-Lung-Opacity)
- [8. Nodule/Mass](#8.-Nodule/Mass)
- [9. Other lesion](#9.-Other-lesion)
- [10. Pleural effusion](#10.-Pleural-effusion)
- [11. Pleural thickening](#11.-Pleural-thickening)
- [12. Pneumothorax](#12.-Pneumothorax)
- [13. Plumonary fibrosis](#13.-Plumonary-fibrosis)
- [14. No finding](#14.-No-finding)

Note that a key part of this competition is working with ground truth from multiple radiologists. That means that the same image will have multiple ground-truth labels as annotated by different radiologists.

<br>


#### DATA FILES

> **`train.csv`** - the train set metadata, with one row for each object, including a class and a bounding box (multiple rows per image possible)<br>
**`sample_submission.csv`** - a sample submission file in the correct format

<br>

<b style="text-decoration: underline; font-family: Futara;">TRAIN COLUMNS</b>
> **`image_id`** - unique image identifier<br>
**`class_name`** - the name of the class of detected object (or "No finding")<br>
**`class_id`** - the ID of the class of detected object<br>
**`rad_id`** - the ID of the radiologist that made the observation<br>
**`x_min`** - minimum X coordinate of the object's bounding box<br>
**`y_min`** - minimum Y coordinate of the object's bounding box<br>
**`x_max`** - maximum X coordinate of the object's bounding box<br>
**`y_max`** - maximum Y coordinate of the object's bounding box

## 0. Aortic enlargement

[[Back to top]](#The-Dataset)

Also known as an aortic aneurysm, this condition can be deadly if left undiagnosed.

![Example Radiographs](https://www.umcvc.org/sites/default/files/styles/large/public/Types%20of%20Aneurysms.jpg?itok=2POtyibt)

The above Illustration shows three types of aneurysm and the two aneurysm shapes. The three types of aneurysm are Ascending Thoracic aortic Aneurysm (TAA), Descending Thoracic Aortic Aneurysm (TAA), and Abdominal Aortic Aneurysm (AAA). The Fusiform Aneurysm and Saccular Aneurysm show the two types of aneurysm shapes.

The aorta is your largest artery and it brings oxygenated blood to all parts of the body. If the walls of the aorta become weak, an enlargement can occur, which is known as an aortic aneurysm. Aneurysms can form in any section of the aorta, but are most common in the abdomen (abdominal aortic aneurysm) or the upper body (thoracic aortic aneurysm). 
  
(Please compare the below images to the normal images.)

In [None]:
idx_list = train_data[train_data['class_id']==0][9:12].index.values
plot_example(idx_list)

## 1. Atelectasis

[[Back to top]](#The-Dataset)


<img src="https://www.verywellhealth.com/thmb/FsRDod4S4NNcpnXTVRBXDKpH_vs=/900x0/filters:no_upscale():max_bytes(150000):strip_icc():format(webp)/2248927-article-understanding-atelectasis-01-5a5e2c3ebeba3300368891ea.png" alt="drawing" style="width:500px;"/>

Your airways are branching tubes that run throughout each of your lungs. When you breathe, air moves from the main airway in your throat, sometimes called your windpipe, to your lungs. The airways continue branching and get progressively smaller until they end in little sacs called alveoli.

Your alveoli help to exchange the oxygen in the air for carbon dioxide, a waste product from your tissues and organs. In order to do this, your alveoli must fill with air.

When some of your alveoli don’t fill with air, it’s called “atelectasis.”

Depending on the underlying cause, atelectasis can involve either small or large portions of your lung.

In [None]:
idx_list = train_data[train_data['class_id']==1][0:3].index.values
plot_example(idx_list)

## 2. Calcification

[[Back to top]](#The-Dataset)



<img src="https://pulmonarychronicles.com/index.php/pulmonarychronicles/article/download/521/1151/3005" alt="Example Radiographs" style="width:500px;"/>

Many diseases or conditions can cause calcification on chest x-ray. Calcium (calcification) may be deposited in areas where previous inflammation of the lungs or pleura has healed. Calcium may be deposited in the aorta due to atherosclerosis. Or calcification may occur in mediastinal lymph nodes.

The image is characterized by a density similar to that of bone.

In [None]:
idx_list = train_data[train_data['class_id']==2][0:3].index.values
plot_example(idx_list)

## 3. Cardiomegaly

[[Back to top]](#The-Dataset)

<img src="https://image.shutterstock.com/image-vector/cardiomegaly-enlarged-normal-heart-muscles-600w-1063197287.jpg" style="width:400px;"/>

Cardiomegaly (sometimes megacardia or megalocardia) is a medical condition in which the heart is enlarged. As such, it is more commonly referred to simply as "having an enlarged heart".

Cardiomegaly is not a disease, but rather a condition that can result from a host of other diseases such as obesity or coronary artery disease. Cardiomegaly can be serious: depending on what part of the heart is enlarged, the patient can suffer from heart failure. Recent studies suggest that cardiomegaly is associated with a higher risk of sudden cardiac death

In [None]:
idx_list = train_data[train_data['class_id']==3][0:3].index.values
plot_example(idx_list)

## 4. Consolidation

[[Back to top]](#The-Dataset)



<img src="https://radiologyassistant.nl/assets/chest-x-ray-lung-disease/a51a09750db92c_Chest-density2.jpg" style="width:400px;"/>

A pulmonary consolidation is a region of normally compressible lung tissue that has filled with liquid instead of air.[1] The condition is marked by induration[2] (swelling or hardening of normally soft tissue) of a normally aerated lung.

In other words,Lung consolidation occurs when the air that usually fills the small airways in your lungs is replaced with something else. Depending on the cause, the air may be replaced with:

 - a fluid, such as pus, blood, or water
 - a solid, such as stomach contents or cells
 
The appearance of your lungs on a chest X-ray, and your symptoms, are similar for all these substances. So, you’ll typically need more tests to find out why your lungs are consolidated. With appropriate treatment, the consolidation usually goes away and air returns.


In [None]:
idx_list = train_data[train_data['class_id']==4][0:3].index.values
plot_example(idx_list)

## 5. ILD

[[Back to top]](#The-Dataset)



<img src="https://www.svhlunghealth.com.au/Images/UserUploadedImages/3444/2_SVH_Lung_Health_Interstitial_Lung_Disease_final_1080.jpg" style="width:600px;"/>

ILD stands for "Interstitial Lung Disease."

Interstitial lung disease (ILD) is the name for more than 200 lung disorders that affect the interstitium, the tissue and space around the alveoli (air sacs). The interstitium is the tiny, fluid-filled tissue and space around the air sacs in the lungs.

In healthy people, the interstitium is very thin. In interstitial lung diseases, the interstitium thickens and scars. Over time, the scarring can cause lung stiffness and eventually affect breathing.

People with interstitial lung disease find it hard to get enough oxygen into their bloodstream. If you have an interstitial lung disease, other compartments of your lungs can also be affected, including:

 - Alveoli
 - Airways (trachea, bronchi, and bronchioles)
 - Blood vessels
 - Pleura (outside lining of the lung).

In [None]:
idx_list = train_data[train_data['class_id']==5][0:3].index.values
plot_example(idx_list)

## 6. Infiltration

[[Back to top]](#The-Dataset)



<img src="https://media.sciencephoto.com/image/f0189104/800wm" style="width:300px;"/>

A pulmonary infiltrate is a substance denser than air, such as pus, blood, or protein, which lingers within the parenchyma of the lungs. Pulmonary infiltrates are associated with pneumonia, and tuberculosis. Pulmonary infiltrates can be observed on a chest radiograph.

In [None]:
idx_list = train_data[train_data['class_id']==6][0:3].index.values
plot_example(idx_list)

## 7. Lung Opacity

[[Back to top]](#The-Dataset)



<img src="https://www.aafp.org/afp/2009/1201/afp20091201p1289-uf1.jpg" style="width:300px;"/>

Pulmonary opacification represents the result of a decrease in the ratio of gas to soft tissue (blood, lung parenchyma and stroma) in the lung.

As we can see the this classification is generic and to check that who is tagging this classification.

In [None]:
idx_list = train_data[train_data['class_id']==7][0:3].index.values
plot_example(idx_list)

## 8. Nodule/Mass

[[Back to top]](#The-Dataset)



<img src="https://health.clevelandclinic.org/wp-content/uploads/sites/3/2016/11/GettyImages-490025288-1200x630.jpg" style="width:300px;"/>

Pulmonary nodules are small, rounded opacities within the pulmonary interstitium. Pulmonary nodules are common and, as the spatial resolution of CT scanners has increased, detection of smaller and smaller nodules has occurred, which are more often an incidental finding.

In other words, a nodule/mass is a round shade (typically less than 3 cm in diameter).

In [None]:
idx_list = train_data[train_data['class_id']==8][0:3].index.values
plot_example(idx_list)

## 9. Other lesion

[[Back to top]](#The-Dataset)



<img src="https://www.verywellhealth.com/thmb/VwMJkvROXXTi3RjIck0UTqON5YA=/900x0/filters:no_upscale():max_bytes(150000):strip_icc():format(webp)/lung-mass-possible-causes-and-what-to-expect-2249388-5bc3f847c9e77c00512dc818.png" style="width:500px;"/>

Others include all abnormalities that do not fall into any other category. This includes bone penetrating images, fractures, and subcutaneous emphysema, etc.

In [None]:
idx_list = train_data[train_data['class_id']==9][0:3].index.values
plot_example(idx_list)

## 10. Pleural effusion

[[Back to top]](#The-Dataset)



<img src="https://upload.wikimedia.org/wikipedia/commons/4/48/Diagram_showing_a_build_up_of_fluid_in_the_lining_of_the_lungs_%28pleural_effusion%29_CRUK_054.svg" style="width:400px; background-color: #502b6b;"/>

A pleural effusion is excess fluid that accumulates in the pleural cavity, the fluid-filled space that surrounds the lungs. Excess fluid can impair breathing by limiting the expansion of the lungs. Various kinds of pleural effusion, depending on the nature of the fluid and what caused its entry into the pleural space, are hydrothorax (serous fluid), hemothorax (blood), urinothorax (urine), chylothorax (chyle), or pyothorax (pus) commonly known as pleural empyema. In contrast, a pneumothorax is the accumulation of air in the pleural space, and is commonly called a "collapsed lung".



In [None]:
idx_list = train_data[train_data['class_id']==10][0:3].index.values
plot_example(idx_list)

## 11. Pleural thickening

[[Back to top]](#The-Dataset)



<img src="https://www.mesothelioma.com/wp-content/uploads/Meso_Pleural-Thickening-53-2.svg" style="width:400px;"/>

Pleural thickening refers to a thickening of the lining of the lungs, the pleura, which is a thin layer of membrane that covers the inside of the rib-cage as well as the outside of the lungs. Diffuse pleural thickening (DPT) is diagnosed when the pleura thickens to the extent that it causes breathlessness.



In [None]:
idx_list = train_data[train_data['class_id']==11][0:3].index.values
plot_example(idx_list)

## 12. Pneumothorax

[[Back to top]](#The-Dataset)



<img src="https://upload.wikimedia.org/wikipedia/commons/4/46/Blausen_0742_Pneumothorax.png" style="width:300px;"/>

A pneumothorax is a condition in which air leaks from the lungs and accumulates in the chest cavity. When air leaks and accumulates in the chest, it cannot expand outward like a balloon due to the ribs' presence. Instead, the lungs are pushed by the air and become smaller. In other words, a pneumothorax is a situation where air leaks from the lungs and the lungs become smaller (collapsed).

In a chest radiograph of a pneumothorax, the collapsed lung is whiter than normal, and the area where the lung is gone is uniformly black. Besides, the edges of the lung may appear linear.


In [None]:
idx_list = train_data[train_data['class_id']==12][1:4].index.values
plot_example(idx_list)

## 13. Plumonary fibrosis

[[Back to top]](#The-Dataset)



<img src="https://www.mayoclinic.org/-/media/kcms/gbs/patient-consumer/images/2016/08/10/14/57/mcdc7_pulmonaryfibrosis-8col.jpg" style="width:500px;"/>

Pulmonary fibrosis is a lung disease that occurs when lung tissue becomes damaged and scarred. This thickened, stiff tissue makes it more difficult for your lungs to work properly. As pulmonary fibrosis worsens, you become progressively more short of breath.

In [None]:
idx_list = train_data[train_data['class_id']==13][3:6].index.values
plot_example(idx_list)

## 14. No finding

[[Back to top]](#The-Dataset)


This is the normal image.

In [None]:
idx_list = train_data[train_data['class_id']==14][0:3].index.values
plot_example(idx_list)


----

# Getting DICOM Metadata

[[ go back to the top ]](#Table-of-contents)


The idea is to get DICOM metadata, combine it and do some analysis. First lets peak into metadata of trains DICOM.

In [None]:
filepath = "../input/vinbigdata-chest-xray-abnormalities-detection/train/50a418190bc3fb1ef1633bf9678929b3.dicom"
ds = dicom.filereader.dcmread(filepath)
print(ds)

In [None]:
filepath = "../input/vinbigdata-chest-xray-abnormalities-detection/test/00a2145de1886cb9eb88869c85d74080.dicom"
ds = dicom.filereader.dcmread(filepath)
print(ds)

In [None]:
#copies from https://www.kaggle.com/hiicao/dicom-metadata-to-pandasframe

#Geting DICOM meta data and savinf in a csv file 
def extract_DICOM_metadata(folder_path):
    text_file = open("dicom_metadata.txt", "w")
    
    cnt = 0
    images = list(os.listdir(folder_path))
    df = pd.DataFrame()
    
    for image in images:
        cnt += 1
            
        image_name = image.split(".")[0]
        dicom_file_path = os.path.join(folder_path,image)
        dicom_file_dataset = dicom.read_file(dicom_file_path)
        text_file.write(str(dicom_file_dataset))

        
        ElementKeys = dicom_file_dataset.dir()
        
        PatientSex = dicom_file_dataset.PatientSex if 'PatientSex' in ElementKeys else ""
        PatientAge = dicom_file_dataset.PatientAge if 'PatientAge' in ElementKeys else ""
        PhotometricInterpretation = dicom_file_dataset.PhotometricInterpretation if 'PhotometricInterpretation' in ElementKeys else ""


        df = df.append(pd.DataFrame({'image_id': image_name, 
                        'PatientSex': PatientSex, 'PatientAge': PatientAge, 'PhotometricInterpretation': PhotometricInterpretation }, index = [cnt]))
    df.to_csv(r'./dicom_metadata.csv', index=False) 
    text_file.close()
    return df




In [None]:
extract_DICOM_metadata('../input/vinbigdata-chest-xray-abnormalities-detection/train')

In [None]:
#loading DICOM meta data 
meta_data = pd.read_csv('./dicom_metadata.csv')


In [None]:
meta_data

In [None]:
train_df = pd.read_csv('../input/vinbigdata-chest-xray-abnormalities-detection/train.csv')

train_df

In [None]:
#combining it with TRAIN data
all_df = pd.merge(left=meta_data, right=train_df, on="image_id", how="inner")

In [None]:
all_df


---


# EDA

[[ go back to the top ]](#Table-of-contents)

Checking missing values.

In [None]:
print(f"Data dimension: {all_df.shape}")
for col in all_df.columns:
    print(f"Column: {col:35} | type: {str(all_df[col].dtype):7} \
| missing values: {all_df[col].isna().sum():3d}")

Peaking into Numerical features

In [None]:
# define numerical features
numerical_features = [col for col in all_df.columns if np.issubdtype(all_df[col].dtype, np.number)]
print(numerical_features)

In [None]:
# print statistics about the different numerical columns
all_df[numerical_features].describe().T

In [None]:
# plot distributions of numerical features
plt.figure(figsize=(10,18))
for index, col in enumerate(numerical_features):
    plt.subplot(5, 2, index+1)
    sns.distplot(all_df[col])

Now, peaking into Categorical Features.

In [None]:
# define categorical features
categorical_features = [col for col in all_df.columns if pd.api.types.is_string_dtype(all_df[col])]
del categorical_features[0]
del categorical_features[1]
print(categorical_features)

In [None]:
# plot distributions of numerical features
plt.figure(figsize=(25,35))
for index, col in enumerate(categorical_features):
    plt.subplot(6, 2, index+1)
    ax = sns.countplot(y=col, data=all_df)
    ax.set_xlabel("count", fontsize=20)
    ax.set_ylabel(col, fontsize=20)
    ax.tick_params(labelsize=20)


---


# Analysis For Data Cleaning

## Redundant Samples

[[Back to top]](#Table-of-contents)

Let's go ahead and find out which duplicate rows in our dataset

Its is possible that we would have two types of data duplication in rows (as there is very slim chances because we dont need any preprocessing KAGGLE datset).

In [None]:
#Complete duplicate rows
check_df_duplication = all_df.groupby(all_df.columns.tolist(),as_index=False).size() 
check_df_duplication[check_df_duplication['size'] > 1]

## Consider Columns That Have Very Few Values

[[Back to top]](#Table-of-contents)

Let's go ahead and find out feature whave have few observation.


In [None]:
#replace NaN values with zeros 
all_df = all_df.replace(np.nan, 0)

In [None]:
all_df['PatientSex'] = all_df['PatientSex'].replace(0, 'U')

In [None]:
all_df['PatientAge'] = all_df['PatientAge'].replace(0, 'U')

In [None]:
all_df['PatientAge'].unique()

In [None]:
all_df = all_df.fillna(0)

In [None]:
from numpy import unique
df_array = pd.DataFrame(all_df).to_numpy()
for i in range(df_array.shape[1]):
  num = len(unique(df_array[:, i]))
  percentage = float(num) / df_array.shape[0] * 100
  print('%d, %d, %.1f%%' % (i, num, percentage))

## Further ANALYSIS COMING sOON