# VinBigData Chest X-ray Abnormalities Detection

## VinBigData Preprocessing

**Author: Théo LANGÉ** - s394369 - theo.lange.369@cranfield.ac.uk

This code has been inspired by the following notebook: 

> https://www.kaggle.com/code/bhallaakshit/dicom-to-jpeg-using-tf2

This notebook takes as input the data given in the VinBigData competition and preprocess it to create the processed dataset that is used to train models and infer annotations on the test set.

It will first convert all DICOM Images to JPEG. All the images will be resized with padding to the shape (512, 512). This Notebook was originally run on an older python environment. The warnings displayed during the conversion does not have any impact on the jpeg images.

Then, the position of the bounding boxes on the resized images is recalculated according to the padding and the initial shape. For each images in the train set a text file is created to store the labels and position of each bounding boxes. In addition, a `test_meta.csv` file is created using the initial shape of the test images. 

The dataset is publicly available on kaggle:
> https://www.kaggle.com/datasets/theolange/ai-vinbigdata

## Table of contents
0. [Imports](#0)
1. [Convert DICOM Images to JPEG](#1)
2. [Update of the train.csv file](#2)
3. [Creation of the file test.csv](#3)
4. [Labels Files](#4)

<a id="0"></a>
# 0. Imports


In [1]:
import os
import sys
import shutil
from glob import glob

import numpy as np
import pandas as pd
from tqdm.notebook import tqdm

import matplotlib.pyplot as plt
import matplotlib.image as img
import seaborn as sns

import tensorflow as tf
import tensorflow_io as tfio

import warnings
warnings.filterwarnings("ignore")

In [2]:
# Define the root data directory
DATA_DIR = "/kaggle/input/vinbigdata-chest-xray-abnormalities-detection"

# Define the paths to the training and testing dicom folders respectively
TRAIN_DIR = os.path.join(DATA_DIR, "train")
TEST_DIR = os.path.join(DATA_DIR, "test")

# Define paths to the relevant csv files
TRAIN_CSV = os.path.join(DATA_DIR, "train.csv")

# Working directory
WORKING_DIR = "/kaggle/working"

<a id="1"></a>
# 1. Convert DICOM images to JPEG

Each DICOM Images will be converted into JPEG Images. They are being resized using padding into the shape (512, 512). The shape of the original images will also be used to update the position of the bounding boxes and saved for later uses.

In [3]:
"""
Reading DICOM images

Input: Path to the image and a list of shape
Output: An Image read by TensorFlow and the list of shape updated
"""


def read_dicom(path, shape):

    # Read the bytes from the DICOM Image
    image_bytes = tf.io.read_file(path)
    image = tfio.image.decode_dicom_image(
        image_bytes, 
        dtype = tf.uint16
    )
    
    shape.append([path.split("/")[-1][:-6],image.shape[1], image.shape[2]])
    
    image = tf.squeeze(image, axis = 0)
    
    # Resize the images with padding to keep the width/height ratio within the border
    image = tf.image.resize_with_pad(image, 512, 512)
    
    image = image - tf.reduce_min(image)
    image = image / tf.reduce_max(image)
    image = tf.cast(image * 255, tf.uint8)
    
    return image, shape

In [4]:
"""
Convert DICOM Images to JPEG Images in grayscale

Input: The source directory of Images in DICOM, the destination directory to store the JPEG images
Output: A list with the original shape of all the images
"""


def dicom_to_jpeg(source, destination):

    # Create the destination folder if non existing
    os.makedirs(destination, exist_ok = True)
    
    shape = []
    
    # Convert each images, store them and get their shape
    for name in tqdm(sorted(os.listdir(source))):
        image, shape = read_dicom(os.path.join(source, name), shape)
        image = tf.io.encode_jpeg(
            image, 
            quality = 100, 
            format = 'grayscale'
        )

        name = name.replace(".dicom", ".jpeg")
        tf.io.write_file(os.path.join(destination, name), image)
    
    return shape

In [5]:
# Convert and store all the Images in the train set
train_shape = dicom_to_jpeg(TRAIN_DIR, "ai_vinbigdata/train")

  0%|          | 0/15000 [00:00<?, ?it/s]

W: invalid value for 'BitsAllocated' (16), > 8 for OB encoded uncompressed 'PixelData'
W: invalid value for 'BitsAllocated' (16), > 8 for OB encoded uncompressed 'PixelData'
W: invalid value for 'BitsAllocated' (16), > 8 for OB encoded uncompressed 'PixelData'
W: invalid value for 'BitsAllocated' (16), > 8 for OB encoded uncompressed 'PixelData'
W: invalid value for 'BitsAllocated' (16), > 8 for OB encoded uncompressed 'PixelData'
W: invalid value for 'BitsAllocated' (16), > 8 for OB encoded uncompressed 'PixelData'
W: invalid value for 'BitsAllocated' (16), > 8 for OB encoded uncompressed 'PixelData'
W: invalid value for 'BitsAllocated' (16), > 8 for OB encoded uncompressed 'PixelData'
W: invalid value for 'BitsAllocated' (16), > 8 for OB encoded uncompressed 'PixelData'
W: invalid value for 'BitsAllocated' (16), > 8 for OB encoded uncompressed 'PixelData'
W: invalid value for 'BitsAllocated' (16), > 8 for OB encoded uncompressed 'PixelData'
W: invalid value for 'BitsAllocated' (16), 

In [6]:
# Convert and Store all the Images in the test set
test_shape = dicom_to_jpeg(TEST_DIR, "ai_vinbigdata/test")

  0%|          | 0/3000 [00:00<?, ?it/s]

<a id="2"></a>
# 2. Update of the train.csv file

The images original shape are added to the `train.csv` file. New columns will be added to give the position of bounding boxes on the resized JPEG Images.

In [7]:
train = pd.read_csv("/kaggle/input/vinbigdata-chest-xray-abnormalities-detection/train.csv")
train.head()

Unnamed: 0,image_id,class_name,class_id,rad_id,x_min,y_min,x_max,y_max
0,50a418190bc3fb1ef1633bf9678929b3,No finding,14,R11,,,,
1,21a10246a5ec7af151081d0cd6d65dc9,No finding,14,R7,,,,
2,9a5094b2563a1ef3ff50dc5c7ff71345,Cardiomegaly,3,R10,691.0,1375.0,1653.0,1831.0
3,051132a778e61a86eb147c7c6f564dfe,Aortic enlargement,0,R10,1264.0,743.0,1611.0,1019.0
4,063319de25ce7edb9b1c6b8881290140,No finding,14,R10,,,,


In [8]:
# Create a DataFrame using the shapes obtained while converting the Images
train_meta = pd.DataFrame(train_shape, columns=["image_id", "h", "w"])
train_meta.head()

Unnamed: 0,image_id,h,w
0,000434271f63a053c4128a0ba6352c7f,2836,2336
1,00053190460d56c53cc3e57321387478,2430,1994
2,0005e8e3701dfb1dd93d53e2ff537b6e,3072,3072
3,0006e0a85696f6bb578e84fafa9a5607,3000,3000
4,0007d316f756b3fa0baea2ff514ce945,2880,2304


In [9]:
"""
Add the shape of the original image into the train.csv file

Input: Train DataFrame and a list of shape
Output: Updated Train DataFrame
"""

def update_train_meta(train, train_meta):

    # Add the original shape of the images to the Train DataFrame
    train["img_original_height"] = train["image_id"].map(lambda x: train_meta[train_meta.image_id==x].values[:,1][0])
    train["img_original_width"] = train["image_id"].map(lambda x: train_meta[train_meta.image_id==x].values[:,2][0])


    return train

# Apply the update
train = update_train_meta(train, train_meta)

In [10]:
"""
This function is used to get the new coordinates of each boxes given the initial shape of the image
Because of the padding, we need to translate the boxes on the resized images

Input: The value to update, the original height and width of the image and the type of coordinate of the value
Output: The updated position of the box coordinate
"""

def update_box(val, height, width, coord):

    # In this case, the padding is added to the x coordinate
    if height > width: 
        ratio = 512/height
        if coord == 'x':
            val = np.round(val*ratio  + (512 - width*ratio)/2)
        else:
            val = np.round(val*ratio)
        
    # In this case, the padding is added to the y coordinate
    elif width > height: 
        ratio = 512/width
        if coord == 'y':
            val = np.round(val*ratio  + (512 - height*ratio)/2)
        else:
            val = np.round(val*ratio)
    
    # No padidng in this case
    else:
        ratio = 512/height
        val = np.round(val*ratio)
    
    return val

In [11]:
"""
This function updates the position of the bounding boxes after having resised with padding every images

Input: Train DataFrame
Output: Train DataFrame with bounding boxes location updated according to the original image shape and the padding
"""

def update_bboxes(train):

    # Apply the function update_box to each coordinates of the bounding boxes
    train['x_min_resized'] = train.apply(lambda row: update_box(row.x_min, row.img_original_height, row.img_original_width, 'x'), axis =1)
    train['y_min_resized'] = train.apply(lambda row: update_box(row.y_min, row.img_original_height, row.img_original_width, 'y'), axis =1)
    train['x_max_resized'] = train.apply(lambda row: update_box(row.x_max, row.img_original_height, row.img_original_width, 'x'), axis =1)
    train['y_max_resized'] = train.apply(lambda row: update_box(row.y_max, row.img_original_height, row.img_original_width, 'y'), axis =1)

    return train

# Update the bounding boxes
train = update_bboxes(train)
train.head()

Unnamed: 0,image_id,class_name,class_id,rad_id,x_min,y_min,x_max,y_max,img_original_height,img_original_width,x_min_resized,y_min_resized,x_max_resized,y_max_resized
0,50a418190bc3fb1ef1633bf9678929b3,No finding,14,R11,,,,,2580,2332,,,,
1,21a10246a5ec7af151081d0cd6d65dc9,No finding,14,R7,,,,,3159,2954,,,,
2,9a5094b2563a1ef3ff50dc5c7ff71345,Cardiomegaly,3,R10,691.0,1375.0,1653.0,1831.0,2336,2080,180.0,301.0,390.0,401.0
3,051132a778e61a86eb147c7c6f564dfe,Aortic enlargement,0,R10,1264.0,743.0,1611.0,1019.0,2880,2304,276.0,132.0,338.0,181.0
4,063319de25ce7edb9b1c6b8881290140,No finding,14,R10,,,,,3072,2540,,,,


In [12]:
train.to_csv('ai_vinbigdata/train.csv', index=None)

<a id="3"></a>
# 3. Creation of the file test.csv

We also need to save the original size of the images in the test set. Indeed, after the localisation of the abnormalities on the images, the boxes will need to be resized according to the original size of the image.

In [13]:
test = pd.DataFrame(test_shape, columns=["image_id", "h", "w"])
test.head()

Unnamed: 0,image_id,h,w
0,002a34c58c5b758217ed1f584ccbcfe9,2584,2345
1,004f33259ee4aef671c2b95d54e4be68,3028,2517
2,008bdde2af2462e86fd373a445d0f4cd,2880,2304
3,009bc039326338823ca3aa84381f17f1,2430,1994
4,00a2145de1886cb9eb88869c85d74080,2408,2136


In [14]:
test.to_csv("ai_vinbigdata/test.csv", index = None)

<a id="4"></a>
# 4. Labels files

In this section, a .txt file will be created for each images in the train set. These files will contain the a row per finding on the image. Each of these row will contain the ID of the finding as well as its relative location on the image. If an image is classified under the label *No Finding*, the file is empty.

In [15]:
df = train.copy()
df.head()

Unnamed: 0,image_id,class_name,class_id,rad_id,x_min,y_min,x_max,y_max,img_original_height,img_original_width,x_min_resized,y_min_resized,x_max_resized,y_max_resized
0,50a418190bc3fb1ef1633bf9678929b3,No finding,14,R11,,,,,2580,2332,,,,
1,21a10246a5ec7af151081d0cd6d65dc9,No finding,14,R7,,,,,3159,2954,,,,
2,9a5094b2563a1ef3ff50dc5c7ff71345,Cardiomegaly,3,R10,691.0,1375.0,1653.0,1831.0,2336,2080,180.0,301.0,390.0,401.0
3,051132a778e61a86eb147c7c6f564dfe,Aortic enlargement,0,R10,1264.0,743.0,1611.0,1019.0,2880,2304,276.0,132.0,338.0,181.0
4,063319de25ce7edb9b1c6b8881290140,No finding,14,R10,,,,,3072,2540,,,,


In [16]:
"""
This function will create the localisation of the bounding boxes that are used by Yolo models
Each box will be located by the coordinates of its center, and its width and height

Input: Train DataFrame
Output: Train DataFrame with Bounding Boxes position as xywh
"""

def boxes_xywh(df):

    # Get the position of the center of the boxe
    df['x_center'] = df.apply(lambda row: ((row.x_max_resized + row.x_min_resized)/2)/512, axis =1)
    df['y_center'] = df.apply(lambda row: ((row.y_max_resized + row.y_min_resized)/2)/512, axis =1)

    # Get the boxe dimension
    df["height"] = df.apply(lambda row: (row.y_max_resized - row.y_min_resized)/512, axis = 1)
    df["width"] = df.apply(lambda row: (row.x_max_resized - row.x_min_resized)/512, axis = 1)

    return df

df = boxes_xywh(df)
df.head()

Unnamed: 0,image_id,class_name,class_id,rad_id,x_min,y_min,x_max,y_max,img_original_height,img_original_width,x_min_resized,y_min_resized,x_max_resized,y_max_resized,x_center,y_center,height,width
0,50a418190bc3fb1ef1633bf9678929b3,No finding,14,R11,,,,,2580,2332,,,,,,,,
1,21a10246a5ec7af151081d0cd6d65dc9,No finding,14,R7,,,,,3159,2954,,,,,,,,
2,9a5094b2563a1ef3ff50dc5c7ff71345,Cardiomegaly,3,R10,691.0,1375.0,1653.0,1831.0,2336,2080,180.0,301.0,390.0,401.0,0.556641,0.685547,0.195312,0.410156
3,051132a778e61a86eb147c7c6f564dfe,Aortic enlargement,0,R10,1264.0,743.0,1611.0,1019.0,2880,2304,276.0,132.0,338.0,181.0,0.599609,0.305664,0.095703,0.121094
4,063319de25ce7edb9b1c6b8881290140,No finding,14,R10,,,,,3072,2540,,,,,,,,


In [17]:
"""
For each images a text file is created to store the annotations informations

Input: Train DataFrame
Output: None, .txt files are created to store the bounding boxes of every images
"""

def write_labels(df):

    os.makedirs('ai_vinbigdata/labels', exist_ok=True)

    for images in tqdm(df.image_id.unique()):
        with open(f'ai_vinbigdata/labels/{images}.txt', 'w+') as f:

            row = df[df['image_id']==images]\
            [['class_id', 'x_center', 'y_center', 'width', 'height']].values
            row = row.astype('str')

            for i in range(len(row)):
                row[i][0] = row[i][0][:-2]
                
            for box in range(len(row)):
                if row[box][0] != '14': 
                    text = ' '.join(row[box])
                    f.write(text)
                    f.write('\n')

In [18]:
write_labels(df)

  0%|          | 0/15000 [00:00<?, ?it/s]

In [19]:
shutil.make_archive("ai_vinbigdata", "zip", "/kaggle/working/ai_vinbigdata")

'/kaggle/working/ai_vinbigdata.zip'