# VinBigData Chest X-ray Abnormalities Detection
Automatically localize and classify thoracic abnormalities from chest radiographs

### The aim of this notebook is to: 
1. Read DICOM x-ray images efficiently.
2. Explore and visualize images and annotation metadata.
3. Apply suitable Image Enhancement techniques to images.
4. Preprocess annotation metadata.
5. Perform Stratified K-Fold Sharding to create training and validation sets.
6. Encode images as JPEG and store them along with annotations as TFRecords.

## Install TF 2 Object Detection API
1. TF Model Garden
2. Protobuf
3. COCO API
4. Object Detection API 

In [None]:
!# Download models
!git clone --depth 1 https://github.com/tensorflow/models

!# Compile proto files 
! # sudo apt install -y protobuf-compiler # Already present
%cd models/research
!protoc object_detection/protos/*.proto --python_out=.
%cd ..
%cd ..

!# Install cocoapi
!pip install cython 
!git clone https://github.com/cocodataset/cocoapi.git
%cd cocoapi/PythonAPI
!make
%cd ..
%cd ..
!cp -r cocoapi/PythonAPI/pycocotools models/research/

!# Install object detection api
%cd models/research
!cp object_detection/packages/tf2/setup.py .
!python -m pip install .
%cd ..
%cd ..

## Import libraries
1. **NumPy:** Numerical computing
2. **Pandas:** Data manipulation 
3. **Open-CV:** Computer Vision
4. **Matplotlib:** Plotting
5. **Scikit-learn:** Machine Learning
6. **TensorFlow:** Deep Learning
7. **Miscellaneous**

In [None]:
!pip install tensorflow_io
!pip install ensemble-boxes

In [None]:
import os
import pandas as pd
import numpy as np
import cv2
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import StratifiedKFold
from ensemble_boxes import *
from tqdm.notebook import tqdm

import pydicom
from pydicom.tag import Tag

import tensorflow as tf
import tensorflow_io as tfio

from object_detection.protos.string_int_label_map_pb2 import StringIntLabelMap, StringIntLabelMapItem
from object_detection.dataset_tools import tf_record_creation_util
from object_detection.utils import dataset_util
import contextlib2

from google.protobuf import text_format

## Read data
1. Chest X-Ray annotations by radiologists (metadata)
2. Sample Chest X-Ray (DICOM image)

In [None]:
# Reading dataset of annotations
path = "../input/vinbigdata-chest-xray-abnormalities-detection"
df = pd.read_csv(os.path.join(path, "train.csv"))

In [None]:
# Reading DICOM images
def read_dicom(path, max_dim):
    image_bytes = tf.io.read_file(path)
    image = tfio.image.decode_dicom_image(
        image_bytes, 
        dtype = tf.uint16
    )
    
    image = tf.squeeze(image, axis = 0)
    
    h, w, _ = image.shape
    
    image = tf.image.resize(
        image, 
        (max_dim, max_dim), 
        preserve_aspect_ratio = True
    )
    
    image = image - tf.reduce_min(image)
    image = image / tf.reduce_max(image)
    image = tf.cast(image * 255, tf.uint8)
    
    return image, h, w

## Visualize and preprocess data

1. Exploring distribution of radiologists

In [None]:
temp = df[["image_id", "rad_id"]].drop_duplicates().reset_index(drop = True)
temp = temp.groupby(["rad_id"]).agg(
    count = pd.NamedAgg("image_id", "count")
).reset_index()

In [None]:
%matplotlib inline

fig, ax = plt.subplots(1, 2, figsize = (15, 5))

sns.countplot(
    df["rad_id"], 
    palette = "tab10", 
    order = list(temp["rad_id"]), 
    ax = ax[0]
)
ax[0].set_title("Number of annotations by radiologists")

sns.barplot(
    x = "rad_id", 
    y = "count", 
    data = temp, 
    palette = "tab10", 
    ax = ax[1]
)
ax[1].set_title("Number of x-rays seen by radiologists")

fig.show()

Radiologists 9, 10 and 8 saw most number of x-rays and made most annotations.

2. Exploring distribution of thoracic abnormalities

In [None]:
temp = df[["image_id", "class_name"]].drop_duplicates().reset_index(drop = True)
temp = temp.groupby(["class_name"]).agg(
    count = pd.NamedAgg("image_id", "count")
).reset_index()

In [None]:
%matplotlib inline

sns.barplot(
    x = "class_name", 
    y = "count", 
    data = temp, 
    palette = "tab10"
)
plt.xticks(rotation = 90)
plt.show()

Looks like most x-rays have no finding. Aortic enlargement is the most common abnormality. At least one occurrence was found in about 3000 x-rays. Cardiomegaly, Pleural thickening and Pulmonary fibrosis follow.

3. Exploring x-rays

In [None]:
%matplotlib inline

max_dim = 500
demo_image = "6d5acf3f8a973a26844d617fffe72998.dicom"
image, h, w = read_dicom(os.path.join(path, "train", demo_image), max_dim)

plt.figure(figsize = (5, 5))
plt.imshow(tf.squeeze(image), 'gray')

Let's improve the contrast of this image using CLAHE (Contrast Limited Adaptive Histogram Equalization). Such image pre-processing redistributes the lightness values of the image making patterns more apparent.

In [None]:
def CLAHE(image):
    clahe = cv2.createCLAHE(
        clipLimit = 2., 
        tileGridSize = (10, 10)
    )
    
    image = clahe.apply(image.numpy()) 
    image = tf.expand_dims(image, axis = 2)
    
    return image

In [None]:
%matplotlib inline

fig = plt.figure(figsize = (8, 8))

axes = fig.add_subplot(1, 2, 1)
plt.imshow(tf.squeeze(image), cmap = "gray")
axes.set_title("Original")

axes = fig.add_subplot(1, 2, 2)
image = CLAHE(image)
plt.imshow(tf.squeeze(image), cmap = "gray")
axes.set_title("Post CLAHE")

Before visualizing the abnormalities on the x-rays, let's perform some preprocessing.

**IMPORTANT**

The API requires the classes to be from 1 to n and outputs 0 when no class is found. Since our labels start with 0, we make unit increment to the class_id and use the new label-map.

In [None]:
# Creating LabelMap
df["class_id"] = df["class_id"] + 1 # Incrementing by 1
LabelMap = df.loc[df["class_name"] != "No finding", ["class_name", "class_id"]] # Removing the examples with no finding
LabelMap = LabelMap.drop_duplicates().reset_index(drop = True)
LabelMap

In [None]:
# Using 14 unique colors to annotate the abnormalities.
LABEL_COLORS = [
    (230, 25, 75), (60, 180, 75), (255, 225, 25), (0, 130, 200), (245, 130, 48), (145, 30, 180), (70, 240, 240), 
    (240, 50, 230), (210, 245, 60), (250, 190, 212), (0, 128, 128), (220, 190, 255), (170, 110, 40), (255, 250, 200), 
]
LabelMap["colors"] = LABEL_COLORS

Let's also save the label mapping as .pbtxt (required). With that we can now visualize the abnormalities on the x-rays.

In [None]:
# Save mappings as .pbtxt
def save_mapping(LabelMap):
    msg = StringIntLabelMap()
    
    for i, row in LabelMap.iterrows():
        msg.item.append(StringIntLabelMapItem(id = row["class_id"], name = row["class_name"]))
    
    text = str(text_format.MessageToBytes(msg, as_utf8 = True), 'utf-8')
    
    f = open("LabelMap.pbtxt", "w")
    f.write(text)
    f.close()
    
save_mapping(LabelMap)

In [None]:
# Remove examples with no findings (won't be used for training)
df = df.dropna().reset_index(drop = True)

# Change data types
df = df.astype({
    "x_min": int, 
    "y_min": int, 
    "x_max": int, 
    "y_max": int,
    "class_id": str
})

In [None]:
def plot_boxes(image, data, title):    
    img = cv2.cvtColor(image.numpy(), cv2.COLOR_GRAY2RGB)
    
    for i, row in data.iterrows():
    
        x1, y1 = row["x_min"], row["y_min"]
        x2, y2 = row["x_max"], row["y_max"]
    
        cv2.rectangle(
            img,
            pt1 = (x1, y1),
            pt2 = (x2, y2),
            color = row["colors"],
            thickness = 2
        )
    
        cv2.putText(
            img, 
            row["class_name"], 
            (x1, y1-5), 
            cv2.FONT_HERSHEY_SIMPLEX, 
            0.5, 
            row["colors"], 
            1
        )

    plt.figure(figsize = (8, 8))
    plt.imshow(img) 
    plt.title(title)

In [None]:
# Selecting a particular radiologist
demo_rad = "R9"

# Preprocessing metadata to suit needs
data = df.loc[
    (df["image_id"] == demo_image[:-6]) & (df["rad_id"] == demo_rad),
    ["class_name", "x_min", "y_min", "x_max", "y_max"]
]

H, W, _ = image.shape
data[["x_min", "x_max"]] = (data[["x_min", "x_max"]]* W/w).astype(int)
data[["y_min", "y_max"]] = (data[["y_min", "y_max"]]* H/h).astype(int)

data = pd.merge(data, LabelMap)

# Plotting annotation by radiologist
plot_boxes(image, data, "Labels for " + demo_image + " by " + demo_rad)

Let's now explore annotations by other radiologists for this x-ray.

In [None]:
# Preprocessing metadata to suit needs
data = df.loc[
    (df["image_id"] == demo_image[:-6]),
    ["class_name", "x_min", "y_min", "x_max", "y_max"]
]

H, W, _ = image.shape
data[["x_min", "x_max"]] = (data[["x_min", "x_max"]]* W/w).astype(int)
data[["y_min", "y_max"]] = (data[["y_min", "y_max"]]* H/h).astype(int)

data = pd.merge(data, LabelMap)

# Plotting annotation by all radiologists
plot_boxes(image, data, "Labels for " + demo_image + " by all radiologists")

That's cluttered. We need not train our model on multiple annotations of the same abnormality. We shall use a technique called Weighted Boxes Fusion (WBF) to provide us with the best annotation. This will definitely reduce the metadata size by a lot.

In [None]:
# Preprocessing as needed for WBF
data = df.loc[
    (df["image_id"] == demo_image[:-6]),
    ["class_name", "x_min", "y_min", "x_max", "y_max"]
]

data[["x_min", "x_max"]] = data[["x_min", "x_max"]]/w
data[["y_min", "y_max"]] = data[["y_min", "y_max"]]/h

data = pd.merge(data, LabelMap)

boxes_list = data[["x_min", "y_min", "x_max", "y_max"]].values.tolist()
scores_list = [1]*len(boxes_list)
labels_list = list(data["class_id"])

# Applying WBF
boxes, _, labels = weighted_boxes_fusion(
    boxes_list = [boxes_list],
    scores_list = [scores_list],
    labels_list = [labels_list],
    weights = None, 
    iou_thr = 0.3, 
    skip_box_thr = 0.0001
)

In [None]:
# Postprocessing after applying WBF 
data = pd.DataFrame(boxes, columns = ["x_min", "y_min", "x_max", "y_max"])

H, W, _ = image.shape
data[["x_min", "x_max"]] = (data[["x_min", "x_max"]]* W).astype(int)
data[["y_min", "y_max"]] = (data[["y_min", "y_max"]]* H).astype(int)

data["class_id"] = labels.astype(int)

data = pd.merge(data, LabelMap)

# Plotting annotation by all radiologists
plot_boxes(image, data, "Labels for " + demo_image + " post WBF")

Awesome. We successfully eliminated multiple annotations for the same abnormality.

## TFRecord Creation

The TFRecord format is a simple format for storing a sequence of binary records. This format is efficient in terms of storage and retrieval. It is the desired input format for the API. But before creating TFRecords, we must first apply WBF to the metadata. To apply WBF we must normalize the coordinates. Reading each image to extract dimensions can be time consuming. Using PyDICOM we can obtain x-ray metadata from which dimensions can be quickly extracted. 

In [None]:
# Dropping rad_id as it is not required for training
df = df.drop(columns = ["rad_id"])

# Obtaining set of x-rays with at least one finding
xrays = set(df["image_id"]) # Only 4394 x-rays, not 15000. Roughly 30% of the x-rays remain.

In [None]:
dimensions = []
for i, xray in tqdm(enumerate(xrays)):
    ds = pydicom.dcmread(
        os.path.join(path, "train", xray + ".dicom"), 
        specific_tags = [
            Tag("0028", "0010"), # Tag for Rows (Height)
            Tag("0028", "0011")  # Tag for Columns (Width)
        ]
    )
    
    dimensions.append([xray, ds.Rows, ds.Columns])

In [None]:
dimensions = pd.DataFrame(dimensions, columns = ["image_id", "height", "width"])
df = pd.merge(dimensions, df)

In [None]:
# Normalize coordinates
df["x_min"], df["x_max"] = df["x_min"]/df["width"], df["x_max"]/df["width"]
df["y_min"], df["y_max"] = df["y_min"]/df["height"], df["y_max"]/df["height"]

In [None]:
# Before applying WBF we had 36096 rows
df_list = []
for i, xray in tqdm(enumerate(xrays)):
    data = df[df["image_id"] == xray]

    boxes_list = data[["x_min", "y_min", "x_max", "y_max"]].values.tolist()
    scores_list = [1]*len(boxes_list)
    labels_list = list(data["class_id"])

    # Applying WBF
    boxes, _, labels = weighted_boxes_fusion(
        boxes_list = [boxes_list],
        scores_list = [scores_list],
        labels_list = [labels_list],
        weights = None, 
        iou_thr = 0.3, 
        skip_box_thr = 0.0001
    )
    
    data = pd.DataFrame(boxes, columns = ["x_min", "y_min", "x_max", "y_max"]) 
    # Leaving the coordinates normalized since the API expects them to be so. 
    
    data["class_id"] = labels.astype(int)
    
    data["image_id"] = xray 
    
    df_list.append(data)

In [None]:
df = pd.concat(df_list) # After applying WBF we have 21836 rows
df = pd.merge(df, LabelMap)
df = df.drop(columns = ["colors"])

Since we have more than a few thousand examples, it is beneficial to shard the dataset into multiple files:
* Parallel reading improves throughput.
* Easy shuffling improves performance.

Sharding is cool but you know what's cooler? Stratified K-Fold Sharding. Basically we break down our dataset into multiple ("K") TFRecords (each is a shard) in such a way that: 
* The distribution of abnormalities remains the same in each shard.
* Each x-ray is part of exactly one shard (to avoid information leak). 

We can conveniently use these shards for training, validation and testing.

In [None]:
# Stratified K-Fold Sharding

num_shards = 25

skf = StratifiedKFold(
    n_splits = num_shards, 
    shuffle = True, 
    random_state = 0
)

df_folds = df[['image_id']].copy()

df_folds.loc[:, 'bbox_count'] = 1
df_folds = df_folds.groupby('image_id').count()   # Number of bounding boxes in the image
df_folds.loc[:, 'object_count'] = df.groupby('image_id')['class_id'].nunique() # Number of classes in the image

# Preparing stratify groups
df_folds.loc[:, 'stratify_group'] = np.char.add(
    df_folds['object_count'].values.astype(str),
    df_folds['bbox_count'].apply(lambda x: f'_{x // 15}').values.astype(str)
)

# Determining which fold the x-ray will fall in
df_folds.loc[:, 'fold'] = 0
skf_split = skf.split(
    X = df_folds.index, 
    y = df_folds['stratify_group']
)

for fold_number, (train_index, val_index) in enumerate(skf_split):
    df_folds.loc[df_folds.iloc[val_index].index, 'fold'] = fold_number
    
df_folds.reset_index(inplace = True)

In [None]:
df = pd.merge(df, df_folds)

temp = df.groupby(["fold", "class_name"]).agg(
    count = pd.NamedAgg("class_name", "count")
).reset_index()

temp = temp.pivot_table(
    index = "class_name",
    columns = "fold",
    values = "count"
)

In [None]:
plt.figure(figsize = (20, 10))
sns.heatmap(
    temp,
    annot = True,
    cmap = "YlGnBu",
    fmt = "g"
)
plt.title("Heatmap of class distribution")

Notice how color is similar along a row. The color distribution indicates the similar class disturbution across all folds (shards).

Once sharding is done, it is important to create TFRecords after applying CLAHE to each x-ray. We must remember to apply the same transformations to the x-rays we intend to make predictions for.

In [None]:
def create_tf_record(img_path, max_dim, img_df):
    
    filename = img_path.split("/")[-1].encode()
    source_id = img_path.encode()
    
    # Preprocess image 
    img, _, _ = read_dicom(img_path, max_dim)
    height, width, _ = img.shape
    img = CLAHE(img)
    
    # Encode as JPEG (Lossy compression)
    img = tf.io.encode_jpeg(
        img, 
        quality = 100, 
        format = 'grayscale'
    )
    
    img_bytes = img.numpy()
    
    img_format = b'jpeg'

    xmin_list = list(img_df["x_min"])
    xmax_list = list(img_df["x_max"])
    ymin_list = list(img_df["y_min"])
    ymax_list = list(img_df["y_max"])
    
    class_name_list = list(img_df["class_name"])
    class_name_list = [c.encode() for c in class_name_list]
    
    class_id_list = list(img_df["class_id"])
    
    # Creating TFRecord
    tf_record = tf.train.Example(
        features = tf.train.Features(
            feature = {
                'image/height': dataset_util.int64_feature(height),
                'image/width': dataset_util.int64_feature(width),
                'image/filename': dataset_util.bytes_feature(filename),
                'image/source_id': dataset_util.bytes_feature(source_id),
                'image/encoded': dataset_util.bytes_feature(img_bytes),
                'image/format': dataset_util.bytes_feature(img_format),
                'image/object/bbox/xmin': dataset_util.float_list_feature(xmin_list),
                'image/object/bbox/xmax': dataset_util.float_list_feature(xmax_list),
                'image/object/bbox/ymin': dataset_util.float_list_feature(ymin_list),
                'image/object/bbox/ymax': dataset_util.float_list_feature(ymax_list),
                'image/object/class/text': dataset_util.bytes_list_feature(class_name_list),
                'image/object/class/label': dataset_util.int64_list_feature(class_id_list),
            }
        )
    )
    
    return tf_record

In [None]:
annot_path = "workspace/annotations" 
os.makedirs(annot_path, exist_ok = True) 

In [None]:
img_cnt = np.zeros(num_shards, dtype = int)

with contextlib2.ExitStack() as tf_record_close_stack:
    output_tfrecords = tf_record_creation_util.open_sharded_output_tfrecords(
        tf_record_close_stack, 
        annot_path, 
        num_shards
    )
    
    for i in tqdm(range(num_shards)):
        df_shard = df[df["fold"] == i]
        xrays = set(df_shard["image_id"])
        
        for xray in xrays:
            df_image = df_shard[df_shard["image_id"] == xray]
            
            img_path = os.path.join(path, "train", xray + ".dicom")
            tf_record = create_tf_record(img_path, max_dim, df_image)
            output_tfrecords[i].write(tf_record.SerializeToString())
            
            img_cnt[i] += 1

print("Converted {} images".format(np.sum(img_cnt)))
print("Images per shard: {}".format(img_cnt))

In [None]:
# Save dataframe
df.to_csv("data.csv", index = False)

TFRecords created! We are now ready to use these for training, validation and testing. Jump to my [second notebook](https://www.kaggle.com/bhallaakshit/training-evaluation-with-tf2-object-detection-api). 

## CREDITS
### I'm a novice TensorFlow developer. This notebook would not have been possible without the following:
1. https://www.kaggle.com/mistag/data-create-tfrecords-of-vinbigdata-chest-x-rays
2. https://www.kaggle.com/backtracking/smart-data-split-train-eval-for-object-detection/comments


Please consider upvoting these notebooks as well. :D