# <u>RANZCR CLiP: Visualize and Understand Dataset</u>

## Contents

1. [Introduction](#1)
1. [Background knowledge](#2)
1. [Data overview](#3)
1. [Visualize x-rays image](#4)
1. [Visualize x-ray images with annotation](#5)
1. [Visualize train label](#6)

<a id="1"></a> <br>
# <div class="alert alert-block alert-info">Introduction</div>

## Goal

To create model to detect the presence and position of catheters and lines on chest x-rays,  and categorize a tube that is poorly placed.

Deep learning algorithms may be able to automatically detect malpositioned catheters and lines. Once alerted, clinicians can reposition or remove them to avoid life-threatening complications.

In [None]:
import re

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from PIL import Image, ImageDraw
import seaborn as sns
from IPython.display import YouTubeVideo

--------------

<a id="2"></a> <br>
# <div class="alert alert-block alert-info">Background knowledge</div>

I'll introduce technical words and its' references to need to understand dataset here.

## Endotracheal Tube

An endotracheal tube is a flexible plastic tube that is placed through the mouth into the trachea (windpipe) to help a patient breathe. The endotracheal tube is then connected to a ventilator, which delivers oxygen to the lungs. The process of inserting the tube is called endotracheal intubation.

For more information, see [How an Endotracheal Tube Is Used](https://www.verywellhealth.com/endotracheal-tube-information-2249093).

In [None]:
YouTubeVideo('FtJr7i7ENMY')

## Nasogastric Tube

Nasogastric tube is a plastic tube through the nose, past the throat, and down into the stomach used in nasogastric intubation medical process. 

For more information, see [Wikipedia - Nasogastric intubation](https://en.wikipedia.org/wiki/Nasogastric_intubation).

In [None]:
YouTubeVideo('Abf3Gd6AaZQ')

## Central venous catheter

A central venous catheter (CVC), also known as a central line is a catheter placed into a large vein. It is a form of venous access. Placement of larger catheters in more centrally located veins is often needed in critically ill patients, or in those requiring prolonged intravenous therapies, for more reliable vascular access. 

For more information, see [Wikipedia - Central venous catheter](https://en.wikipedia.org/wiki/Central_venous_catheter).

In [None]:
YouTubeVideo('mTBrCMn86cU')

## Swan Ganz Catheter Present

Swan-Ganz catheter is used in Pulmonary artery catheterization (PAC), or right heart catheterization.  In the operation, the insertion of a catheter into a pulmonary artery. Its purpose is diagnostic; it is used to detect heart failure or sepsis, monitor therapy, and evaluate the effects of drugs. 

For more information, see [Wikipedia - Pulmonary artery catheter](https://en.wikipedia.org/wiki/Pulmonary_artery_catheter).

In [None]:
YouTubeVideo('YkN30T6ig30')

--------------

<a id="3"></a> <br>
# <div class="alert alert-block alert-success">Data overview</div>

Let's see given data. there are roughly 7 kinds data.

In [None]:
!ls ../input/ranzcr-clip-catheter-line-classification/

## train.csv

train.csv contains image IDs, binary labels, and patient IDs.

In [None]:
train = pd.read_csv("../input/ranzcr-clip-catheter-line-classification/train.csv")
train.head()

In [None]:
train.info()

## train_annotations.csv

Segmentation annotations for training samples that have them.

In [None]:
train_anno = pd.read_csv("../input/ranzcr-clip-catheter-line-classification/train_annotations.csv")
train_anno.head()

In [None]:
train_anno.info()

There are all label column values.

In [None]:
set(train_anno.label)

## sample_submission.csv

sample_submission.csv is sumple submission file.

In [None]:
sub = pd.read_csv("../input/ranzcr-clip-catheter-line-classification/sample_submission.csv")
sub.head()

## train, test

There are chest x-rays images. I'll explain how to visualize them [here](#4).

## train_tfrecords, test_tfrecords

Serialized train and test data.

In [None]:
! ls ../input/ranzcr-clip-catheter-line-classification/train_tfrecords/

In [None]:
! ls ../input/ranzcr-clip-catheter-line-classification/test_tfrecords/

<a id="4"></a> <br>
# <div class="alert alert-block alert-success">Visualize x-rays image</div>

We can visualize data by matplotlib.

In [None]:
train_idx = 5010
uid = train[["StudyInstanceUID"]].iat[train_idx,0]
train_img = Image.open(f"../input/ranzcr-clip-catheter-line-classification/train/{uid}.jpg")

In [None]:
plt.figure(figsize=(15, 15))
plt.imshow(train_img)
plt.title(f"train_{uid}")

We can see that this is an X-ray image of the chest. We can also see that the right and left sides are marked. We can also see that there are several tubes in the image, too.

But we can't recognize where the tubing and catheters to detect are, so we will overwrite the annotation data in the next section to confirm.

<a id="5"></a> <br>
# <div class="alert alert-block alert-success">Visualize x-ray images with annotation</div>

We'll extract the annotation data from train_annotations.csv and overwrite the images.

## Normal sample

First, we'll check "XXX - Normal" images.

The data which StudyInstanceUID is '1.2.826.0.1.3680043.8.498.11012190756062412030973253000324820445' is good for example, so I'll extract it.

In [None]:
uid = train_anno["StudyInstanceUID"].iloc[5010]
uid

In [None]:
anno_data = train_anno[train_anno["StudyInstanceUID"] == uid]
anno_data = anno_data.reset_index(drop=True)
anno_data

We can find that in this image,  "NGT - Normal", "ETT - Normal" and "CVC - Normal" labeled tubes and catheters are.

We'll create a utility function to add easy-to-read annotations.

In [None]:
def draw_annotaion(structures, im):
    """
    anatomical_structure: list of points of annotation.
    im: Pillow  image instance.
    """
    points = []
    draw = ImageDraw.Draw(im)
    for i in range(0, len(structures), 2):
        points.append(tuple(structures[i:i+2]))
        
        draw.line(points, width=50, fill='red')
    return im

We will extract the annotation data from each dataframe using regular expressions since the list is stored as a string in the dataframe.

The extracted annotations will be overwritten on the images.

In [None]:
regex = re.compile('\d+')
images = []
labels = anno_data["label"].values
for i in range(len(anno_data)):
    match = regex.findall(anno_data["data"].at[i])
    match = [int(item) for item in match]
    images.append(draw_annotaion(match, train_img.copy()))

Let's visualize them.

In [None]:
fig, axs = plt.subplots(2, 2,  figsize=(15,15))
cnt = 0
for i in range(2):
    for j in range(2):
        axs[i][j].imshow(images[cnt])
        axs[i][j].set_title(f"{labels[cnt]}")
        cnt += 1

## Swan Ganz Catheter Present

We'll also check Swan-Ganz Catheter.

In [None]:
uid = train_anno["StudyInstanceUID"].iloc[170]
anno_data = train_anno[train_anno["StudyInstanceUID"] == uid]
anno_data = anno_data.reset_index(drop=True)
anno_data

In [None]:
img = Image.open(f"../input/ranzcr-clip-catheter-line-classification/train/{uid}.jpg")

match = regex.findall(anno_data["data"].at[0])
match = [int(item) for item in match]
    
img_annotated = draw_annotaion(match, img.copy())
plt.figure(figsize=(8, 8))
plt.imshow(img_annotated)
plt.title("Swan Ganz Catheter Present")

## Abnormal sample

We will also look at the abnormal data.

### CVC

In [None]:
uid = train_anno["StudyInstanceUID"].iloc[25]
anno_data = train_anno[train_anno["StudyInstanceUID"] == uid]
anno_data = anno_data.reset_index(drop=True)
anno_data

In [None]:
img = Image.open(f"../input/ranzcr-clip-catheter-line-classification/train/{uid}.jpg")

match = regex.findall(anno_data["data"].at[0])
match = [int(item) for item in match]
    
img_annotated = draw_annotaion(match, img.copy())
plt.figure(figsize=(8, 8))
plt.imshow(img_annotated)
plt.title("CVC - Abnormal")

### ETT

In [None]:
uid = train_anno["StudyInstanceUID"].iloc[1562]
anno_data = train_anno[train_anno["StudyInstanceUID"] == uid]
anno_data = anno_data.reset_index(drop=True)
anno_data

In [None]:
img = Image.open(f"../input/ranzcr-clip-catheter-line-classification/train/{uid}.jpg")

match = regex.findall(anno_data["data"].at[0])
match = [int(item) for item in match]
    
img_annotated = draw_annotaion(match, img.copy())
plt.figure(figsize=(8, 8))
plt.imshow(img_annotated)
plt.title("ETT - Abnormal")

### NGT

In [None]:
uid = train_anno["StudyInstanceUID"].iloc[1287]
anno_data = train_anno[train_anno["StudyInstanceUID"] == uid]
anno_data = anno_data.reset_index(drop=True)
anno_data

In [None]:
img = Image.open(f"../input/ranzcr-clip-catheter-line-classification/train/{uid}.jpg")

match = regex.findall(anno_data["data"].at[0])
match = [int(item) for item in match]
    
img_annotated = draw_annotaion(match, img.copy())
plt.figure(figsize=(8, 8))
plt.imshow(img_annotated)
plt.title("NGT - Abnormal")

## Borderline sample

We will also look at the borderline data.

### ETE

In [None]:
uid = train_anno["StudyInstanceUID"].iloc[129]
anno_data = train_anno[train_anno["StudyInstanceUID"] == uid]
anno_data = anno_data.reset_index(drop=True)
anno_data

In [None]:
img = Image.open(f"../input/ranzcr-clip-catheter-line-classification/train/{uid}.jpg")

match = regex.findall(anno_data["data"].at[0])
match = [int(item) for item in match]
    
img_annotated = draw_annotaion(match, img.copy())
plt.figure(figsize=(8, 8))
plt.imshow(img_annotated)
plt.title("ETE - Borderline")

### NGT

In [None]:
uid = train_anno["StudyInstanceUID"].iloc[515]
anno_data = train_anno[train_anno["StudyInstanceUID"] == uid]
anno_data = anno_data.reset_index(drop=True)
anno_data

In [None]:
img = Image.open(f"../input/ranzcr-clip-catheter-line-classification/train/{uid}.jpg")

match = regex.findall(anno_data["data"].at[2])
match = [int(item) for item in match]
    
img_annotated = draw_annotaion(match, img.copy())
plt.figure(figsize=(8, 8))
plt.imshow(img_annotated)
plt.title("NGT - Borderline")

### CVC

In [None]:
uid = train_anno["StudyInstanceUID"].iloc[17940]
anno_data = train_anno[train_anno["StudyInstanceUID"] == uid]
anno_data = anno_data.reset_index(drop=True)
anno_data

In [None]:
img = Image.open(f"../input/ranzcr-clip-catheter-line-classification/train/{uid}.jpg")

match = regex.findall(anno_data["data"].at[1])
match = [int(item) for item in match]
    
img_annotated = draw_annotaion(match, img.copy())
plt.figure(figsize=(8, 8))
plt.imshow(img_annotated)
plt.title("CVC - Borderline")

## Incompletely Imaged sample

Incompletely Imaged is probably an image that the NGtube has protruded from the x-ray image.

In [None]:
uid = train_anno["StudyInstanceUID"].iloc[17867]
anno_data = train_anno[train_anno["StudyInstanceUID"] == uid]
anno_data = anno_data.reset_index(drop=True)
anno_data

In [None]:
img = Image.open(f"../input/ranzcr-clip-catheter-line-classification/train/{uid}.jpg")

match = regex.findall(anno_data["data"].at[1])
match = [int(item) for item in match]
    
img_annotated = draw_annotaion(match, img.copy())
plt.figure(figsize=(8, 8))
plt.imshow(img_annotated)
plt.title("NGT - Incompletely Imaged")

In [None]:
uid = train_anno["StudyInstanceUID"].iloc[457]
anno_data = train_anno[train_anno["StudyInstanceUID"] == uid]
anno_data = anno_data.reset_index(drop=True)
anno_data

In [None]:
img = Image.open(f"../input/ranzcr-clip-catheter-line-classification/train/{uid}.jpg")

match = regex.findall(anno_data["data"].at[2])
match = [int(item) for item in match]
    
img_annotated = draw_annotaion(match, img.copy())
plt.figure(figsize=(8, 8))
plt.imshow(img_annotated)
plt.title("NGT - Incompletely Imaged")

<a id="6"></a> <br>
# <div class="alert alert-block alert-success">Visualize train label</div>

In train.csv, there are binary labels. From this data, we can determine what treatment has been performed on each StudyInstanceUID image and what state it is in, so we will visualize it.

In [None]:
train.head()

To prepare, define variables and utility functions.

In [None]:
ett_cols = [col for col in train.columns if "ETT" in col]
ngt_cols = [col for col in train.columns if "NGT" in col]
cvc_cols = [col for col in train.columns if "CVC" in col]

In [None]:
def df_transform(df):
    df_res = pd.DataFrame()
    for col in df.columns:
        df_tmp = pd.DataFrame(df[col])
        df_tmp.columns = ["data"]
        df_tmp["type"] = col
        df_res = pd.concat([df_res, df_tmp])
        
    return df_res

## Breakdown of each treatment performed

ETT seems to have been done with about 10% of the training data. Of the data conducted, most seems to be normal, with a little borderline. Abnormal is rarely.

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(15,6), gridspec_kw=dict(wspace=0.3, hspace=0.6))
g1 = sns.countplot(data = df_transform(train[ett_cols]),x="data", ax=axes[0])
g1.set_title("Number of each status of ETT")

g2 = sns.countplot(data = df_transform(train[ett_cols]),x="data", hue="type", ax=axes[1])
g2.set_title("Number of each status of ETT - detail")

NGT seems to have been done with about 10% of the training data, too. Borderline and Abnormal are rarely.

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(15,6), gridspec_kw=dict(wspace=0.3, hspace=0.6))
g1 = sns.countplot(data = df_transform(train[ngt_cols]),x="data", ax=axes[0])
g1.set_title("Number of each status of NGT")

g2 = sns.countplot(data = df_transform(train[ngt_cols]),x="data", hue="type", ax=axes[1])
g2.set_title("Number of each status of NGT - detail")

CVC seems to have been done with about 30% of the training data, too. Borderline and Abnormal are less common than normal, but we can see that a not insignificant number of cases are occurring.

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(15,6), gridspec_kw=dict(wspace=0.3, hspace=0.6))
g1 = sns.countplot(data = df_transform(train[cvc_cols]),x="data", ax=axes[0])
g1.set_title("Number of each status of NCVCGT")

g2 = sns.countplot(data = df_transform(train[cvc_cols]),x="data", hue="type", ax=axes[1])
g2.set_title("Number of each status of CVC - detail")

There seem to be few procedures that use the Swan Ganz Catheter.

In [None]:
g = sns.countplot(data = train, x="Swan Ganz Catheter Present")
g.set_title("Number of exists of Swan Ganz Catheter Present")

### Is the treatment being performed at the same time?

Each record seems to have several tubes and catheters. If the value is 1, there are tubes and catheters. So if we sum the values, we can know the number of them.

In [None]:
data_cols = [col for col in train.columns if col not in ['StudyInstanceUID', 'PatientID']]
g = sns.countplot(np.sum(train[data_cols].values, axis=1))
g.set_title("Number of conditions occurring simultaneously for a PatientID")

Most of the images are of a single procedure, but there are also less than 10000 images of two or more procedures being performed. There are also less than 3000 images of four or more.