# <center> RANZCR Catheter and Line Position Challenge </center>
## <center> Exploratory Data Analysis </center>

In [None]:
import os
import ast

import numpy as np
import pandas as pd

import cv2

import re
regex = re.compile('\d+')

from matplotlib import pyplot as plt
%matplotlib inline
%config InlineBackend.figure_format = 'svg'

In [None]:
BASE_DIR = '../input/ranzcr-clip-catheter-line-classification'

df_train = pd.read_csv(os.path.join(BASE_DIR, 'train.csv'))
df_annotations = pd.read_csv(os.path.join(BASE_DIR, 'train_annotations.csv'))

df_annotations['image_path'] = [os.path.join(BASE_DIR, 'train', x + '.jpg') for x in df_annotations['StudyInstanceUID']]

## Number of Samples

- The number of samples (images) is ~30k, which is very small than the imagenet sample size (14M). So, transfer learning will be extremely beneficial in this competition.

In [None]:
df_train.shape[0]

## Number of Targets

In [None]:
targets = df_train.columns[1:-1]
print(list(targets))
print('\nNumber of Targets: %d' % len(targets))

The target list includes 4 types of tubes that are seen in the X-ray images. Since the goal of the competition is to predict whether these tubes are placed normally, abnormally, or borderline, we better know where these tubes are inserted in the body and where they are directed to.

`ETT:` Endotracheal Tube - A flexible plastic tube that is **placed through the mouth into the trachea (windpipe)** to help a patient breathe. The endotracheal tube is then connected to a ventilator, which delivers oxygen to the lungs.

`NGT:` Nasogastric Tube - A flexible tube of rubber or plastic that is **passed through the nose, down through the esophagus, and into the stomach**. It can be used to either remove substances from or add them to the stomach. 

`CVC:` Central Venous Catheter - A thin, flexible tube that is **inserted into a vein, usually below the right collarbone (shoulders), and guided (threaded) into a large vein above the right side of the heart** called the superior vena cava.

`Swan Ganz Catheter`: A thin tube that is usually **inserted in the neck or groin and directed through the veins and into the right side of the heart**.

## Target Co-occurance

- The target co-occurance is not 1 for all samples as seen below, so this is a **MULTI-LABEL** classification problem.
- 24 samples have no targets!

In [None]:
df_train.head()
df_train[targets].sum(1).value_counts(sort=True).plot.bar()
plt.xlabel('Co-occurance')
plt.ylabel('Frequency')
plt.show()

In [None]:
df_train[targets].sum(1).value_counts(sort=True)

## Target Mean

- There is a huge target imbalance.
- The most frequent target is CVC-Normal, which is observed in 71% of the data. 

In [None]:
df_train[targets].mean()

## Patient Frequency

- There are 3255 unique patients. One particular patient is present as much as 172 times in the data. It may be a good idea to do cross validation by grouping the patients.

In [None]:
df_train['PatientID'].value_counts()

## Are the label conditions mutually exclusive?

- ETT can have **only** 1 of the 3 three conditions: Abnormal, Borderline, Normal, or None

In [None]:
df_train[targets[:3]].value_counts().rename('Counts').reset_index()

- NGT can have 2 conditions at the same time but not 3! The cases with 2 conditions are only 25.

In [None]:
df_train[targets[4:7]].value_counts().rename('Counts').reset_index()

- CVC can have all combinations of the Abnormal, Borderline, and Normal.

In [None]:
df_train[targets[7:10]].value_counts().rename('Counts').reset_index()

## Annotations

- Annotation lengths are not same for all the images. They vary between 4 and 152. The distribution of the annodation lengths can be seen figure below.

In [None]:
df_annotations['length'] = [len(regex.findall(df_annotations.loc[i, 'data'])) for i in range(len(df_annotations))]
print('Minimum annotation length:', df_annotations['length'].min())
print('Maximum annotation length:', df_annotations['length'].max())

In [None]:
ax = df_annotations['length'].hist(bins=70, grid=False)
ax.set_xlabel('Annotation Length')
ax.set_ylabel('Frequency')
plt.show()

## Number of annotated tubes for a given image

- Majority of the images with tube annotations have 1 annotated tube. However, tube annotations are observed as much as 6!
- There are also images without the tube annotation data.

In [None]:
df_annotations['StudyInstanceUID'].value_counts().value_counts().plot.bar()
plt.show()

In [None]:
def plot_row(row=5):
    colors = [(228, 26, 28), (55, 126, 184), (152, 78, 163), (255, 255, 51), (166, 86, 40), (247, 129, 191), (153, 153, 153)]
    studyid = df_annotations.loc[row, 'StudyInstanceUID']
    img = cv2.imread(df_annotations.loc[row, 'image_path'], 1)
    
    data = df_annotations[df_annotations['StudyInstanceUID'] == studyid]
    print(data['label'])
    for index, row in data.reset_index().iterrows():
        pts = np.array(ast.literal_eval(row['data']), np.int32)
        if row['label'][:3] == 'NGT':
            color_index = 0
        elif row['label'][:3] == 'CVC':
            color_index = 1
        elif row['label'][:3] == 'ETT':
            color_index = 3
        else:
            color_index = 4
        for i, pt in enumerate(pts):
            if i < len(pts)-1:
                cv2.line(img,tuple(pts[i]), tuple(pts[i+1]), colors[color_index], 30)
    plt.imshow(img)
    plt.show()

### Some Example Images with Annotations

In [None]:
plot_row(35)

In [None]:
plot_row(42)

In [None]:
plot_row(310)

In [None]:
plot_row(325)