# Why this Competition?
This competiton provides another great opportunity for computer vision to be applied in real world and potential to have meaningful impact on people's lives. It is also a great oppertunity for us Data Science enthusiasts to understand the medical definitions of improper catheter position and nonetheless showcase our skills in a competitive setting. It also provides the unique oppertunities for beginners (myself included) to get their hands dirty and indulge is constructive discussions and knowledge sharing on this platform.

# Problem Statement
Essentially this competition requires us to automate the work on a physicial/radiologist to correctly classify the position of lines and catheters from a x-ray image. This in turn makes our algorithm a virtual doctor (hence the name of this notebook) and decrease the monotonous and error-prone critical tasks of a physicial/radiologist. Once flagged, the clinicians can reposition the same and avoid any life threating complications.
The scoring metric for this competition is also a tricky one for multi-class classification: **AUC (Macro Averaged)**.

## Why bother?
Considering the overwhelmed medical staff in the current Covid-19 world we live in, makes the environemnt at medical facilities very stressful. And doing a crirical task such as inserting lifesaving lines and catheters in a patient in such environemnt is both time taking and prone to human errors. *"Serious complications can occur as a result of malpositioned lines and tubes in patients."* And to properly follow protocals for imaging the positions and analyzing requires tenacity and time which is at a premium in the current scenario. Thus the objective becomes very relevant and important for us.

## Data Description:-
* About 40,000 chest X-ray images provided with labels showing the catheter position and categorizing if it is poorly placed.

## Expected Outcome:-
* Detect the presence and position of catheters and lines on chest x-rays and Categorize a tube that is poorly placed.

## Problem Category:-
For the data and objective its is evident that this is a **multi-label classification problem** in the **Computer Vision** domain.

# About this Notebook
* Being a beginner myself, this notebook will focus solely on basics, getting to know the data and build a primitive yet effective model.
    * This notebook will be updated several times as and when I learn new interesting stuff and think will be useful for the audience. Please also consider that I too am on a learning voyage here.
* Our weapon of choice will be Deep-Learning through the journey of this notebook.
* Through this notebook are starting out journey to train a virtual physician who can (hopefully) detect correctly the position of catheters and lines in a patient. So this notebook will follow a semeseter-by-semester approach to gradually train out own Deep-Learning Doctor.

***DISCLAIMER:- No offence to the doctors in the audience. By no means I an suggesting that we can actually get a MD with Computer Vision. It is meant as a pun and I hope you take it as some harmless humour. If I end-up offending anyone, you can let me know and I will happily change the title.***

# Entrance: Imports
Just like any student, our journey begins by having some pre-requisites that will help us navigate through this course. So let's crack this with some basic library imports we require to get ready for school.

In [None]:
# For Reproducable results
from numpy.random import seed
seed(1)
from tensorflow import random as random_tf
random_tf.set_seed(2)

# Asthetics
import warnings
import sklearn.exceptions
warnings.filterwarnings('ignore', category=DeprecationWarning)
warnings.filterwarnings('ignore', category=FutureWarning)
warnings.filterwarnings("ignore", category=sklearn.exceptions.UndefinedMetricWarning)

# General
import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
import numpy as np
import os
import sys
import time
import cv2
import ast
import imagehash
import glob
from tqdm import tqdm

# Visialisation
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set(style="whitegrid")
from plotly import graph_objs as go
import plotly.express as px
import plotly.figure_factory as ff
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
from PIL import Image

# Deep Learning
import torch

In [None]:
data_path = '../input/ranzcr-clip-catheter-line-classification'

labels_file_path = os.path.join(data_path, 'train.csv')
segmentation_annotation_path = os.path.join(data_path, 'train_annotations.csv')
sample_submission_path = os.path.join(data_path, 'sample_submission.csv')
train_images_path = os.path.join(data_path, 'train')
test_images_path = os.path.join(data_path, 'test')

print(f'Label File path: {labels_file_path}')
print(f'Segmentation File path: {segmentation_annotation_path}')
print(f'Sample Submission File path: {sample_submission_path}')
print(f'Train Images path: {train_images_path}')
print(f'Test Images path: {test_images_path}')

# Utils

In [None]:
def plot_img_from_df(test_df, train_images_path, images_in_each_row = 3):
    image_list = test_df['StudyInstanceUID'].to_list()
    rows = int(len(image_list) / images_in_each_row)
    plt.figure(figsize=(20,20))
    for i, img in enumerate(image_list):
        full_path = os.path.join(train_images_path, img) + '.jpg'
        img = cv2.imread(full_path)
        img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
        plt.subplot(rows, images_in_each_row, i+1)
        plt.axis('off')
        plt.imshow(img)

In [None]:
# The function below is referred from https://www.kaggle.com/ihelon/catheter-position-exploratory-data-analysis
def plot_image_with_annotations(annot_df, train_images_path, row_ind):
    row = annot_df.iloc[row_ind]
    image_path = os.path.join(train_images_path, row["StudyInstanceUID"] + ".jpg")
    label = row["label"]
    data = np.array(ast.literal_eval(row["data"]))
    
    plt.figure(figsize=(20, 10))
    image = cv2.imread(image_path)
    image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
    plt.subplot(1, 2, 1)
    plt.axis('off')
    plt.imshow(image)
    plt.subplot(1, 2, 2)
    plt.axis('off')
    plt.imshow(image)
    plt.scatter(data[:, 0], data[:, 1])
    plt.suptitle(label, fontsize=15)

In [None]:
# The function below is referred from https://www.kaggle.com/c/ranzcr-clip-catheter-line-classification/discussion/203353#1117642
def visualize_annotations(annot_df, train_images_path, img_id):
    plt.figure(figsize=(8, 8))
    image = cv2.imread(os.path.join(train_images_path, img_id + ".jpg"))
    image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
    plt.imshow(image)
    
    df_patient = annot_df.loc[annot_df["StudyInstanceUID"] == img_id]
    
    if df_patient.shape[0]:        
        labels = df_patient["label"].values.tolist()
        lines = df_patient["data"].apply(ast.literal_eval).values.tolist()

        for line, label in zip(lines, labels):         
            line = np.asarray(line)
            plt.scatter(line[:, 0], line[:, 1], s=40, label=label)
        
        plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0, prop={'size': 20})
        
    plt.tick_params(axis="x", labelsize=15)
    plt.tick_params(axis="y", labelsize=15)
    plt.axis('off')
    plt.show()

# Semester 1: Analyzing the Inputs  
Let's start by seeing what kind of data is provided to us and familiarize ourselves with some basic terms as well as form some base understanding of the data.
We will start by answering the following questions:-
1. What are the various labels in the train data?
2. What is ETT, NGT and CVC?
3. What is Normal, Abnormal and Boderline for each of them? What are the traits to look for while identifying them?
4. What the representation of each observation in our training data?

With these questions in mind, let's start cracking...

In [None]:
labels_df = pd.read_csv(labels_file_path)
labels_df.head()

In [None]:
annot_df = pd.read_csv(segmentation_annotation_path)
annot_df.head()

In [None]:
labels_df.describe()

Let's first explore this tabular data first before jumping to image data. We will now examine if 1 particular image has noth Normal as well as abnormal label for each type of intubation/catheter...

In [None]:
print(labels_df[labels_df['ETT - Abnormal'] == 1]['ETT - Normal'].value_counts())
print(labels_df[labels_df['ETT - Normal'] == 1]['ETT - Abnormal'].value_counts())

In [None]:
print(labels_df[labels_df['NGT - Abnormal'] == 1]['NGT - Normal'].value_counts())
print(labels_df[labels_df['NGT - Normal'] == 1]['NGT - Abnormal'].value_counts())

Uh-Oh! We can see that there are some images classified as both having Normal as well as Abnormal NGT. Let's look at some such examples:-

In [None]:
test_df = labels_df[(labels_df['NGT - Abnormal'] == 1) & (labels_df['NGT - Normal'] == 1)]
plot_img_from_df(test_df, train_images_path)

In [None]:
print(labels_df[labels_df['CVC - Abnormal'] == 1]['CVC - Normal'].value_counts())
print(labels_df[labels_df['CVC - Normal'] == 1]['CVC - Abnormal'].value_counts())

In [None]:
test_df = labels_df[(labels_df['CVC - Abnormal'] == 1) & (labels_df['CVC - Normal'] == 1)][:9]
plot_img_from_df(test_df, train_images_path)

We can see that a lot of images are classified both as normal as well as abnormal in the same X-ray. The reason for that is, **One person at the same point in time can have multiple catheters which are visible in same image.** In a situation where 1 is normal and another is abnormal, both the columns will get a True value.  
Now let's see how many images in our data set have each of these procedures:-

In [None]:
cols_dict = {'ETT' : ['ETT - Abnormal', 'ETT - Borderline', 'ETT - Normal'],
             'NGT' : ['NGT - Abnormal', 'NGT - Borderline', 'NGT - Incompletely Imaged', 'NGT - Normal'],
             'CVC' : [ 'CVC - Abnormal', 'CVC - Borderline', 'CVC - Normal'],
             'SGC' : ['Swan Ganz Catheter Present']}
counts_dict = {}

In [None]:
for proc in cols_dict.keys():
    counts_dict[proc] = labels_df[labels_df[cols_dict[proc]].sum(axis=1) > 0].shape[0]

In [None]:
fig = go.Figure(
    data=[go.Bar(x = list(counts_dict.keys()), y = list(counts_dict.values()))],
    layout=go.Layout(
        title=go.layout.Title(text="Representation of Each procedure in Train Set")
    )
)

fig.show()

As we can see almost all of out training images have form of CVC inserted in them. And very few have SGC catheter present. Now let's examine each category one by one.  

## ETT
ETT Stands for EndoTracheal Tube which is placed through the mouth into the trachea (windpipe) to help a patient breathe who is having trouble breathing. More details can be found in the link [here](https://youtu.be/gwKwCARKYfw).
The procedure looks something like the one below:-
![](https://yourcprmd.com/wp-content/uploads/2018/08/e211.jpg)

Source:- https://www.yourcprmd.com/  
Let's look at some examples...

In [None]:
counts_dict['ETT']

In [None]:
temp_df = labels_df[labels_df[cols_dict['ETT']].sum(axis=1) > 0].sample(frac=1)[:9]
plot_img_from_df(temp_df, train_images_path)

If we look closely to the area near the throat parallel to the spinal cord we can see a faint tube like structure. That is the ETT. Now let's look at some normal and abnormal cases of the same.

### 1. Normal Case:-

In [None]:
temp_df = labels_df[labels_df['ETT - Normal'] == 1].sample(frac=1)[:9]
plot_img_from_df(temp_df, train_images_path)

In [None]:
err = True
count = -1
while err:
    try:
        count += 1
        if count > temp_df.shape[0]:
            print('No annotated images found for this condition')
            break
        plot_image_with_annotations(annot_df, train_images_path,
                                    row_ind = annot_df.index[(annot_df['StudyInstanceUID'] == temp_df['StudyInstanceUID'].values[count]) & (annot_df['label'] == 'ETT - Normal')][0])
        err = False
    except:
         pass

In [None]:
visualize_annotations(annot_df, train_images_path,
                      img_id = temp_df['StudyInstanceUID'].values[count])

We can vaguely see a clean run through the throat and into the trachea and ends just above the carina. That is the recommended position for the ending.  

### 2. Abnormal Case:-

In [None]:
temp_df = labels_df[labels_df['ETT - Abnormal'] == 1].sample(frac=1)[:9]
plot_img_from_df(temp_df, train_images_path)

In [None]:
err = True
count = -1
while err:
    try:
        count += 1
        if count > temp_df.shape[0]:
            print('No annotated images found for this condition')
            break
        plot_image_with_annotations(annot_df, train_images_path,
                                    row_ind = annot_df.index[(annot_df['StudyInstanceUID'] == temp_df['StudyInstanceUID'].values[count]) & (annot_df['label'] == 'ETT - Abnormal')][0])
        err = False
    except:
         pass

In [None]:
visualize_annotations(annot_df, train_images_path,
                      img_id = temp_df['StudyInstanceUID'].values[count])

In some of the images we can see that the tube is running a little too long inside the trachea and has already overshoot the carina. Maybe that is the reason for this being classified as abnormal.  

### 3. Borderline Case:-

In [None]:
temp_df = labels_df[labels_df['ETT - Borderline'] == 1].sample(frac=1)[:9]
plot_img_from_df(temp_df, train_images_path)

In [None]:
err = True
count = -1
while err:
    try:
        count += 1
        if count > temp_df.shape[0]:
            print('No annotated images found for this condition')
            break
        plot_image_with_annotations(annot_df, train_images_path,
                                    row_ind = annot_df.index[(annot_df['StudyInstanceUID'] == temp_df['StudyInstanceUID'].values[count]) & (annot_df['label'] == 'ETT - Borderline')][0])
        err = False
    except:
         pass

In [None]:
visualize_annotations(annot_df, train_images_path,
                      img_id = temp_df['StudyInstanceUID'].values[count])

The bordeline cases are not so apparent as to why they were labeled as such. The tube appears to be longer than Normal but shorter than abnormal. It has overshoot the carina but not by much.  
Let's have a quick look at the values of each category:-

In [None]:
ett_counts_dict = dict(labels_df[cols_dict['ETT']].sum(axis=0))

In [None]:
fig = go.Figure(
    data=[go.Bar(x = list(ett_counts_dict.keys()), y = list(ett_counts_dict.values()))],
    layout=go.Layout(
        title=go.layout.Title(text="Count of each ETT catrgory")
    )
)

fig.show()

***Observations:-***
* Most of the ETT tubes are place either normally or on the borderline.
* If we generate new images again and again by running the function we see that the images can be either rotated, skewed or containg extra organs like jaws/neck sometimes.
* Some images have reflections/shadows too.
* We need to look at the chest region to confidently know the condition, but in some images it is covered by other organs anound that area (probably due to some awkward angle while imaging), so not always is the image apparent enough to make a judgement.  

## NGT

NGT stands for NasoGastric Tube which is passed through the nose, down through the esophagus, and into the stomach. It can be used to either remove substances from or add them to the stomach. An NG tube is only meant to be used on a temporary basis and is not for long-term use. More details can be found in the link [here](https://youtu.be/7dSEKQLMa18). The procedure looks something like in the image below:-  
![](https://spareyourtummy.files.wordpress.com/2013/05/nasogastric_tube1.jpg?w=584)  
Source:- https://spareyourtummy.wordpress.com/others/enteral-nutrition/nasogastric-tube/  
**So typically it should be longer than than the ETT.**  
Let's look at some examples...

In [None]:
counts_dict['NGT']

In [None]:
temp_df = labels_df[labels_df[cols_dict['NGT']].sum(axis=1) > 0].sample(frac=1)[:9]
plot_img_from_df(temp_df, train_images_path)

If we look closely to the area near the throat parallel to the spinal cord we can see a faint tube like structure. That is the ETT. Now let's look at some normal and abnormal cases of the same.
Although it is not as apparent and not all images have views up until the stomach, we can still make-out faint presence of a tube running parallel to the spinal cord. It is very difficult to makeout where it ends but let's hope our Computer Vision model has better performance than an untrained human.

### 1. Normal Case:-

In [None]:
temp_df = labels_df[labels_df['NGT - Normal'] == 1].sample(frac=1)[:9]
plot_img_from_df(temp_df, train_images_path)

In [None]:
err = True
count = -1
while err:
    try:
        count += 1
        if count > temp_df.shape[0]:
            print('No annotated images found for this condition')
            break
        plot_image_with_annotations(annot_df, train_images_path,
                                    row_ind = annot_df.index[(annot_df['StudyInstanceUID'] == temp_df['StudyInstanceUID'].values[count]) & (annot_df['label'] == 'NGT - Normal')][0])
        err = False
    except:
         pass

In [None]:
visualize_annotations(annot_df, train_images_path,
                      img_id = temp_df['StudyInstanceUID'].values[count])

So in normal cases the tube starts from the nose and goes all the way down to stomach. It also passes the 4 point inspection test.  

### 2. Abnormal Case:-

In [None]:
temp_df = labels_df[labels_df['NGT - Abnormal'] == 1].sample(frac=1)[:9]
plot_img_from_df(temp_df, train_images_path)

In [None]:
err = True
count = -1
while err:
    try:
        count += 1
        if count > temp_df.shape[0]:
            print('No annotated images found for this condition')
            break
        plot_image_with_annotations(annot_df, train_images_path,
                                    row_ind = annot_df.index[(annot_df['StudyInstanceUID'] == temp_df['StudyInstanceUID'].values[count]) & (annot_df['label'] == 'NGT - Abnormal')][0])
        err = False
    except:
         pass

In [None]:
visualize_annotations(annot_df, train_images_path,
                      img_id = temp_df['StudyInstanceUID'].values[count])

In abnormal NGT cases we can see that the tube ends before fully entering the stomach. Also point 3 & 4 of NGT inspection are not well defined in the image. That is the reason it is classified as abnormal. Here we can also see another line of NGT which oveershoots the image boundaries, that is why it is classified as Incompletely Imaged. So out previous hypothesis of multiple lines in the same patient at same time is proven.

### 3. Borderline Case:-

In [None]:
temp_df = labels_df[labels_df['NGT - Borderline'] == 1].sample(frac=1)[:9]
plot_img_from_df(temp_df, train_images_path)

In [None]:
err = True
count = -1
while err:
    try:
        count += 1
        if count > temp_df.shape[0]:
            print('No annotated images found for this condition')
            break
        plot_image_with_annotations(annot_df, train_images_path,
                                    row_ind = annot_df.index[(annot_df['StudyInstanceUID'] == temp_df['StudyInstanceUID'].values[count]) & (annot_df['label'] == 'NGT - Borderline')][0])
        err = False
    except:
         pass

In [None]:
visualize_annotations(annot_df, train_images_path,
                      img_id = temp_df['StudyInstanceUID'].values[count])

The tube seems to be going inside the stomach but at a different angle and this angle might not be that conducive to coiling inside the stomach. That is why it might be classified as borderline.  

### 4. Incompletely Imaged:-

In [None]:
temp_df = labels_df[labels_df['NGT - Incompletely Imaged'] == 1].sample(frac=1)[:9]
plot_img_from_df(temp_df, train_images_path)

In [None]:
err = True
count = -1
while err:
    try:
        count += 1
        if count > temp_df.shape[0]:
            print('No annotated images found for this condition')
            break
        plot_image_with_annotations(annot_df, train_images_path,
                                    row_ind = annot_df.index[(annot_df['StudyInstanceUID'] == temp_df['StudyInstanceUID'].values[count]) & (annot_df['label'] == 'NGT - Incompletely Imaged')][0])
        err = False
    except:
         pass

In [None]:
visualize_annotations(annot_df, train_images_path,
                      img_id = temp_df['StudyInstanceUID'].values[count])

These images lack proper exposure of stomach region in the imaging, thus point 3 & 4 can't be validated from the checklist hence they are inconsequential.

In [None]:
ngt_counts_dict = dict(labels_df[cols_dict['NGT']].sum(axis=0))

In [None]:
fig = go.Figure(
    data=[go.Bar(x = list(ngt_counts_dict.keys()), y = list(ngt_counts_dict.values()))],
    layout=go.Layout(
        title=go.layout.Title(text="Count of each NGT catrgory")
    )
)

fig.show()

***Observations:-***
* Most of the NGT tubes are place either normally or on the borderline. There is also a huge proportion of images which are inconsequential in nature. However luckily the abnormal cases are quite low.
* Similar to ETT images, If we generate new images again and again by running the function we see that the images can be either rotated or skewed.
* Some images have reflections/shadows too.
* There is also a substantial overlap between the type of images in "Incompletely Imaged" and "Abnormal" class.  

## CVC
CVC stands for Central Venous Catheter which  is inserted into a vein, usually below the right collarbone, and guided (threaded) into a large vein above the right side of the heart called the superior vena cava. It is used to give intravenous fluids, blood transfusions, chemotherapy, and other drugs. More details can be found in the link [here](https://youtu.be/mTBrCMn86cU). The procedure looks something like this:-
![](https://dm3omg1n1n7zx.cloudfront.net/rcni/static/journals/ns/aop/ns.2020.e11559/graphic/ns.2020.e11559_0001.jpg) 
Source:- https://journals.rcni.com/  
So typically it should be a lot thinner than both ETT and NGT (which were tubes). This can also be entered through either jugular or subclavian or femoral veins. Commonly an imaging is only done when insertion is done through jugular or subclavian.  
Let's look at some examples...

In [None]:
counts_dict['CVC']

In [None]:
temp_df = labels_df[labels_df[cols_dict['CVC']].sum(axis=1) > 0].sample(frac=1)[:9]
plot_img_from_df(temp_df, train_images_path)

If we look closely, we can see there is a very thin but dense wire like shape in each of the images. While some of those maybe ecternal devices like EKG leads, etc but the ones entering the vens-cava are most likely CVC. That is what we are going to study and classify here. But as you can guess this one is going to be tricky because of it's thin nature and resemblance to many other lines.  

### 1. Normal Case:-

In [None]:
temp_df = labels_df[labels_df['CVC - Normal'] == 1].sample(frac=1)[:9]
plot_img_from_df(temp_df, train_images_path)

In [None]:
err = True
count = -1
while err:
    try:
        count += 1
        if count > temp_df.shape[0]:
            print('No annotated images found for this condition')
            break
        plot_image_with_annotations(annot_df, train_images_path,
                                    row_ind = annot_df.index[(annot_df['StudyInstanceUID'] == temp_df['StudyInstanceUID'].values[count]) & (annot_df['label'] == 'CVC - Normal')][0])
        err = False
    except:
         pass

In [None]:
visualize_annotations(annot_df, train_images_path,
                      img_id = temp_df['StudyInstanceUID'].values[count])

From the image above we can see that there are 2 catheters in this patient. One probably through the jugular and the other through subclavian vein. Thet do not share any common pathway and terminate near the superior vena-cava on the right side.  
That is the standard procedure and that is probably the reason why this image is classified as normal.  

### 2. Abnormal Case:-

In [None]:
temp_df = labels_df[labels_df['CVC - Abnormal'] == 1].sample(frac=1)[:9]
plot_img_from_df(temp_df, train_images_path)

In [None]:
err = True
count = -1
while err:
    try:
        count += 1
        if count > temp_df.shape[0]:
            print('No annotated images found for this condition')
            break
        plot_image_with_annotations(annot_df, train_images_path,
                                    row_ind = annot_df.index[(annot_df['StudyInstanceUID'] == temp_df['StudyInstanceUID'].values[count]) & (annot_df['label'] == 'CVC - Abnormal')][0])
        err = False
    except:
         pass

In [None]:
visualize_annotations(annot_df, train_images_path,
                      img_id = temp_df['StudyInstanceUID'].values[count])

The line enters through the subclavian vein but has terminated well above the superior vena-cava and also to the left of the midline, which is not an expected position or an outcome. It can be cause due to various reasons like aterial placement of the central line or not enough length pushed into the patient. That is probably why this image is classified as Abnormal.  

### 3. Borderline Case:-

In [None]:
temp_df = labels_df[labels_df['CVC - Borderline'] == 1].sample(frac=1)[:9]
plot_img_from_df(temp_df, train_images_path)

In [None]:
err = True
count = -1
while err:
    try:
        count += 1
        if count > temp_df.shape[0]:
            print('No annotated images found for this condition')
            break
        plot_image_with_annotations(annot_df, train_images_path,
                                    row_ind = annot_df.index[(annot_df['StudyInstanceUID'] == temp_df['StudyInstanceUID'].values[count]) & (annot_df['label'] == 'CVC - Borderline')][0])
        err = False
    except:
         pass

In [None]:
visualize_annotations(annot_df, train_images_path,
                      img_id = temp_df['StudyInstanceUID'].values[count])

As we can see the catheter is administered through subclavian vein (it's kind of becoming a pattern now) and terminates to the right of the midline but only just. So it might have a possibility of being in the left brachiocephalic vein or just about in the superior vena cava. And probably because of these traits it's marked as borderline.

In [None]:
ngt_counts_dict = dict(labels_df[cols_dict['CVC']].sum(axis=0))

In [None]:
fig = go.Figure(
    data=[go.Bar(x = list(ngt_counts_dict.keys()), y = list(ngt_counts_dict.values()))],
    layout=go.Layout(
        title=go.layout.Title(text="Count of each CVC catrgory")
    )
)

fig.show()

***Observations:-***
* Most of the CVC lines are place either normally or on the borderline. Luckily the abnormal cases are quite lower, but sadly not as low as we would like.
* Similar to ETT and NGT images, If we generate new images again and again by running the function we see that the images can be either rotated, skewed or cut.
* Some images have reflections/shadows too.
* It is going to be an uphill task classifying the various CVC types because of their thin nature and operating in a dense region of body (close to spinal cord). To add to this pain, is the fact that many other lines like EKG leads have a similar construction and loot probably the same, only the position and presence of a bigger tip for example is the dicriminator.  

## SGC
SGC stands for Swan Ganz Catheter which  is into the right side of the heart and the arteries leading to the lungs. It is done to monitor the heart's function and blood flow and pressures in and around the heart. This test is most often done in people who are very ill. More details can be found in the link [here](https://youtu.be/y241HEaBkLA). The procedure looks something like this:-
![](https://i.pinimg.com/474x/0c/2e/29/0c2e297965bf884835cd4cab1e56179a.jpg) 
Source:- https://in.pinterest.com/pin/707909635160327979/  
So typically it should be a lot thinner than both ETT and NGT (which were tubes). This can also be entered through internal jugular passing through the right atrium into the pulmonary artery carrying blood to the lungs.  
Let's look at some examples...

In [None]:
counts_dict['SGC']

Only a very small subset of patients needed Pulmonary Artery Catheter.

In [None]:
temp_df = labels_df[labels_df[cols_dict['SGC']].sum(axis=1) > 0].sample(frac=1)[:9]
plot_img_from_df(temp_df, train_images_path)

We can see a similar line to CVC in these images as well, the only difference is this goes right through the heart and terminates well within the left side of the midline. As there are not normal/abnormal split present in the training data we are going to only detect presence or absence of the same in an image.

### 1. Present:-

In [None]:
temp_df = labels_df[labels_df['Swan Ganz Catheter Present'] == 1].sample(frac=1)[:9]
plot_img_from_df(temp_df, train_images_path)

In [None]:
temp_df = labels_df[labels_df['Swan Ganz Catheter Present'] == 1].sample(frac=1)
err = True
count = -1
while err:
    try:
        count += 1
        if count > temp_df.shape[0]:
            print('No annotated images found for this condition')
            break
        plot_image_with_annotations(annot_df, train_images_path,
                                    row_ind = annot_df.index[(annot_df['StudyInstanceUID'] == temp_df['StudyInstanceUID'].values[count]) & (annot_df['label'] == 'Swan Ganz Catheter Present')][0])
        err = False
    except:
         pass

Another differenciating factor from CVC to SGC is the presence of a loop in the sistem which is basically unavoidable due to the construction of human heart.

## Summary : Semester 1
Let's summerize our observations and data understanding in this section:-
* In majority of cases the images are either normal or borderline. The representation of abnormal images unluckily (or luckily for the pateints) is very low.
* We required sqinting of eye and changing screen brightness to see lines/tubes properly. Which suggests that wile we are training our model it might be a good idea to reduce the brightness of images while augmentation.
* ETT are tubes which are inserted into the trachea to help a person breathe.
    * In a normal condition, the tube will pass parallely to the spine and terminate a little above the carina.
    * In abnormal situations, the tube might overshoot the carina and maybe even be present in esophagus instead of the trachea.
    * In borderline conditions, the tube might be in the right tract, but a little bit longer than normal which might caus ethe patient some discomfort when they try to move their neck.
* NGT are tubes which are commonly passed through the nasal cavity and esophagus into stomach to assist feeding of a person or collect sample from stomach.
    * In a normal condition, the tube will pass parallely to the spine and bend just before the stomach and bend to reach it. It should also pass the 4-point test.
    * In abnormal situations, the tube might not even reach the stomach or might be curled in an improper manner such that the 4-point check is violated.
    * In borderline conditions, the tube will reach the stomach but the bend might not be completely in line with the SOP and chgecking procedure.
    * In cases where the image is not adequate or has been cut above the abdomenal cavity, no comments could be made regarding the correctness of the installation.
* CVC is line used to administer medicines to the patient. It can be inserted either through jugular or subclavian vein.
    * In a normal condition, the line will pass cleanly through the vein and terminate to the right of the midline near the superior vena-cava.
    * In abnormal cases, the line might end up either to the left of midline or before the superior vena cava.
    * In borderline cases, the line would terminate just to the right of the midline and in the left brachiocephalic vein or just about in the superior vena cava
* SVC is a line used for multiple purposes like checking pressure, heartbeat, etc.
    * It can be identified by the distinctive loop near the hearty where it enters the pulmonary artery from the internal jugular though the right atrium.
    * Similar to CVC it will terminate to the right of the midline.
    * We do not have data regarding normal/abnormal classification of SVS, we just need to predict the presence.
* A person at any point of time can have multiple CVC or NGT.
* It will be difficult to identify CVC in images due to other similar looking lines present, like EKG leads.

Now that we have the pre-requisites and we know whgat to look for in the image, let's start with some data cleaning and making everything "sterile" for the model training.

# Semester 2: Sterilizing the Data  
We saw in the Cassava compeition there were same exact images in 2 different classes or a subset of one image belonging to a completely different class. Let's see if we have similar problems here as well.

In [None]:
# Referenced from https://www.kaggle.com/tanulsingh077/how-to-become-leaf-doctor-with-deep-learning/

funcs = [
    imagehash.average_hash,
    imagehash.phash,
    imagehash.dhash,
    imagehash.whash,
]

image_ids = []
hashes = []

for path in tqdm(glob.glob(os.path.join(train_images_path, '*.jpg' ))):
    image = Image.open(path)
    image_id = os.path.basename(path)
    image_ids.append(image_id)
    hashes.append(np.array([f(image).hash for f in funcs]).reshape(256))

In [None]:
hashes_all = np.array(hashes)

# Convert numpy array into torch tensor to speed up similarity calculation
hashes_all = torch.Tensor(hashes_all.astype(int)).cpu()

# Calculate similarities among all image pairs. Divide the value by 256 to normalize (0-1)
%time sims = np.array([(hashes_all[i] == hashes_all).sum(dim=1).cpu().numpy()/256 for i in range(hashes_all.shape[0])])

In [None]:
# Thresholding
indices1 = np.where(sims > 0.95)
indices2 = np.where(indices1[0] != indices1[1])
image_ids1 = [image_ids[i] for i in indices1[0][indices2]]
image_ids2 = [image_ids[i] for i in indices1[1][indices2]]
dups = {tuple(sorted([image_ids1,image_ids2])):True for image_ids1, image_ids2 in zip(image_ids1, image_ids2)}
print('found %d duplicates' % len(dups))

Now plotting some duplicate images:-

In [None]:
duplicate_image_ids = sorted(list(dups))
counter = 0

for row in range(len(dups)):
    img_id_1 = duplicate_image_ids[row][0]
    img_id_2 = duplicate_image_ids[row][1]
    img1 = cv2.imread(os.path.join(train_images_path, img_id_1))
    img1 = cv2.cvtColor(img1, cv2.COLOR_BGR2RGB)
    img2 = cv2.imread(os.path.join(train_images_path, img_id_2))
    img2 = cv2.cvtColor(img2, cv2.COLOR_BGR2RGB)
    
    plt.figure(figsize=(10, 10))
    plt.subplot(1, 2, 1)
    plt.axis('off')
    plt.imshow(img1)
    plt.subplot(1, 2, 2)
    plt.axis('off')
    plt.imshow(img2)
    plt.show()
    
    counter += 1
    if counter > 10:
        break

Now that out of the way, let's learn the skills to detect the classes automatically from the images.

# Semester 3: Algorithmic Background
We are going to make use of image augmentations, Deep Neural Networks, CNN, transfer learning (Resnet50, InceptionV3, EfficientNet) for computer vision, Cyclic Learning Rate scheduler (Cosine decay) extensively. Please feel free to pause and go thoough the reading materials linked below. The modelling notebook will make use of these resources and knowledge to build classifier(s).  
* [Deep Neural Network](https://machinelearningmastery.com/what-is-deep-learning/)
* [Convolutional Neural Networks (CNN)](https://cs231n.github.io/convolutional-networks/)
* [Transfer Learning](https://www.coursera.org/lecture/convolutional-neural-networks/transfer-learning-4THzO)
* [Resnet 50](https://arxiv.org/abs/1512.03385)
* [Inception V3](https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/44903.pdf)
* [Efficient Net](https://arxiv.org/abs/1905.11946)
* [Image Augmentation](https://www.pyimagesearch.com/2019/07/08/keras-imagedatagenerator-and-data-augmentation/)
* [Cyclic Learning Rate](https://arxiv.org/abs/1506.01186)
* [Cosine Decay](https://arxiv.org/abs/1608.03983)

The modelling notebook is under construction, will be available very soon.  
I will link at top when it's ready.  

And thanks a lot for reading my notebook. I hope it was worth your time and you have learnt about the domain as well as the objective of the problem at hand. ðŸ˜„