# Masks for catheter segmentation

In this notebook you will find:
* Code snippets for masks creation
* Examples of usage of already created masks
* Dataset with created masks

In [None]:
import os
import ast
import random
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from tqdm import tqdm
from typing import Dict

import cv2
from PIL import Image
from scipy import interpolate
from torch.utils.data import Dataset

In [None]:
def set_seed(seed):
    random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    np.random.seed(seed)
    
set_seed(123)

In [None]:
TRAIN_DATA = '../input/ranzcr-clip-catheter-line-classification/train'
TEST_DATA = '../input/ranzcr-clip-catheter-line-classification/test'

TRAIN_CSV = '../input/ranzcr-clip-catheter-line-classification/train.csv'
TRAIN_ANNOT_CSV = '../input/ranzcr-clip-catheter-line-classification/train_annotations.csv'
SUBMISSION = '../input/ranzcr-clip-catheter-line-classification/sample_submission.csv'

MASKS = '../input/ranzcr-catheter-and-line-masks/train_masks'

In [None]:
train_df = pd.read_csv(TRAIN_CSV)
train_annot_df = pd.read_csv(TRAIN_ANNOT_CSV)
submission = pd.read_csv(SUBMISSION)

In the competition dataset, about 30% of data has manual annotations of catheter and line positions. <br> 
For the test set, there is no annotation available.

In [None]:
train_imgs = set(train_df['StudyInstanceUID'].unique())
annotated_imgs = set(train_annot_df['StudyInstanceUID'].unique())
test_imgs = set(submission['StudyInstanceUID'].unique())

print('Annotated images in train set: ', len(train_imgs.intersection(annotated_imgs)))
print('Annotated images in test set: ', len(test_imgs.intersection(annotated_imgs)))

One of the most straightforward ways to use provided annotation data is to train the segmentation model, then predict catheter masks for train and test sets then use it as additional data in the downstream classification model. <br>
To train such a segmentation model catheter masks should be used. <br>
To create a segmentation target masks one should:
* Interpolate annotated points as a continuous line
* Define classes for masks

Different strategies for class definition can be used:
* 1 class - find any catheter in the image.
* 4 classes - find and classify one of the catheter types regardless of their positioning (e.g. CVC, NGT, etc.).
* 11 classes - classify catheters with respect to their positioning (e.g. CVC-Normal, ETT-Abnormal, etc.).
* N classes - any other combination you may find useful.

1 class segmentation should be the simplest one to train but it provides less information, 11 classes it the opposite. As a tradeoff between model complexity and provided information, I decided to use 4 class segmentation. <br>
If you want to change it you can adjust classes in a `labels_dict` below and run `create_masks` function. 


In [None]:
labels_dict = {
    'CVC - Normal': 1,
    'CVC - Borderline': 1,
    'CVC - Abnormal': 1,
    'NGT - Normal': 2,
    'NGT - Incompletely Imaged': 2,
    'NGT - Borderline': 2,
    'NGT - Abnormal': 2,
    'ETT - Normal': 3,
    'ETT - Borderline': 3,
    'ETT - Abnormal': 3, 
    'Swan Ganz Catheter Present': 4,
}

### Data preparation
First, let's check consistency between `train` and `train_annotation` labels:

In [None]:
# Check consistensy between train and train_annot

for idx in train_annot_df.index:
  uid = train_annot_df.loc[idx, 'StudyInstanceUID']
  label = train_annot_df.loc[idx, 'label']
  train_label_value = train_df[train_df['StudyInstanceUID'] == uid][label].values[0]
  assert train_label_value == 1
print('Labels are consistent.')

Then let's correct some label mistakes: <br>
A correction based on this [discussion topic](https://www.kaggle.com/c/ranzcr-clip-catheter-line-classification/discussion/210064).

In [None]:
# Errors correction 

to_correct = [
	[3589,	'1.2.826.0.1.3680043.8.498.57005638787237813934531972491254580369',	'CVC - Borderline',	'NGT - Borderline'],
	[4344,	'1.2.826.0.1.3680043.8.498.93345761486297843389996628528592497280',	'ETT - Abnormal',	'CVC - Abnormal'],
	[6294,	'1.2.826.0.1.3680043.8.498.50891603479257167332052859560303996365',	'NGT - Normal',	'CVC - Normal'],
	[7558,	'1.2.826.0.1.3680043.8.498.32665013930528750130301395098139968929',	'NGT - Borderline',	'CVC - Borderline'],
	[8457,	'1.2.826.0.1.3680043.8.498.47822809495672253227315400926882161159',	'NGT - Borderline',	'CVC - Borderline'],
	[8586,	'1.2.826.0.1.3680043.8.498.55171965195784371324650309161724846475',	'NGT - Borderline',	'CVC - Borderline'],
	[8589,	'1.2.826.0.1.3680043.8.498.29639870594803047496855371142714987539',	'ETT - Normal',	'CVC - Normal'],
	[9908,	'1.2.826.0.1.3680043.8.498.52422864792637441690285442425747003963',	'NGT - Normal',	'ETT - Normal'],
	[10889,	'1.2.826.0.1.3680043.8.498.51277351337858188519077141427236143108',	'NGT - Normal',	'CVC - Normal'],
	[10963,	'1.2.826.0.1.3680043.8.498.33011244702337270174558484639492100815',	'CVC - Normal',	'NGT - Normal'],
	[11902,	'1.2.826.0.1.3680043.8.498.10505287747515183956922280117689383476',	'NGT - Normal',	'CVC - Normal'],
	[12041,	'1.2.826.0.1.3680043.8.498.43340424479611237895060478106689360500',	'NGT - Normal',	'CVC - Normal'],
	[12782,	'1.2.826.0.1.3680043.8.498.12545979153892772426852721449004507757',	'NGT - Abnormal',	'CVC - Abnormal'],
	[13513,	'1.2.826.0.1.3680043.8.498.83700037297895094021306651705503600111',	'NGT - Normal',	'ETT - Normal'],
	[14226,	'1.2.826.0.1.3680043.8.498.35772244095675958072394978496245125294',	'NGT - Normal',	'ETT - Normal'],
	[15750,	'1.2.826.0.1.3680043.8.498.96130195933728659348647733812659169362',	'CVC - Abnormal',	'NGT - Abnormal'],
	[15779,	'1.2.826.0.1.3680043.8.498.75269816256944932004789976844599885553',	'NGT - Abnormal',	'CVC - Abnormal'],
	[16629,	'1.2.826.0.1.3680043.8.498.11935284122896798228836385959451625327',	'NGT - Abnormal',	'CVC - Abnormal'],
	[17501,	'1.2.826.0.1.3680043.8.498.83574817573978660270935463700320068005',	'NGT - Abnormal',	'CVC - Abnormal']
]

In [None]:
for case in to_correct:
  train_df.loc[train_df.StudyInstanceUID==case[1], case[2]] = 0
  train_df.loc[train_df.StudyInstanceUID==case[1], case[3]] = 1
  train_annot_df.loc[case[0], 'label'] = case[3]

print('Labels are corrected.')

### Masks creation
Function bellow performs the piecewise linear interpolation of the annotated points, then makes interpolated line thicker and returns the result as `np.array`. <br>
Labels on the resulted mask correspond to `labels_dict`, label `0` - corresponds to the background.

In [None]:
def create_mask(img_name, data_path, df, labels_dict, thin_scale=145):

  img_path = os.path.join(data_path, img_name + '.jpg')
  img = np.asanyarray(Image.open(img_path), dtype='uint8')
  img_data = df[df['StudyInstanceUID'] == img_name]

  mask = np.zeros_like(img)

  for idx in img_data.index:
    data = np.array(ast.literal_eval(img_data.loc[idx, 'data']))
    label = img_data.loc[idx, 'label']
    label_id = labels_dict[label]
    x, y = data[:, 0], data[:, 1]

    for i in range(data.shape[0]-1):
      xi, yi = np.array([x[i], x[i+1]]), np.array([y[i], y[i+1]])
      f1, f2 = interpolate.interp1d(xi, yi), interpolate.interp1d(yi, xi)
      x_new, y_new = np.arange(xi.min(), xi.max(), 1), np.arange(yi.min(), yi.max(), 1)
      y_inter, x_inter = f1(x_new), f2(y_new)
      
      y_mask = y_inter.astype(np.int32).clip(0, mask.shape[0]-1)  
      x_mask = x_inter.astype(np.int32).clip(0, mask.shape[1]-1)  

      mask[y_mask, x_new] = label_id
      mask[y_new, x_mask] = label_id

  ks = max(mask.shape) // thin_scale
  kernel = cv2.getStructuringElement(cv2.MORPH_OPEN, (ks, ks))
  mask = cv2.dilate(mask, kernel, iterations=1)

  return mask.astype(np.int16)

**Visual test**

In [None]:
def plot_sample(img_sample, train_annot_df, train_df, images_path, labels_dict):
  fig, ax = plt.subplots(1, 2, figsize=(14, 14))

  img_path = os.path.join(images_path, img_sample + '.jpg')
  img = np.asanyarray(Image.open(img_path), dtype=np.uint16)

  sample = train_annot_df[train_annot_df['StudyInstanceUID'] == img_sample]
  annots_data = np.array(ast.literal_eval(sample['data'].values[0]))

  sample = train_annot_df[train_annot_df['StudyInstanceUID'] == img_sample]
  mask = create_mask(img_sample, images_path, train_annot_df, labels_dict)

  print(sample.label)
  ax[0].imshow(img)
  ax[0].scatter(annots_data[:, 0], annots_data[:, 1])
  ax[1].imshow(mask)

In [None]:
img_sample = train_annot_df['StudyInstanceUID'].sample().values[0]

plot_sample(img_sample, train_annot_df, train_df, TRAIN_DATA, labels_dict)

You can use masks from [provided dataset](https://www.kaggle.com/glebkum/ranzcr-catheter-and-line-masks) or create your own, for example with other class labeling using `create_masks` function below:

In [None]:
# PATH_TO_SAVE_MASKS = # define path

def create_masks(train_annot_df, train_data_path, labels_dict, path_so_save):
    # Create and save masks for train in `.npz` format
    for img_name in tqdm(train_annot_df['StudyInstanceUID'].unique()):
      mask = create_mask(img_name, train_data_path, train_annot_df, labels_dict)
      f_name = os.path.join(path_so_save, img_name)
      np.savez_compressed(f_name, mask)

### Usage example
In order to train the segmentation model, one can use already created masks. <br>
An example of the Dataloader class for that task is provided below.

In [None]:
class SegmDataset(Dataset):
    def __init__(self,
                 data_df_path: str,
                 data_path: str,
                 data_masks_path: str,
                 transforms=None):
        super().__init__()
        self.data_df = pd.read_csv(data_df_path)
        self.data_path = data_path
        self.data_masks_path = data_masks_path
        self.transforms = transforms
        self.unique_images = self.data_df['StudyInstanceUID'].unique()

    def __len__(self):
        return len(self.unique_images)

    def __getitem__(self, idx):
        img_uid = self.unique_images[idx]
        img_path = os.path.join(self.data_path, img_uid + '.jpg')
        mask_path = os.path.join(self.data_masks_path, img_uid + '.npz')
        img = np.asarray(Image.open(img_path), dtype='uint8')
        mask = np.load(mask_path)['arr_0']

        if self.transforms:
            transformed = self.transforms(image=img, mask=mask)
            img = transformed['image']
            mask = transformed['mask']

        data = {
            'image': img,
            'mask': mask
        }

        return data

In [None]:
dataloader = SegmDataset(TRAIN_ANNOT_CSV,
                         TRAIN_DATA,
                         MASKS)

In [None]:
fig, ax = plt.subplots(3, 2, figsize=(12, 12))

for i in range(3):
    rand_idx = random.randint(0, len(dataloader)) 
    data_sample = dataloader[rand_idx]
    ax[i][0].imshow(data_sample['image'])
    ax[i][1].imshow(data_sample['mask'])
    
ax[0][0].set_title('Original image')
ax[0][1].set_title('Target mask')