# Shrink Wasted Space 87% and Increase Data Pts 4x

## Motivation

This notebook is based on [previous work](https://www.kaggle.com/dannellyz/tissue-detect-scaling-bounding-boxes-4xfaster) I have done to isolate key areas in medical imaging. From experience in other competitions, I certainly agree with [@Loulou](https://www.kaggle.com/louise2001) in the [post](https://www.kaggle.com/c/ranzcr-clip-catheter-line-classification/discussion/203356) on previous work. Finding the most import areas of the images and shrinking the learning area is a must. As the first seciton shows there is a ton of area in each image that is currently un-labeled when bounding just the annotated sections. Additionally, when investigating I found large inconsistencies on the number of data points per annotation as well as the spacing between observations. I worked to help add data points in order to normalize the labeling.

## Review on Unused Space in Train Images

To begin lets look at the spaces in each of the images that is not used in traning. Lets find some features of each image to help show this:

In [None]:
#Install and update as needed
!pip install --upgrade seaborn

In [None]:
#import needed libs
import pandas as pd
import numpy as np
import cv2
import ast
import itertools
from PIL import Image
import matplotlib.pyplot as plt
from math import hypot
import seaborn as sns 

#Set up Globals and get Annotaions
BASE_DIR = "/kaggle/input/ranzcr-clip-catheter-line-classification/"
TRAIN_IMG_DIR = BASE_DIR + "train/"
TESTING = False
train_anno = pd.read_csv(BASE_DIR + "train_annotations.csv")

#Sample for Testing
if TESTING:
    train_anno = train_anno.sample(int(len(train_anno) * .1))
    
train_anno.head()

### Features of Each Image ()
* data - nested list of all points that make up the points along the catheter insertion
* data_pt_cnt - total number of data points in a image
* data_pt_len - the total length of the placement annotation
* pt_lev_avg - the average length between points in an annotation
* img_sz - (height, width) touple of image size
* img_area - simple product of img_sz touple
* min_rect - (x,y,w,h) produced from cv2.boundingRect(data)
* min_area - area from the min rectangle bounding the data points
* extra_area - difference from img_area - min_arae
* extra_pct - difference in percentage terms

In [None]:
def get_img_loc(UID):
    return f"{TRAIN_IMG_DIR}{UID}.jpg"


def get_img_features(row):
    return_cols = []
    
    # Transform string data into nested list and get number
    data = ast.literal_eval(row.data)
    data_pt_cnt = len(data)
    
    # Length of the annotation and average per points
    data_pt_len = sum(hypot(x1 - x2, y1 - y2) for (x1, y1), (x2, y2) in zip(data, data[1:]))
    pt_lev_avg = data_pt_len / data_pt_cnt
    
    #image Size with PIL...much faster than cv2
    fname = get_img_loc(row["StudyInstanceUID"])
    im = Image.open(fname)
    img_sz = im.size
    img_area = np.prod(img_sz)
    
    # min_rect = x,y,w,h - bounded around all data points
    data_array = np.float32(data)
    min_rect = cv2.boundingRect(data_array)
    
    #Get areas and extra
    min_area = np.prod(min_rect[2:])
    extra_area = img_area - min_area
    pct_extra = extra_area / img_area
    return pd.Series([data, data_pt_cnt, data_pt_len, pt_lev_avg, img_sz, img_area, min_rect, min_area, extra_area, pct_extra])

img_features = ["data", "data_pt_cnt", "data_pt_len", "pt_lev_avg", "img_sz", "img_area", "min_rect", "min_area", "extra_area", "pct_extra"]
train_anno[img_features] = pd.DataFrame(train_anno.apply(get_img_features, axis=1))

#Add number of placement types
train_df = pd.read_csv(BASE_DIR + "train.csv")
train_df["placement_cnt"] = train_df.sum(axis=1)
train_anno = train_anno.merge(train_df[["StudyInstanceUID","placement_cnt"]])

train_anno.head()

In [None]:
avg_extra = train_anno["pct_extra"].mean()
print(f"The average percent of unused space in each image, extra_pct, is: {avg_extra}")

**Mean Extra Space 87%!**

Given this there must be ways to cut down images to hone in on the parts that contain the data.

### Correlation of Extra Space to data points or placement types

I started exploring this extra space by seeing if the images that had more types of placements or more annotated points used less space. This would be the idea that as you increase the amount of data there should be less space:

In [None]:
data_extra = train_anno["data_pt_cnt"].corr(train_anno["pct_extra"])
placement_extra = train_anno["placement_cnt"].corr(train_anno["pct_extra"])
placement_data = train_anno["placement_cnt"].corr(train_anno["data_pt_cnt"])
print(f"1: {data_extra:.2} = data_pt_cnt coor pct_extra | strong negative | more data points less extra space")
print(f"2: {placement_extra:.2} = placement_cnt to pct_extra | weak positive | more placements somewhat more extra space")
print(f"3: {placement_data:.2} = placement_cnt to data_pt_cnt | weak negative | more placements somewhat less data points")

1. The first finding makes sense that as you add data points there is less extra space
2. The second makes less sense? One would assume that as you add types of placements there would be less extra space? Potentially the x-rays are tighter in cases where there are more types of placements.
3. This also does not make sense, but does square with the above. As there are more types of placements there are less data points? Have added a [discussion](https://www.kaggle.com/c/ranzcr-clip-catheter-line-classification/discussion/206294) to explore this. 

In that discussion it was the thought of the competiton host that:
> @jarrelscy: My guess would be that because labelers label the whole image at one time, when there are many lines and tubes they may be trying to finish it faster and hence putting less points.

I think this makes sense so I wanted to explore more that relationship and also look into iterpolating data points on those with less labels

# Discovery 1: Data Point Inconsistency Per Annotattion


The chat below is a histogram of the count of data points per x-ray, `data_pt_cnt`. As you can see it fluctuates quite a bit and is not consistent across the number of placement annotations.

In [None]:
sns.displot(train_anno, x="data_pt_cnt", hue="placement_cnt", multiple="stack", palette="Set2")

Furthermore when looking at the average length between annotation points, `pt_lev_avg`, one can see a lower skew towards those with fewer placements. 

In [None]:
sns.displot(train_anno, x="pt_lev_avg", hue="placement_cnt", multiple="stack", palette="Set2")

Looking at these two findings as box plots you can see the downward skew of data points in relation to number of placement annotations as well as the upward skew in the average length of an annotation over the same x vector.

In [None]:
sns.boxplot(x="placement_cnt", y="pt_lev_avg", data=train_anno)

In [None]:
sns.boxplot(x="placement_cnt", y="data_pt_cnt", data=train_anno)

# Finding 1: The more annotations the less provided data

Given this, the x-rays with more types of placements receive less attention when labeling form the experts. From a human based perspective this would make sense. On more crowded, full slides the expert would need to move quicker and have less time to detail each individual placement. For those with 1 or only two types of placement they could add much more detail in the same period of time.

# Solution 1: a) Interpolate or b) Drop Outliers.

## a) Interpolating

Interpolating seeks to take a vector and add additional data points based on its shape. As seen below this works well for convex lines but concave gives it problems. 

In [None]:
import numpy as np
import scipy as sp
import scipy.interpolate
import matplotlib.pyplot as plt

#https://stackoverflow.com/questions/4072844/add-more-sample-points-to-data

# Generate some random data
y = (np.random.random(10) - 0.5).cumsum()
x = np.arange(y.size)

# Interpolate the data using a cubic spline to "new_length" samples
new_length = 50
new_x = np.linspace(x.min(), x.max(), new_length)
new_y = sp.interpolate.interp1d(x, y, kind='cubic')(new_x)

# Plot the results
plt.figure()
plt.subplot(2,1,1)
plt.plot(x, y, 'bo-')
plt.title('Using 1D Cubic Spline Interpolation')

plt.subplot(2,1,2)
plt.plot(new_x, new_y, 'ro-')

plt.show()

Since many of the annotations can be quite odd we need to interpolate in-between each point rather than over the line as a whole.

<span style="color:blue;"> Blue == Base Annotation </span>

<span style="color:crimson;"> Red  == Interpolation </span>

In [None]:
#sample_img = train_anno.sample(1)
#sample_loc = sample_img["StudyInstanceUID"].values[0]
#sample_data = sample_img["data"].values[0]
sample_data = train_anno[train_anno["StudyInstanceUID"] == '1.2.826.0.1.3680043.8.498.13137786603668786022908361036269592497']["data"].values[0]
x_test = np.array([x[0] for x in sample_data])
y_test = np.array([x[1] for x in sample_data])

for point in range(1, len(x_test) + 1):
    # Interpolate the data using a cubic spline to "new_length" samples
    new_length = 50
    new_x_test = np.linspace(x_test.min(), x_test.max(), new_length)
    new_y_test = sp.interpolate.interp1d(x_test, y_test, kind='linear')(new_x_test)

# Plot the results
plt.figure()
plt.subplot(2,1,1)
plt.plot(x_test, y_test, 'bo-')
plt.title('Using 1D Cubic Spline Interpolation')

plt.subplot(2,1,2)
plt.plot(new_x_test, new_y_test, 'ro-')

plt.show()

The fix is to interpolate iterativly between each pair of lines as shown below.

In [None]:
x_test = np.array([x[0] for x in sample_data])
y_test = np.array([x[1] for x in sample_data])

all_xs = []
all_ys = []

for i in range(1, len(x_test)):
    xs = np.array([x_test[i-1],  x_test[i]])
    ys = np.array([y_test[i-1],  y_test[i]])
    # Interpolate the data using a cubic spline to "new_length" samples
    new_length = 6
    new_xs = np.linspace(xs.min(), xs.max(), new_length)
    new_ys = sp.interpolate.interp1d(xs, ys, kind='linear')(new_xs)
    
    #print(xs, ys)
    #print(new_xs, new_ys)
    #print("-----")
    all_xs.extend(new_xs)
    all_ys.extend(new_ys)
#print(all_xs)
#print(all_ys)
# Plot the results
plt.figure()
plt.subplot(2,1,1)
plt.plot(x_test, y_test, 'bo-')
plt.title('Using 1D Cubic Spline Interpolation')

plt.subplot(2,1,2)
plt.plot(all_xs, all_ys, 'ro-')

plt.show()

In [None]:
sample1 = '1.2.826.0.1.3680043.8.498.13137786603668786022908361036269592497'
sample2 = '1.2.826.0.1.3680043.8.498.80893755100097324352087763503154307486'
sample3 = '1.2.826.0.1.3680043.8.498.87570237940815582849204532832440641266'

## More to come...