## **0. Why this notebook?**
I wanted to share my EDA and helper functions for this competition. Becase:
1. I am building my public portfolio.
2. I do not work on competition to win. I do it to learn and gain experience.
3. If my stuff helped you to get a good place in the final round, it would prove my ideas. I would like to see comments about my ideas.

**Update in Section 4** June 17 2022: Added function to split the dataset by cases. The function makes sure that each fold has **similar amount of data** interms of number of images. This is important because in the hiddent test set, there are cases that do not show in given training set. Overall splitting by cases will help to mimic the hiddent test set.

## **1. Drop Images (by ordering)**
Why should we drop some images if we can?
For neural nets, it takes **weights** to learn that something is **important** (**input - target** relation), and it also takes **weights** to learn that something is **not important** (e.g. **large black background**). If we can **systematically** tell that some images always (or most likely) do not have targets (large bowel, small bowel and stomache), we may drop them from the train set, which will help the neural nets to focus on learning the input - target relation. This whole thing will work like a filter, on both train set and test set.

By quickly checking through the image set, we can see that some **very dark image** (almost all black) always do not have labels, and most of time, the **first and the last couple of images** in each day do not have labels. 

The following part is to find out how many image in the beginning and end of the day we can drop.

**Summary**: If you do not want to miss any images that has targets in it, systematically we can filtered out 7845 images out of 38496 images, which is about 20%. If you are willing to risk it, you may be able to filter out 9000 images with only lossing less than 50 images with targets.

The **toBeDropped_order.csv** file will be **saved** to output, so you can just **download** it for your project.

**Criteria** to filter images without missing targets: for the days that have **144** slices, drop the **first 23** images and **last 7** images. For the days that have **80** slices, drop the **first** image and **last 4** images. The image ids are in the dataframe **toBeDropped**. Feel free to check.

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

In [None]:
# read the label file.
mask_csv = pd.read_csv(r'../input/uw-madison-gi-tract-image-segmentation/train.csv')
# create a new column to show if one id (image) has any labels
mask_csv['hasSeg'] = mask_csv['segmentation'].notnull()
id_seg = mask_csv.groupby('id')['hasSeg'].sum().to_frame(name = 'hasSeg').reset_index()
id_seg['hasSeg'] = id_seg['hasSeg']>0
id_seg.head(5)

In [None]:
# Here we see that 21906 images do not have labels, and 16590 images have labels. 43% of the train set has labels.
id_seg.hasSeg.value_counts()

In the following cell, we get the day information and the slice number from ID. Some days have **144** images, and some days have **80** images. I guess when the doctor is sure about the location of the target, they use less images, so in the following I treated them as two groups. The **maxSlice** is to show how many slices are in that day.

In [None]:
id_seg['day'] = id_seg.apply(lambda row: row['id'].split('slice')[0], axis = 1)
id_seg['slice'] = id_seg.apply(lambda row: int(row['id'].split('_')[-1]), axis = 1)

temp_slice = id_seg.groupby('day')['slice'].max().to_frame(name = 'maxSlice').reset_index()
max_slice = pd.merge(id_seg, temp_slice, on='day')
max_slice.head(5)

Only **15** ( = 1200 / 80 ) days have 80 slices. **259** ( = 37296 / 144 ) days have 144 slices.

In [None]:
max_slice.maxSlice.value_counts()

In [None]:
# Seperate them
max_slice_80 = max_slice[max_slice.maxSlice == 80].copy()
max_slice_144 = max_slice[max_slice.maxSlice == 144].copy()

In the following cell, I use for loop to check if the first x images of each day has targets (labels), and then save the ratio = (the number of images that has lables) / (total number of images in this distance).

For example, if distance is set to 3, then for each day, I get all the first 2 ( = 3 - 2) slices, and check how many of them have labels. In this case 1 out of 30 images have labels, so the ratio is 0.03333333333333333.

Another example, if distance is set to 5, then for each day, I get all the first 4 ( = 5 - 1) slices, and check how many of them have labels. In this case 5 out of 60 images have labels, so the ratio is 0.08333333333333333.

Why is this useful? This means that if the train set is **representitive**, I can **safely** drop **all the first image** from each day, since they do not have any labels.

In [None]:
ratio_80_beginning = []
for distance in range(2,10):
    max_slice_80['isCloseToStart'] = max_slice_80.apply(lambda x: (x['slice'] < distance), axis=1)
    ratio_80_beginning.append(max_slice_80[max_slice_80['isCloseToStart'] == True]['hasSeg'].sum() / (15*(distance-1)))
    #This 15 is from that 15 days have nb of slices = 80
ratio_80_beginning

In [None]:
temp = max_slice[(max_slice['maxSlice'] == 80) & (max_slice['slice'] < 3)]['hasSeg']
print('nb of images with lables with distance(3): ' + str(temp.sum()))
print('nb of images within distance(3): '  + str(temp.shape[0]))
print('ratio of images with labels:' + str(temp.sum() / temp.shape[0]))

In [None]:
temp = max_slice[(max_slice['maxSlice'] == 80) & (max_slice['slice'] < 5)]['hasSeg']
print('nb of images with lables with distance(5): ' + str(temp.sum()))
print('nb of images within distance(5): '  + str(temp.shape[0]))
print('ratio of images with labels:' + str(temp.sum() / temp.shape[0]))

The following code shows the ratio for the end side of days that 80 slices, the start and end side of days that has 144 slices.

In [None]:
ratio_80_end = []
for distance in range(2,10):
    max_slice_80['isCloseToEnd'] = max_slice_80.apply(lambda x: (x['maxSlice'] - x['slice'])<distance-1, axis=1)
    ratio_80_end.append(max_slice_80[max_slice_80['isCloseToEnd'] == True]['hasSeg'].sum() / (15*distance-1))
print(ratio_80_end)

ratio_144_beg = []
for distance in range(2,30):
    max_slice_144['isCloseToBeg'] = max_slice_144.apply(lambda x: (x['slice'] < distance), axis=1)
    ratio_144_beg.append(max_slice_144[max_slice_144['isCloseToBeg'] == True]['hasSeg'].sum() / (259*(distance-1)))
print(ratio_144_beg)

ratio_144_end = []
for distance in range(2,20):
    max_slice_144['isCloseToEnd'] = max_slice_144.apply(lambda x: (x['maxSlice'] - x['slice'])<distance-1, axis=1)
    ratio_144_end.append(max_slice_144[max_slice_144['isCloseToEnd'] == True]['hasSeg'].sum() / (259*(distance-1)))
print(ratio_144_end)

In [None]:
toBeDropped = max_slice[((max_slice['maxSlice'] == 144) & ( max_slice['slice'] < 24)) |
         ((max_slice['maxSlice'] == 144) & ((144 - max_slice['slice']) < 7)|
         ((max_slice['maxSlice'] == 80) & (max_slice['slice'] < 2)|
         (max_slice['maxSlice'] == 80) & ((80 - max_slice['slice']) < 4)))]
print('total number of dropped images with labels:' + str(toBeDropped.hasSeg.sum()))
print('total number of dropped images:' + str(toBeDropped.shape[0]))

In [None]:
toBeDropped.head(5)

In [None]:
s1 = pd.merge(mask_csv, toBeDropped, how='inner', on=['id'])
print('double check, should be 0 if all these images do not have target: '+ str(s1.hasSeg_x.sum()))
print('nb of image filtered out: ' + str(s1.shape[0]/3))

In [None]:
# save the csv file.
toBeDropped.to_csv('toBeDropped_order.csv',index=False)

## **2. Drop Images (by intensity)**
Similarly with the previous section, we are going to check if we can drop some images depending on the **intensity** of the image. Some images have **lower average intensity** and other features, so we may be able to create filters to drop those images that pretty much would not have any defects.

For example, amongest the images that have targets in them, the **least average image intensity** is **29.66**. This means that, if a given image has average intensity less then 29.66, we could just treat it as non-informative image and drop it from the train set. This process will be done similar during testing.

However, we need to set a **safty margin** here, since the given train set is mostly like not be 100% representitive. Here I set the safty margin = 0.8, which means that if an image has average intensity less than **29.66 * 0.8 = 23.72**, I will drop it from the train set, or just give all zero prediction if it is in test set.

The safty margin works similarly on other features.

I also made more features:
1. ratio_nonZero: the ratio of number of non zero pixels to total number of pixels, for each image.
2. ratio_larger10: the ratio of number of pixels that has value larger than 10, to total number of pixels, for each image.
3. ratio_larger20, ratio_larger50, ratio_larger100 are defined similarly.

**Summary**: With **safty margin** = 0.8, this method alone can drop **2888** images. If you **combine** this method with the previou method, you can drop **8334** images in total, which means that most of dark images are taken in the beginning or the end of the days.

The **toBeDropped_intensity.csv** file will be **saved** to output, so you can just **download** it for your project.

In [None]:
import os
import cv2

The major part of the following decode function is from:
https://www.kaggle.com/code/awsaf49/uwmgi-unet-train-pytorch

**Note**: Output is RGB, in which **Red: Large bowel**, **Green: Small bowel**, **Blue: Stomach**

In [None]:
# Helper function to decode mask.
# Inputs: masks_csv: the csv file that contain all the label information. Straight from competition.
# id: the image id. It can only decode one id at a time.
# shape: the shape of the corresponding image.
def decode(masks_csv, id, shape):
    temp_masks = masks_csv[masks_csv['id'] == id]
    temp_mask_list = []
    for name, i in [('large_bowel',1), ('small_bowel',2), ('stomach',3)]:
        temp_mask_rle = temp_masks[temp_masks['class']==name]['segmentation'].values[0]
        h, w = shape
        img = np.zeros((h * w,), dtype=np.float32)
        if type(temp_mask_rle) == str:
            s = temp_mask_rle.split()
            s = list(map(int, s))
            starts = np.array(s[0::2]) - 1
            lengths = s[1::2]
            ends = starts + lengths
            for lo, hi in zip(starts, ends):
                img[lo : hi] = 1
        img = img.reshape(shape)
        temp_mask_list.append(img)

    mask = np.stack(temp_mask_list, -1).astype(np.uint8)
    return mask

In [None]:
mask_stats = pd.DataFrame(columns=['id', 'hasMask', 'avgIntensity', \
                'ratio_nonZero', 'ratio_larger10', 'ratio_larger20', 'ratio_larger50', 'ratio_larger100'])

src = r'../input/uw-madison-gi-tract-image-segmentation/train'

# counter = 0
for path, subdirs, files in os.walk(src):
    if len(files)>0:
        for name in files: 
            # counter += 1
            # if counter <10:
            case_temp = path.split('/')[-3]
            day_temp = (path.split('/')[-2]).split('_')[-1]
            filename_original = os.path.join(path, name)
            image = cv2.imread(filename_original, cv2.IMREAD_UNCHANGED)
            filename_dest = case_temp+'_' + day_temp + '_' + name
            splits_temp = filename_dest.split('_')

            shape_temp = (int(splits_temp[5]),int(splits_temp[4]))
            id_temp = splits_temp[0] +'_' + splits_temp[1] +'_' + splits_temp[2] +'_' +splits_temp[3]
            mask = decode(mask_csv, id_temp, shape_temp)

            nb_pixels = image.shape[0]*image.shape[1]
            
            temp_row = pd.DataFrame({'id':id_temp, 
            'hasMask':(mask.max()>0) + 0, 
            'avgIntensity':np.mean(image),
            'ratio_nonZero':(image>0).sum() / nb_pixels, 
            'ratio_larger10': (image>10).sum() / nb_pixels, 
            'ratio_larger20': (image>20).sum() / nb_pixels, 
            'ratio_larger50': (image>50).sum() / nb_pixels, 
            'ratio_larger100': (image>100).sum() / nb_pixels},  index=[0])
            mask_stats = pd.concat([mask_stats, temp_row],ignore_index= True)

In [None]:
mask_stats.head(2)

In [None]:
mask_stats[mask_stats.hasMask == 1].avgIntensity.min()

Visulization: you can clearly see that in some cases, all the images within that bin do not contain targets.

In [None]:
import matplotlib.pyplot as plt
mask = mask_stats[mask_stats.hasMask == 1]
noMask = mask_stats[mask_stats.hasMask == 0]
plt.figure(figsize=[18,9])
column_names = ['avgIntensity', 'ratio_nonZero', 'ratio_larger10', 'ratio_larger20', 'ratio_larger50', 'ratio_larger100']
for i in range(len(column_names)):
    cur_subplot = 231 + i
    cur_feature = column_names[i]
    plt.subplot(cur_subplot)
    plt.hist(x = [mask[cur_feature], noMask[cur_feature]], stacked=False,label = ['mask','noMask'], color = ['skyblue','orange'])
    plt.xlabel(cur_feature)
    plt.ylabel('# of Passengers')
    plt.legend()

In [None]:
safty_margin = 0.8

toBeDropped_intensity = pd.DataFrame()
for i in column_names:
    temp_min = mask_stats[mask_stats.hasMask == 1][i].min()
    temp_toBeDropped = mask_stats[(mask_stats.hasMask == 0) & (mask_stats[i] < temp_min*safty_margin)]
    toBeDropped_intensity = pd.concat([toBeDropped_intensity, temp_toBeDropped])
toBeDropped_intensity = toBeDropped_intensity.drop_duplicates()

In [None]:
toBeDropped_intensity.head(5)

In [None]:
toBeDropped_intensity.to_csv('toBeDropped_intensity.csv',index=False)

In [None]:
len(set(toBeDropped_intensity.id.tolist() + toBeDropped.id.tolist()))

## **3. False Annotation**
Thanks to https://www.kaggle.com/competitions/uw-madison-gi-tract-image-segmentation/discussion/319963 and https://www.kaggle.com/competitions/uw-madison-gi-tract-image-segmentation/discussion/321979. 

**Masks for case7_day0 and case81_day30 are incorrect. Some masks are missing for case138_day0.**

**Summary**: These two notebooks should draw more attention. Flase annotation is a typical and serious problem in deep learning. Especially with a medium size dataset like this competition, these false annotation could be critical that can affect final result. I personally **removed all slices** from these 3 days. 

## **4. Train-Test Split**
Data description: "Each case in this competition is represented by multiple sets of scan slices (each set is identified by the day the scan took place). **Some cases are split by time (early days are in train, later days are in test) while some cases are split by case - the entirety of the case is in train or test.** The goal of this competition is to be able to generalize to both partially and wholly unseen cases."
From https://www.kaggle.com/competitions/uw-madison-gi-tract-image-segmentation/data.

When we do the train-test split, we should try our best to **mimic the hidden test set**. When doing the **N-fold cross validation / ensemble**, we should split the data by **days** or even **cases**. I personaly am splitting them by days, since this is easier to make sure all folds have similar amount of images. Later I may try to randomly split it by both cases and days.

**Do NOT just copy all images together in one folder and split them..**

**Update**: added function to split training data into folds that have similart amount of data regarding the number of images. Some cases have more days and some have less, so when splitting training data by cases, it is important to make sure the split is even. The first two helper function are from: https://stackoverflow.com/questions/3420937/algorithm-to-find-which-number-in-a-list-sum-up-to-a-certain-number. I use dynamic programming to make sure that the number of images in each fold is similar.

In [None]:
def f(v, i, S, memo):
    if i >= len(v): return 1 if S == 0 else 0
    if (i, S) not in memo:  # <-- Check if value has not been calculated.
        count = f(v, i + 1, S, memo)
        count += f(v, i + 1, S - v[i], memo)
        memo[(i, S)] = count  # <-- Memoize calculated result.
    return memo[(i, S)]     # <-- Return memoized value.
def g(v, S, memo):
    subset = []
    for i, x in enumerate(v):
    # Check if there is still a solution if we include v[i]
        if f(v, i + 1, S - x, memo) > 0:
            subset.append(x)
            S -= x
    return subset

In [None]:
# helper function, get the size of a folder
def get_dir_size(path='.'):
    total = 0
    with os.scandir(path) as it:
        for entry in it:
            if entry.is_file():
                total += entry.stat().st_size
            elif entry.is_dir():
                total += get_dir_size(entry.path)
    return total

In [None]:
import os
import numpy as np
full_directory = r'../input/uw-madison-gi-tract-image-segmentation/train'
# create a dict that maps the case name with its folder size, in MB
size_dirc = {}
for temp_dirc in os.listdir(full_directory):
    temp_size = np.around(get_dir_size(full_directory + '//' + temp_dirc)/(1024*1024)).astype(np.uint8)
    size_dirc[temp_dirc] = temp_size

In [None]:
# input: dictionary that has the case name as key, and the size of the folder as value
# output: two dictionaries that have similar amout of data
def split_cases_twoGroups(size_directory):
    size_list = np.array(list(size_directory.values()))
#     since the previous 0-1 Knapsack Problem function depends on the order of the options, 
#     here I shuffle the order first to introduce randomness
    np.random.seed(42)
    np.random.shuffle(size_list)
#     the target here is the amount of data each fold should have.
#     you may change it, if you want to split it to 5 fold or something else
    target = int(size_list.sum()/2)
    memo = dict()
    if f(size_list, 0, target, memo) == 0: print("There are no valid subsets.")
    else: half_sizes = np.array((g(size_list, target, memo)))

    size_dirc_half_0 = size_directory.copy()
    size_dirc_half_1 = {}
    used_key = []
    for size in half_sizes:
#         find the key by searching value. Since there could be more than one key that have the same value
#         need to pop the key after it is found
#         the left over dict is the second half
        key_temp = list(size_dirc_half_0.keys())[list(size_dirc_half_0.values()).index(size)]
        size_dirc_half_1[key_temp] = size
        size_dirc_half_0.pop(key_temp)
    return size_dirc_half_0, size_dirc_half_1

In [None]:
size_dirc_half_0, size_dirc_half_1 = split_cases_twoGroups(size_directory = size_dirc)
size_dirc_quarter_0, size_dirc_quarter_1 = split_cases_twoGroups(size_directory = size_dirc_half_0)
size_dirc_quarter_2, size_dirc_quarter_3 = split_cases_twoGroups(size_directory = size_dirc_half_1)

In [None]:
size_dirc_quarter_0

In [None]:
size_dirc_quarter_1

In [None]:
print('size of the 1 quarter (MB): ' + str(np.array(list(size_dirc_quarter_0.values())).sum()))
print('size of the 2 quarter (MB): ' + str(np.array(list(size_dirc_quarter_1.values())).sum()))
print('size of the 3 quarter (MB): ' + str(np.array(list(size_dirc_quarter_2.values())).sum()))
print('size of the 4 quarter (MB): ' + str(np.array(list(size_dirc_quarter_3.values())).sum()))

## **5. Create 2.5D Images / Files**
Credits to the idea here: https://www.kaggle.com/competitions/uw-madison-gi-tract-image-segmentation/discussion/322549. 2.5 images give the model  **temporal** information, on the top of **spatial** imformation from single images. The following function is to generate 2.5D image, given a center image. 

**Note**: 
1. In the result of the function, the **target** image (the "center" image) are places as the **first slice** in the pile. This is for convenience reasons. Such **ordering** will not hurt the model perfomance, since neural network will **automatically** decide which layer is more important. As long as the ordering is **consistant**, the ordering itself does not matter.
2. For this 2.5D function, there could be **many slices (images)** but it will have only **one mask** that corresponds to the "center" image. This is because 2.5 D image is used to give more information to CNN, and thus it is **N to 1** segmentation. If **more than one mask** are used, it will become **N to N** segmentation and will be unnecessarily complicated.
3. This helper function supports **augmentation** (only **Albumentations**, because it has nice support for 16 bit images). If augmentation is passes, all slices and the mask will have the **exactly same** augmentation, which is done by adding **addition targets** in the augmentator.

**Parameters**:
1. **masks_csv**: the label mask csv. This is needed to get decoded mask.
2. **image_path**: the path to the "center" image. Again, this image will be at index 0 in the output.
3. **nb_layers_oneSide**: the number of layers for one side. E.g. if this is set to 3, the output will have 3*2+1 = 7 layers(images). 3 images from left, and 3 images from right.
4. **stride**: the stride to pick image. E.g. if the slice number of the center image is 10, nb_layers_oneSide = 3 and stride = 2, the output will have slices: 4, 6, 8, 10, 12, 14, 16.
5. **augmentation**: optional. If yes, a transform from Albumantation is needed.

At the end, if needed, I recommend to save the ouput as numpy file, since it can contain more than 3 layers.

In [None]:
import cv2
import os
from PIL import Image

In [None]:
def twoPointFiveD(masks_csv, image_path, nb_layers_oneSide, stride, augmentation = None):
    # sparse the file path, and get case, day, shape and ID
    case_temp = image_path.split('/')[-4]
    day_temp = (image_path.split('/')[-3]).split('_')[-1]
    filename = case_temp + '_' + day_temp + '_' +image_path.split('/')[-1]
    splits_temp = filename.split('_')
    shape_temp = (int(splits_temp[5]),int(splits_temp[4]))
    id_temp = splits_temp[0] +'_' + splits_temp[1] +'_' + splits_temp[2] +'_' +splits_temp[3]

    image = cv2.imread(image_path, cv2.IMREAD_UNCHANGED)
    mask = decode(masks_csv, id_temp, shape_temp)
    

    # get slice offsets
    slice_number = int(image_path.split('slice_')[1].split('_')[0])
    offsets = np.arange(-nb_layers_oneSide, nb_layers_oneSide+1)*stride

    # load other slices and stack them together
    # NOTE: The target image will be the first image with index 0, instead of being in the middle like sandwich
    layers = [image]
    for offset in offsets:
        temp_zeros = (4-len(str(slice_number+offset)))*'0'
        temp_path = image_path.split('slice_')[0] + 'slice_' + temp_zeros +str(slice_number+offset) + image_path.split('slice_')[1][4:]
        temp_layer = None
        # print(temp_path)
        if offset ==0:
            continue
        elif not os.path.isfile(temp_path):
            layers.append(np.zeros(image.shape))
        else:
            layers.append(cv2.imread(temp_path, cv2.IMREAD_UNCHANGED))
    result = np.moveaxis(np.stack(layers), 0, -1)

    # if augmentation, all images and one single mask will have exactly the same augmentation
    # NOTE: The augmented target image will be the first image with index 0, instead of being in the middle like sandwich
    if augmentation:
        additional_targets = {}
        for i in range(1,nb_layers_oneSide*2+1):
            additional_targets['image' + str(i)] = 'image'
        augmentation.add_targets(additional_targets)

        args = {'image':result[:,:,0], 'mask':mask}
        for i in range(1,nb_layers_oneSide*2+1):
            args['image' + str(i)] = result[:,:,i]
        transformed = augmentation(**args)

        result_layers = [transformed['image']]
        for key in additional_targets.keys():
            result_layers.append(transformed[key])
            mask = transformed['mask']
        result = np.moveaxis(np.stack(result_layers), 0, -1)
    return result, mask

In [None]:
image_path = '../input/uw-madison-gi-tract-image-segmentation/train/case145/case145_day19/scans/slice_0102_360_310_1.50_1.50.png'

In [None]:
nb_layers_oneSide = 1
stride = 2
images_stacked, mask = twoPointFiveD(masks_csv = mask_csv, image_path= image_path, 
            nb_layers_oneSide = nb_layers_oneSide, stride = stride, augmentation = None)

In [None]:
colors = {0:(255,0,0), 1:(0,255,0),2:(0,0,255),}
image_original = images_stacked[:,:,0]
mask_original = mask
image_toShow = ((image_original/image_original.max())*255).astype(np.uint8)
image_contour = cv2.cvtColor(image_toShow.copy(),cv2.COLOR_GRAY2BGR)

for i in range(3):
    output_temp_class = np.zeros(image_original.shape).astype(np.uint8)
    output_temp_class = mask_original[:,:,i]
    contours, hierarchy = cv2.findContours(output_temp_class, cv2.RETR_TREE, cv2.CHAIN_APPROX_SIMPLE)
    _ = cv2.drawContours(image_contour, contours, contourIdx = -1, color = colors[i], thickness = 1)
Image.fromarray(image_contour)

In [None]:
import albumentations as A
transform = A.Compose([
    A.ToFloat(max_value = 65535.0),
    A.Affine(scale=(0.9,1.1), rotate = (-15,15), shear = (-7,7), p = 0.5),
    A.RandomBrightnessContrast(brightness_limit=0.001, contrast_limit=0.001, p = 0.5),
    A.GaussNoise(var_limit=0.0000002, p = 0.5),
    A.RandomCrop(width=224, height=224),
    A.FromFloat(max_value = 65535.0),
])

nb_layers_oneSide = 2
stride = 2

image_path = '../input/uw-madison-gi-tract-image-segmentation/train/case123/case123_day20/scans/slice_0082_266_266_1.50_1.50.png'
images_stacked, mask = twoPointFiveD(masks_csv = mask_csv, image_path= image_path, 
            nb_layers_oneSide = nb_layers_oneSide, stride = stride, augmentation = transform)
print(images_stacked.shape)

In [None]:
colors = {0:(255,0,0), 1:(0,255,0),2:(0,0,255),}
image_original = images_stacked[:,:,0]
mask_original = mask
image_toShow = ((image_original/image_original.max())*255).astype(np.uint8)
image_contour = cv2.cvtColor(image_toShow.copy(),cv2.COLOR_GRAY2RGB)

for i in range(3):
    output_temp_class = np.zeros(image_original.shape).astype(np.uint8)
    output_temp_class = mask_original[:,:,i]
    contours, hierarchy = cv2.findContours(output_temp_class, cv2.RETR_TREE, cv2.CHAIN_APPROX_SIMPLE)
    _ = cv2.drawContours(image_contour, contours, contourIdx = -1, color = colors[i], thickness = 1)
Image.fromarray(image_contour)

## **6. Augmentation**
Augmentation is very porject specific. A good augmentation should improve the **robustness** of the dataset, which can help in production to deal with subtle **randomness** (camera angle change, dust, ambient light change, etc.).

Here I would like to share my augmentation. Since the original image is in **16 bit**, everything about augmenting brightness and contrast is **treakier** than 8 bit images. You will see later that I set some parameters **extremly small**, because otherwise the pixel value will just go crazy. I am not sure if this is a bug of albumentation. Anyhow, the reasult of this augmentation seems reasonable to me.

Also, instead of using randome crop with a image size, I iterate through each image with 4 corners with size 224(changable, depending on your neural nets). For example, for an image with size (310 , 360), the position of each corner image is:
1. Top left:  [0:224, 0:224]
2. Top right: [0:224, 136:360]
3. Bottom left: [86:310, 0:224]
4. Bottom right: [86:310, 136:360]

In this case, the **center part** of the image will be most likely **repeated** in each corner image. This is actually going to be helpful for training, because in most of the images, the center part contains more information and more targets.

**NOTE**: I apply augmentation on **FULL** image **BEFORE** such corner cropping, because when agumentating (translating, resizing down..), it may generate garbage pixels (zeros), Applying augmentation on full image and then cropping can help to reduce such garbage pixels.

In [None]:
import albumentations as A
import cv2

transform_aug = A.Compose([
    A.ToFloat(max_value = 65535.0),
    A.Affine(translate_percent = (0,0.1), scale=(0.9,1.1), rotate = (-15,15), shear = (-7,7), p = 0.75),
    A.RandomBrightnessContrast(brightness_limit=0.0005, contrast_limit=0.0005, p = 0.25),
    A.GaussNoise(var_limit=0.0000001, p = 0.25),
    A.FromFloat(max_value = 65535.0),
])

In [None]:
#  The first part is same as creating 2.5D image. Just to get the id and decoded mask.
image_path = '../input/uw-madison-gi-tract-image-segmentation/train/case145/case145_day19/scans/slice_0102_360_310_1.50_1.50.png'
case_temp = image_path.split('/')[-4]
day_temp = (image_path.split('/')[-3]).split('_')[-1]
filename = case_temp + '_' + day_temp + '_' +image_path.split('/')[-1]
splits_temp = filename.split('_')
shape_temp = (int(splits_temp[5]),int(splits_temp[4]))
id_temp = splits_temp[0] +'_' + splits_temp[1] +'_' + splits_temp[2] +'_' +splits_temp[3]

image = cv2.imread(image_path, cv2.IMREAD_UNCHANGED)
mask = decode(mask_csv, id_temp, shape_temp)

# Calculating offsets.
offset_height = image.shape[0] - 224
offset_width = image.shape[1] - 224
offsets = [(0,0), (offset_height,0), (0, offset_width), (offset_height, offset_width)]
image_patches = []
mask_patches = []

# for each corner, augment the full image and then crop.
for offset_index in range(4):
    offset_height_temp, offset_width_temp = offsets[offset_index]
    transformed = transform_aug(image=image, mask = mask)
    transformed_image = transformed['image']
    transformed_mask = transformed['mask']
    image_aug_temp = transformed_image[offset_height_temp:offset_height_temp+224, offset_width_temp:offset_width_temp+224]
    mask_aug_temp = transformed_mask[offset_height_temp:offset_height_temp+224, offset_width_temp:offset_width_temp+224,:]
    image_patches.append(image_aug_temp)
    mask_patches.append(mask_aug_temp)


In [None]:
from PIL import Image
# a helper function to visulize the mask and the image.
def visulize_contour_mask(image, mask):
    colors = {0:(255,0,0), 1:(0,255,0),2:(0,0,255),}
    image_original = image
    mask_original = mask
    image_toShow = ((image_original/image_original.max())*255).astype(np.uint8)
    image_contour = cv2.cvtColor(image_toShow.copy(),cv2.COLOR_GRAY2BGR)

    for i in range(3):
        output_temp_class = np.zeros(image_original.shape).astype(np.uint8)
        output_temp_class = mask_original[:,:,i]
        contours, hierarchy = cv2.findContours(output_temp_class, cv2.RETR_TREE, cv2.CHAIN_APPROX_SIMPLE)
        _ = cv2.drawContours(image_contour, contours, contourIdx = -1, color = colors[i], thickness = 1)
    return Image.fromarray(image_contour)

In [None]:
visulize_contour_mask(image_patches[0], mask_patches[0])

In [None]:
visulize_contour_mask(image_patches[1], mask_patches[1])