<a href="https://colab.research.google.com/github/yxmauw/gi-im-segmentation-2d/blob/main/Unet.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Set up environment

In [1]:
# mount google drive
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


In [2]:
!git clone https://github.com/yxmauw/gi-im-segmentation-2d.git

Cloning into 'gi-im-segmentation-2d'...
remote: Enumerating objects: 16, done.[K
remote: Counting objects: 100% (16/16), done.[K
remote: Compressing objects: 100% (16/16), done.[K
remote: Total 16 (delta 4), reused 0 (delta 0), pack-reused 0[K
Unpacking objects: 100% (16/16), done.


In [3]:
!mkdir ~/.kaggle #Make a directory named “.kaggle”

In [4]:
!cp ./gi-im-segmentation-2d//kaggle.json ~/.kaggle/ # Copy the “kaggle.json” into this new directory

In [5]:
!chmod 600 ~/.kaggle/kaggle.json # Allocate the required permission for this file

In [6]:
!kaggle competitions download -c uw-madison-gi-tract-image-segmentation # download dataset

Downloading uw-madison-gi-tract-image-segmentation.zip to /content
 99% 2.28G/2.30G [00:19<00:00, 136MB/s]
100% 2.30G/2.30G [00:19<00:00, 124MB/s]


In [None]:
!unzip uw-madison-gi-tract-image-segmentation.zip #unzip folders

# Data

Since unable to submit for kaggle score, must generate train, validation and test set from training set

In [8]:
import numpy as np
import pandas as pd

In [14]:
# read mask annotations
mask_df = pd.read_csv('train.csv')
print(len(mask_df))

# find out how many slices have mask
im_with_mask = mask_df.loc[mask_df['segmentation'].isnull()==False, :]
print(len(im_with_mask))
display(im_with_mask.head(3))

115488
33913


Unnamed: 0,id,class,segmentation
194,case123_day20_slice_0065,stomach,28094 3 28358 7 28623 9 28889 9 29155 9 29421 ...
197,case123_day20_slice_0066,stomach,27561 8 27825 11 28090 13 28355 14 28620 15 28...
200,case123_day20_slice_0067,stomach,15323 4 15587 8 15852 10 16117 11 16383 12 166...


In [15]:
im_with_mask['class'].value_counts()

large_bowel    14085
small_bowel    11201
stomach         8627
Name: class, dtype: int64

In [16]:
mask_df['class'].value_counts()

large_bowel    38496
small_bowel    38496
stomach        38496
Name: class, dtype: int64

Only consider the image slices that have segmentation masks. So that is 33,913 image slices.

According to other users on Kaggle, case7 and case81 are segmented incorrectly

In [21]:
# using regex to extract case numbers into a new column from im_with_mask
im_with_mask['case'] = im_with_mask.id.str.extract(r'(case[0-9]+)', 
                                     expand = True)
im_with_mask.head(3)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,id,class,segmentation,case
194,case123_day20_slice_0065,stomach,28094 3 28358 7 28623 9 28889 9 29155 9 29421 ...,case123
197,case123_day20_slice_0066,stomach,27561 8 27825 11 28090 13 28355 14 28620 15 28...,case123
200,case123_day20_slice_0067,stomach,15323 4 15587 8 15852 10 16117 11 16383 12 166...,case123


In [28]:
# how many unique cases in im_with_mask
display(im_with_mask['case'].unique())
print(len(im_with_mask['case'].unique()))

array(['case123', 'case77', 'case133', 'case129', 'case139', 'case130',
       'case88', 'case44', 'case145', 'case15', 'case110', 'case42',
       'case118', 'case66', 'case91', 'case142', 'case58', 'case63',
       'case114', 'case102', 'case115', 'case65', 'case53', 'case122',
       'case125', 'case117', 'case140', 'case134', 'case9', 'case113',
       'case90', 'case49', 'case19', 'case6', 'case67', 'case154',
       'case135', 'case84', 'case147', 'case101', 'case7', 'case119',
       'case32', 'case24', 'case33', 'case22', 'case149', 'case11',
       'case148', 'case124', 'case111', 'case89', 'case136', 'case116',
       'case143', 'case35', 'case108', 'case43', 'case55', 'case141',
       'case92', 'case16', 'case131', 'case81', 'case34', 'case36',
       'case20', 'case121', 'case29', 'case18', 'case138', 'case146',
       'case144', 'case40', 'case54', 'case78', 'case47', 'case156',
       'case85', 'case107', 'case41', 'case80', 'case2', 'case74',
       'case30'], dtype=obj

85


In [29]:
# count how many original files 
import os.path  
import glob  
folder = glob.glob("train/*")
len(folder)

85

This means that all cases within train folder has some slices that have segmentation mask, but not all slices are for each case have segmentation mask.

In [None]:
# attach case number and day to filenames and remove pixel /size info from name
from glob import glob
import os
pre = "case101_day20_"
[os.rename(f, "{}{}".format(pre, f)) for f in glob("train/case101/case101_day20/scans/*.png")]

In [39]:
import re

for filename in glob("train/case101/case101_day20/scans/*.png"):
  print(filename)
  base_filename = re.split('train/case101/case101_day20/scans/', filename)[-1].split('_266')[0]
  suffix = '.png'
  upd_filename = 'case101_day20'
  dst_dir = 'drive/MyDrive/Colab Notebooks/working_directory'
  x = os.path.join(dst_dir, base_filename + suffix)
  print(x)
  break

train/case101/case101_day20/scans/slice_0059_266_266_1.50_1.50.png
drive/MyDrive/Colab Notebooks/working_directory/slice_0059.png


In [None]:
# extract case files and put into a working directory
