# Cholec Folder Split

**NOTE: If you need to split your data, and you have both an `images` folder and `groundtruth` folder, proceed to `Generate Random Data Split` below**

The following code was specially made for the `cholec` dataset

- After downloading the Cholec Segmentation dataset from Kaggle, we organize the dataset into two main folders:
    - `images` contains the input images for our semantic segmentation network
    - `gt` contains the groundtruth image masks for our semantic segmentation network

In [8]:
import os
from shutil import copy2, move
import numpy as np

In [9]:
cholec_path = "../../../Downloads/cholec/"
print(len(cholec_path))
files = []

26


In [10]:
for dirname, _, filenames in os.walk(cholec_path):
    for filename in filenames:
        curr_file = os.path.join(dirname, filename)
        video_num = curr_file[33:47]
        new_file_name = video_num + "_" + filename

        if "color" in filename:
            move(curr_file, "../../../anaconda3/cholec/gt/")
            original_name = "../../../anaconda3/cholec/gt/" + filename
            new_name = "../../../anaconda3/cholec/gt/" + new_file_name
            os.rename(original_name, new_name)

In [11]:
for dirname, _, filenames in os.walk(cholec_path):
    for filename in filenames:
        curr_file = os.path.join(dirname, filename)
        video_num = curr_file[33:47]
        new_file_name = video_num + "_" + filename
        if "mask" not in filename:
            move(curr_file, "../../../anaconda3/cholec/images/")
            original_name = "../../../anaconda3/cholec/images/" + filename
            new_name = "../../../anaconda3/cholec/images/" + new_file_name
            os.rename(original_name, new_name)

# Generate Random Data Split

- The below code splits our data from `images` and `gt` into `train`/`val`/`test` sets with the following ratios:
    - `train` = 0.7
    - `val` = 0.15
    - `test` = 0.15

- This split code is adapted from the following Stanford CS230 Blog Post https://cs230.stanford.edu/blog/split/

In [12]:
import glob
import random
import os
from PIL import Image
from tqdm import tqdm

In [32]:
image_list = glob.glob('/home/sohamaserkar/Downloads/cholec/images/*')
gt_list = glob.glob("/home/sohamaserkar/Downloads/cholec/gt/*")

In [33]:
image_list.sort()
gt_list.sort()

In [34]:
random.seed(2021)

In [35]:
combined_list = list(zip(image_list, gt_list))
random.shuffle(combined_list)
image_list, gt_list = zip(*combined_list)

In [36]:
split_1 = int(0.15 * len(image_list))
split_2 = int(0.3 * len(image_list))

test_images, test_gt = image_list[:split_1], gt_list[:split_1]
val_images, val_gt = image_list[split_1:split_2], gt_list[split_1:split_2]
train_images, train_gt = image_list[split_2:], gt_list[split_2:]

In [37]:
print(len(train_images) / len(image_list))
print(len(test_images) / len(image_list))
print(len(val_images) / len(image_list))

0.7
0.15
0.15


In [38]:
print(len(train_gt) / len(gt_list))
print(len(test_gt) / len(gt_list))
print(len(val_gt) / len(gt_list))

0.7
0.15
0.15


In [39]:
im_filenames = {'train_images': train_images,'val_images': val_images, 'test_images': test_images}
gt_filenames = {'train_gt': train_gt,'val_gt': val_gt, 'test_gt': test_gt}

In [43]:
im_output_dir = "/home/sohamaserkar/Downloads/combined_cholec"

In [44]:
def resize_and_save(filename, output_dir):
    """Resize the image contained in `filename` and save it to the `output_dir`"""
    image = Image.open(filename)
    save_path = os.path.join(output_dir, filename.split('/')[-1])
    image.save(save_path)

In [46]:
for split in ['train_images', 'val_images', 'test_images']:
    output_dir_split = os.path.join(im_output_dir, split)
    print(output_dir_split)
    
    
    if not os.path.exists(output_dir_split):
        os.mkdir(output_dir_split)
    else:
        print("Warning: dir {} already exists".format(output_dir_split))
    
    print("Processing {} data, saving preprocessed data to {}".format(split, output_dir_split))
    for filename in tqdm(im_filenames[split]):
        resize_and_save(filename, output_dir_split)


print("Done building dataset")

/home/sohamaserkar/Downloads/combined_cholec/train_images
Processing train_images data, saving preprocessed data to /home/sohamaserkar/Downloads/combined_cholec/train_images


100%|██████████| 5656/5656 [09:11<00:00, 10.25it/s]


/home/sohamaserkar/Downloads/combined_cholec/val_images
Processing val_images data, saving preprocessed data to /home/sohamaserkar/Downloads/combined_cholec/val_images


100%|██████████| 1212/1212 [02:00<00:00, 10.06it/s]


/home/sohamaserkar/Downloads/combined_cholec/test_images
Processing test_images data, saving preprocessed data to /home/sohamaserkar/Downloads/combined_cholec/test_images


100%|██████████| 1212/1212 [02:07<00:00,  9.50it/s]

Done building dataset





In [47]:
gt_output_dir = "/home/sohamaserkar/Downloads/combined_cholec"

In [48]:
for split in ['train_gt', 'val_gt', 'test_gt']:
    output_dir_split = os.path.join(gt_output_dir, split)
    print(output_dir_split)
    
    
    if not os.path.exists(output_dir_split):
        os.mkdir(output_dir_split)
    else:
        print("Warning: dir {} already exists".format(output_dir_split))
    
    print("Processing {} data, saving preprocessed data to {}".format(split, output_dir_split))
    for filename in tqdm(gt_filenames[split]):
        resize_and_save(filename, output_dir_split)


print("Done building dataset")

/home/sohamaserkar/Downloads/combined_cholec/train_gt
Processing train_gt data, saving preprocessed data to /home/sohamaserkar/Downloads/combined_cholec/train_gt


100%|██████████| 5656/5656 [01:49<00:00, 51.81it/s]


/home/sohamaserkar/Downloads/combined_cholec/val_gt
Processing val_gt data, saving preprocessed data to /home/sohamaserkar/Downloads/combined_cholec/val_gt


100%|██████████| 1212/1212 [00:23<00:00, 51.46it/s]


/home/sohamaserkar/Downloads/combined_cholec/test_gt
Processing test_gt data, saving preprocessed data to /home/sohamaserkar/Downloads/combined_cholec/test_gt


100%|██████████| 1212/1212 [00:23<00:00, 52.40it/s]

Done building dataset



