# Preprocessing
## Train, Validation and Test Split using CRIC Structured Dataset

## About:

Author: Fahad

Email: sfahadahmed@gmail.com

This notebook contains code for splitting physical folder structure to a Train-Validation-Test structure which can be used for model training, validation and testing.

The code has been tested using the CRIC structured dataset.







## Google Drive

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
!unzip CRIC_Dataset_Strutured.zip

Archive:  CRIC_Dataset_Strutured.zip
   creating: dataset/
   creating: dataset/ASC-H/
  inflating: dataset/ASC-H/ASC-H_48.png  
  inflating: dataset/ASC-H/ASC-H_339.png  
  inflating: dataset/ASC-H/ASC-H_386.png  
  inflating: dataset/ASC-H/ASC-H_387.png  
  inflating: dataset/ASC-H/ASC-H_52.png  
  inflating: dataset/ASC-H/ASC-H_55.png  
  inflating: dataset/ASC-H/ASC-H_81.png  
  inflating: dataset/ASC-H/ASC-H_51.png  
  inflating: dataset/ASC-H/ASC-H_56.png  
  inflating: dataset/ASC-H/ASC-H_384.png  
  inflating: dataset/ASC-H/ASC-H_388.png  
  inflating: dataset/ASC-H/ASC-H_389.png  
  inflating: dataset/ASC-H/ASC-H_49.png  
  inflating: dataset/ASC-H/ASC-H_3.png  
  inflating: dataset/ASC-H/ASC-H_82.png  
  inflating: dataset/ASC-H/ASC-H_64.png  
  inflating: dataset/ASC-H/ASC-H_47.png  
  inflating: dataset/ASC-H/ASC-H_65.png  
  inflating: dataset/ASC-H/ASC-H_58.png  
  inflating: dataset/ASC-H/ASC-H_385.png  
  inflating: dataset/ASC-H/ASC-H_53.png  
  inflating: dataset/ASC-

## Imports

In [6]:
import os
import shutil
import glob2
import math

## Explore

In [7]:
# CRIC structured dataset files
DATA_DIR = 'dataset/'

In [8]:
classes = os.listdir(DATA_DIR)
classes

['LSIL', 'SCC', 'HSIL', 'ASC-US', 'NILM', 'ASC-H']

In [9]:
NUM_CLASSES = len(classes)
NUM_CLASSES

6

## Organize Samples

In [10]:
files_by_class = {}
for i in range(NUM_CLASSES):
    obj = {}
    obj['images'] = glob2.glob(DATA_DIR+classes[i]+'/*.png')
    obj['num_images'] = len(obj['images'])
    files_by_class[classes[i]] = obj

In [12]:
# Show sample count for each class
total_images = 0
for k in files_by_class:
    obj = files_by_class[k]
    total_images += int(obj['num_images'])
    print(k,': ', obj['num_images'])

print('Total: ', total_images)

LSIL :  165
SCC :  21
HSIL :  24
ASC-US :  101
NILM :  59
ASC-H :  30
Total:  400


## Split Dataset for Train, Validation and Test

In [16]:
# 80-10-10 split
train_split = 0.8
val_split = 0.1
test_split = 0.1

## Create Folder Structure

In [18]:
os.mkdir('dataset_split/')

# create folders for train, val and test
dirs = ['dataset_split/train', 'dataset_split/val', 'dataset_split/test']
for dir in dirs:
    os.mkdir(dir)

    # create folder for each class
    for c in classes:
        os.mkdir(dir+'/'+c)
    

## Copy Files to Folder Structure

In [19]:
# Show count for each class
for k in files_by_class:
    obj = files_by_class[k]
    images = obj['images']
    num_images = int(obj['num_images'])
    
    # calculate sample size
    train_samples = round(num_images * train_split)
    val_samples = round(num_images * val_split)
    test_samples = round(num_images * test_split)

    # split arrays
    train_images = images[0:train_samples]
    val_images = images[train_samples:train_samples+val_samples]
    test_images = images[train_samples+val_samples:]

    # Copy files
    for img in train_images:
        filename = img[img.rindex('/')+1:]
        shutil.copyfile(img, 'dataset_split/train/'+k+'/'+filename)

    for img in val_images:
        filename = img[img.rindex('/')+1:]
        shutil.copyfile(img, 'dataset_split/val/'+k+'/'+filename)

    for img in test_images:
        filename = img[img.rindex('/')+1:]
        shutil.copyfile(img, 'dataset_split/test/'+k+'/'+filename)

    print('%s, train_len: %s, val_len: %s, test_len: %s' % (k, len(train_images), len(val_images), len(test_images)))


LSIL, train_len: 132, val_len: 16, test_len: 17
SCC, train_len: 17, val_len: 2, test_len: 2
HSIL, train_len: 19, val_len: 2, test_len: 3
ASC-US, train_len: 81, val_len: 10, test_len: 10
NILM, train_len: 47, val_len: 6, test_len: 6
ASC-H, train_len: 24, val_len: 3, test_len: 3
