# Prepare training, test, and validation sets for the pets dataset

## Overview

This notebook assumes that there will be a data directory called **data/pets**, relative to the current directory, that contains two files that were previously downloaded from the [Oxford-IIIT Pet Dataset](http://www.robots.ox.ac.uk/~vgg/data/pets/):

* **annotations.tar.gz**
* **images.tar.gz**

These will be moved to a new directory **data/pets/pristine**, before being extracted. At the end of this notebook, your data directory should contain another two sub-directories:

* **data/pets/full**
* **data/pets/sample**

The **full** sub-directory contains the full dataset, with images separated according to label. The **sample** sub-directory contains a subset of this data that allows for faster training while initially developing a model.

## Processing

In [None]:
import glob, os, random, re, shutil, sys, tarfile
import numpy as np

We seed the random number generator so that sample generation is deterministic:

In [None]:
seed = random.randrange(sys.maxsize)
rng = random.Random(seed)
print("Random seed:", seed)

Generate path for data directory:

In [None]:
data_dir = os.path.join(os.getcwd(), 'data', 'pets')
print("Data directory:", data_dir)

Create directory structure:

In [None]:
%mkdir -p {data_dir}/full/test
%mkdir -p {data_dir}/full/train
%mkdir -p {data_dir}/full/valid
%mkdir -p {data_dir}/pristine
%mkdir -p {data_dir}/sample

Set aside original test and train data files:

In [None]:
%mv {data_dir}/*.tar.gz {data_dir}/pristine

Extracting **images.tar.gz** will create a directory called **images**:

In [None]:
with tarfile.open(os.path.join(data_dir, 'pristine', 'images.tar.gz'), "r:gz") as tar:
    tar.extractall(os.path.join(data_dir, 'full'))

The filename for each image is prefixed with its label, so we can use that to sort them into directories, before setting aside a portion of those images as test and validation sets:

In [None]:
pattern = r'([^/]+)_\d+.jpg$'

classes = set()
images_dir = os.path.join(data_dir, 'full', 'images')
for file in glob.glob(os.path.join(images_dir, '*.jpg')):
    basename = os.path.basename(file)
    matches = re.match(pattern, basename)
    if matches:
        classes.add(matches.group(1))
    else:
        print('Failed to extract label from filename:', file)

classes = sorted(classes)
for c in classes:
    print('Moving images for class:', c)
    target_dir = os.path.join(data_dir, 'full', 'train', c)
    prefix = os.path.join(images_dir, c)
    %mkdir -p {target_dir}
    %mv {prefix}_*.jpg {target_dir}

Set aside test and validation sets:

In [None]:
def set_aside_images(kind, ratio):
    for c in classes:
        src_dir = os.path.join(data_dir, 'full', 'train', c)
        file_list = glob.glob(os.path.join(src_dir, '*.jpg'))
        shuffled_list = np.random.permutation(file_list)
        num_images = int(shuffled_list.size * ratio)
        target_dir = os.path.join(data_dir, 'full', kind, c)
        %mkdir -p {target_dir}
        for i in range(num_images):
            shutil.move(shuffled_list[i], target_dir)

# Setting aside 20% for test set
set_aside_images('test', 0.2)

# Setting aside 10% for validation set
set_aside_images('valid', 0.1)

Remove the original **images** directory extracted from **images.tar.gz**:

In [None]:
%rm -rf {images_dir}