# Prepare training and test sets for the dogs/cats dataset

## Overview

This notebook assumes that there will be a data directory called **data/dogscats-redux**, relative to the current directory, that contains two files that were previously downloaded from Kaggle's 'dogscats-redux' dataset:

* **test.zip**
* **train.zip**

These will be moved to a new directory **data/dogscats-redux/pristine**, before being extracted. At the end of this notebook, your data directory should contain another two sub-directories:

* **data/dogscats-redux/full**
* **data/dogscats-redux/sample**

The **full** sub-directory contains the full dataset, with images separated according to label. The **sample** sub-directory contains a subset of this data that allows for faster training while initially developing a model.

In [None]:
import glob, os, random, shutil, sys, zipfile
import numpy as np

We seed the random number generator so that sample generation is deterministic:

In [None]:
seed = random.randrange(sys.maxsize)
rng = random.Random(seed)
print("Random seed:", seed)

Generate path for data directory:

In [None]:
data_dir = os.path.join(os.getcwd(), 'data', 'dogscats-redux')
print("Data directory:", data_dir)

Create directory structure:

In [None]:
%mkdir -p {data_dir}/full
%mkdir -p {data_dir}/pristine
%mkdir -p {data_dir}/sample

Set aside original test and train data files:

In [None]:
%mv {data_dir}/test.zip {data_dir}/train.zip {data_dir}/pristine

Decompress test data:

In [None]:
with zipfile.ZipFile(os.path.join(data_dir, "pristine", "test.zip"), "r") as ref:
    ref.extractall(os.path.join(data_dir, "full", "test"))

full_test_unlabelled_dir = os.path.join(data_dir, "full", "test", "unlabelled")
    
%mv {data_dir}/full/test/test {full_test_unlabelled_dir}

Decompress and rearrange training data:

In [None]:
with zipfile.ZipFile(os.path.join(data_dir, "pristine", "train.zip"), "r") as ref:
    ref.extractall(os.path.join(data_dir, "full"))

full_train_dir = os.path.join(data_dir, "full", "train")

full_trains_cats_dir = os.path.join(full_train_dir, "cats")
%mkdir -p {full_trains_cats_dir}
for file in glob.glob(os.path.join(full_train_dir, "cat.*.jpg")):
    shutil.move(file, full_trains_cats_dir)

full_trains_dogs_dir = os.path.join(full_train_dir, "dogs")
%mkdir -p {full_trains_dogs_dir}
for file in glob.glob(os.path.join(full_train_dir, "dog.*.jpg")):
    shutil.move(file, full_trains_dogs_dir)

Create a validation dataset:

In [None]:
full_valid_dir = os.path.join(data_dir, "full", "valid")

full_valid_cats_dir = os.path.join(full_valid_dir, "cats")
%mkdir -p {full_valid_cats_dir}

full_valid_dogs_dir = os.path.join(full_valid_dir, "dogs")
%mkdir -p {full_valid_dogs_dir}

cats_shuffled = np.random.permutation(glob.glob(os.path.join(full_trains_cats_dir, "*.jpg")))
dogs_shuffled = np.random.permutation(glob.glob(os.path.join(full_trains_dogs_dir, "*.jpg")))

num_validation_images_per_class = 1000
for i in range(num_validation_images_per_class):
    shutil.move(cats_shuffled[i], full_valid_cats_dir)
    shutil.move(dogs_shuffled[i], full_valid_dogs_dir)

Sample from training dataset:

In [None]:
sample_train_cats_dir = os.path.join(data_dir, "sample", "train", "cats")
%mkdir -p {sample_train_cats_dir}

sample_train_dogs_dir = os.path.join(data_dir, "sample", "train", "dogs")
%mkdir -p {sample_train_dogs_dir}

num_training_samples_per_class = 1000
for i in range(num_validation_images_per_class, num_validation_images_per_class + num_training_samples_per_class):
    shutil.copy(cats_shuffled[i], sample_train_cats_dir)
    shutil.copy(dogs_shuffled[i], sample_train_dogs_dir)

Sample from test dataset:

In [None]:
sample_test_unlabelled_dir = os.path.join(data_dir, "sample", "test", "unlabelled")
%mkdir -p {sample_test_unlabelled_dir}
    
test_shuffled = np.random.permutation(glob.glob(os.path.join(full_test_unlabelled_dir, "*.jpg")))
    
num_test_samples = 500
for i in range(num_test_samples):
    shutil.copy(test_shuffled[i], sample_test_unlabelled_dir)

Sample from validation dataset:

In [None]:
sample_valid_cats_dir = os.path.join(data_dir, "sample", "valid", "cats")
%mkdir -p {sample_valid_cats_dir}

sample_valid_dogs_dir = os.path.join(data_dir, "sample", "valid", "dogs")
%mkdir -p {sample_valid_dogs_dir}

valid_cats_shuffled = np.random.permutation(glob.glob(os.path.join(full_valid_cats_dir, "*.jpg")))
valid_dogs_shuffled = np.random.permutation(glob.glob(os.path.join(full_valid_dogs_dir, "*.jpg")))

num_validation_samples_per_class = 100
for i in range(num_validation_samples_per_class):
    shutil.copy(valid_cats_shuffled[i], sample_valid_cats_dir)
    shutil.copy(valid_dogs_shuffled[i], sample_valid_dogs_dir)