In [74]:
import os
import numpy as np
from tqdm import tqdm
import shutil

# Generate the data splits

In an ideal world, we would recieve the data pre-partitioned into train, validation, and test. However, this is not the case. Ideally, I would perform k-fold cross validation in order to ensure my model is robust, but I'm opting to go with the naive strategy here for the sake of time and reduced complexity.

To start, lets choose our split percents. I think I will train on 80%, and validate on 20%. I'm opting not to create a test set here because I'm going to naively assume that the validation set I create will be representative of the data the model will be tested on/the model will see during deployment. This naive split will allow us to select a model, and then we can analyze the kNN retrievals on the validation dataset.

In [30]:
DATA_DIR = '/home/zack/datasets/geological_similarity'
OUTPUT_DIR = '/home/zack/datasets/geological_similarity/'
TRAIN_PERCENT = 0.8
# If we are regenerating the dataset, make sure we only include our categories
CLASSES = ['rhyolite', 'quartzite', 'marble', 'andesite', 'schist', 'gneiss']

In [31]:
train_file_list = []
val_file_list = []
for subdir in os.listdir(DATA_DIR):
    if subdir not in CLASSES: continue
    subdir_path = os.path.join(DATA_DIR, subdir)
    # Make sure we don't open any weird hidden files
    if not os.path.isdir(subdir_path) or subdir[0] == '.': continue
    files = np.array([os.path.join(subdir_path, x) for x in os.listdir(subdir_path)])
    # Randomly shuffle the file list (we don't know if the files are alphabetical/by order of difficulty)
    np.random.shuffle(files)
    # Compute the number of files from this subdir that will be placed in train
    num_train_samples = int(len(files) * TRAIN_PERCENT)
    # Generate the splits
    train, val = list(files[:num_train_samples]), list(files[num_train_samples:])
    train_file_list += train
    val_file_list += val
print(f'Training set will be generated with {len(train_file_list)} samples. Validation set will be generated with {len(val_file_list)} samples.')

Training set will be generated with 23998 samples. Validation set will be generated with 6000 samples.


It looks like there are only 29998 files in the dataset, so we are missing two images. This won't be an issue, but its worth pointing out. This was validated through the command line as well.


```
$ find /home/zack/datasets/geological_similarity -type f -name "*.jpg" | wc -l 
29998
```

Now that we have our splits, lets create a train and val subfolder that we will symlink all of our data to. This will require the least amount of coding for a Torch dataset, as it will meet the spec for an ImageFolder dataset.

In [32]:
def symlink_file_list(file_list, data_dir, output_dir):
    for file in tqdm(file_list, desc='Creating symlinks for split'):
        output_path = file.replace(data_dir, output_dir)
        os.makedirs(os.path.dirname(output_path), exist_ok=True)
        os.symlink(file, output_path)

In [38]:
train_output_dir = os.path.join(OUTPUT_DIR, 'train')
val_output_dir = os.path.join(OUTPUT_DIR, 'val')

if os.path.exists(train_output_dir): shutil.rmtree(train_output_dir)
if os.path.exists(val_output_dir):  shutil.rmtree(val_output_dir)

symlink_file_list(train_file_list, DATA_DIR, train_output_dir)
symlink_file_list(val_file_list, DATA_DIR, val_output_dir)

Creating symlinks for split: 100%|████████████████████████████████████████████████████████████████████████████████████| 23998/23998 [00:00<00:00, 41130.83it/s]
Creating symlinks for split: 100%|██████████████████████████████████████████████████████████████████████████████████████| 6000/6000 [00:00<00:00, 41547.16it/s]


# Compute normalization statistics

Having zero-mean, unit standard deviation data assists the learning process and speeds up convergence. In order to facilitate this, we need to compute the mean and standard deviation of the training dataset. Lets do that now

In [71]:
from torchvision.datasets import ImageFolder
from torchvision.transforms import ToTensor
import torch


ds = ImageFolder(train_output_dir, transform=ToTensor())
ds = torch.hstack([x.unsqueeze(1) for (x, y) in ds]).transpose(0, 1)
ds = ds.view(ds.shape[0], 3, -1).transpose(1, 2).view(-1, 3)
print(f'Dataset loaded with shape {ds.shape}')

Dataset loaded with shape torch.Size([18814432, 3])


In [72]:
print(f'Dataset mean: {ds.mean(dim=0)}')
print(f'Dataset std: {ds.std(dim=0)}')

Dataset mean: tensor([0.5083, 0.5198, 0.5196])
Dataset std: tensor([0.1851, 0.1997, 0.2196])


# Continue to supervised_baseline notebook

Excellent! Lets continue to train our model. 