# Labeling

This notebook is for creating a set of labeled images to train a classifier that can identify "good" and "bad" quality images. It uses Jupyter widgets as a basic labeling interface. The labels created within this notebook are needed before running either the classify_rf.py or classify_nn.py scripts.

In [None]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

In [None]:
import os
import numpy as np
import dask.array as da

#this is the module with the labeling widgets
from corrector import Dataset, PredictionsCorrector

In [None]:
savedir = './'
impaths = da.from_npy_stack(os.path.join(savedir, 'deduplicated.npz'))
print(f'{len(impaths)} image paths in array')

## Manual Labeling

Notice that we're creating labels for all the images, not just the ones that we want to label. If the number of desired labeled images is know ahead of time, set the num_images parameter. For example, for 1000 labeled images, num_images=1000. When this parameter is set the PredictionsCorrector class will stop loading new images once num_images have been loaded. Otherwise, PredictionsCorrector will continue to load images until all the images in impaths have been labeled.

In [None]:
#it's possible to resume from a previously created label array by setting
#resume_labels to the path of the label array
resume_labels = None #os.path.join(save_dir, 'patch_quality_labels.npy')

if resume_labels is not None:
    labels = np.load(resume_labels)
    assert(len(labels) == len(impaths)), \
    "Number of labels is in resumed label array does not match the number of images!"
else:
    #if we're not resuming, then we'll start with all
    #labels as "none"
    labels = np.array(['none'] * len(impaths))
    
#make a the dataset class
#setting the eval_label to "none" ensures that only unlabeled images
#will be presented; this is especially important if we're resuming from a 
#label file that has some images marked as "good" or "bad"
dataset = Dataset(impaths, labels=labels, num_images=None, eval_label='none')

Every time the submit button is pressed the labels for the dataset are updated. This means that it is possible to extract and save all labels before the PredictionsCorrector class has loaded all the images. For example, if the dataset contains 5000 images and the batch_size for the PredictionsCorrector is 50, it is possible to run the next two cells (which print info and save the labels) after clicking submit on the first 50 images. It's recommended that saving be done often so that labels are not lost if the notebook crashes.

In [None]:
#manually label images by changing default "none" to either "good" or "bad"
#in the dropdown that appears under the image
classes = ['good', 'bad', 'none']
pc = PredictionsCorrector(dataset, classes, batch_size=50, rows=5) #50 images per batch in 5 rows implies 10 columns

In [None]:
num_bad_labels = len(np.where(np.array(pc.corrected_labels()) == 'bad')[0])
num_good_labels = len(np.where(np.array(pc.corrected_labels()) == 'good')[0])
num_unlabeled = len(impaths) - num_bad_labels - num_good_labels
print(f'Images with label "bad": {num_bad_labels}, "good": {num_good_labels}, "none": {num_unlabeled}')

In [None]:
#save the results
#note that the saved array contains strings of either "good", "bad", or "none"
np.save(os.path.join(savedir, 'patch_quality_labels.npy'), np.array(pc.corrected_labels()))