## Building an Open Images classifier dataset

In this notebook we will transform the metadata and image files that we have into a dataset file formatted for input into a machine learning model.

### Prelude

At this point we've already run the script that selected relevant images and and placed them in the `../data/images/` directory. 

In [1]:
%ls ../data/images/ | head

10006714784_9337d5d0e1_o.jpg
10014143174_1de79c8af8_o.jpg
10022662923_ab0567fe1a_o.jpg
10052146336_dc364e0a10_o.jpg
1006312339_d306fc933d_o.jpg
10065094283_0db2b64b2d_o.jpg
10102600246_6385283711_o.jpg
10123662565_4ab592b952_o.jpg
101266618_99a28a70ff_o.jpg
10148587244_f576b88c8f_o.jpg


To do that, we either executed `src/openimager.py` as a script (e.g. `python openimager.py "sandwich" "hot dog" "hamburger"`) or used it as a module (using `openimager.download`). TODO: adapt `download` with a folder target parameter.

### Packaging the data

However, before we can move on to actually training the model, we must subselect the relevant image metadata, and package that up alongside the raw images so that we have a common starting point.

The script we wrote for downloading relevant training images does all of the work required, but doesn't save that anywhere because it is solely responsible for downloading images, not for saving metadata to disc.

The following segment of code is copied from that routine, and it recreates the relevant metadata.

TODO: unify the script and this code cell with a common method for doing this work.

In [2]:
import pandas as pd

categories = ['Sandwich', 'Hot dog', 'Hamburger']

kwargs = {'header': None, 'names': ['LabelID', 'LabelName']}
class_names = pd.read_csv("../data/metadata/image-class-names.csv", **kwargs)
train_boxed = pd.read_csv("../data/metadata/train-annotations-bbox.csv", index_col=0)
image_ids = pd.read_csv("../data/metadata/train-images-ids.csv", index_col=0)
label_map = dict(class_names.set_index('LabelName').loc[categories, 'LabelID']
                 .to_frame().reset_index().set_index('LabelID')['LabelName'])
label_values = set(label_map.keys())
relevant_training_images = train_boxed[train_boxed.LabelName.isin(label_values)]
relevant_flickr_urls = (relevant_training_images.set_index('ImageID')
                        .join(image_ids.set_index('ImageID'))
                        .loc[:, 'OriginalURL'])
relevant_flickr_img_metadata = (relevant_training_images.set_index('ImageID').loc[relevant_flickr_urls.index]
                                .pipe(lambda df: df.assign(LabelValue=df.LabelName.map(lambda v: label_map[v]))))

  mask |= (ar1 == a)


Now we join this information in a way that lets us map image to metadata entry.

In [3]:
u_relevant_flickr_urls = pd.Series(relevant_flickr_urls.unique(), 
                                   index=relevant_flickr_urls.index.unique(),
                                   name='OriginalURL')

X_meta = (relevant_training_images
          .set_index('ImageID')
          .join(u_relevant_flickr_urls.map(lambda v: v.split("/")[-1]), how='left')
          .reset_index()
          .pipe(lambda df: df.assign(LabelName=df.LabelName.map(lambda l: label_map[l]))))

The machine learning model takes the raw images as input. In order for that to work, the input images must all be the same size. The boxed images included in this dataset are _not_ all the same size (or orientation, or anything else like that). They are in this way representative of true image classification task out "in the wild", as opposed to more templatized image recognition tasks like MNIST and Fashion-MNIST.

In [4]:
from PIL import Image  # pip install pillow
import numpy as np

In [5]:
# TODO: write these transformations in sklearn semantics?
from skimage.transform import resize

def get_image_arr(img):
    """Given an image URL, returns that image read out from local disk into a numpy array."""
    return np.asarray(Image.open('../data/images/' + img), dtype='int32')

def crop_image_to_bbox(img_arr, XMin, XMax, YMin, YMax):
    """Given an image array, crops the image to just the bounding box provided."""
    shape = img_arr.shape
    
    XMin_x = int(np.floor(XMin*shape[0]))
    XMax_x = int(np.ceil(XMax*shape[0]))
    YMin_y = int(np.floor(YMin*shape[1]))
    YMax_y = int(np.ceil(YMax*shape[1]))
    
    return img_arr[XMin_x:XMax_x, YMin_y:YMax_y]

def resize_img_arr(img_arr):
    """Given an image array, resizes that image to a standard size (aspect ratio is not preserved)."""
    return resize(img_arr, (128, 128), anti_aliasing=True, mode='constant', preserve_range=True)

def is_non_unary_channel(img_arr):
    """Returns whether or not the given image has multiple channels or not (expect 3: RGB)."""
    return len(img_arr.shape) == 3 and img_arr.shape[2] > 1

In [6]:
# TODO: it would be more efficient to broadcast these operations directly on the numpy array(s).

def process_chunk(X_meta):
    return (X_meta
     # use the metadata to select images and generate cropped numpy arrays
     .pipe(lambda df: 
           df.apply(
               lambda srs: crop_image_to_bbox(
                   get_image_arr(srs.OriginalURL),
                   srs.XMin, srs.XMax, srs.YMin, srs.YMax
               ), axis='columns')
          )
     # exclude single-channel (e.g. grayscale) images; these seem to always be Image Not Found placeholders
     .pipe(lambda srs: srs[srs.map(is_non_unary_channel)])
     # resize the images to 128x128
     .map(resize_img_arr)
     # round to integer values, and then store as unsigned integers
     .map(lambda arr: np.round(arr))
     .map(lambda arr: arr.astype('uint8'))
     # linearize the pixel values for the purposes of saving to disk
     .map(np.ravel)
     # construct a DataFrame out of the pixel values
     .pipe(lambda srs: pd.DataFrame(np.vstack(srs.values)).add_prefix('p'))
     .join(X_meta.reset_index(drop=True)
           .loc[:, ['LabelName', 'ImageID', 'IsOccluded', 'IsTruncated', 'IsGroupOf', 'IsDepiction', 'IsInside']])
     # move ImageID to first axis of columns
     .set_index('ImageID').reset_index()
     # drop index labels with -1 values, incidating "missing" on IsOccluded et. al.
     # these always appear together and there are just 33 images with these values missing
     .query("IsOccluded >= 0")
     # store the remaining labels as binary boolean values
     .astype({label: np.bool for label in ['IsOccluded', 'IsTruncated', 'IsGroupOf', 'IsDepiction', 'IsInside']})
    )

In [8]:
from tqdm import tqdm_notebook

frames = []

# this process is chunked b/c reading entire images into numpy arrays is very expensive
# you'll run out of memory very easiy because some of the images are very very large
for i in tqdm_notebook(range(X_meta.index.max() // 20 + 1)):
    n_s, n_e = i * 20, i * 20 + 20
    X_meta_fragment = X_meta.iloc[n_s:n_e + 20]
    frames.append(process_chunk(X_meta_fragment))

HBox(children=(IntProgress(value=0, max=157), HTML(value='')))

  " Skipping tag %s" % (size, len(data), tag))
  " Skipping tag %s" % (size, len(data), tag))
  " Skipping tag %s" % (size, len(data), tag))
  " Skipping tag %s" % (size, len(data), tag))
  " Skipping tag %s" % (size, len(data), tag))
  " Skipping tag %s" % (size, len(data), tag))
  " Skipping tag %s" % (size, len(data), tag))
  " Skipping tag %s" % (size, len(data), tag))
  " Skipping tag %s" % (size, len(data), tag))
  " Skipping tag %s" % (size, len(data), tag))
  " Skipping tag %s" % (size, len(data), tag))
  " Skipping tag %s" % (size, len(data), tag))
  " Skipping tag %s" % (size, len(data), tag))
  " Skipping tag %s" % (size, len(data), tag))
  " Skipping tag %s" % (size, len(data), tag))
  " Skipping tag %s" % (size, len(data), tag))
  " Skipping tag %s" % (size, len(data), tag))
  " Skipping tag %s" % (size, len(data), tag))
  " Skipping tag %s" % (size, len(data), tag))
  " Skipping tag %s" % (size, len(data), tag))
  " Skipping tag %s" % (size, len(data), tag))
  " Skipping 




In [17]:
!mkdir "../data/training/"

In [9]:
pd.concat(frames).to_csv("../data/training/X_meta_concat.csv")