# Data conversion and normalization

This file contains all the code necessary to convert the [NYU hand pose dataset](http://cims.nyu.edu/~tompson/NYU_Hand_Pose_Dataset.htm) into a single [.hdf5](http://www.h5py.org/) file for use by [deep_hand_pose.ipynb](deep_hand_pose.ipynb).

In [17]:
import h5py
import numpy as np
import pandas as pd
import scipy.misc
import scipy.io as sio
from keras.utils.generic_utils import Progbar
from PIL import Image
import sys
from os import path

### Loading dataset

The default dataset output directory is the [dataset](../dataset) subfolder of the project root.

The default input dataset directory, containing the NYU hand pose dataset, is assumed to be a folder called [nyu_hand_dataset_v2](../nyu_hand_dataset_v2) in the project root.

Simply unzipping the NYU dataset into the project root should be enough to get up and running without any extra configuration.

#### Training data

Training data is located in the [dataset/train](../nyu_hand_dataset_v2/dataset/train) subfolder of the NYU dataset.

There are 72757 training images and corresponding labels.

#### Testing data

Testing data is located in the [dataset/test](../nyu_hand_dataset_v2/dataset/test) subfolder of the NYU dataset.

There are 8252 testing images and corresponding labels.

In [29]:
INPUT_DIR   = '../nyu_hand_dataset_v2/dataset'
DATASET_DIR = '../dataset'
dataset     = h5py.File(path.join(DATASET_DIR, 'dataset.hdf5'))

data_path   = lambda *args : path.join(INPUT_DIR, *args)
image_path  = lambda type, angle, index : '%s_%d_%07d.png' % (type, angle, index + 1)
labels_path = 'joint_data.mat'

In [19]:
def load_data(set, type, angle, dtype='uint8'):
    labels = load_labels(set, angle)
    image_paths = (data_path(set, image_path(type, angle, i)) for i in range(len(labels)))
    images = load_images(image_paths, dtype)
    return images, labels

#### Images

Each image is $ 640 \times 480 \times 3 $, stored in the standard .png format as unsigned 8 bit integers.

Images are named as [type]\_[angle]\_[number].png.

Type may be one of:

* rgb
* depth
* synthdepth

Angle may be one of:

* 1 (front)
* 2 (top)
* 3 (side)

Numbers start at 1 and are padded with zeroes to 7 digits.

In [20]:
def load_images(image_paths, dtype='uint8'):
    for image_path in image_paths:
        with Image.open(image_path) as image:
            yield np.asarray(image, dtype)

#### Labels

Each label is $ 36 \times 3 $, stored in the MATLAB .mat format as double precision floating point numbers in a file named joint_data.mat. There is one label for each of the three camera angles.

Each 3-tuple is a point in the $ uvd $ coordinate space, which is the same as that used in the [depth images](#Depth-images). $ u $ and $ v $ are in pixels, while $ d $ is in millimeters.

There are 36 tuples per label, corresponding to keypoints on the hand. Each finger is represented by 6 keypoints, plus three for the palm and three for the wrist.

Finger keypoints are named as [finger]\_[segment]\_[component].

Finger may be one of:

* F1 (pinky)
* F2 (ring)
* F3 (middle)
* F4 (index)
* TH (thumb)

Segment may be one of:

* KNU3 (fingertip / distal phalanx)
* KNU2 (finger middle / middle phalanx)
* KNU1 (finger base / proximal phalanx)

Component may be one of:

* A (bone neck)
* B (joint / bone base)

Palm and wrist keypoints are named as follows:

* PALM_1 (outer / ulnar wrist)
* PALM_2 (inner / radial wrist)
* PALM_3 (inner / radial palm)
* PALM_4 (outer / ulnar palm)
* PALM_5 (middle palm / ring metacarpal)
* PALM_6 (middle wrist / lunate)

In [21]:
def load_labels(set, angle):
    joint_data  = sio.loadmat(data_path(set, labels_path))
    labels      = joint_data['joint_uvd'][angle - 1]
    joint_names = [name[0] for name in joint_data['joint_names'][0]]

    return pd.Panel(labels, major_axis=joint_names, minor_axis=['u', 'v', 'd'])

#### Depth images

Depth images and synthetic depth images [store the most significant bits of depth data in the green channel and the least significant bits in the blue channel](https://github.com/jsupancic/deep_hand_pose/blob/master/src/caffe/jvl.cpp#L507).

In [22]:
def convert_depth(image):
    return np.expand_dims((image[:, :, 1] << 8) + image[:, :, 2], 2)

### Normalization

The [Deep Hand Pose](https://github.com/jsupancic/deep_hand_pose) project performs a number of normalization steps on the NYU dataset before it is fed into the model.

Broadly, these consist of:

* [Rescaling image depth](https://github.com/jsupancic/deep_hand_pose/blob/master/src/caffe/jvl.cpp#L510) from millimeters to centimeters
* [Rescaling label depth](https://github.com/jsupancic/deep_hand_pose/blob/master/src/caffe/jvl.cpp#L621) from millimeters to centimeters
* [Recentering](https://github.com/jsupancic/deep_hand_pose/blob/master/src/caffe/jvl.cpp#L631) around the [middle finger base](https://github.com/jsupancic/deep_hand_pose/blob/master/src/caffe/jvl.cpp#L630)
* [Clipping](https://github.com/jsupancic/deep_hand_pose/blob/master/src/caffe/jvl.cpp#L650) to [a smaller, depth-dependent bounding box](https://github.com/jsupancic/deep_hand_pose/blob/master/src/caffe/jvl.cpp#L638)
* [Clamping](https://github.com/jsupancic/deep_hand_pose/blob/master/src/caffe/jvl.cpp#L665) depth data and scaling it to a range of -1 to 1
* [Resizing](https://github.com/jsupancic/deep_hand_pose/blob/master/src/caffe/layers/hand_data_layer.cpp#L422) the image and label to 128 x 128
* [Extracting a subset of the hand pose keypoints](https://github.com/jsupancic/deep_hand_pose/blob/master/include/jvl/blob_io.hpp#L8)

In [23]:
def normalize(image, label):
    label = label.copy()
    
    label.ix[:, 'd'] /= 10.
    image = image / 10.
    
    center = label.ix['F3_KNU1_B', ['v', 'u']]
    center_depth = label.ix[:, 'd'].median()
    
    # TODO: what do these mean
    metric_size = 38
    fx = 525
    metric_size_factor = fx * metric_size
    
    bounding_box = np.array([[0, 0], [1, 1]], dtype='float')
    bounding_box = (bounding_box - 0.5) * metric_size_factor / center_depth
    bounding_box += center
    
    label.ix[:, 'd'] -= center_depth
    
    image, label = clip(image, label, bounding_box.astype(int))
    
    return image, label

In [24]:
# Resize an image to the specified dimensions, scaling its label accordingly
def resize(image, label, dimensions):
    label.ix[:, ['v', 'u']] *= np.array(dimensions) / image.shape[:-1]
    
    # TODO: Try to implement or use OpenCV's INTER_AREA resize strategy?
    image = scipy.misc.imresize(np.squeeze(image), dimensions, 'bilinear')
    
    return image, label

In [25]:
# Clip an image to the specified bounding box, translating its label accordingly
# Bounding box should look like np.array([[x_1, y_1], [x_2, y_2]]), where
# (x_1, y_1) are the coordinates of the lower left corner and 
# (x_2, y_2) are the coordinates of the upper right corner
def clip(image, label, bounding_box):
    label.ix[:, ['v', 'u']] -= bounding_box[0]
    
    image_box = np.array([[0, 0], image.shape[:-1]], dtype='int')
    
    padding = np.array([image_box[0] - bounding_box[0], bounding_box[1] - image_box[1]]).clip(0)
    bounding_box += padding[0]
    padding = np.concatenate((padding.T, np.array([[0, 0]])))
    
    image = np.pad(image, padding, 'edge')
    image = image[slice(*bounding_box[:, 0]), slice(*bounding_box[:, 1])]
    
    return image, label

In [None]:
images, labels = load_data('train', 'depth', 1)

data_image = dataset.require_dataset('image/train', (len(labels), 128, 128, 1), dtype='float')
data_label = dataset.require_dataset('label/train', (len(labels), 28), dtype='float')

p = Progbar(len(labels))

for image, item in zip(images, labels.iteritems()):
    index, label = item
    p.update(index)
    
    image = convert_depth(image)
    image, label = normalize(image, label)
    image, label = resize(image, label, (128, 128))
    
    data_image[index] = np.expand_dims(image, 2)
    data_label[index] = label.ix[[
            'F1_KNU3_A',
            'F1_KNU2_B',
            'F2_KNU3_A',
            'F2_KNU2_B',
            'F3_KNU3_A',
            'F3_KNU2_B',
            'F4_KNU3_A',
            'F4_KNU2_B',
            'TH_KNU3_A',
            'TH_KNU3_B',
            'TH_KNU2_B',
            'PALM_1',
            'PALM_2',
            'PALM_3'
        ], ['u', 'v']].values.flatten()

dataset.close()

 1254/72757 [..............................] - ETA: 1822s