Environment:
- Ubuntu 16.04 LTS
- TensorFlow 1.4
- Python 2.7
- OpenCV 3.2.0
In this work, we aim to learn a direct mapping from hand pose and depth hand image. Our goals are two-fold: (1) Be capable of exploring the underlying conditional image formation distribution
In Order to generate the depth hand images as similar as raw depth hand images possible, we propose to combined GAN and style-transfer to obtain the synthesized images. The structure can be divided into three parts, including the generator, discriminator, and style transfer network. The generator generates synthesized hand images with hand poses, and then we follow the GANs idea for this two-player zero-sum game setting and consider the optimization problem that characterizes the interplay between G and D, the last style-transfer part aims to transform the smooth synthetic images generated into the more similar depth hand images to real ones.
Suppose the generator
$\text{min}{\theta} \space \text{max}r = E{x,y\sim p(x,y)}[\text{log}D_r(x,y)]+E{y\sim p(y) }[\text{log}(1-D_r(G_{\theta}(y)))]$
with
To summarize, learning the generator
On the other hand, discriminator
The learning process is practically carried out by alternating optimize the
Style transfer is a task of generating image, whose style is equal to a style image and content is equal to a content image. Now that we have a clear definition of the style and content representation, and then we can define a loss function which essentially shows us how far away our generated images are from being the perfect style trandfer.
Without style transfer, the synthetic images from the generator are rather smooth, so style transfer can be applied to transform the synthetic images into more similar to real ones. In the following,we describe in detail in the specific net architecture of our style transfer. We follow the style transfer idea to utilize the convolutional neural network of VGG-19 to extract the features from its multiple layers. We can define the index
Content loss: Given the chosen content layer
style loss: We will do something similar for the style-layers, but now we want to measure which features in the style-layers activate simultaneously for the style-image, and then copy this activation-pattern to the mixed-image. These feature correlations are given by Gram matrix
The loss function for style is quite similar to out content loss, except that we calculate the Mean Squared Error for the Gram-matrices instead of the raw tensor-outputs from the layers.
Total variation loss: Furthermore, we condiser encouraging spatial smoothness in generated phantom by incorporating the following total variation loss(
$l_{t v}\left(G_\theta\right)=\sum\left(\left|\hat{x}{w, h+1}-\hat{x}{w, h}\right|2^2\right)+\left|\hat{x}{w+1, h}-\hat{x}_{w, h}\right|_2^2$
with
The above-mentioned three loss functions together lead to
the objective function of discriminator
We can obtain the synthetic images, far similar to raw images, with GAN network and style transfer structure. We've tried GAN network, without style transfer structure, to gain the synthesized image which is smoother than the raw image. Due to GAN ignoring the noise of real depth images, style transfer is employed to extract the contours of the synthetic image and the textures of style image, and then to mix the content and style features to obtain the phantom based on a particular style representation provided by the single style image. Besides, it can be emprically obeserved that style structure eliminates the shadow of image background.
Raw | Style | GAN+Style Transfer |
---|---|---|