Code adapted from [affinelayer's TensorFlow implementation](https://github.com/affinelayer/pix2pix-tensorflow). I would highly recommend checking out [his post](https://github.com/affinelayer/pix2pix-tensorflow) for a high-level overview of how the model works. This post is more meant to walk through the code line-by-line.

## Load Data

First lets load in our data. To start we're going to work with the facades dataset. The input image will be a labeled version of the second picture of a building facade, with different colors representing different features like windows, doors, etc. Our goal will be to produce the second image from the first.

Unlike prior posts, we're going to use some of TensorFlow's utilities for loading in data. We get our list of files, put them into a queue, and have a `WholeFileReader` read and decode each. The format of these images is the first half is the target photo, and the second half is the annotated version.

<img style="display:block;margin:auto" src="imgs/ipynb/pix2pix/input_example.jpg">

We'll preprocess the image to have pixel values between $[-1, 1]$ and then cut it in half, assigning the first part to our target and the second to our input (flipped since we're mapping labeled $\rightarrow$ photo).

In [1]:
import tensorflow as tf
import numpy as np
import glob
import math
import time
import os

BATCH_SIZE = 1
def loadData(path, shuffle=True):
    ''' Loads in image data from path. '''
    input_paths = glob.glob(os.path.join(path, '*.jpg')) # All jpgs
    path_queue = tf.train.string_input_producer(input_paths, shuffle=shuffle) # Produces image paths
    reader = tf.WholeFileReader()
    paths, contents = reader.read(path_queue)
    rawInput = tf.image.decode_jpeg(contents)
    rawInput = tf.image.convert_image_dtype(rawInput, dtype=tf.float32)

    # [height, width, channel]
    rawInput.set_shape([None, None, 3])
    width = tf.shape(rawInput)[1]

    def process(r):
        # Resize to 256x256
        r = tf.image.resize_images(r, [256, 256], method=tf.image.ResizeMethod.AREA)
        # Pix vals from [0, 1] => [-1, 1]
        return r * 2 - 1

    targets = process(rawInput[:,:width//2,:]) # Left side
    inputs = process(rawInput[:,width//2:,:]) # Right side

    paths, inputs, targets = tf.train.batch([paths, inputs, targets], batch_size=BATCH_SIZE)
    steps_per_epoch = int(math.ceil(len(input_paths) / BATCH_SIZE))
    return paths, inputs, targets, steps_per_epoch

# Train data
paths, inputs, targets, steps_per_epoch = loadData('data/pix2pix/facades/train')
# Test data
tst_paths, tst_inputs, tst_targets, tst_steps_per_epoch = loadData('data/pix2pix/facades/val', shuffle=False)

## Some Helper Functions

Before we start building the model, there are a couple functions we'll be using that we should define.

### Conv & Deconv

If you're not familiar with Convolution, check out my post on [Convolutional Neural Nets](/posts/cnn-mnist.html). Deconvolution (or Transposed Convolution) was new to me. Normally when performing convolution, we're applying filters in such a way that we decrease the size of our image, effectively downsampling it. Deconvolution does the opposite, upsampling an image to output a larger tensor. There is a lot of debate as what to call this; deconvolution has a well defined meaning from signal processing as the inverse of convolution, which is not what this operation does. Because of this, many people opt to use the name Transposed Convolution instead, which is also what the TensorFlow API does. I'm just going to use `deconv` in code because it's easier to type.

What deconvolution ends up looking like is just convolution but with more padding around/between pixels:

<div style='text-align: center'>
<figure style="display: inline-block;">
<img src="imgs/ipynb/pix2pix/conv.gif">
<figcaption>Convolution with Stride 2</figcaption>
</figure>
<figure style="display: inline-block;">
<img src="imgs/ipynb/pix2pix/deconv.gif">
<figcaption>Deconvolution with Stride 2</figcaption>
</figure>
<p>Animations from <a href="https://github.com/vdumoulin/conv_arithmetic">here</a></p>
</div>

In [2]:
def conv(batch_input, out_channels, stride):
    ''' Convolve input with given stride. '''
    with tf.variable_scope("conv"):
        in_channels = batch_input.get_shape()[3]
        # The trainable filter we create for the conv
        conv_filter = tf.get_variable("filter",
                                      [4, 4, in_channels, out_channels],
                                      dtype=tf.float32,
                                      initializer=tf.random_normal_initializer(0, 0.02))

        padded_input = tf.pad(batch_input,
                              [[0, 0], [1, 1], [1, 1], [0, 0]],
                              mode="CONSTANT")
        # Output of the conv
        conv = tf.nn.conv2d(padded_input,
                            conv_filter,
                            [1, stride, stride, 1],
                            padding="VALID")
        return conv

def deconv(batch_input, out_channels):
    ''' Transposed Convolution. '''
    with tf.variable_scope("deconv"):
        batch, in_height, in_width, in_channels = [int(d) for d in batch_input.get_shape()]

        # The trainable filter we create for the deconv
        conv_filter = tf.get_variable("filter",
                                      [4, 4, out_channels, in_channels],
                                      dtype=tf.float32,
                                      initializer=tf.random_normal_initializer(0, 0.02))
        # Output of the deconv
        conv = tf.nn.conv2d_transpose(batch_input,
                                      conv_filter,
                                      [batch, in_height * 2, in_width * 2, out_channels],
                                      [1, 2, 2, 1],
                                      padding="SAME")
        return conv

### Leaky ReLU

Leaky ReLU is a variation of normal ReLU that helps prevent dead neurons. With ReLU, values less than 0 are set to 0, and have no gradient. This creates the problem where a neuron can start always outputting 0 for any input and once in that state, can't get out of it since there is no gradient to go up. Leaky ReLU tries to fix this by having values less than zero have a slight negative slope, equivalent to the following:

\begin{align*}
\operatorname{ReLU}(x) &= \max(0,x) \\
\operatorname{LReLU}(x,a) &= \frac{1+a}{2}x + \frac{1-a}{2} \vert x \vert  \\
&= \begin{cases}
    x,&  x \geq 0\\
    ax,& x < 0
\end{cases}
\end{align*}

<div style="text-align:center">
<img style="margin: 5px; border: 1px solid black; display: inline-block; width: 30%" src="imgs/ipynb/pix2pix/relu.png"><img style="margin: 5px; border: 1px solid black; display: inline-block; width: 30%" src="imgs/ipynb/pix2pix/lrelu.png"></div>

In [3]:
def lrelu(x, a):
    ''' Leaky ReLU.
        x is our tensor.
        a is the magnitude of negative slope for x < 0.
    '''
    return (0.5 * (1 + a)) * x + (0.5 * (1 - a)) * tf.abs(x)

### Batch Normalization

Batch Normalization is a cool general technique for improving training. All it does is normalize the input to a layer for mean and variance, while also including a trainable bias and scale parameter that allows the amount of normalization to be adjusted. From [the paper](https://arxiv.org/pdf/1502.03167.pdf) where it was introduced:

<img style="width: 40%; display: block; margin: auto;" src="imgs/ipynb/pix2pix/batchnorm.png">

If $\gamma \approx \sigma$ and $\beta \approx \mu$ then the transformation becomes an identity, allowing the NN to disable the behavior if not beneficial. In general, however, it allows for higher learning rates, reduces the need for dropout as it has a regularizing effect, and reduces dependence on initialization. 

In [4]:
def batchnorm(inp):
    ''' Batch Normalization. '''
    with tf.variable_scope("batchnorm"):
        channels = inp.get_shape()[3]
        offset = tf.get_variable("offset",
                                 [channels],
                                 dtype=tf.float32,
                                 initializer=tf.zeros_initializer())
        scale = tf.get_variable("scale",
                                [channels],
                                dtype=tf.float32,
                                initializer=tf.random_normal_initializer(1.0, 0.02))

        mean, variance = tf.nn.moments(inp, axes=[0, 1, 2], keep_dims=False)
        variance_epsilon = 1e-5
        normalized = tf.nn.batch_normalization(inp, mean, variance,
                                               offset, scale, variance_epsilon=variance_epsilon)
        return normalized

Now with that out of the way, lets get to actually building our model. Like all GAN architectures, our model will have a generator and discriminator. Let's start with the generator.

## Generator

<figure>
<img style="width: 100%;" src="imgs/ipynb/pix2pix/generator.png">
<img style="display:block;width:50%;margin:auto;" src="imgs/ipynb/pix2pix/units.png">
<figcaption style='text-align:center'>Images from <a href="https://affinelayer.com/pix2pix/">affinelayer</a></figcaption>
</figure>

The generator is made up of a series of convolution layers, followed by a series of deconvolution layers. This has an effect similar to an autoencoder, where the model is forced to compress the 256x256x3 image into a single 1x1x256 vector keeping only the most important parts of the image before upsampling it again with the deconvolution layers. However, often with these image-to-image translation tasks larger structural features are the same in both input and target, so the model has skip layers where layers from the convolution are inputted into their opposite deconvolution layer, allowing this information to transfer.

Each layer in the encoder gets leaky ReLU and and decoder layers get normal ReLU and dropout. Both get batch normalization.

In [5]:
# Number of generator filters
NGF = 64

def create_generator(generator_inputs):
    ''' Creates our generator for the given inputs. '''
    layers = []

    # encoder_1: [batch, 256, 256, 3] => [batch, 128, 128, ngf]
    # This layer doesn't get batchnorm (from paper)
    with tf.variable_scope("encoder_1"):
        output = conv(generator_inputs, NGF, stride=2)
        layers.append(output)

    layer_specs = [
        NGF * 2, # encoder_2: [batch, 128, 128, ngf] => [batch, 64, 64, ngf * 2]
        NGF * 4, # encoder_3: [batch, 64, 64, ngf * 2] => [batch, 32, 32, ngf * 4]
        NGF * 8, # encoder_4: [batch, 32, 32, ngf * 4] => [batch, 16, 16, ngf * 8]
        NGF * 8, # encoder_5: [batch, 16, 16, ngf * 8] => [batch, 8, 8, ngf * 8]
        NGF * 8, # encoder_6: [batch, 8, 8, ngf * 8] => [batch, 4, 4, ngf * 8]
        NGF * 8, # encoder_7: [batch, 4, 4, ngf * 8] => [batch, 2, 2, ngf * 8]
        NGF * 8, # encoder_8: [batch, 2, 2, ngf * 8] => [batch, 1, 1, ngf * 8]
    ]

    for out_channels in layer_specs:
        with tf.variable_scope("encoder_%d" % (len(layers) + 1)):
            rectified = lrelu(layers[-1], 0.2)
            # [batch, in_height, in_width, in_channels] => [batch, in_height/2, in_width/2, out_channels]
            convolved = conv(rectified, out_channels, stride=2)
            output = batchnorm(convolved)
            layers.append(output)

    layer_specs = [
        (NGF * 8, 0.5),   # decoder_8: [batch, 1, 1, ngf * 8]       => [batch, 2, 2, ngf * 8 * 2]
        (NGF * 8, 0.5),   # decoder_7: [batch, 2, 2, ngf * 8 * 2]   => [batch, 4, 4, ngf * 8 * 2]
        (NGF * 8, 0.5),   # decoder_6: [batch, 4, 4, ngf * 8 * 2]   => [batch, 8, 8, ngf * 8 * 2]
        (NGF * 8, 0.0),   # decoder_5: [batch, 8, 8, ngf * 8 * 2]   => [batch, 16, 16, ngf * 8 * 2]
        (NGF * 4, 0.0),   # decoder_4: [batch, 16, 16, ngf * 8 * 2] => [batch, 32, 32, ngf * 4 * 2]
        (NGF * 2, 0.0),   # decoder_3: [batch, 32, 32, ngf * 4 * 2] => [batch, 64, 64, ngf * 2 * 2]
        (NGF, 0.0),       # decoder_2: [batch, 64, 64, ngf * 2 * 2] => [batch, 128, 128, ngf * 2]
    ]

    num_encoder_layers = len(layers)
    for decoder_layer, (out_channels, dropout) in enumerate(layer_specs):
        # Conv layer to connect to
        skip_layer = num_encoder_layers - decoder_layer - 1
        with tf.variable_scope("decoder_%d" % (skip_layer + 1)):
            if decoder_layer == 0:
                # First decoder layer doesn't have skip connections
                # since it is directly connected to the skip_layer.
                inp = layers[-1]
            else:
                inp = tf.concat([layers[-1], layers[skip_layer]], axis=3)

            rectified = tf.nn.relu(inp)
            # [batch, in_height, in_width, in_channels] => [batch, in_height*2, in_width*2, out_channels]
            output = deconv(rectified, out_channels)
            output = batchnorm(output)

            if dropout > 0.0:
                output = tf.nn.dropout(output, keep_prob=1 - dropout)

            layers.append(output)

    # decoder_1: [batch, 128, 128, ngf * 2] => [batch, 256, 256, 3]
    with tf.variable_scope("decoder_1"):
        inp = tf.concat([layers[-1], layers[0]], axis=3)
        rectified = tf.nn.relu(inp)
        output = deconv(rectified, 3)
        output = tf.tanh(output) # Limits output to (-1, 1)
        layers.append(output)

    return layers[-1]

## Discriminator

<figure>
<img style="width: 100%" src="imgs/ipynb/pix2pix/discriminator.png">
<figcaption style='text-align:center'>Image from <a href="https://affinelayer.com/pix2pix/">affinelayer</a></figcaption>
</figure>

Next up is our discriminator, which is a lot simpler than the generator. The discriminator receives an input (a labeled facade in this case) and a generated or real output for that input that it has to gauge the legitimacy of. It is made up of a series of convolutions outputting a single probability of being "real" at the end. Each layer also gets batch normalization and leaky ReLU.

In [6]:
# Number of discriminator filters
NDF = 64

def create_discriminator(discrim_inputs, discrim_targets):
    n_layers = 3
    layers = []

    # 2x [batch, height, width, in_channels] => [batch, height, width, in_channels * 2]
    inp = tf.concat([discrim_inputs, discrim_targets], axis=3)

    # layer_1: [batch, 256, 256, in_channels * 2] => [batch, 128, 128, ndf]
    with tf.variable_scope("layer_1"):
        convolved = conv(inp, NDF, stride=2)
        rectified = lrelu(convolved, 0.2)
        layers.append(rectified)

    # layer_2: [batch, 128, 128, ndf] => [batch, 64, 64, ndf * 2]
    # layer_3: [batch, 64, 64, ndf * 2] => [batch, 32, 32, ndf * 4]
    # layer_4: [batch, 32, 32, ndf * 4] => [batch, 31, 31, ndf * 8]
    for i in range(n_layers):
        with tf.variable_scope("layer_%d" % (len(layers) + 1)):
            out_channels = NDF * min(2**(i+1), 8)
            stride = 1 if i == n_layers - 1 else 2  # last layer here has stride 1
            convolved = conv(layers[-1], out_channels, stride=stride)
            normalized = batchnorm(convolved)
            rectified = lrelu(normalized, 0.2)
            layers.append(rectified)

    # layer_5: [batch, 31, 31, ndf * 8] => [batch, 30, 30, 1]
    with tf.variable_scope("layer_%d" % (len(layers) + 1)):
        convolved = conv(rectified, out_channels=1, stride=1)
        output = tf.sigmoid(convolved) # Limits output to (0, 1), a probability
        layers.append(output)

    return layers[-1]

## Loss & Optimizer

The discriminator's loss is simple, just minimize giving the wrong label.

For the generator, we have two losses. The first is the standard GAN loss, try to minimize the discriminator detecting that the generated image is fake. The second loss is an L1 loss that tries to minimize the differences between generated and target image, since they should be as similar as possible. L2 loss is also sometimes used here, but the paper notes L1 loss produces less blurring. We then linearly combine these two losses to get our total generator loss.

In [7]:
# Our generated images
with tf.variable_scope("generator"):
    outputs = create_generator(inputs)
with tf.variable_scope("generator", reuse=True):
    tst_outputs = create_generator(tst_inputs)

# Discriminator for our "real" images
with tf.variable_scope("discriminator"):
    predict_real = create_discriminator(inputs, targets)

# Discriminator for our generated images
with tf.variable_scope("discriminator", reuse=True):
    predict_fake = create_discriminator(inputs, outputs)

# To prevent log(0)
EPS = 1e-12

# Discriminator Loss
discrim_loss = tf.reduce_mean(-(tf.log(predict_real + EPS) + tf.log(1 - predict_fake + EPS)))

# Parameters controling how much we weight each loss
L1_WEIGHT = 100.
GAN_WEIGHT = 1.

# Adversarial Loss
gen_loss_GAN = tf.reduce_mean(-tf.log(predict_fake + EPS))
# L1 Loss
gen_loss_L1 = tf.reduce_mean(tf.abs(targets - outputs))
# Overall Loss
gen_loss = gen_loss_GAN * GAN_WEIGHT + gen_loss_L1 * L1_WEIGHT


# Learning Rate
LR = 0.0002
# Momentum Term
BETA1 = 0.5

# Discriminator training variables
discrim_tvars = [var for var in tf.trainable_variables() if var.name.startswith("discriminator")]
discrim_optim = tf.train.AdamOptimizer(LR, BETA1)
discrim_grads_and_vars = discrim_optim.compute_gradients(discrim_loss, var_list=discrim_tvars)
discrim_train = discrim_optim.apply_gradients(discrim_grads_and_vars)

# Makes it so discrim must exec first
with tf.control_dependencies([discrim_train]):
    # All trainable generator variables
    gen_tvars = [var for var in tf.trainable_variables() if var.name.startswith("generator")]
    gen_optim = tf.train.AdamOptimizer(LR, BETA1)
    gen_grads_and_vars = gen_optim.compute_gradients(gen_loss, var_list=gen_tvars)
    gen_train = gen_optim.apply_gradients(gen_grads_and_vars)

# Maintain moving averages of vars
ema = tf.train.ExponentialMovingAverage(decay=0.99)
update_losses = ema.apply([discrim_loss, gen_loss_GAN, gen_loss_L1])

# Global Step
global_step = tf.contrib.framework.get_or_create_global_step()
incr_global_step = tf.assign(global_step, global_step+1)


train = tf.group(update_losses, incr_global_step, gen_train)

## Train

Training is pretty straightforward, however, one thing to note here is that unlike prior models I've written up this definitely requires a GPU if you're going to do the full 200 epochs, or you'll be waiting a while.

In [None]:
def save_images(fetches):
    ''' Saves out images from fetches. '''
    image_dir = os.path.join('out', "images")
    if not os.path.exists(image_dir):
        os.makedirs(image_dir)

    filesets = []
    for i, in_path in enumerate(fetches["paths"]):
        name, _ = os.path.splitext(os.path.basename(in_path.decode("utf8")))
        fileset = {"name": name}
        for kind in ["inputs", "outputs", "targets"]:
            filename = name + "-" + kind + ".png"
            fileset[kind] = filename
            out_path = os.path.join(image_dir, filename)
            contents = fetches[kind][i]
            with open(out_path, "wb") as f:
                f.write(contents)
        filesets.append(fileset)
    return filesets

def convert(image):
    ''' Converts NN output back to normal image. '''
    image = image + 1 / 2 # [-1, 1] => [0, 1]
    return tf.image.convert_image_dtype(image, dtype=tf.uint8, saturate=True)

# Reverse processing on images so they can be outputted
converted_inputs = convert(tst_inputs)
converted_targets = convert(tst_targets)
converted_outputs = convert(tst_outputs)

# Gets image data for inputs, targets, and outputs
display_fetches = {
    "paths": tst_paths,
    "inputs": tf.map_fn(tf.image.encode_png, converted_inputs, dtype=tf.string, name="input_pngs"),
    "targets": tf.map_fn(tf.image.encode_png, converted_targets, dtype=tf.string, name="target_pngs"),
    "outputs": tf.map_fn(tf.image.encode_png, converted_outputs, dtype=tf.string, name="output_pngs"),
}

MAX_EPOCHS = 200
OUTPUT_FREQ = 50
SAVE_FREQ = 5000

saver = tf.train.Saver(max_to_keep=1)
sv = tf.train.Supervisor(logdir='out', saver=None)
start = time.time()
with sv.managed_session() as sess:
    max_steps = steps_per_epoch * MAX_EPOCHS
    start = time.time()

    for step in range(max_steps):
        def should(freq):
            ''' Returns true if correct frequency interval. '''
            return freq > 0 and ((step + 1) % freq == 0 or step == max_steps - 1)

        fetches = {
            "train": train,
            "global_step": sv.global_step,
        }

        if should(OUTPUT_FREQ):
            fetches["discrim_loss"] = discrim_loss
            fetches["gen_loss_GAN"] = gen_loss_GAN
            fetches["gen_loss_L1"] = gen_loss_L1

        results = sess.run(fetches)

        if should(OUTPUT_FREQ):
            train_epoch = math.ceil(results["global_step"] / steps_per_epoch)
            train_step = (results["global_step"] - 1) % steps_per_epoch + 1
            print(f"progress  epoch {train_epoch}  step {train_step}")
            print("discrim_loss", results["discrim_loss"])
            print("gen_loss_GAN", results["gen_loss_GAN"])
            print("gen_loss_L1", results["gen_loss_L1"])

        if should(SAVE_FREQ):
            print("saving model")
            saver.save(sess, os.path.join('out', "model"), global_step=sv.global_step)

        if sv.should_stop():
            break

    # Save out images
    for step in range(tst_steps_per_epoch):
        results = sess.run(display_fetches)
        filesets = save_images(results)

## Results

Some results from the facades and maps datasets:

| Input | Generated | Target |
|-------|-----------|--------|
| <img src='imgs/ipynb/pix2pix/results/6-inputs.png'> | <img src='imgs/ipynb/pix2pix/results/6-outputs.png'> | <img src='imgs/ipynb/pix2pix/results/6-targets.png'> |
| <img src='imgs/ipynb/pix2pix/results/9-inputs.png'> | <img src='imgs/ipynb/pix2pix/results/9-outputs.png'> | <img src='imgs/ipynb/pix2pix/results/9-targets.png'> |
|       |           |        |