<h1><b>BicycleGAN</b><i> (Implementation in pytorch)</i></h1>

> <h2><b>Multimodal Image-to-Image Translation</b><br>

![BicycleGAN](img/bicyclegan.png)
<h2>Introduction</h2>
<h6>
Deep learning techniques have made rapid progress in conditional image generation. However, most techniques in this space have focused on generating a single result. Our aim is to generate a distribution of output images given an input image.<br><br>
Mapping from a high-dimensional input to a high-dimensional output distribution is challenging. A common approach to representing multimodality is learning a low-dimensional latent code, which should represent aspects of the possible outputs not contained in the input image.  At inference time,
a deterministic generator uses the input image, along with stochastically sampled latent codes, to produce randomly sampled outputs.
</h6>

---

<br>
<h2>Why BicycleGAN?</h2>
A common problem in existing methods is mode collapse, where only a small number of real samples get represented in the output.

>  **Mode Collapse** <br>
Real life data distribution are multimodal. For example, **MNIST** dataset has 10 major modes from 0 to 9. When mode collapses, very few modes are generated.
You can simply understand it as lack of variety. However complete collapse doesn't occur often whereas partial collapse is common. Given figure explains it all. 

![Mode Collapse](img/mode_collapse.png)<br>
Top row produces all the 10 modes of Mnist whereas bottom row produces only single mode (digit '6' ).<br><br>
BicycleGan proposes a bijection between the output and latent space.<br>
Not only the direct task of mapping the latent code (along with the input) to the output is performed but also jointly we learn an encoder from the output back to the latent space. This discourages two different latent codes from generating the same output (non-injective mapping) i.e. preventing **mode collapse**

---
<br>
<h2>What BicycleGAN does?</h2>
<h6>
Goal is to learn a multi-modal mapping between two image domains, for example, edges and photographs, or night and day images, etc.
Consider the input domain  <b>A ⊂ R$^{H×W×3}$</b> , which is to be mapped to an output domain <b>B ⊂ R$^{H×W×3}$</b>.( For example, consider A as edges and consider B as photographs made using those edges )<br><br>
We are given a dataset of paired instances from these domains, (A∈A, B∈B) which is representative of a joint distribution p(A, B). It is important to note that there could be multiple plausible paired instances B that would correspond to
an input instance A, but the training dataset usually contains only one such pair. However, given a new instance A during test time, our model should be able to generate a diverse set of output $\hat{B}$ 's, corresponding to different modes in the distribution p(B|A).<br><br>
We would like to learn the mapping that could sample the output $\hat{B}$  
from true conditional distribution given A, and produce results which are both diverse and realistic.<br><br>
To do so, we learn a low-dimensional latent space z ∈ R$^{z}$, which encapsulates the ambiguous aspects of the
output mode which are not present in the input image. For example, a sketch of a shoe could map to a variety of colors and textures, which could get compressed in this latent code. We then learn a deterministic mapping G : (A, z) → B to the output. To enable stochastic sampling, we desire the latent code vector z to be drawn from some prior distribution p(z); we use a standard Gaussian distribution N (0, I) in this work.
</h6>

---

<br>
<h3>Our model consists of 2 parts</h3>

>  Conditional Variational Autoencoder GAN: cVAE-GAN (1st part of model)
 (**B** → **z** → $\hat{B}$)

<br>

![cVAE-GAN](img/cvae.png)
<br>


*  The ground truth B is directly mapped with latent code(z) using an encoder E.
*  The generator G then uses both the latent code and the input image A to synthesize the desired output $\hat{B}$.
*  The overall model can be easily understood as the reconstruction of B, with latent encoding z concatenated with the paired A in the middle, similar to an autoencoder.
* The distribution Q(z|B) of latent code z (output of the encoder E) is dealt with a Gaussian assumption, $Q(\mathrm{z}|\mathrm{B})=E(\mathrm{B})$.<br><br>


---


> <h4><b>cVAE-GAN objective, a conditional version of the VAE-GAN</b></h4><br>
$$G^{*},\displaystyle \ E^{*}=\arg\min_{G,E}\max_{D}\ \mathcal{L}_{\mathrm{G}\mathrm{A}\mathrm{N}}^{\mathrm{V}\mathrm{A}\mathrm{E}}(G,\ D,\ E)+\lambda \mathcal{L}{1^{\mathrm{V}\mathrm{A}\mathrm{E}}}(G,\ E)+\lambda_{\mathrm{K}\mathrm{L}}\mathcal{L}_{\mathrm{K}\mathrm{L}}(E)$$

<h3>where</h3> $$\mathcal{L}_{\mathrm{G}\mathrm{A}\mathrm{N}}^{\mathrm{V}\mathrm{A}\mathrm{E}}=\mathrm{E}_{\mathrm{A},\mathrm{B}\sim p(\mathrm{A},\mathrm{B})}[\log(D(\mathrm{A},\ \mathrm{B}))]+\mathrm{E}_{\mathrm{A},\mathrm{B}\sim p(\mathrm{A},\mathrm{B}),\mathrm{z}\sim E(\mathrm{B})}[\log(1-D(\mathrm{A},\ G(\mathrm{A},\ \mathrm{z})))]$$

* <h5>This is the typical loss function of GAN where Generator and Discriminator play a min-max game. Here Generator tries to fool the Discriminator whereas Discriminator tries to distinct the images generated by the Generator from the original ones.</h5>
<br><br>

$$\mathcal{L}_{1}^{\mathrm{V}\mathrm{A}\mathrm{E}}(G)= \mathrm{E}_{\mathrm{A},\mathrm{B}\sim p(\mathrm{A},\mathrm{B}),\mathrm{z}\sim E(\mathrm{B})}||\mathrm{B}-G(\mathrm{A},\ \mathrm{z})||_{1}$$
* <h5>To encourage the output of the generator to match the input as well as stabilize the training, we use an  $\ell_{1}$ loss between the output and the ground truth image.
<br><br><br>

$$\mathcal{L}_{\mathrm{K}\mathrm{L}}(E)=\mathrm{E}_{\mathrm{B}\sim p(\mathrm{B})}[\mathcal{D}_{\mathrm{K}\mathrm{L}}(E(\mathrm{B})||\mathcal{N}(0,\ I))]$$
* <h6>The latent distribution encoded by $E(B)$ is encouraged to be close to a random Gaussian to enable sampling at inference time, when $\mathrm{B}$ is not known. </h6><b>Here  $$\displaystyle \mathcal{D}_{\mathrm{K}\mathrm{L}}(p||q)=-\int p(z)\log\frac{p(z)}{q(z)}dz$$
 

---


<br><br>
>Consider the deterministic version of this approach, i.e., dropping KLdivergence and encoding z = E(B). It is called cAE-GAN .
There is no guarantee in cAE-GAN on the distribution of the latent space z, which makes the test-time
sampling of z difficult.

<br><br>

---

>  Conditional Latent Regressor GAN: cLR-GAN (${z}$ → $\hat{B}$ → $\hat{z}$)
<br>(2nd part of the model)

<br>

![cLR-GAN](img/clr.jpg)
<br>

* A randomly drawn latent code z is recovered with $\hat{\mathrm{z}}=E(G(\mathrm{A},\ \mathrm{z}))$
*  Encoder E here is producing a point estimate for $\hat{\mathrm{z}}$, whereas the encoder in the previous section was predicting a Gaussian distribution.
<br><br>
> <h4><b>cLR-GAN objective function</b></h4>
<br>
$G^{*},\displaystyle \ E^{*}=\arg\min_{G,E}\max_{D}\ \mathcal{L}_{\mathrm{G}\mathrm{A}\mathrm{N}}(G,\ D)+\lambda_{\mathrm{l}\mathrm{a}\mathrm{t}\mathrm{e}\mathrm{n}\mathrm{t}}\mathcal{L}_{1}^{\mathrm{l}\mathrm{a}\mathrm{t}\mathrm{e}\mathrm{n}\mathrm{t}}(G,\ E)$

<br>
<h3>where</h3>
$$\mathcal{L}_{1}^{\mathrm{l}\mathrm{a}\mathrm{t}\mathrm{e}\mathrm{n}\mathrm{t}}(G,\ E)=\mathrm{E}_{\mathrm{A}\sim p(\mathrm{A}),\mathrm{z}\sim p(\mathrm{z})}||\mathrm{z}-E(G(\mathrm{A},\ \mathrm{z}))||_{1}$$

* $\hat{\mathrm{z}}=E(G(\mathrm{A},\ \mathrm{z}))$ is encouraged to be close to the randomly drawn $\mathrm{z}$ to enable bijective mapping.
<br><br>
<h6>
* The discriminator loss $L_{\mathrm{G}\mathrm{A}\mathrm{N}}(G,\ D)$  on $\hat{\mathrm{B}}$ is used to encourage the network to generate realistic results.
</h6>

---

<br><br>
> <h2>Hybrid Model: BicycleGAN</h2>

Combine the cVAE-GAN and cLR-GAN objectives in the hybrid model.<br><br>
Training is done in both directions, aiming to take advantage of both cycles<br> ($\mathrm{B}\rightarrow \mathrm{z}\rightarrow\hat{\mathrm{B}}$ and $\mathrm{z}\rightarrow\hat{\mathrm{B}}\rightarrow\hat{\mathrm{z}}$), hence the name BicycleGAN.<br><br>

> <h4>Combined Objective</h4>

<br>
$$
G^{*},\ E^{*}=\arg\min_{G,E}\max\ \mathcal{L}_{\mathrm{G}\mathrm{A}\mathrm{N}}^{\mathrm{V}\mathrm{A}\mathrm{E}}(G,\ D,\ E)+\lambda \mathcal{L}_{1^{\mathrm{A}\mathrm{E}}}(G,\ E)
$$
$$
+\mathcal{L}_{\mathrm{G}\mathrm{A}\mathrm{N}}(G,\ D)+\lambda_{\mathrm{l}\mathrm{a}\mathrm{t}\mathrm{e}\mathrm{n}\mathrm{t}}\mathcal{L}_{1}^{\mathrm{l}\mathrm{a}\mathrm{t}\mathrm{e}\mathrm{n}\mathrm{t}}(G,\ E)+\lambda_{\mathrm{K}\mathrm{L}}\mathcal{L}_{\mathrm{K}\mathrm{L}}(E)\ ,
$$
<h5>
where the hyper-parameters $\lambda, \lambda_{\mathrm{l}\mathrm{a}\mathrm{t}\mathrm{e}\mathrm{n}\mathrm{t}}$, and $\lambda_{\mathrm{K}\mathrm{L}}$ control the relative importance of each term.<h5>


---

<br><br>
> <h2>Implementation details<h2> 

<h4>Network architecture<h4>

* For generator G, U-Net is used, which contains an encoder-decoder
architecture, with symmetric skip connections. The architecture has been shown to produce strong results in the unimodal image prediction setting when there is a spatial correspondence between input and output pairs.<br><br>
![U-Net](img/unet.png)<br><br>
* For discriminator D, generally two PatchGAN discriminators at different
scales are used, which aim to predict real vs. fake overlapping image patches.
<br><br>
* For the encoder E, these two networks are preferred: <br>
(1) $E_{CNN}$: CNN with a few convolutional and downsampling layers .<br>
(2) $E_{Resnet}$: a classifier with several residual block .



---

<br><br>
> <h2>Implementation in Pytorch<h2>

<br>
<h4>Dataset : Edges2Shoes</h4>

<br>

![Edges2Shoes](img/edgesToShoes.png)
<br>

<h5>Let us first import the required libraries</h5>


In [None]:
import os
import argparse
import numpy as np
from  PIL import Image

import torch
import torchvision
import torch.nn as nn
import torch.optim as optim
import torchvision.transforms as Transforms
from  torch.utils.data import Dataset
from  torch.autograd import Variable

<h4>Dataloader</h4>

In [None]:
class Edges2Shoes(Dataset):
    def __init__(self, root, transform, mode='train'):
        self.root = root
        self.transform = transform
        self.mode = mode
        
        data_dir = os.path.join(root, mode)
        self.file_list = os.listdir(data_dir)
        
    def __len__(self):
        return len(self.file_list)
        
    def __getitem__(self, idx):
        img_path = os.path.join(self.root, self.mode, self.file_list[idx])
        img = Image.open(img_path)
        W, H = img.size[0], img.size[1]
        
        data = img.crop((0, 0, int(W / 2), H))
        ground_truth = img.crop((int(W / 2), 0, W, H))
        
        data = self.transform(data)
        ground_truth = self.transform(ground_truth)
        
        return (data, ground_truth)

def data_loader(root, batch_size=1, shuffle=True, img_size=128, mode='train'):    
    transform = Transforms.Compose([Transforms.Scale((img_size, img_size)),
                                    Transforms.ToTensor(),
                                    Transforms.Normalize(mean=(0.5, 0.5, 0.5),
                                                         std=(0.5, 0.5, 0.5))
                                   ])
    
    dset = Edges2Shoes(root, transform, mode=mode)
    
    if batch_size == 'all':
        batch_size = len(dset)
        
    dloader = torch.utils.data.DataLoader(dset,
                                          batch_size=batch_size,
                                          shuffle=shuffle,
                                          num_workers=0,
                                          drop_last=True)
    dlen = len(dset)
    
    return dloader, dlen


> <h3>ConvBlock</h3>

<br>
<h5>Small unit block consists of  (convolution layer - normalization layer - non linearity layer)</h5>
<br>
<i>Parameters</i><h5>

    1. in_dim : Input dimension(channels number)
    2. out_dim : Output dimension(channels number)
    3. k : Kernel size(filter size)
    4. s : stride
    5. p : padding size
    6. norm : If it is true add Instance Normalization layer, otherwise skip this layer
    7. non_linear : You can choose between 'leaky_relu', 'relu', 'None'
</hs>

    

In [None]:
class ConvBlock(nn.Module):
    def __init__(self, in_dim, out_dim, k=4, s=2, p=1, norm=True, non_linear='leaky_relu'):
        super(ConvBlock, self).__init__()
        layers = []
        
        # Convolution Layer
        layers += [nn.Conv2d(in_dim, out_dim, kernel_size=k, stride=s, padding=p)]
        
        # Normalization Layer
        if norm is True:
            layers += [nn.InstanceNorm2d(out_dim, affine=True)]
            
        # Non-linearity Layer
        if non_linear == 'leaky_relu':
            layers += [nn.LeakyReLU(negative_slope=0.2, inplace=True)]
        elif non_linear == 'relu':
            layers += [nn.ReLU(inplace=True)]
        
        self.conv_block = nn.Sequential(* layers)
        
    def forward(self, x):
        out = self.conv_block(x)
        return out
    

<h3>DeconvBlock</h3>
<h5>Small unit block consists of (transpose conv layer - normalization layer - non linearity layer)
    
<i>Parameters</i>

    1. in_dim : Input dimension(channels number)
    2. out_dim : Output dimension(channels number)
    3. k : Kernel size(filter size)
    4. s : stride
    5. p : padding size
    6. norm : If it is true add Instance Normalization layer, otherwise skip this layer
    7. non_linear : You can choose between 'relu', 'tanh', None
</h5>

In [None]:
class DeconvBlock(nn.Module):
    def __init__(self, in_dim, out_dim, k=4, s=2, p=1, norm=True, non_linear='relu'):
        super(DeconvBlock, self).__init__()
        layers = []
        
        # Transpose Convolution Layer
        layers += [nn.ConvTranspose2d(in_dim, out_dim, kernel_size=k, stride=s, padding=p)]
        
        # Normalization Layer
        if norm is True:
            layers += [nn.InstanceNorm2d(out_dim, affine=True)]
        
        # Non-Linearity Layer
        if non_linear == 'relu':
            layers += [nn.ReLU(inplace=True)]
        elif non_linear == 'tanh':
            layers += [nn.Tanh()]
            
        self.deconv_block = nn.Sequential(* layers)
            
    def forward(self, x):
        out = self.deconv_block(x)
        return out

<h2>Generator</h2>

> U-Net Generator 

<br><h5>
Downsampled activation volume and upsampled activation volume which have same width and height make pairs and they are concatenated when upsampling.<br></h5>

    Pairs : (up_1, down_6)
            (up_2, down_5)  
            (up_3, down_4) 
            (up_4, down_3) 
            (up_5, down_2) 
            (up_6, down_1)
            down_7 doesn't have a partener.
<br><h5>
ex) up_1 and down_6 have same size of  (N, 512, 2, 2) given that input size is (N, 3, 128, 128). When forwarding into upsample_2, up_1 and down_6 are concatenated to make (N, 1024, 2, 2) and then upsample_2 makes (N, 512, 4, 4). That is why upsample_2 has 1024 input dimension and 512 output dimension .
Except upsample_1, all the other upsampling blocks do the same thing.</h5>

In [None]:
class Generator(nn.Module):
    def __init__(self, z_dim=8): 
        super(Generator, self).__init__()
        # Reduce H and W by half at every downsampling
        self.downsample_1 = ConvBlock(3 + z_dim, 64, k=4, s=2, p=1, norm=False, non_linear='leaky_relu')
        self.downsample_2 = ConvBlock(64, 128, k=4, s=2, p=1, norm=True, non_linear='leaky_relu')
        self.downsample_3 = ConvBlock(128, 256, k=4, s=2, p=1, norm=True, non_linear='leaky_relu')
        self.downsample_4 = ConvBlock(256, 512, k=4, s=2, p=1, norm=True, non_linear='leaky_relu')
        self.downsample_5 = ConvBlock(512, 512, k=4, s=2, p=1, norm=True, non_linear='leaky_relu')
        self.downsample_6 = ConvBlock(512, 512, k=4, s=2, p=1, norm=True, non_linear='leaky_relu')
        self.downsample_7 = ConvBlock(512, 512, k=4, s=2, p=1, norm=True, non_linear='leaky_relu')
        
        # Need concatenation when upsampling, see foward function for details
        self.upsample_1 = DeconvBlock(512, 512, k=4, s=2, p=1, norm=True, non_linear='relu')
        self.upsample_2 = DeconvBlock(1024, 512, k=4, s=2, p=1, norm=True, non_linear='relu')
        self.upsample_3 = DeconvBlock(1024, 512, k=4, s=2, p=1, norm=True, non_linear='relu')
        self.upsample_4 = DeconvBlock(1024, 256, k=4, s=2, p=1, norm=True, non_linear='relu')
        self.upsample_5 = DeconvBlock(512, 128, k=4, s=2, p=1, norm=True, non_linear='relu')
        self.upsample_6 = DeconvBlock(256, 64, k=4, s=2, p=1, norm=True, non_linear='relu')
        self.upsample_7 = DeconvBlock(128, 3, k=4, s=2, p=1, norm=False, non_linear='Tanh')
    
    def forward(self, x, z):
        # z : (N, z_dim) -> (N, z_dim, 1, 1) -> (N, z_dim, H, W)
        # x_with_z : (N, 3 + z_dim, H, W)
        z = z.unsqueeze(dim=2).unsqueeze(dim=3)
        z = z.expand(z.size(0), z.size(1), x.size(2), x.size(3))
        x_with_z = torch.cat([x, z], dim=1)
        
        down_1 = self.downsample_1(x_with_z)
        down_2 = self.downsample_2(down_1)
        down_3 = self.downsample_3(down_2)
        down_4 = self.downsample_4(down_3)
        down_5 = self.downsample_5(down_4)
        down_6 = self.downsample_6(down_5)
        down_7 = self.downsample_7(down_6)

        up_1 = self.upsample_1(down_7)
        up_2 = self.upsample_2(torch.cat([up_1, down_6], dim=1))
        up_3 = self.upsample_3(torch.cat([up_2, down_5], dim=1))
        up_4 = self.upsample_4(torch.cat([up_3, down_4], dim=1))
        up_5 = self.upsample_5(torch.cat([up_4, down_3], dim=1))
        up_6 = self.upsample_6(torch.cat([up_5, down_2], dim=1))
        out = self.upsample_7(torch.cat([up_6, down_1], dim=1))
        
        return out 
    

<h2> Discriminator </h2>

> PatchGAN discriminator <b>:</b> 

<br><h5>
  It uses two discriminator which have different output sizes (different local  probabilities).

    d_1 : (N, 3, 128, 128) -> (N, 1, 14, 14)
    d_2 : (N, 3, 128, 128) -> (N, 1, 30, 30)

In training, the generator needs to fool both of d_1 and d_2 and it makes the generator more robust.

In [None]:
class Discriminator(nn.Module):
    def __init__(self):
        super(Discriminator, self).__init__()       
        # Discriminator with last patch (14x14)
        # (N, 3, 128, 128) -> (N, 1, 14, 14)
        self.d_1 = nn.Sequential(nn.AvgPool2d(kernel_size=3, stride=2, padding=0, count_include_pad=False),
                                 ConvBlock(3, 32, k=4, s=2, p=1, norm=False, non_linear='leaky_relu'),
                                 ConvBlock(32, 64, k=4, s=2, p=1, norm=True, non_linear='leaky-relu'),
                                 ConvBlock(64, 128, k=4, s=1, p=1, norm=True, non_linear='leaky-relu'),
                                 ConvBlock(128, 1, k=4, s=1, p=1, norm=False, non_linear=None))
        
        # Discriminator with last patch (30x30)
        # (N, 3, 128, 128) -> (N, 1, 30, 30)
        self.d_2 = nn.Sequential(ConvBlock(3, 64, k=4, s=2, p=1, norm=False, non_linear='leaky_relu'),
                                 ConvBlock(64, 128, k=4, s=2, p=1, norm=True, non_linear='leaky-relu'),
                                 ConvBlock(128, 256, k=4, s=1, p=1, norm=True, non_linear='leaky-relu'),
                                 ConvBlock(256, 1, k=4, s=1, p=1, norm=False, non_linear=None))
    
    def forward(self, x):
        out_1 = self.d_1(x)
        out_2 = self.d_2(x)
        return (out_1, out_2)

<h2>ResBlock</h2>
    
This residual block is different with the one we usaully know which consists of<br>[conv - norm - act - conv - norm] and identity mapping(x -> x) for shortcut. Also spatial size is decreased by half because of AvgPool2d.

In [None]:
class ResBlock(nn.Module):
    def __init__(self, in_dim, out_dim):
        super(ResBlock, self).__init__()
        self.conv = nn.Sequential(nn.InstanceNorm2d(in_dim, affine=True),
                                  nn.LeakyReLU(negative_slope=0.2, inplace=True),
                                  nn.Conv2d(in_dim, in_dim, kernel_size=3, stride=1, padding=1),
                                  nn.InstanceNorm2d(in_dim, affine=True),
                                  nn.LeakyReLU(negative_slope=0.2, inplace=True),
                                  nn.Conv2d(in_dim, out_dim, kernel_size=3, stride=1, padding=1),
                                  nn.AvgPool2d(kernel_size=2, stride=2, padding=0))
        
        self.short_cut = nn.Sequential(nn.AvgPool2d(kernel_size=2, stride=2, padding=0),
                                       nn.Conv2d(in_dim, out_dim, kernel_size=1, stride=1, padding=0))
        
    def forward(self, x):
        out = self.conv(x) + self.short_cut(x)
        return out

<h2>Encoder</h2><h5>
Output is mu and log(var) for reparameterization trick used in Variation Auto Encoder.<br>Encoding is done in this order.

    1. Use this encoder and get mu and log_var
    2. std = exp(log(var / 2))
    3. random_z = N(0, 1)
    4. encoded_z = random_z * std + mu (Reparameterization trick)

In [None]:
class Encoder(nn.Module):
    def __init__(self, z_dim=8):
        super(Encoder, self).__init__()
        
        self.conv = nn.Conv2d(3, 64, kernel_size=4, stride=2, padding=1)
        self.res_blocks = nn.Sequential(ResBlock(64, 128),
                                        ResBlock(128, 192),
                                        ResBlock(192, 256))
        self.pool_block = nn.Sequential(nn.LeakyReLU(negative_slope=0.2, inplace=True),
                                        nn.AvgPool2d(kernel_size=8, stride=8, padding=0))
        
        # Return mu and logvar for reparameterization trick
        self.fc_mu = nn.Linear(256, z_dim)
        self.fc_logvar = nn.Linear(256, z_dim)
        
    def forward(self, x):
        # (N, 3, 128, 128) -> (N, 64, 64, 64)
        out = self.conv(x)
        # (N, 64, 64, 64) -> (N, 128, 32, 32) -> (N, 192, 16, 16) -> (N, 256, 8, 8)
        out = self.res_blocks(out)
        # (N, 256, 8, 8) -> (N, 256, 1, 1)
        out = self.pool_block(out)
        # (N, 256, 1, 1) -> (N, 256)
        out = out.view(x.size(0), -1)
        
        # (N, 256) -> (N, z_dim) x 2
        mu = self.fc_mu(out)
        log_var = self.fc_logvar(out)
        
        return (mu, log_var)

<h3>Function var</h3>
<h4>Convert tensor to Variable</h4>


In [None]:
def var(tensor, requires_grad=True):
    if torch.cuda.is_available():
        dtype = torch.cuda.FloatTensor
    else:
        dtype = torch.FloatTensor
        
    var = Variable(tensor.type(dtype), requires_grad=requires_grad)
    
    return var

<h3>Function make_img </h3>

* <h4>Generate images</h4>

<h5>Parameters

    dloader : Data loader for test data set
    G : Generator
    z : random_z(size = (N, img_num, z_dim))
    N : test img number / img_num : Number of images that you want to generate with one test img / z_dim : 8
    img_num : Number of images that you want to generate with one test img
</h5>


In [None]:
def make_img(dloader, G, z, img_num=5, img_size=128):
    if torch.cuda.is_available():
        dtype = torch.cuda.FloatTensor
    else:
        dtype = torch.FloatTensor
        
    dloader = iter(dloader)
    img, _ = dloader.next()

    N = img.size(0)    
    img = var(img.type(dtype))

    result_img = torch.FloatTensor(N * (img_num + 1), 3, img_size, img_size).type(dtype)

    for i in range(N):
        # original image to the leftmost
        result_img[i * (img_num + 1)] = img[i].data

        # Insert generated images to the next of the original image
        for j in range(img_num):
            img_ = img[i].unsqueeze(dim=0)
            z_ = z[i, j, :].unsqueeze(dim=0)
            
            out_img = G(img_, z_)
            result_img[i * (img_num + 1) + j + 1] = out_img.data


    # [-1, 1] -> [0, 1]
    result_img = result_img / 2 + 0.5
    
    return result_img


<h2>mse_loss</h2>

* Calculate mean squared error loss

<h5>Parameters
    
    score : Output of discriminator
    target : 1 for real and 0 for fake

In [None]:
def mse_loss(score, target=1):
    dtype = type(score)
    
    if target == 1:
        label = util.var(torch.ones(score.size()), requires_grad=False)
    elif target == 0:
        label = util.var(torch.zeros(score.size()), requires_grad=False)
    
    criterion = nn.MSELoss()
    loss = criterion(score, label)
    
    return loss

<h2>L1_loss</h2>

* Calculate L1 loss

<h5>Parameters

    pred : Output of network
    target : Ground truth
</h5>

In [None]:
def L1_loss(pred, target):
    return torch.mean(torch.abs(pred - target))

In [None]:
def lr_decay_rule(epoch, start_decay=100, lr_decay=100):
    decay_rate = 1.0 - (max(0, epoch - start_decay) / float(lr_decay))
    return decay_rate

<h3> Solver function</h3>

* <h5>Include all the training details 


In [None]:
class Solver():
    def __init__(self, root='data/edges2shoes', result_dir='result', weight_dir='weight', load_weight=False,
                 batch_size=2, test_size=20, test_img_num=5, img_size=128, num_epoch=100, save_every=1000,
                 lr=0.0002, beta_1=0.5, beta_2=0.999, lambda_kl=0.01, lambda_img=10, lambda_z=0.5, z_dim=8):
        
        # Data type(Can use GPU or not?)
        self.dtype = torch.cuda.FloatTensor
        if torch.cuda.is_available() is False:
            self.dtype = torch.FloatTensor
        
        # Data loader for training
        self.dloader, dlen = data_loader(root=root, batch_size=batch_size, shuffle=True, 
                                         img_size=img_size, mode='train')

        # Data loader for test
        self.t_dloader, _ = data_loader(root=root, batch_size=test_size, shuffle=False, 
                                        img_size=img_size, mode='val')

        # Both of D_cVAE and D_cLR has two discriminators which have different output size((14x14) and (30x30)).
        # Totally, we have for discriminators now.
        self.D_cVAE = model.Discriminator().type(self.dtype)
        self.D_cLR = model.Discriminator().type(self.dtype)
        self.G = model.Generator(z_dim=z_dim).type(self.dtype)
        self.E = model.Encoder(z_dim=z_dim).type(self.dtype)

        # Optimizers
        self.optim_D_cVAE = optim.Adam(self.D_cVAE.parameters(), lr=lr, betas=(beta_1, beta_2))
        self.optim_D_cLR = optim.Adam(self.D_cLR.parameters(), lr=lr, betas=(beta_1, beta_2))
        self.optim_G = optim.Adam(self.G.parameters(), lr=lr, betas=(beta_1, beta_2))
        self.optim_E = optim.Adam(self.E.parameters(), lr=lr, betas=(beta_1, beta_2))

        # fixed random_z for intermediate test
        self.fixed_z = util.var(torch.randn(test_size, test_img_num, z_dim))
        
        # Some hyperparameters
        self.z_dim = z_dim
        self.lambda_kl = lambda_kl
        self.lambda_img = lambda_img
        self.lambda_z = lambda_z

        # Extra things
        self.result_dir = result_dir
        self.weight_dir = weight_dir
        self.load_weight = load_weight
        self.test_img_num = test_img_num
        self.img_size = img_size
        self.start_epoch = 0
        self.num_epoch = num_epoch
        self.save_every = save_every
        
    '''
        < set_train_phase >
        Set training phase
    '''
    def set_train_phase(self):
        self.D_cVAE.train()
        self.D_cLR.train()
        self.G.train()
        self.E.train()
        
    '''
        < load_pretrained >
        If you want to continue to train, load pretrained weight
    '''
    def load_pretrained(self):
        self.D_cVAE.load_state_dict(torch.load(os.path.join(self.weight_dir, 'D_cVAE.pkl')))
        self.D_cLR.load_state_dict(torch.load(os.path.join(self.weight_dir, 'D_cLR.pkl')))
        self.G.load_state_dict(torch.load(os.path.join(self.weight_dir, 'G.pkl')))
        self.E.load_state_dict(torch.load(os.path.join(self.weight_dir, 'E.pkl')))
        
        log_file = open('log.txt', 'r')
        line = log_file.readline()
        self.start_epoch = int(line)
        
    '''
        < save_weight >
        Save weight
    '''
    def save_weight(self, epoch=None):
        if epoch is None:
            d_cVAE_name = 'D_cVAE.pkl'
            d_cLR_name = 'D_cLR.pkl'
            g_name = 'G.pkl'
            e_name = 'E.pkl'
        else:
            d_cVAE_name = '{epochs}-{name}'.format(epochs=str(epoch), name='D_cVAE.pkl')
            d_cLR_name = '{epochs}-{name}'.format(epochs=str(epoch), name='D_cLR.pkl')
            g_name = '{epochs}-{name}'.format(epochs=str(epoch), name='G.pkl')
            e_name = '{epochs}-{name}'.format(epochs=str(epoch), name='E.pkl')
            
        torch.save(self.D_cVAE.state_dict(), os.path.join(self.weight_dir, d_cVAE_name))
        torch.save(self.D_cVAE.state_dict(), os.path.join(self.weight_dir, d_cLR_name))
        torch.save(self.G.state_dict(), os.path.join(self.weight_dir, g_name))
        torch.save(self.E.state_dict(), os.path.join(self.weight_dir, e_name))
    
    '''
        < all_zero_grad >
        Set all optimizers' grad to zero 
    '''
    def all_zero_grad(self):
        self.optim_D_cVAE.zero_grad()
        self.optim_D_cLR.zero_grad()
        self.optim_G.zero_grad()
        self.optim_E.zero_grad()
        
    '''
        < train >
        Train the D_cVAE, D_cLR, G and E 
    '''
    def train(self):
        if self.load_weight is True:
            self.load_pretrained()
        
        self.set_train_phase()
        
        for epoch in range(self.start_epoch, self.num_epoch):
            for iters, (img, ground_truth) in enumerate(self.dloader):
                # img : (2, 3, 128, 128) of domain A / ground_truth : (2, 3, 128, 128) of domain B
                img, ground_truth = util.var(img), util.var(ground_truth)

                # Seperate data for cVAE_GAN and cLR_GAN
                cVAE_data = {'img' : img[0].unsqueeze(dim=0), 'ground_truth' : ground_truth[0].unsqueeze(dim=0)}
                cLR_data = {'img' : img[1].unsqueeze(dim=0), 'ground_truth' : ground_truth[1].unsqueeze(dim=0)}

                ''' ----------------------------- 1. Train D ----------------------------- '''
                #############   Step 1. D loss in cVAE-GAN #############

                # Encoded latent vector
                mu, log_variance = self.E(cVAE_data['ground_truth'])
                std = torch.exp(log_variance / 2)
                random_z = util.var(torch.randn(1, self.z_dim))
                encoded_z = (random_z * std) + mu

                # Generate fake image
                fake_img_cVAE = self.G(cVAE_data['img'], encoded_z)

                # Get scores and loss
                real_d_cVAE_1, real_d_cVAE_2 = self.D_cVAE(cVAE_data['ground_truth'])
                fake_d_cVAE_1, fake_d_cVAE_2 = self.D_cVAE(fake_img_cVAE)
                
                # mse_loss for LSGAN
                D_loss_cVAE_1 = mse_loss(real_d_cVAE_1, 1) + mse_loss(fake_d_cVAE_1, 0)
                D_loss_cVAE_2 = mse_loss(real_d_cVAE_2, 1) + mse_loss(fake_d_cVAE_2, 0)
                
                #############   Step 2. D loss in cLR-GAN   #############

                # Random latent vector
                random_z = util.var(torch.randn(1, self.z_dim))

                # Generate fake image
                fake_img_cLR = self.G(cLR_data['img'], random_z)

                # Get scores and loss
                real_d_cLR_1, real_d_cLR_2 = self.D_cLR(cLR_data['ground_truth'])
                fake_d_cLR_1, fake_d_cLR_2 = self.D_cLR(fake_img_cLR)
                
                D_loss_cLR_1 = mse_loss(real_d_cLR_1, 1) + mse_loss(fake_d_cLR_1, 0)
                D_loss_cLR_2 = mse_loss(real_d_cLR_2, 1) + mse_loss(fake_d_cLR_2, 0)

                D_loss = D_loss_cVAE_1 + D_loss_cLR_1 + D_loss_cVAE_2 + D_loss_cLR_2

                # Update
                self.all_zero_grad()
                D_loss.backward()
                self.optim_D_cVAE.step()
                self.optim_D_cLR.step()

                ''' ----------------------------- 2. Train G & E ----------------------------- '''
                ############# Step 1. GAN loss to fool discriminator (cVAE_GAN and cLR_GAN) #############

                # Encoded latent vector
                mu, log_variance = self.E(cVAE_data['ground_truth'])
                std = torch.exp(log_variance / 2)
                random_z = util.var(torch.randn(1, self.z_dim))
                encoded_z = (random_z * std) + mu

                # Generate fake image and get adversarial loss
                fake_img_cVAE = self.G(cVAE_data['img'], encoded_z)
                fake_d_cVAE_1, fake_d_cVAE_2 = self.D_cVAE(fake_img_cVAE)

                GAN_loss_cVAE_1 = mse_loss(fake_d_cVAE_1, 1)
                GAN_loss_cVAE_2 = mse_loss(fake_d_cVAE_2, 1)

                # Random latent vector
                random_z = util.var(torch.randn(1, self.z_dim))

                # Generate fake image and get adversarial loss
                fake_img_cLR = self.G(cLR_data['img'], random_z)
                fake_d_cLR_1, fake_d_cLR_2 = self.D_cLR(fake_img_cLR)

                GAN_loss_cLR_1 = mse_loss(fake_d_cLR_1, 1)
                GAN_loss_cLR_2 = mse_loss(fake_d_cLR_2, 1)

                G_GAN_loss = GAN_loss_cVAE_1 + GAN_loss_cVAE_2 + GAN_loss_cLR_1 + GAN_loss_cLR_2

                ############# Step 2. KL-divergence with N(0, 1) (cVAE-GAN) #############
                
                KL_div = self.lambda_kl * torch.sum(0.5 * (mu ** 2 + torch.exp(log_variance) - log_variance - 1))

                ############# Step 3. Reconstruction of ground truth image (|G(A, z) - B|) (cVAE-GAN) #############
                img_recon_loss = self.lambda_img * L1_loss(fake_img_cVAE, cVAE_data['ground_truth'])

                EG_loss = G_GAN_loss + KL_div + img_recon_loss
                self.all_zero_grad()
                EG_loss.backward(retain_graph=True)
                self.optim_E.step()
                self.optim_G.step()

                ''' ----------------------------- 3. Train ONLY G ----------------------------- '''
                ############ Step 1. Reconstrution of random latent code (|E(G(A, z)) - z|) (cLR-GAN) ############
                
                # This step should update ONLY G.
                mu_, log_variance_ = self.E(fake_img_cLR)
                z_recon_loss = L1_loss(mu_, random_z)

                G_alone_loss = self.lambda_z * z_recon_loss

                self.all_zero_grad()
                G_alone_loss.backward()
                self.optim_G.step()

                log_file = open('log.txt', 'w')
                log_file.write(str(epoch))
                
                # Print error and save intermediate result image and weight
                if iters % self.save_every == 0:
                    print('[Epoch : %d / Iters : %d] => D_loss : %f / G_GAN_loss : %f / KL_div : %f / img_recon_loss : %f / z_recon_loss : %f'\
                          %(epoch, iters, D_loss.data[0], G_GAN_loss.data[0], KL_div.data[0], img_recon_loss.data[0], G_alone_loss.data[0]))

                    # Save intermediate result image
                    if os.path.exists(self.result_dir) is False:
                        os.makedirs(self.result_dir)

                    result_img = util.make_img(self.t_dloader, self.G, self.fixed_z, 
                                               img_num=self.test_img_num, img_size=self.img_size)

                    img_name = '{epoch}_{iters}.png'.format(epoch=epoch, iters=iters)
                    img_path = os.path.join(self.result_dir, img_name)

                    torchvision.utils.save_image(result_img, img_path, nrow=self.test_img_num+1)

                    # Save intermediate weight
                    if os.path.exists(self.weight_dir) is False:
                        os.makedirs(self.weight_dir)
                    
                    self.save_weight()
                    
            # Save weight at the end of every epoch
            self.save_weight(epoch=epoch)


<h3>Resources</h3>

* [Toward Multimodal Image-to-Image Translation](https://arxiv.org/abs/1711.11586)
* https://github.com/eveningglow/BicycleGAN-pytorch