## Unsupervised representation learning with generative adversarial networks (GANs)

### Outline
* What is a GAN?
* Why study GANs?
* GANs vs Other generative models
* How do GANs work?
    * GAN framework
    * Training process
        * Cost functions
        * Minimax game
* Implementations
    * GAN implementation
    * DCGAN implementation
    * Cycle GAN implementation

$$ % Latex macros
\newcommand{\mat}[1]{\begin{pmatrix} #1 \end{pmatrix}}
\newcommand{\p}[2]{\frac{\partial #1}{\partial #2}}
\newcommand{\b}[1]{\boldsymbol{#1}}
\newcommand{\c}[1]{\mathcal{#1}}
$$

## How do GANs work?
### GAN Framework

A generative adversarial network consists of two models: the __generator__ and the __discriminator__. 
    
The discriminator is a binary classifier  which identifies whether an input belongs to either of the two classes, _real_ or _fake_. Meanwhile, the generator is trained to fool the discriminator such that its outputs are classified as part of the _real_ data. 

The generator could be thought of as being like a forger that makes fake art while a discriminator is like a fraud detective that distinguishes between genuine and fake artwork. The goal of the forger is to be able to make art that is as indistinguishable from the real ones, as much as the goal of the generator is to create fake samples that are drawn from the same distribution as the actual data.

Hence we end up with two models competing against each other with the generator learning how to generate samples that the discrimniator can no longer detect as being fake. The competition between the two networks drives their learning.

<img src="figures/GAN_framework.png" width="500">

Formally, the players in the game are represented as two functions: 
* Discriminator: $D(x, \b \theta^{(D)})$ where $\b x$ are observed variables
* Generator: $G(z, \b \theta^{(G)})$ where $\b z$ are latent variables

$\theta^{(i)}$ represent the parameters/weights of the models

The discriminator $D(x, \b \theta^{(D)})$ and the generator $G(z, \b \theta^{(G)})$ optimize cost functions that are dependent on each others' parameters: 

$J^{(D)}(\b \theta^{(D)}, \b \theta^{(G)})$ $\rightarrow$ The discriminator minimizes this cost function while changing only its parameters $\b \theta^{(D)}$

$J^{(G)}(\b \theta^{(D)}, \b \theta^{(G)})$ $\rightarrow$ The generator minimizes this cost function while changing only 
its parameters $\b \theta^{(G)}$

This problem is framed more easily as a game rather than an optimization problem, because both players have cost functions that depend on each others' parameters while only having control of their own parameters. In a game theoretic approach, the problem becomes that of finding the Nash equilibria $(\b \theta^{(D)}, \b \theta^{(G)})$ for which $J^{(D)}$ is a minimum with respect to $\b \theta^{(D)}$ and $J^{(G)}$ is a minimum with respect to $\b \theta^{(G)}$.

In order to train the network, simultaneous stochastic gradient descent (SGD) steps are taken for the generator and the discriminator: one step for $D$ to minimize $J^{(D)}$ and another step for $G$ to minimize $J^{(G)}$

### Cost functions

$\textbf {Discriminator cost function}$

For most implementations of GANs, the cost function used for the discriminator is the standard binary cross-entropy loss used for binary classifiers with sigmoid output activations. The only difference is that it is trained on both real data from the dataset (labeled as $1$) and fake data from the generator (labeled as $0$). 

$$J^{(D)}(\b \theta^{(D)}, \b \theta^{(G)})  = -\textbf{E}_{x\sim p_{data}(x)}[\log{D(x)}] - \textbf{E}_{z\sim p_{z}(z)}[\log{(1-D(G(z)))}]$$

In this manner, the model approximates by using supervised learning to estimate a density ratio:
$$\frac{p_{data}(x)}{p_{model}(x)}$$

$\textbf {Generator cost function}$

To complete the specification of the game, the generator cost function must be defined. The simplest case is to consider a zero-sum game, i.e., the total cost for all players is always zero:

$$J^{(G)} = -J^{(D)}$$

Then we can define the game just by specifying the discriminator pay-off:

$$V(\b \theta^{(D)}, \b \theta^{(G)})  = J^{(D)}(\b \theta^{(D)}, \b \theta^{(G)})$$


And we end up with a $\textbf {minimax}$ game

$$\arg \min_{\b \theta^{(G)}}\max_{\b \theta^{(D)}}V(\b \theta^{(G)}, \b \theta^{(D)})  = \textbf{E}_{x\sim p_{data}(x)}[\log{D(x)}] + \textbf{E}_{z\sim p_{z}(z)}[\log{(1-D(G(z)))}]$$

In practice, the two players are represented as neural networks for differentiability.  


The following section describes the first implementation of a GAN using two multilayer perceptrons acting as the adversarial discriminator and generator networks.

## Vanilla Generative Adversarial Network (GAN)

## Deep convolutional GAN (DCGAN)

One of the most widely used type of GAN is the deep convolutional GAN (DCGAN) first implemented by Radford et al (). The main feature of the DCGAN is its use of all convolutional layers with batch normalization to stabilize training. The discriminator network is a convolutional neural network classifier with all convolutional layers (CNN), while the generator is a network of transposed convolution blocks. 

<table align="center">
<tr>
<td> <img src="figures/conv_anim.gif" style="height: 350px; width: 350px"/> 
<td> <img src= "figures/convT_anim.gif" style="height: 350px; width: 350px"/> 
</tr>
<tr>
<th style= "text-align:center"> Convolution operation </th >
<th style= "text-align:center"> Convolution transpose operation</th>  
</tr>
</table>

The key features of a DCGAN are given by Radford et al (2016) as follows :
<img src="figures/Architecture_guidelines.png" width="700">


Batch normalization is described by the following algorithm:
<img src="figures/batch_norm.png" width="300">

The architecture used for their generator network is shown below:
<img src="figures/Generator_network.png" width="700">
<table align="center">
<tr>
<th style= "text-align:center"> Generator network architecture for the DCGAN</th>
</tr>
</table>

### DCGAN PyTorch implementation

In our implementation of the DCGAN, we follow the architecture and training method used in Radford et al, 2016. Shown bellow are code snippets for the pytorch implementation of a DCGAN.

In [12]:
from __future__ import division
from IPython.display import clear_output
import torch
from torch import nn
from torch.autograd import Variable
from torch.optim import Adam
from torchvision import transforms, datasets
import torchvision.utils as vutils

In [13]:
class Discriminator(torch.nn.Module):
    """
    This discriminator network is based on the original DCGAN paper by Radford et al.
    The discriminator is a CNN which takes as input a 3-channel image data (i.e. RGB image)
    and outputs a probability,p(real), that the image is from the real dataset.
    """
    def __init__(self):
        super(Discriminator, self).__init__()
        #Conv block 1
        self.Conv1 = nn.Sequential(
            nn.Conv2d(in_channels=3, out_channels=128, kernel_size=4, stride=2, padding=1, bias=False),
            nn.LeakyReLU(0.2, inplace=True)
        )
        #Conv block 2
        self.Conv2 = nn.Sequential(
            nn.Conv2d(in_channels=128, out_channels=256, kernel_size=4, stride=2, padding=1, bias=False),
            nn.BatchNorm2d(256),
            nn.LeakyReLU(0.2, inplace=True)
        )
        #Conv block 3
        self.Conv3 = nn.Sequential(
            nn.Conv2d(in_channels=256, out_channels=512, kernel_size=4, stride=2, padding=1, bias=False),
            nn.BatchNorm2d(512),
            nn.LeakyReLU(0.2, inplace=True)
        )
        #Conv block 4
        self.Conv4 = nn.Sequential(
            nn.Conv2d(in_channels=512, out_channels=1024, kernel_size=4,stride=2, padding=1, bias=False),
            nn.BatchNorm2d(1024),
            nn.LeakyReLU(0.2, inplace=True)
        )
        self.Out = nn.Sequential(
            nn.Linear(1024*4*4, 1),
            nn.Sigmoid(),
        )

    def forward(self, I):
        # Convolutional layers
        X = self.Conv1(I)
        X = self.Conv2(X)
        X = self.Conv3(X)
        X = self.Conv4(X)
        # reshape and apply sigmoid activation
        X = X.view(-1, 1024*4*4)
        X = self.Out(X)
        return X

In [16]:
class Generator(torch.nn.Module):
    """
    This generator network is based on the original DCGAN paper by Radford et al.
    The generator takes as input a 100-dimensional noise vector (z) and maps it to the data space 
    (which in this case is the image space) via a series of transposed convolution blocks.
    From the input random noise, the generator outputs an image with the same size as the input.    
    """
    
    def __init__(self):
        super(Generator, self).__init__()
        self.nz = 100
        self.linear = torch.nn.Linear(self.nz, 1024*4*4)
        
        #first transposed convolution block
        self.Conv1 = nn.Sequential(
            nn.ConvTranspose2d(in_channels=1024, out_channels=512, kernel_size=4,stride=2, padding=1, bias=False),
            nn.BatchNorm2d(512),
            nn.ReLU(inplace=True))
        
        #second transposed convolution block
        self.Conv2 = nn.Sequential(
            nn.ConvTranspose2d(in_channels=512, out_channels=256, kernel_size=4, stride=2, padding=1, bias=False),
            nn.BatchNorm2d(256),
            nn.ReLU(inplace=True)
        )
        
        #third transposed convolution block
        self.Conv3 = nn.Sequential(
            nn.ConvTranspose2d(in_channels=256, out_channels=128, kernel_size=4, stride=2, padding=1, bias=False),
            nn.BatchNorm2d(128),
            nn.ReLU(inplace=True)
        )
        
        #fourth transposed convolution block
        self.Conv4 = nn.Sequential(
            nn.ConvTranspose2d(in_channels=128, out_channels=3, kernel_size=4, stride=2, padding=1, bias=False)
        )
        self.out = torch.nn.Tanh()

    def forward(self, z):
        """
        Perform forward calculation for generator output, given random noise input z
        """
        # Project and reshape
        X = self.linear(z)
        X = X.view(X.shape[0], 1024, 4, 4)
        # conv blocks
        X = self.Conv1(X)
        X = self.Conv2(X)
        X = self.Conv3(X)
        X = self.Conv4(X)
        # tanh activation
        return self.out(X)

In [17]:
def train_discriminator(optimizer, real_data, fake_data):
    # Reset gradients
    optimizer.zero_grad()
    
    # 1.1 Train on Real Data
    prediction_real = netD(real_data)
    # Calculate error and backpropagate
    error_real = loss(prediction_real, real_data_target(real_data.size(0)))
    error_real.backward()

    # 1.2 Train on Fake Data
    prediction_fake = netD(fake_data)
    # Calculate error and backpropagate
    error_fake = loss(prediction_fake, fake_data_target(real_data.size(0)))
    error_fake.backward()
    
    # 1.3 Update weights with gradients
    optimizer.step()
    
    # Return error
    return error_real + error_fake, prediction_real, prediction_fake

def train_generator(optimizer, fake_data):
    # 2. Train Generator
    # Reset gradients
    optimizer.zero_grad()
    # Sample noise and generate fake data
    prediction = netD(fake_data)
    # Calculate error and backpropagate
    error = loss(prediction, real_data_target(prediction.size(0)))
    error.backward()
    # Update weights with gradients
    optimizer.step()
    # Return error
    return error

In [22]:
def weights_init(m):
    classname = m.__class__.__name__
    if classname.find('Conv') != -1 or classname.find('BatchNorm') != -1:
        m.weight.data.normal_(0.00, 0.02)
        
def noise(s):
    """
    Generate s-dimensional noise vector from random normal distribution with mean zero and std one
    """
    z = Variable(torch.randn(s, 100))
    if torch.cuda.is_available(): 
        return z.cuda()
    return z

In [23]:
# create network instances
netG = Generator()
netD = Discriminator()

#initialize weights
netD.apply(weights_init)
netG.apply(weights_init)

#use cuda if available
if torch.cuda.is_available():
    netG.cuda()
    netD.cuda()

# Set learning rate
lr = 0.0002

# Number of training epochs
num_epochs = 200

# setup optimizers
optD = Adam(netD.parameters(), lr=lr, betas=(0.5, 0.999))
optG = Adam(netG.parameters(), lr=lr, betas=(0.5, 0.999))

# fixed test noise
test_noise = noise(20)

# loss function
loss = nn.BCELoss()

real_label = 1
fake_label = 0

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [None]:
for epoch in range(num_epochs):
    dloss_log = []
    gloss_log = []
    for i, data in enumerate(data_loader, 0):
        real_data = Variable(data[0])        
        # 1. Train Discriminator
        if torch.cuda.is_available(): real_data = real_data.cuda()
        # Generate fake data
        fake_data = netG(noise(real_data.size(0))).detach()
        # Train D
        d_error, d_pred_real, d_pred_fake = train_discriminator(optD, 
                                                                real_data, fake_data)

        # 2. Train Generator
        # Generate fake data
        fake_data = netG(noise(real_data.size(0)))
        # Train G
        g_error = train_generator(optG, fake_data)

### Results for MNIST
<img src="figures/MNIST.gif" width = '800'> 

### Results for Fashion MNIST
<img src="figures/FASHION.gif" width = '800'> 

### Results for CIFAR10
<img src="figures/CIFAR10.gif" width = '800'> 