In [None]:
from google.colab import drive

drive.mount('/content/drive', force_remount=True)

# enter the foldername in your Drive where you have saved the unzipped
# 'cs231n' folder containing the '.py', 'classifiers' and 'datasets'
# folders.
FOLDERNAME = 'IntroDL/hw5/'

assert FOLDERNAME is not None, "[!] Enter the foldername."

%cd drive/My\ Drive
%cp -r $FOLDERNAME ../../
%cd ../..
%cd 'hw5/'

### What is a GAN?

In 2014, [Goodfellow et al.](https://arxiv.org/abs/1406.2661) presented a method for training generative models called Generative Adversarial Networks (GANs for short). In a GAN, we build two different neural networks. Our first network is a traditional classification network, called the **discriminator**. We will train the discriminator to take images, and classify them as being real (belonging to the training set) or fake (not present in the training set). Our other network, called the **generator**, will take random noise as input and transform it using a neural network to produce images. The goal of the generator is to fool the discriminator into thinking the images it produced are real.

We can think of this back and forth process of the generator ($G$) trying to fool the discriminator ($D$), and the discriminator trying to correctly classify real vs. fake as a minimax game:
$$\underset{G}{\text{minimize}}\; \underset{D}{\text{maximize}}\; \mathbb{E}_{x \sim p_\text{data}}\left[\log D(x)\right] + \mathbb{E}_{z \sim p(z)}\left[\log \left(1-D(G(z))\right)\right]$$
where $x \sim p_\text{data}$ are samples from the input data, $z \sim p(z)$ are the random noise samples, $G(z)$ are the generated images using the neural network generator $G$, and $D$ is the output of the discriminator, specifying the probability of an input being real. In [Goodfellow et al.](https://arxiv.org/abs/1406.2661), they analyze this minimax game and show how it relates to minimizing the Jensen-Shannon divergence between the training data distribution and the generated samples from $G$.

To optimize this minimax game, we will aternate between taking gradient *descent* steps on the objective for $G$, and gradient *ascent* steps on the objective for $D$:
1. update the **generator** ($G$) to minimize the probability of the __discriminator making the correct choice__. 
2. update the **discriminator** ($D$) to maximize the probability of the __discriminator making the correct choice__.

While these updates are useful for analysis, they do not perform well in practice. Instead, we will use a different objective when we update the generator: maximize the probability of the **discriminator making the incorrect choice**. This small change helps to allevaiate problems with the generator gradient vanishing when the discriminator is confident. This is the standard update used in most GAN papers, and was used in the original paper from [Goodfellow et al.](https://arxiv.org/abs/1406.2661). 

In this assignment, we will alternate the following updates:
1. Update the generator ($G$) to maximize the probability of the discriminator making the incorrect choice on generated data:
$$\underset{G}{\text{maximize}}\;  \mathbb{E}_{z \sim p(z)}\left[\log D(G(z))\right]$$
2. Update the discriminator ($D$), to maximize the probability of the discriminator making the correct choice on real and generated data:
$$\underset{D}{\text{maximize}}\; \mathbb{E}_{x \sim p_\text{data}}\left[\log D(x)\right] + \mathbb{E}_{z \sim p(z)}\left[\log \left(1-D(G(z))\right)\right]$$

### What else is there?
Since 2014, GANs have exploded into a huge research area, with massive [workshops](https://sites.google.com/site/nips2016adversarial/), and [hundreds of new papers](https://github.com/hindupuravinash/the-gan-zoo). Compared to other approaches for generative models, they often produce the highest quality samples but are some of the most difficult and finicky models to train (see [this github repo](https://github.com/soumith/ganhacks) that contains a set of 17 hacks that are useful for getting models working). Improving the stabiilty and robustness of GAN training is an open research question, with new papers coming out every day! For a more recent tutorial on GANs, see [here](https://arxiv.org/abs/1701.00160). There is also some even more recent exciting work that changes the objective function to Wasserstein distance and yields much more stable results across model architectures: [WGAN](https://arxiv.org/abs/1701.07875), [WGAN-GP](https://arxiv.org/abs/1704.00028).


GANs are not the only way to train a generative model! For other approaches to generative modeling check out the [deep generative model chapter](http://www.deeplearningbook.org/contents/generative_models.html) of the Deep Learning [book](http://www.deeplearningbook.org). 

Here's an example of what your outputs from the 3 different models you're going to train should look like... note that GANs are sometimes finicky, so your outputs might not look exactly like this... this is just meant to be a *rough* guideline of the kind of quality you can expect:

In [None]:
from IPython.display import Image
Image(filename="gan_outputs_tf.png")

## Setup

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, Dataset
from torchvision import datasets, transforms

import numpy as np
import os

import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec

%matplotlib inline
plt.rcParams['figure.figsize'] = (10.0, 8.0) # set default size of plots
plt.rcParams['image.interpolation'] = 'nearest'
plt.rcParams['image.cmap'] = 'gray'

# A bunch of utility functions

def show_images(images):
    images = images.view(images.shape[0], -1).detach().cpu().numpy()
    sqrtn = int(np.ceil(np.sqrt(images.shape[0])))
    sqrtimg = int(np.ceil(np.sqrt(images.shape[1])))

    fig = plt.figure(figsize=(sqrtn, sqrtn))
    gs = gridspec.GridSpec(sqrtn, sqrtn)
    gs.update(wspace=0.05, hspace=0.05)

    for i, img in enumerate(images):
        ax = plt.subplot(gs[i])
        plt.axis('off')
        ax.set_xticklabels([])
        ax.set_yticklabels([])
        ax.set_aspect('equal')
        plt.imshow(img.reshape([sqrtimg, sqrtimg]))
    plt.show()

def preprocess_img(x):
    return 2 * x - 1.0

def deprocess_img(x):
    return (x + 1.0) / 2.0

def rel_error(x, y):
    return np.max(np.abs(x - y) / (np.maximum(1e-8, np.abs(x) + np.abs(y))))

def count_params(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

answers = {}

for k, v in np.load('gan-checks-tf.npz').items():
    answers[k] = torch.tensor(v)

NOISE_DIM = 10
NUM_SAMPLES = 10000

## Dataset
 GANs are notoriously finicky with hyperparameters, and also require many training epochs. In order to make this assignment approachable without a GPU, we will be working on the MNIST dataset, which is 60,000 training and 10,000 test images. Each picture contains a centered image of white digit on black background (0 through 9). This was one of the first datasets used to train convolutional neural networks and it is fairly easy -- a standard CNN model can easily exceed 99% accuracy. 
 

**Heads-up**: Our MNIST wrapper returns images as vectors. That is, they're size (batch, 784). If you want to treat them as images, we have to resize them to (batch,28,28) or (batch,28,28,1). They are also type np.float32 and bounded [0,1]. 

In [None]:
transform = transforms.Compose([
    transforms.ToTensor(), # [0,1]
])

mnist_dataset = datasets.MNIST(root='./data', train=True, download=True, transform=transform)
batch_size = 16
mnist_loader = DataLoader(mnist_dataset, batch_size=batch_size, shuffle=False)


In [None]:
# Show a batch
data_iter = iter(mnist_loader)
images, labels = next(data_iter)
show_images(images)

## Random Noise
Generate a Torch `Tensor` containing uniform noise from -1 to 1 with shape `[batch_size, dim]`.

In [None]:
def sample_noise(batch_size, dim):
    """Generate random uniform noise from -1 to 1.
    
    Inputs:
    - batch_size: integer giving the batch size of noise to generate
    - dim: integer giving the dimension of the noise to generate
    
    Returns:
    TensorFlow Tensor containing uniform noise in [-1, 1] with shape [batch_size, dim]
    """
    # TODO: sample and return noise
    # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    return torch.rand(batch_size, dim) * 2 - 1

    # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

Make sure noise is the correct shape and type:

In [None]:
def test_sample_noise():
    batch_size = 3
    dim = 4
    z = sample_noise(batch_size, dim)
    # Check z has the correct shape
    assert list(z.shape) == [batch_size, dim]
    # Make sure z is a Tensor and not a numpy array
    assert isinstance(z, torch.Tensor)
    # Check that we get different noise for different evaluations
    z1 = sample_noise(batch_size, dim)
    z2 = sample_noise(batch_size, dim)
    assert not np.array_equal(z1.numpy(), z2.numpy())
    # Check that we get the correct range
    assert np.all(z1.numpy() >= -1.0) and np.all(z1.numpy() <= 1.0)
    print("All tests passed!")
    
test_sample_noise()

## Discriminator
Our first step is to build a discriminator. **Hint:** You should use the layers in `torch.nn` to build the model.

Architecture:
 * Fully connected layer with input size 784 and output size 256
 * LeakyReLU with alpha 0.01
 * Fully connected layer with output size 256
 * LeakyReLU with alpha 0.01
 * Fully connected layer with output size 1 
 
The output of the discriminator should thus have shape `[batch_size, 1]`, and contain real numbers corresponding to the scores that each of the `batch_size` inputs is a real image.

In [None]:
import torch.nn as nn

class Discriminator(nn.Module):
    def __init__(self, input_dim=784):
        super(Discriminator, self).__init__()
        # TODO: implement architecture
        # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
        self.model = nn.Sequential(
            nn.Linear(input_dim, 256),
            nn.LeakyReLU(0.01),
            nn.Linear(256, 256),
            nn.LeakyReLU(0.01),
            nn.Linear(256, 1)
        )
        # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

    def forward(self, x):
        # TODO: forward function
        # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
        if len(x.shape) == 4:
            x = x.view(x.size(0), -1)
        return self.model(x)
        # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****


Test to make sure the number of parameters in the discriminator is correct:

In [None]:
def test_discriminator(true_count=267009):
    model = Discriminator()
    cur_count = count_params(model)
    if cur_count != true_count:
        print('Incorrect number of parameters in discriminator. {0} instead of {1}. Check your achitecture.'.format(cur_count,true_count))
    else:
        print('Correct number of parameters in discriminator.')
        
test_discriminator()

## Generator
Now to build a generator. You should use the layers in `torch.nn` to construct the model. All fully connected layers should include bias terms. Note that you can use the tf.nn module to access activation functions. Once again, use the default initializers for parameters.

Architecture:
 * Fully connected layer with input size z.shape[1] (the number of noise dimensions) and output size 1024
 * `ReLU`
 * Fully connected layer with output size 1024 
 * `ReLU`
 * Fully connected layer with output size 784
 * `Tanh` (To restrict every element of the output to be in the range [-1,1])

In [None]:
import torch.nn as nn

class Generator(nn.Module):
    def __init__(self, noise_dim=NOISE_DIM):
        super(Generator, self).__init__()
        # TODO: implement architecture
        # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
        self.model = nn.Sequential(
            nn.Linear(noise_dim, 1024),
            nn.ReLU(),
            nn.Linear(1024, 1024),
            nn.ReLU(),
            nn.Linear(1024, 784),
            nn.Tanh()
        )
        # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

    def forward(self, z):
        # TODO: implement forward function
        # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
        x = self.model(z)
        x = x.view(-1, 1, 28, 28)
        return x
        # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****


Test to make sure the number of parameters in the generator is correct:

In [None]:
def test_generator(true_count=1858320):
    model = Generator(4)
    cur_count = count_params(model)
    if cur_count != true_count:
        print('Incorrect number of parameters in generator. {0} instead of {1}. Check your achitecture.'.format(cur_count,true_count))
    else:
        print('Correct number of parameters in generator.')
        
test_generator()

# GAN Loss

Compute the generator and discriminator loss. The generator loss is:
$$\ell_G  =  -\mathbb{E}_{z \sim p(z)}\left[\log D(G(z))\right]$$
and the discriminator loss is:
$$ \ell_D = -\mathbb{E}_{x \sim p_\text{data}}\left[\log D(x)\right] - \mathbb{E}_{z \sim p(z)}\left[\log \left(1-D(G(z))\right)\right]$$
Note that these are negated from the equations presented earlier as we will be *minimizing* these losses.

**HINTS**: Use `torch.ones_like` and `torch.zeros_like` to generate labels for your discriminator. Use `torch.nn.BCEWithLogitsLoss` to help compute your loss function.

In [None]:
import torch
import torch.nn as nn


def discriminator_loss(logits_real, logits_fake):
    """
    Computes the discriminator loss described above.
    
    Inputs:
    - logits_real: Tensor of shape (N, 1) giving scores for the real data.
    - logits_fake: Tensor of shape (N, 1) giving scores for the fake data.
    
    Returns:
    - loss: Tensor containing (scalar) the loss for the discriminator.
    """
    loss = None
    # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    real_labels = torch.ones_like(logits_real)
    fake_labels = torch.zeros_like(logits_fake)
    
    criterion = nn.BCEWithLogitsLoss()
    
    loss_real = criterion(logits_real, real_labels)
    
    loss_fake = criterion(logits_fake, fake_labels)
    
    loss = loss_real + loss_fake
    # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    
    return loss

def generator_loss(logits_fake):
    """
    Computes the generator loss described above.

    Inputs:
    - logits_fake: PyTorch Tensor of shape (N,) giving scores for the fake data.
    
    Returns:
    - loss: PyTorch Tensor containing the (scalar) loss for the generator.
    """
    loss = None
    # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    fake_labels = torch.ones_like(logits_fake)
    
    criterion = nn.BCEWithLogitsLoss()
    
    loss = criterion(logits_fake, fake_labels)
    # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    
    return loss

Test your GAN loss. Make sure both the generator and discriminator loss are correct. You should see errors less than 1e-8.

In [None]:
def test_discriminator_loss(logits_real, logits_fake, d_loss_true):
    d_loss = discriminator_loss(logits_real,
                                logits_fake)
    print("Maximum error in d_loss: %g"%rel_error(d_loss_true.numpy(), d_loss.numpy()))

test_discriminator_loss(answers['logits_real'], answers['logits_fake'],
                        answers['d_loss_true'])

In [None]:
def test_generator_loss(logits_fake, g_loss_true):
    g_loss = generator_loss(logits_fake)
    print("Maximum error in g_loss: %g"%rel_error(g_loss_true.numpy(), g_loss.numpy()))

test_generator_loss(answers['logits_fake'], answers['g_loss_true'])

# Optimizing our loss
Make an `Adam` optimizer with a 1e-3 learning rate, beta1=0.5 to mininize G_loss and D_loss separately. The trick of decreasing beta was shown to be effective in helping GANs converge in the [Improved Techniques for Training GANs](https://arxiv.org/abs/1606.03498) paper. In fact, with our current hyperparameters, if you set beta1 to the Tensorflow default of 0.9, there's a good chance your discriminator loss will go to zero and the generator will fail to learn entirely. In fact, this is a common failure mode in GANs; if your D(x) learns too fast (e.g. loss goes near zero), your G(z) is never able to learn. Often D(x) is trained with SGD with Momentum or RMSProp instead of Adam, but here we'll use Adam for both D(x) and G(z). 

In [None]:
import torch.optim as optim

def get_solvers(D, G, learning_rate=1e-3, beta1=0.5):
    """
    Create Adam optimizers for GAN training in PyTorch.

    Inputs:
    - D: Discriminator (nn.Module)
    - G: Generator (nn.Module)
    - learning_rate: learning rate for both solvers
    - beta1: beta1 value for Adam (1st moment decay)

    Returns:
    - D_solver: optimizer for the discriminator
    - G_solver: optimizer for the generator
    """
    D_solver = None
    G_solver = None
    # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    D_solver = optim.Adam(D.parameters(), lr=learning_rate, betas=(beta1, 0.999))
    G_solver = optim.Adam(G.parameters(), lr=learning_rate, betas=(beta1, 0.999))
    # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    return D_solver, G_solver

# Training a GAN!
Well that wasn't so hard, was it? After the first epoch, you should see fuzzy outlines, clear shapes as you approach epoch 3, and decent shapes, about half of which will be sharp and clearly recognizable as we pass epoch 5. In our case, we'll simply train D(x) and G(z) with one batch each every iteration. However, papers often experiment with different schedules of training D(x) and G(z), sometimes doing one for more steps than the other, or even training each one until the loss gets "good enough" and then switching to training the other. 

In [None]:
import torch
import torch.nn.functional as F
import matplotlib.pyplot as plt

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

def run_a_gan(D, G, D_solver, G_solver, discriminator_loss, generator_loss,
              show_every=20, print_every=20, batch_size=128, num_epochs=10,
              noise_size=NOISE_DIM):
    """
    Train a GAN in PyTorch.
    
    Inputs:
    - D: Discriminator model (nn.Module)
    - G: Generator model (nn.Module)
    - D_solver: optimizer for Discriminator
    - G_solver: optimizer for Generator
    - discriminator_loss: function to compute D loss
    - generator_loss: function to compute G loss
    """
    D = D.to(device)
    G = G.to(device)
    transform = transforms.Compose([transforms.ToTensor()])
    dataloader = DataLoader(
        datasets.MNIST(root='./data', train=True, download=True, transform=transform),
        batch_size=batch_size, shuffle=True
    )

    iter_count = 0
    for epoch in range(num_epochs):
        for x, _ in dataloader:
            real_data = x.view(x.size(0), -1).to(device)

            # ---------------------
            # 1. Update Discriminator
            # ---------------------
            D_solver.zero_grad()
            logits_real = D(preprocess_img(real_data))

            g_fake_seed = sample_noise(batch_size, noise_size).to(device)
            fake_images = G(g_fake_seed)
            logits_fake = D(fake_images)

            d_total_error = discriminator_loss(logits_real, logits_fake)
            d_total_error.backward()
            D_solver.step()

            # ---------------------
            # 2. Update Generator
            # ---------------------
            G_solver.zero_grad()
            g_fake_seed = sample_noise(batch_size, noise_size).to(device)
            fake_images = G(g_fake_seed)
            gen_logits_fake = D(fake_images)

            g_error = generator_loss(gen_logits_fake)
            g_error.backward()
            G_solver.step()

            # ---------------------
            # 3. Logging & Visualization
            # ---------------------
            if iter_count % show_every == 0:
                print(f'Epoch: {epoch}, Iter: {iter_count}, D: {d_total_error.item():.4f}, G: {g_error.item():.4f}')
                imgs_numpy = fake_images[:16].detach().cpu()
                show_images(imgs_numpy)
                plt.show()

            iter_count += 1

    # ----- Final Visualization -----
    z = sample_noise(batch_size, noise_size).to(device)
    G_sample = G(z).detach().cpu()
    print('Final images')
    show_images(G_sample[:16])
    plt.show()


#### Train your GAN! This should take about 10 minutes on a CPU, or about 2 minutes on GPU.

In [None]:
# Make the discriminator
D = Discriminator()

# Make the generator
G = Generator()

# Use the function you wrote earlier to get optimizers for the Discriminator and the Generator
D_solver, G_solver = get_solvers(D, G)

# Run it!
run_a_gan(D, G, D_solver, G_solver, discriminator_loss, generator_loss)

# Least Squares GAN
We'll now look at [Least Squares GAN](https://arxiv.org/abs/1611.04076), a newer, more stable alternative to the original GAN loss function. For this part, all we have to do is change the loss function and retrain the model. We'll implement equation (9) in the paper, with the generator loss:
$$\ell_G  =  \frac{1}{2}\mathbb{E}_{z \sim p(z)}\left[\left(D(G(z))-1\right)^2\right]$$
and the discriminator loss:
$$ \ell_D = \frac{1}{2}\mathbb{E}_{x \sim p_\text{data}}\left[\left(D(x)-1\right)^2\right] + \frac{1}{2}\mathbb{E}_{z \sim p(z)}\left[ \left(D(G(z))\right)^2\right]$$


**HINTS**: Instead of computing the expectation, we will be averaging over elements of the minibatch, so make sure to combine the loss by averaging instead of summing. When plugging in for $D(x)$ and $D(G(z))$ use the direct output from the discriminator (`score_real` and `score_fake`).

In [None]:
import torch

def ls_discriminator_loss(scores_real, scores_fake):
    """
    Compute the Least-Squares GAN loss for the discriminator.
    
    Inputs:
    - scores_real: Tensor of shape (N, 1) giving scores for the real data.
    - scores_fake: Tensor of shape (N, 1) giving scores for the fake data.
    
    Outputs:
    - loss: A Tensor containing the loss.
    """
    loss = None
    # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    loss_real = 0.5 * torch.mean((scores_real - 1) ** 2)
    loss_fake = 0.5 * torch.mean(scores_fake ** 2)
    loss = loss_real + loss_fake
    # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    return loss

def ls_generator_loss(scores_fake):
    """
    Computes the Least-Squares GAN loss for the generator.
    
    Inputs:
    - scores_fake: Tensor of shape (N, 1) giving scores for the fake data.
    
    Outputs:
    - loss: A Tensor containing the loss.
    """
    loss = None
    # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    loss = 0.5 * torch.mean((scores_fake - 1) ** 2)
    # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    return loss

Test your LSGAN loss. You should see errors less than 1e-8.

In [None]:
def test_lsgan_loss(score_real, score_fake, d_loss_true, g_loss_true):
    
    d_loss = ls_discriminator_loss(score_real, score_fake)
    g_loss = ls_generator_loss(score_fake)
    print("Maximum error in d_loss: %g"%rel_error(d_loss_true.numpy(), d_loss.numpy()))
    print("Maximum error in g_loss: %g"%rel_error(g_loss_true.numpy(), g_loss.numpy()))

test_lsgan_loss(answers['logits_real'], answers['logits_fake'],
                answers['d_loss_lsgan_true'], answers['g_loss_lsgan_true'])

Create new training steps so we instead minimize the LSGAN loss:

In [None]:
# Make the discriminator
D = Discriminator()

# Make the generator
G = Generator()

# Use the function you wrote earlier to get optimizers for the Discriminator and the Generator
D_solver, G_solver = get_solvers(D, G)

# Run it!
run_a_gan(D, G, D_solver, G_solver, ls_discriminator_loss, ls_generator_loss)

# Deep Convolutional GANs
In the first part of the notebook, we implemented an almost direct copy of the original GAN network from Ian Goodfellow. However, this network architecture allows no real spatial reasoning. It is unable to reason about things like "sharp edges" in general because it lacks any convolutional layers. Thus, in this section, we will implement some of the ideas from [DCGAN](https://arxiv.org/abs/1511.06434), where we use convolutional networks as our discriminators and generators.

#### Discriminator
We will use a discriminator inspired by the TensorFlow MNIST classification [tutorial](https://www.tensorflow.org/get_started/mnist/pros), which is able to get above 99% accuracy on the MNIST dataset fairly quickly. *Be sure to check the dimensions of x and reshape when needed*, fully connected blocks expect [N,D] Tensors while conv2d blocks expect [N,H,W,C] Tensors. Please use `tf.keras.layers` to define the following architecture:

Architecture:
* Conv2D: 32 Filters, 5x5, Stride 1, padding 0
* Leaky ReLU(alpha=0.01)
* Max Pool 2x2, Stride 2
* Conv2D: 64 Filters, 5x5, Stride 1, padding 0
* Leaky ReLU(alpha=0.01)
* Max Pool 2x2, Stride 2
* Flatten
* Fully Connected with output size 4 x 4 x 64
* Leaky ReLU(alpha=0.01)
* Fully Connected with output size 1

Once again, please use biases for all convolutional and fully connected layers, and use the default parameter initializers. Note that a padding of 0 can be accomplished with the 'VALID' padding option.

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F

class Discriminator(nn.Module):
    def __init__(self):
        super(Discriminator, self).__init__()
        # TODO: implement architecture
        # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
        self.conv1 = nn.Conv2d(1, 32, kernel_size=5, stride=1, padding=0)
        self.pool = nn.MaxPool2d(kernel_size=2, stride=2)
        self.conv2 = nn.Conv2d(32, 64, kernel_size=5, stride=1, padding=0)
        self.fc1 = nn.Linear(4 * 4 * 64, 4 * 4 * 64)
        self.fc2 = nn.Linear(4 * 4 * 64, 1)
        # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

    def forward(self, x):
        # TODO: implement forward function
        # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
        x = self.conv1(x)
        x = F.leaky_relu(x, negative_slope=0.01)
        x = self.pool(x)
        
        x = self.conv2(x)
        x = F.leaky_relu(x, negative_slope=0.01)
        x = self.pool(x)
        
        x = x.view(-1, 4 * 4 * 64)
        
        x = self.fc1(x)
        x = F.leaky_relu(x, negative_slope=0.01)
        
        x = self.fc2(x)
        
        return x
        # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

model = Discriminator()
test_discriminator(1102721)

#### Generator
For the generator, we will copy the architecture exactly from the [InfoGAN paper](https://arxiv.org/pdf/1606.03657.pdf). See Appendix C.1 MNIST. Please use `tf.keras.layers` for your implementation. You might find the documentation for [tf.keras.layers.Conv2DTranspose](https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/keras/layers/Conv2DTranspose) useful. The architecture is as follows.

Architecture:
* Fully connected with output size 1024 
* `ReLU`
* BatchNorm
* Fully connected with output size 7 x 7 x 128 
* `ReLU`
* BatchNorm
* Resize into Image Tensor of size 7, 7, 128
* Conv2D^T (transpose): 64 filters of 4x4, stride 2
* `ReLU`
* BatchNorm
* Conv2d^T (transpose): 1 filter of 4x4, stride 2
* `TanH`

Once again, use biases for the fully connected and transpose convolutional layers. Please use the default initializers for your parameters. For padding, choose the 'same' option for transpose convolutions. For Batch Normalization, assume we are always in 'training' mode.

In [None]:
import torch
import torch.nn as nn

class Generator(nn.Module):
    def __init__(self, noise_dim=NOISE_DIM):
        super(Generator, self).__init__()
        # TODO: implement architecture
        # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
        self.fc1 = nn.Linear(noise_dim, 1024)
        self.bn1 = nn.BatchNorm1d(1024)
        
        self.fc2 = nn.Linear(1024, 7 * 7 * 128)
        self.bn2 = nn.BatchNorm1d(7 * 7 * 128)
        
        self.conv_t1 = nn.ConvTranspose2d(128, 64, kernel_size=4, stride=2, padding=1)
        self.bn3 = nn.BatchNorm2d(64)
        
        self.conv_t2 = nn.ConvTranspose2d(64, 1, kernel_size=4, stride=2, padding=1)
        # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

    def forward(self, z):
        # TODO: implement forward function
        # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
        x = self.fc1(z)
        x = nn.functional.relu(x)
        x = self.bn1(x)
        
        x = self.fc2(x)
        x = nn.functional.relu(x)
        x = self.bn2(x)
        
        x = x.view(-1, 128, 7, 7)
        
        x = self.conv_t1(x)
        x = nn.functional.relu(x)
        x = self.bn3(x)
        
        x = self.conv_t2(x)
        x = torch.tanh(x)
        
        return x
        # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
test_generator(6580801)

We have to recreate our network since we've changed our functions.

### Train and evaluate a DCGAN
This is the one part of A3 that significantly benefits from using a GPU. It takes 3 minutes on a GPU for the requested five epochs. Or about 50 minutes on a dual core laptop on CPU (feel free to use 3 epochs if you do it on CPU).

In [None]:
# Make the discriminator
D = Discriminator()

# Make the generator
G = Generator()

# Use the function you wrote earlier to get optimizers for the Discriminator and the Generator
D_solver, G_solver = get_solvers(D, G)

# Run it!
run_a_gan(D, G, D_solver, G_solver, ls_discriminator_loss, ls_generator_loss, num_epochs=5)

### Inception score

In [None]:
# ----- Hyperparameters -----
batch_size = 128
num_classes = 10
epochs = 20
# ----- Data Preparation -----
# (x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()
train_dataset = datasets.MNIST(root='./data', train=True, download=True, transform=transform)
test_dataset = datasets.MNIST(root='./data', train=False, download=True, transform=transform)

train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=batch_size)

print(f'{len(train_dataset)} train samples')
print(f'{len(test_dataset)} test samples')

# ----- Model Definition -----
class MLPClassifier(nn.Module):
    def __init__(self):
        super(MLPClassifier, self).__init__()
        self.fc1 = nn.Linear(784, 512)
        self.fc2 = nn.Linear(512, 512)
        self.fc3 = nn.Linear(512, num_classes)

    def forward(self, x):
        x = torch.flatten(x, start_dim=1)
        x = F.relu(self.fc1(x))
        x = F.dropout(x, 0.2)
        x = F.relu(self.fc2(x))
        x = F.dropout(x, 0.2)
        x = self.fc3(x)
        return x

    def prob(self, x):
        x = self.forward(x)
        prob = F.softmax(x, dim=-1)
        return prob

model = MLPClassifier().to(device)
optimizer = optim.RMSprop(model.parameters(), lr=0.001, alpha=0.9)
criterion = nn.CrossEntropyLoss()

# ----- Training -----
for epoch in range(epochs):
    model.train()
    for batch_x, batch_y in train_loader:
        batch_x = batch_x.to(device)
        batch_y = batch_y.to(device)
        optimizer.zero_grad()
        outputs = model(batch_x)
        loss = criterion(outputs, batch_y)
        loss.backward()
        optimizer.step()
    print(f'Epoch {epoch+1}/{epochs}, Loss: {loss.item():.4f}')

# ----- Evaluation -----
model.eval()
correct = 0
total = 0
with torch.no_grad():
    for batch_x, batch_y in test_loader:
        batch_x = batch_x.to(device)
        batch_y = batch_y.to(device)
        outputs = model(batch_x)
        predicted = torch.argmax(outputs, dim=1)
        total += batch_y.size(0)
        correct += (predicted == batch_y).sum().item()

print('Test accuracy:', correct / total)

### Verify the trained classifier on the generated samples
Generate samples and visually inspect if the predicted labels on the samples match the actual digits in generated images.

In [None]:
with torch.no_grad():
    z = sample_noise(NUM_SAMPLES, NOISE_DIM).to(device)
    G_sample = G(z)
    G_sample = deprocess_img(G_sample)
show_images(G_sample[:20].cpu())
plt.show()

In [None]:
with torch.no_grad():
    G_sample = G_sample.reshape(NUM_SAMPLES, 784)
    print(np.argmax(model(G_sample[:20].to(device)).cpu().numpy(), axis=-1))

### Implement the inception score
Implement Equation 1 in the reference [3]. Replace expectation in the equation with empirical average of `num_samples` samples. Don't forget the exponentiation at the end. You should get Inception score of at least 8.5

In [None]:
with torch.no_grad():
    # TODO: implement here
    # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    probs = F.softmax(model(G_sample.to(device)), dim=1).cpu().numpy()
    
    p_y = np.mean(probs, axis=0)
    
    kl_divs = []
    for i in range(probs.shape[0]):
        p = probs[i]
        kl_div = np.sum(p * np.log(p / p_y + 1e-10))
        kl_divs.append(kl_div)
    
    inception_score = np.exp(np.mean(kl_divs))
    # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
print(f'Inception score: {inception_score:.4f}')

### Plot the histogram of predicted labels
Let's additionally inspect the class diversity of the generated samples.

In [None]:
with torch.no_grad():
    plt.hist(np.argmax(model(G_sample).cpu(), axis=-1),
             bins=np.arange(11)-0.5, rwidth=0.8, density=True)
plt.xticks(range(10))
plt.show()

## INLINE QUESTION 1

We will look at an example to see why alternating minimization of the same objective (like in a GAN) can be tricky business.

Consider $f(x,y)=xy$. What does $\min_x\max_y f(x,y)$ evaluate to? (Hint: minmax tries to minimize the maximum value achievable.)

Now try to evaluate this function numerically for 6 steps, starting at the point $(1,1)$, 
by using alternating gradient (first updating y, then updating x using that updated y) with step size $1$. **Here step size is the learning_rate, and steps will be learning_rate * gradient.**
You'll find that writing out the update step in terms of $x_t,y_t,x_{t+1},y_{t+1}$ will be useful.

Breifly explain what $\min_x\max_y f(x,y)$ evaluates to and record the six pairs of explicit values for $(x_t,y_t)$ in the table below.

### Your answer:
 
# #  For $\min_x\max_y f(x,y) = xy$, the $\max_y xy$ reaches its maximum value when $y$ approaches infinity. However, considering $\min_x$, when $x=0$, the result is always 0, so $\min_x\max_y xy = 0$.
# #
# #  Calculating with alternating gradient descent:
#  $\nabla_y f(x,y) = x$, $\nabla_x f(x,y) = y$
#  $y_{t+1} = y_t + \text{step size} \cdot \nabla_y f(x_t, y_t) = y_t + x_t$
#  $x_{t+1} = x_t - \text{step size} \cdot \nabla_x f(x_t, y_{t+1}) = x_t - y_{t+1} = x_t - (y_t + x_t) = -y_t$
#
#  $y_0$ | $y_1$ | $y_2$ | $y_3$ | $y_4$ | $y_5$ | $y_6$ 
#  ----- | ----- | ----- | ----- | ----- | ----- | ----- 
#    1   |   2   |   1   |  -1   |  -2   |  -1   |   1   
#  $x_0$ | $x_1$ | $x_2$ | $x_3$ | $x_4$ | $x_5$ | $x_6$ 
#    1   |  -1   |  -1   |   1   |   1   |  -1   |  -1   
   


## INLINE QUESTION 2
Using this method, will we ever reach the optimal value? Why or why not?

# ### 답변: 
# # Using this method, we cannot reach the optimal value. As seen in the calculations above, the values of $(x_t, y_t)$ repeat periodically, forming patterns like $(1,1)$, $(-1,2)$, $(-1,1)$, $(1,-1)$, $(1,-2)$, $(-1,-1)$, $(-1,1)$. This is because the algorithm cycles through these values instead of converging to the optimal value $(0,y)$. This phenomenon illustrates the instability that can also occur in GAN training.


## INLINE QUESTION 3
If the generator loss decreases during training while the discriminator loss stays at a constant high value from the start, is this a good sign? Why or why not? A qualitative answer is sufficient.

### Your answer: 
# ### 답변:
# # This is not a good sign. When the generator loss decreases while the discriminator loss remains consistently high from the beginning, it indicates that the discriminator is not learning properly. In ideal GAN training, the generator and discriminator should compete with each other and maintain a balance. If the discriminator fails to learn effectively, it cannot provide useful feedback to the generator, and consequently, the generator may be under the illusion that it's improving while actually producing low-quality samples. This can lead to mode collapse or a situation where the generator only exploits the weaknesses of the discriminator.
