# Course 03 - Apply Generative Adversarial Networks (GANs)

## Syllabus

* Week 1: GANs for Data Augmentation and Privacy
* Week 2: Image-to-Image Translation with Pix2Pix
* Week 3: Unpaired Translation with CycleGAN

## Week 1: GANs for Data Augmentation and Privacy

### Overview of GAN Applications

One cool application of GAN is being able to translate from one style to another. And here you can see you can sketch something on the left side very roughly and expect the GAN to generate a realistic photo of that for you. Where different colors in your schedule, represent different classes that you would like it to draw, and this is using a GAN called GauGAN.

Another application of image-to-image translation is Super-Resolution GAN. So taking an original image that is very low resolution and getting a much higher resolution image from that.

<table><tr>
<td> <img src="images/gan_Over_1.PNG" style="width: 500px;"/> </td>
<td> <img src="images/gan_Over_2.PNG" style="width: 500px;"/> </td>
</tr></table>

There is multimodal image-to-image translation as well.

Beyond image-to-image translation there's also text to image translation.

<table><tr>
<td> <img src="images/gan_Over_3.PNG" style="width: 500px;"/> </td>
<td> <img src="images/gan_Over_4.PNG" style="width: 500px;"/> </td>
</tr></table>

You also can map from multiple inputs to an output, meaning your conditioning on these inputs.

GANs can also be used for image filters such as those on Snapchat.

Another application area is image editing. So say you have an image of a person here and you have a mask of that person. What you can do is actually you can edit that mask slightly, so you can change this mask a little bit right here with the hair and then you can get a different image, the GAN can take that edited mask and edit that image for you.

<table><tr>
<td> <img src="images/gan_Over_5.PNG" style="width: 500px;"/> </td>
<td> <img src="images/gan_Over_6.PNG" style="width: 500px;"/> </td>
</tr></table>


GANs can be used to stylize various images, making it easier to draw beautiful pieces of art where at least selecting from those GAN outputs what they want.

GANs can also be used for data augmentation. Meaning the generated data can then be used to supplement real data for a downstream tasks such as classification or detection or segmentation, etc. So we can increase your real data set size and diversity by using your generated samples.

<img src="images/gan_Over_7.PNG" style="width: 500px;"/>

GANs have also been used in medical related areas. For example simulating tissues from a state of the art GAN.

GANs can also be used in types of media. And of course for this there are positive and negative implications, deepfakes for instance, which generally has a negative connotation associated with it. Because it's often about stealing identities without permission.

### Data Augmentation: Methods and Uses

Data augmentation is typically used to supplement data when real data is either too expensive to acquire more of or too rare, when you don't have enough of it. GANs are well suited for this task because you can actually generate fake data to supplement that real data and then use that for, a downstream tasks such as a classifier or a detector, or a segmentation model, or any type of discriminative model. Actually one common way of doing data augmentation is not using GANs necessarily. It's actually just taking an input image. You can do all sorts of augmentations on this image and then feed all of these images in to the classifier as real data as well. That could be horizontal flip, some kind of rotation, zoom and crop or it could be applying some kind of filter to it.

A GAN could help with data augmentation, generating a ton of different images. There is a body of literature that explores how to best combine images from all different types of data augmentation techniques. It tries to find some kind of policy in determining which image you need to give at what point to your classifier. Explore RandAugment for more information (RandAugment: Practical automated data augmentation with a reduced search space (Cubuk, Zoph, Shlens, and Le, 2019): https://arxiv.org/abs/1909.13719).

While it might be difficult to gather more data, you could perhaps instead use a GAN to generate some fake spectrograms for you as well. The data you're working with might also be very hard to obtain, such as brain scans or mammograms of tumors due to patient privacy. You can use a GAN to actually generate synthetic liver lesions (generated by DCGAN for instance). In other cases, it might be unethical to acquire more data samples.

### Data Augmentation: Pros & Cons

Pros:
* They are often better than handcrafted synthetic examples. So if you're going to use data augmentation already, using GAN generated data that mimics the real examples much better has shown to be more helpful than handcrafted synthetic examples.
* You can generate more labeled examples. If your training data set is imbalanced or doesn't have many examples of a certain class, then you can use your conditional Gan to generate significantly more labeled examples for those classes.
* It can improve your downstream models generalization. You can get your GAN to generate more data that mimics what expert is doing for your downstream model to learn better segmentation.

Cons:
* It won't be able to cover the entire diversity of what you need if you're training data set is limited as well. The diversity of your generated outputs will still rely heavily on the diversity of your training data set.
* And if your GAN starts in memorize or mimic the real data so much that the fakes almost look identical to the real. Basically over fit to the real data in some way, then it might not be helpful to supplement your task with this fake data.

### GANs for Privacy

So Medical Privacy is important because you want to protect real patient data. Often using real patient data in your models can harm the patients if someone reverse engineers your model and figures out who those people are or if that data is somehow released. Ensuring that medical data is private can encourage data-sharing between different institutions because that would not breach any personal health information or PHI issues. And finally, if you had simulated medical data from again, for example, that is certainly less expensive to acquire and certainly more abundant because you can keep generating infinitely then real data. So in this way you don't have to expose the real training data set to the model and to then whoever is going to be using that model later on, you can just train the model using these GAN generated outputs. But you might be wondering how well does this GAN generated data do without any of the reals? So in data augmentation you kind of use both of them, but in Privacy Preservation you only want to use the GAN generated data.

But how well does it do exactly? Trained on either just GAN generate data, and then comparing how well that model does on just training on real data. Training on just GAN generated data approaches the accuracy of training on real data. Of course, it's not the same as training on real data, but it does get pretty close and might be enough for some applications to warrant using this GAN generated data for the sake of Privacy.

There are caveats to using this approach. So it is very possible that your GAN will generate samples that look nearly identical to your reals. And that's really bad because you are no longer preserving the Privacy of, say his person. While the GAN can very likely generate a lot of the samples in your real data set, it will also generate a ton of other different types of samples such that probabilistically no one will know which ones are real and which ones are fake.

### GANs for Anonymity

GANs do have the power to enable healthy expression for various stigmatized groups and help them remain anonymous while still expressing themselves through a realistic looking face. However, GANs anonymization can be used for both good and evil. Identity theft is certainly not good with Deepfakes.

### Automated Data Augmentation

RandAugment: Practical automated data augmentation with a reduced search space (Cubuk, Zoph, Shlens, and Le, 2019): https://arxiv.org/abs/1909.13719

### Generative Teaching Networks

Generative Teaching Networks: Accelerating Neural Architecture Search by Learning to Generate Synthetic Training Data (Such et al. 2019): https://arxiv.org/abs/1912.07768

Essentially, a GTN is composed of a generator (i.e. teacher), which produces synthetic data, and a student, which is trained on this data for some task. The key difference between GTNs and GANs is that GTN models work cooperatively (as opposed to adversarially).

### Talking heads

Few-Shot Adversarial Learning of Realistic Neural Talking Head Models (Zakharov, Shysheya, Burkov, and Lempitsky, 2019): https://arxiv.org/abs/1905.08233

### De-identification

De-identification without losing faces (Li and Lyu, 2019): https://arxiv.org/abs/1902.04202

### GAN Fingerprints

Attributing Fake Images to GANs: Learning and Analyzing GAN Fingerprints (Yu, Davis, and Fritz, 2019): https://arxiv.org/abs/1811.08180

### Works cited

* Semantic Image Synthesis with Spatially-Adaptive Normalization (Park, Liu, Wang, and Zhu, 2019): https://arxiv.org/abs/1903.07291
* Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network (Ledig et al., 2017): https://arxiv.org/abs/1609.04802
* Multimodal Unsupervised Image-to-Image Translation (Huang et al., 2018): https://github.com/NVlabs/MUNIT
* StackGAN: Text to Photo-realistic Image Synthesis with Stacked Generative Adversarial Networks (Zhang et al., 2017): https://arxiv.org/abs/1612.03242
* Few-Shot Adversarial Learning of Realistic Neural Talking Head Models (Zakharov, Shysheya, Burkov, and Lempitsky, 2019): https://arxiv.org/abs/1905.08233
* MaskGAN: Towards Diverse and Interactive Facial Image Manipulation (Lee, Liu, Wu, and Luo, 2020): https://arxiv.org/abs/1907.11922
* When AI generated paintings dance to music... (2019): https://www.youtube.com/watch?v=85l961MmY8Y
* Data Augmentation Generative Adversarial Networks (Antoniou, Storkey, and Edwards, 2018): https://arxiv.org/abs/1711.04340

* Establishing an evaluation metric to quantify climate change image realism (Sharon Zhou, Luccioni, Cosne, Bernstein, and Bengio, 2020): https://iopscience.iop.org/article/10.1088/2632-2153/ab7657/meta
* Deepfake example (2019): https://en.wikipedia.org/wiki/File:Deepfake_example.gif
* Introduction to adversarial robustness (Kolter and Madry): https://adversarial-ml-tutorial.org/introduction/
* Large Scale GAN Training for High Fidelity Natural Image Synthesis (Brock, Donahue, and Simonyan, 2019): https://openreview.net/pdf?id=B1xsqj09Fm
* GazeGAN - Unpaired Adversarial Image Generation for Gaze Estimation (Sela, Xu, He, Navalpakkam, and Lagun, 2017): https://arxiv.org/abs/1711.09767
* Data Augmentation using GANs for Speech Emotion Recognition (Chatziagapi et al., 2019): https://pdfs.semanticscholar.org/395b/ea6f025e599db710893acb6321e2a1898a1f.pdf
* GAN-based Synthetic Medical Image Augmentation for increased CNN Performance in Liver Lesion Classification (Frid-Adar et al., 2018): https://arxiv.org/abs/1803.01229
* GANsfer Learning: Combining labelled and unlabelled data for GAN based data augmentation (Bowles, Gunn, Hammers, and Rueckert, 2018): https://arxiv.org/abs/1811.10669
* Data augmentation using generative adversarial networks (CycleGAN) to improve generalizability in CT segmentation tasks (Sandfort, Yan, Pickhardt, and Summers, 2019): https://www.nature.com/articles/s41598-019-52737-x/figures/3

* De-identification without losing faces (Li and Lyu, 2019): https://arxiv.org/abs/1902.04202
* Privacy-Preserving Generative Deep Neural Networks Support Clinical Data Sharing (Beaulieu-Jones et al., 2019): https://www.ahajournals.org/doi/epub/10.1161/CIRCOUTCOMES.118.005122
* DeepPrivacy: A Generative Adversarial Network for Face Anonymization (Hukkelås, Mester, and Lindseth, 2019): https://arxiv.org/abs/1909.04538
* GAIN: Missing Data Imputation using Generative Adversarial Nets (Yoon, Jordon, and van der Schaar, 2018): https://arxiv.org/abs/1806.02920
* Conditional Infilling GANs for Data Augmentation in Mammogram Classification (E. Wu, K.  Wu, Cox, and Lotter, 2018): https://link.springer.com/chapter/10.1007/978-3-030-00946-5_11
* The Effectiveness of Data Augmentation in Image Classification using Deep Learning (Perez and Wang, 2017): https://arxiv.org/abs/1712.04621
* CIFAR-10 and CIFAR-100 Dataset; Learning Multiple Layers of Features from Tiny Images (Krizhevsky, 2009): https://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf

## Week 02 - Image-to-Image Translation with Pix2Pix

Image-to-image translation is a framework of conditional generation that transforms images into different styles. Taking in an image and transforming it to get a different image of a different style, but maintaining that content.

* Black and white image getting transformed into a colored image. In a way this is actually a type of conditional generation, but it's conditioning on the content of one image
* Going from a segmentation map to a realistic photo that maps onto trees where it's labeled trees, road where it's labeled road, and car where it's labeled car.
* Video to video, where you have frames of a video go map onto the frames of another video. It is essentially image-to-image translation, but for many images, many frames. You can take this black and white trained film from back in the day and make this old film 4K with some realistic color on it as well. 

For your training dataset, every single input example that you might have, you have a corresponding output image or target image that contains the contents of that input image with a different style. So it maps one-on-one. Basically what you do is you condition on the input to get the output image. So the paired output image is not necessarily the ground truth, but a ground truth, a possibility.

<table><tr>
<td> <img src="images/gan_im2im_1.PNG" style="width: 500px;"/> </td>
<td> <img src="images/gan_im2im_2.PNG" style="width: 500px;"/> </td>
</tr></table>

Also, instead of conditioning on a single input image, you can take as input models wearing different clothes and also a point-wise map of where they should be standing, their pose. Those dots represent different poses. Then from these two pieces of information, you want your GAN to generate that person in a different pose.

You can also go from a text to image.

<img src="images/gan_im2im_3.PNG" style="width: 500px;"/>


### Pix2Pix Overview

It is a very successful use of a type of conditional GAN to perform paired image-to-image translation, where you condition on the input image and have a direct output pair. Pix2Pix, instead of a class vector, you actually pass in an entire image as the input. The noise vector didn't make a huge difference to the generator's output. Typically, what the noise vector was used for was so that the generator could generate all these different outputs. But what they found was that the noise vector didn't actually make that much of a difference in terms of what the generated output did look like. This is likely due to the fact that there's a paired output image that this generator is trying to get at. Instead of noise, they actually found that they could add some stochasticity into the network using dropout. Dropout, just randomly plugs out nodes in certain layers in your neural network as it's training, so different nodes can learn different things. As a result, this adds some randomness to your outputs as it's training, though it's not as drastic as when you input a different noise vectors as that input.

<table><tr>
<td> <img src="images/gan_pix2pix_1.PNG" style="width: 500px;"/> </td>
<td> <img src="images/gan_pix2pix_2.PNG" style="width: 500px;"/> </td>
</tr></table>


The discriminator also gets this real input. Then it also gets concatenated with it, either the real target outputs. It gets one of those concatenated with it and the discriminator has to decide whether it was real or fake. It sees this entire segmentation mask. "Does this image actually look like a realistic mapping of that segmentation mask?" 

Both the generator and discriminator will get an upgrade, and the generator will first become what's called a U-Net. A U-Net is typically used for segmentation. It's the type of encoders, so you see it encodes an image and then a decoder after that, followed by some skip connections in between. Then the discriminator will become a PatchGAN. It gives more feedback on different parts of the image. Instead of saying something is real or fake for an entire image, is this real or fake for different patches of that image. A lot of real fake determinations in a matrix as an output. This means that the discriminator will give a lot more feedback back to the generator.

<img src="images/gan_pix2pix_3.PNG" style="width: 500px;"/>


### Pix2Pix: PatchGAN

The discriminator outputs a matrix of values instead of a single value of real or fake. Where 0 still corresponds to a fake classification and 1 still corresponds to a real classification.
PatchGAN architecture is about outputting a matrix of values as opposed to a single value. So PatchGAN will output a matrix of classifications instead of a single output. Putting one value out of an entire matrix of different values. In each value in this matrix of values is still between 0 and 1 where 0 is fake and 1 is real. By sliding its field of view across all the patches in the input image, the PatchGAN will then give feedback on each region or patch of the image. And because it outputs the probability of each patch being real, it can be trained with BCE loss still. So for a fake image from the generator, what this means is that the PatchGAN should try to output a matrix of all zeros. And the same logic goes for a real image from your data set.

<img src="images/gan_PatchGAN_1.PNG" style="width: 500px;"/>


### Pix2Pix: U-Net

Pix2Pix uses a U-Net for its generator, and a U-Net is an encoder-decoder framework that uses skip connections that concatenate same resolution or same block or same level. Feature maps to each other from the encoder to the decoder, and this helps the decoder learn more details from the encoder directly, in case there are finer details that are lost during the encoding stage. And the skip connections also help in the backwards layer, of course to help more gradient to flow from the decoder back to the encoder.

<img src="images/gan_UNet_1.PNG" style="width: 500px;"/>

Segmentation is taking a real image and getting a segmentation mask or labels on every single pixel that image in terms of what object it is. An image-to-image translation task implies that there is a correct answer in terms of what each pixel is and which class each pixel belongs to.

U-Net is good at taking in an input image and mapping it to an output image. And typically it's used for just image segmentation, but Pix2Pix wants to use it for this generation task as well. The U-Net generator actually takes in an entire image. The architecture of the Pix2Pix generator is this encoder-decoder structure.

<table><tr>
<td> <img src="images/gan_UNet_2.PNG" style="width: 500px;"/> </td>
<td> <img src="images/gan_UNet_3.PNG" style="width: 500px;"/> </td>
</tr></table>

All that important information in that image is compressed into the "bottleneck", just so you can get those high level features and decode it an output y, another image. And this might remind you of an auto-encoder except for an auto-encoder, you want to be as close as possible to x, and here you don't want that. You want y to be a different style conditioned on x. However, since it's easy to overfit these networks to your training image pairs, U-Net also introduces skip connections from the encoder to decoder. During the encoding stage, every single block that is the same resolution as its corresponding block in the decoding stage get this extra connection to go concatenate with that value. Such that information that might have been compressed too much can still trickle through and still get to some of these later layers. These skip connections are concatenated from the encoder before going into each convolutional block in the decoder. And skip connections just allow information to flow from earlier layers to later layers. It's easy to get certain details that the encoder may have lost in the process of downsampling to the decoder, and that means those finer grain details. This is in the forward pass, of course. In the backward pass, skip connections can also improve this gradient flow. Skip connections were introduced to help with the vanishing gradient problem when you stack too many layers together. So the gradient gets so tiny when it's multiplied in back prop, limiting our networks from going deeper and having more layers.



It goes through eight encoder blocks to compress that input. And then each block downsamples the spatial size by factor of two. So at the very end is just as 1 x 1, height width, and each of these encoder blocks contains a convolutional layer, a BN norm layer, and LeakyReLU activation. Convolutions will make your input smaller by having the height and width with stride of 2. 

<table><tr>
<td> <img src="images/gan_UNet_4.PNG" style="width: 500px;"/> </td>
<td> <img src="images/gan_UNet_5.PNG" style="width: 500px;"/> </td>
</tr></table>

On the decoder side you have this input size of 1 x 1, and you have eight blocks again, but these are decoder blocks. And then you get y as output which is your generated image, which is the same size as your input 256 by 256 times 3 channels for color. Each decoder block is composed of a transposed convolution, followed by a BatchNorm, and then a ReLU activation function. Dropout is added to this network, but it's actually just added to the first three blocks of this decoder. Drop out randomly disable different neurons at each iteration of training to allow different neurons to learn. This is only present during training, and as with all uses of Dropout, it is typically turned off during inference or test time. Dropout does add some kind of noise to this model. Remember that we're taking away the noise as input right now, and so this is where stochasticity does seep into this model architecture, but only during training.

You can think of the decoder as performing the inverse operations as the encoder, which is why they contain the same number of blocks, or eight blocks.

<table><tr>
<td> <img src="images/gan_UNet_6.PNG" style="width: 500px;"/> </td>
<td> <img src="images/gan_UNet_7.PNG" style="width: 500px;"/> </td>
</tr></table>


### Pix2Pix: Pixel Distance Loss Term

Pix2Pix lets the generator add an L1 regularization term to his loss function that takes the pixel difference between the real target output in the fake one. And this encourages the generator to make images similar to the real, an image to image translation. This extra layer of supervision definitely helps with this style transfer task.

<img src="images/gan_dloss_1.PNG" style="width: 500px;"/>

This is an additional loss for the Pix2Pix generator in particular. You can add an additional last term such as L1 regularization or gradient penalty and you add this Lambda term, so that it doesn't overwhelm your loss.

An adversarial loss is just another way to talk about the GAN loss. And for Pix2Pixin particular if you want your output to look pretty, you can actually add in additional pixel loss term here to give the generator a little bit more information about the real target image, so we can try to match it more closely. So the pixel distance last term looks at the generated output from the generator plus that real target output. It takes the pixel difference between the two, trying to encourage the generated output to be as close as possible to the real output. Because of this, pixel distance is really small. That means that the images are almost exactly the same.

This error is also an added layer supervision where the generator sees the real image now. It implicitly sees the paired output in some way, even though this is a very softway and mainly makes those samples look super nice.

Taken together, the total Pix2Pix generator is composed of the BCE loss, which is that adversarial loss plus this pixel distance loss.

### Pix2Pix: Putting It All Together

It's time to put all the components of Pix2Pix together. That includes the U-Net generator, the patchGAN discriminator, and your pixel distance loss term. After you train, of course, you can adapt this segmentation mask and draw your own, and then your gain will just generate a realistic image for you.

You put your input into your U-Net generator and it generates some output. Then the image gets concatenated along the channel dimension with the original real image, that is the real input image that was used for conditioning. That goes into the discriminator, which is a patchGAN discriminator, and that outputs a matrix of different values, a classification matrix between zero and one of how real are or how fake different parts of that image look. Then for the discriminator's loss, the output of concatenating the generated output with the real input will be compared to the fake label matrix, which is a matrix of all zeros because a discriminator will succeed if it classified every single patch of that image looked fake, was zero. Then on a real outputs, then the discriminator will want to compare its classification matrix with the real label because they want to get as close as possible to all ones in this classification matrix in his predictions. 

For the generator's loss, it's still the discriminator looking at the generated output concatenated with that real input along the channel dimension. But instead of a matrix of all zeros assimilable is looking at the real labeled matrix of ones because the generator wants a discriminator to think that every single patch of it's generated image looks real. The generator also has the pixel distance loss term that's looking at how different it's generated output is to the real target output multiplied by some Lambda. That's the full generator loss.

<table><tr>
<td> <img src="images/gan_tog_1.PNG" style="width: 500px;"/> </td>
<td> <img src="images/gan_tog_2.PNG" style="width: 500px;"/> </td>
</tr></table>


### Pix2Pix Advancements

Pix2PixHD operates on much higher resolution images and includes a lot of modifications that make it significantly better. Basically you can have the segmentation mask of someone's face and you can adapt that mask however you want and be able to generate a different type of face.

GauGAN by NVIDIA allows you to draw sketches here on the left and indicate what kind of class they are and then it's able to generate this realistic photo for you. It actually uses adaptive instance normalization to take in the segmentation map, and use AdaIN, adaptive instance normalization for informing styles again.

<img src="images/gan_adv_1.PNG" style="width: 500px;"/>


#### Interesting papers

Image-to-Image Translation with Conditional Adversarial Networks (Isola, Zhu, Zhou, and Efros, 2018): https://arxiv.org/abs/1611.07004

Pix2PixHD, which synthesizes high-resolution images from semantic label maps. Proposed in High-Resolution Image Synthesis and Semantic Manipulation with Conditional GANs (Wang et al. 2018) (https://arxiv.org/abs/1711.11585), Pix2PixHD improves upon Pix2Pix via multiscale architecture, improved adversarial loss, and instance maps.

Super-Resolution GAN (SRGAN), a GAN that enhances the resolution of images by 4x, proposed in Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network (Ledig et al. 2017) (https://arxiv.org/abs/1609.04802)

Patch-Based Image Inpainting with Generative Adversarial Networks (Demir and Unal, 2018): https://arxiv.org/abs/1803.07422

GauGAN, which synthesizes high-resolution images from semantic label maps, which you implement and train. GauGAN is based around a special denormalization technique proposed in Semantic Image Synthesis with Spatially-Adaptive Normalization (Park et al. 2019) (https://arxiv.org/abs/1903.07291)

#### From the videos:

* DeOldify... (Antic, 2019): https://twitter.com/citnaj/status/1124904251128406016
* pix2pixHD (Wang et al., 2018): https://github.com/NVIDIA/pix2pixHD
* [4k, 60 fps] Arrival of a Train at La Ciotat (The Lumière Brothers, 1896) (Shiryaev, 2020): https://youtu.be/3RYNThid23g
* Image-to-Image Translation with Conditional Adversarial Networks (Isola, Zhu, Zhou, and Efros, 2018): https://arxiv.org/abs/1611.07004
* Pose Guided Person Image Generation (Ma et al., 2018): https://arxiv.org/abs/1705.09368
* AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks (Xu et al., 2017): https://arxiv.org/abs/1711.10485
* Few-Shot Adversarial Learning of Realistic Neural Talking Head Models (Zakharov, Shysheya, Burkov, and Lempitsky, 2019): https://arxiv.org/abs/1905.08233
* Patch-Based Image Inpainting with Generative Adversarial Networks (Demir and Unal, 2018): https://arxiv.org/abs/1803.07422
* Image Segmentation Using DIGITS 5 (Heinrich, 2016): https://developer.nvidia.com/blog/image-segmentation-using-digits-5/
* Stroke of Genius: GauGAN Turns Doodles into Stunning, Photorealistic Landscapes (Salian, 2019): https://blogs.nvidia.com/blog/2019/03/18/gaugan-photorealistic-landscapes-nvidia-research/

#### From the notebooks:

* Crowdsourcing the creation of image segmentation algorithms for connectomics (Arganda-Carreras et al., 2015): https://www.frontiersin.org/articles/10.3389/fnana.2015.00142/full
* U-Net: Convolutional Networks for Biomedical Image Segmentation (Ronneberger, Fischer, and Brox, 2015): https://arxiv.org/abs/1505.04597

## Week 3: Unpaired Translation with CycleGAN

All you need is two piles of images of two different styles and your GAN will figure out the mapping from there. Actually, it'll be two GANs figuring out that mapping. One will go one direction and the other will go the other direction. The interaction between these two GANs form a cycle known as cycle GAN.

<img src="images/gan_intro_1.PNG" style="width: 500px;"/>


### Unpaired Image-to-Image Translation

Unpaired image to image translation uses piles of different styled images instead of paired images. The model learns that mapping between those two piles by keeping the contents that are present in both, while changing the style which is different or unique to each of those piles.

Unpaired image to image translation works, where it's a mapping between two piles of image styles. It's really about finding the common content of these two different piles as well as their differences. You've seen a paired image to image translation before, and that's when you have that clear input. Because you can use edge detector and go from a realistic image to those edges. Then you can get that paired data set pretty easily

<table><tr>
<td> <img src="images/gan_pup_1.PNG" style="width: 500px;"/> </td>
<td> <img src="images/gan_pup_2.PNG" style="width: 500px;"/> </td>
</tr></table>

The paradigm between these two image to image translation tasks, is that for one of them, you have these paired images. You have all the different pairs that match onto each other, but you don't necessarily have that correspondence all the time, and so an unpaired image to image translation, you actually just have two piles of two different image styles, x and y. Using these two piles, pile from x and y, you want your model to learn general stylistic elements from one pile to the other and transform images in one to another and sometimes also vice versa. There's still some type of content that is preserved. It's just the stylistic elements that are changed. That's pretty key in thinking about this translation task because there are commonalities and stylistic differences, unique things about each pile where you want to be able to tease out what's common, and keep those common elements, and only transfer those unique elements that are to each pile. The models goal is to learn that mapping between these two piles and figure out those common elements and unique elements. The common elements are often known as content. The content of the image, which is common to both of these piles and then styles often referred to what is different between them.

### CycleGAN Overview

CycleGAN consists of two different GANs that transform images from two piles, to and from each other. You don't have pairs of images, so how is your model really supposed to know what to generate for you?
Technically because you're only changing styles and not the content of the image, these two images should be the same. This content preservation while mapping from one pile to another and back again, is known as cycle consistency because the translation forms a cycle between these two piles. One simple way of creating this cycle is actually to use two different GANs. Together these two GANs have cycle consistency to make realistic unpaired translation possible. The adversarial part of the GANs just discriminates it for both of them, will ensure realism in the images, while the cycle consistency part is really in charge of getting the content to be preserved while only moving around the styles.

<table><tr>
<td> <img src="images/gan_ov_1.PNG" style="width: 500px;"/> </td>
<td> <img src="images/gan_ov_2.PNG" style="width: 500px;"/> </td>
</tr></table>


The discriminator is a PatchGAN. The generators are a combination of the learnings from DCGAN and U-Net with additional skip connections. CycleGAN basically borrows concepts from U-Net, in that there's downsampling and the encoding section, and then there's some decoding section where there's upsampling. These various encoding and decoding box are composed of convolutional layers with batch norm and ReLu. In addition to this U-Net framework, CycleGAN also expands up bottleneck section with more convolutional layers like from the DCGANs generator. The bottleneck section here also uses additional skip connections within itself called Redbox or residual connections. They help with adding additional layers and image transformations by allowing the model to learn identity functions.

### CycleGAN: Two GANs

CycleGAN is made up of four components, two GANs that each have a generator and a discriminator, and the inputs of the generator and discriminators are the same as Pix2pix except that there are no paired real images. So you don't have that extra pixel distance loss that you saw before with a real target output because there is no target output, and instead you're looking at two piles of different images.

### CycleGAN: Cycle Consistency

Cycle consistency is a loss term that is added to CycleGAN and puts a cycle in CycleGAN. Cycle consistency is an extra loss term to the loss function and this is for each of the two GANs. What cycle consistency expects is for the generated fake image to look exactly like the real one, because only styles should have changed. What you can do here is you can take the pixel difference between these two images and add that to your loss function, and you want to encourage these two images to be as close as possible. This also applies in the opposite direction.

You can now construct the entire cycle consistency loss by summing the pixel differences from both directions. You can just sum over i samples of each. This constitutes the entire cycle consistency lost arm for your generator. Since each direction uses both generators, you actually just have one optimizer for both of your generators. There's only one loss term that both of your generators are using.

This cycle consistency loss is calculated and shared between the two generators. More concretely, you can sum the cycle consistency over both images in your training dataset and weight this by some Lambda term to get your full cycle consistency loss term that you then add to your generators loss.

Cycle consistency is a broader concept than it's used in CycleGAN. It's actually used across deep learning quite a bit. It helped with data augmentation, for example, and also has been used in text translation too.

#### Ablation studies

The CycleGAN paper also showed some cool ablation studies and what an ablation study is. The word ablation means that you're cutting out various components of what you're introducing in CycleGAN and seeing how the model does without those various components.

<table><tr>
<td> <img src="images/gan_CC_1.PNG" style="width: 500px;"/> </td>
<td> <img src="images/gan_CC_2.PNG" style="width: 500px;"/> </td>
</tr></table>


* First, it took cycle GAN, but took away off a GAN components what if it only had cycle consistency loss, how would this model do? With just cycle consistency loss doesn't do too well.
What's happening is that you're actually looking at a pair dataset. Remember that's CycleGAN operates on unpaired and so it really is just looking out two piles. The ground truth is taking in this realistic image and producing this segmentation output on the opposite direction, taking a segmentation output and producing this realistic image.

* With cycle consistency alone, you can see that these outputs are just not realistic at all. Probably some mode collapse going on, and there's not much realism at all. Adversarial loss from GANs really makes things look realistic, and so as expected without it, with just cycle consistency, you don't get those realistic outputs that you would want.

* What about if you only have GANs and there's no cycle consistency? Well, the outputs actually look fairly realistic. They look pretty good, except for a little bit of mode collapse going on.

* You could use half of cycle consistency, but that also isn't enough for the GAN to learn diverse and quality mappings. You can see that without cycle consistency loss, some mode collapse appears.

In summary, cycle consistency is important in transferring on common style elements while maintaining common content across those images. This can be done by adding that pixel distance loss to the adversarial loss to encourage cycle consistency in both directions. The ablation studies show that the cycle consistency loss term in both directions help prevent mode collapse.

### CycleGAN: Least Squares Loss

Least Squares loss is used to help with training stability, namely vanishing gradient problems that you saw from BCE loss, but then cause mode collapse and other issues resulting in the end of learning, which is the worst thing you could possibly get. It's a method that minimizes the sum of squared residuals. What that means is that it tries to find the best fit line that has the smallest sum of squared distances between that line and all the points. By taking the sum of all of those squares and minimizing that value, you get the best fit line you can.

<img src="images/gan_LS_1.PNG" style="width: 500px;"/>

How this translates into GAN land is that your line is now your label of real or fake. That's your discriminators loss-function using the least squares adversarial loss-function. On the generator side, it will instead want to see how far away it is from one. In summary, these are the discriminator and generator loss terms under least squares adversarial loss. This might look similar to previous loss functions you've seen, namely BCE loss. But importantly, you'll see in this case that loss isn't very flat like you saw with those sigmoids in BCE, which cause all of those vanishing gradient problems because it's only flat with the discriminators predictions are exactly one here in exactly zero here for fake. This is also known as the mean squared error or MSE.

### CycleGAN: Identity Loss

Identity loss is an optional loss term that was proposed in CycleGAN, mainly to help with color preservation in the outputs. It's another pixel difference loss term.

It ensures that putting an image B in the opposite generator, which is the generator that goes from A to B, it should ideally output the exact same image because it's already a B style. You expect here an identity mapping or essentially no change in input to the output.

You can get the pixel distance between the real input, and whatever the generator produces, and add that to your loss function. In this case where the pixel distance is zero, the identity loss has a loss of zero, and so that's the idea. This is exactly what you want your generator to be doing. You don't want your generator to transform it into any other thing, so you want to encourage this behavior, an apply it to both directions of the cycle. You again want to add a lambda term, different weightings for the cycle consistency loss term and the identity loss term.

<table><tr>
<td> <img src="images/gan_IL_1.PNG" style="width: 500px;"/> </td>
<td> <img src="images/gan_IL_2.PNG" style="width: 500px;"/> </td>
</tr></table>

In summary, identity loss takes a real image in one of those piles and inputs it into the opposite generator. What you expect is actually an identity mapping, because that input image that you're putting into your generator, already has the styles that the generator is trying to map it to. Pixel distance is used to determine this, and ideally there's no difference between input and output. The identity loss is zero. The main reason why it's optional is that, it's been shown in many cases, to be very helpful, but in other cases not to make much of a difference.

### CycleGAN: Putting It All Together

And so to start, you first input your zebra image and you get a fake horse using your generator that maps from zebra to horse. Your horse discriminator then looks at this fake image as well as real images. It doesn't know which one's which, and outputs a classification matrix of how real or how fake it thinks those patches of those images are. And what's used here is least squares loss, and specifically for reals, this classification matrix is full of ones. And for fakes it's full of zeros, and that's how you compute the least squares adversarial loss. And that's for the discriminator and for the generator, this classification matrix is actually all ones. So in addition to this least squares adversarial loss, you also want to take your fake horse and feed it through the other generator that's going from horses to zeros to generate this fake zebra such that you can then compute the cycle consistency loss in this direction. And you do that by taking the pixel difference between the real input and this fake generated zebra because they really should look the same as you're only supposed to be transferring styles between these two generators. And again, the same is done in the opposite direction as well. If you do choose to use identity loss for your task, you also want to put your zebra through the opposite generator from horses and zebras. To get a zebra you take the pixel difference here and you save that as the identity loss and again you should do the same with horses going through the zebra to horse generator.

<table><tr>
<td> <img src="images/gan_AT_1.PNG" style="width: 500px;"/> </td>
<td> <img src="images/gan_AT_2.PNG" style="width: 500px;"/> </td>
</tr></table>


### CycleGAN Applications & Variants

CycleGAN can be used for many different things, including various filters that you've seen on Snapchat, aging, changing someone's perceived gender, zebrying a horse, changing different seasons of a scene, making a certain painting style, style transfer. CycleGAN is also often used in data augmentation.

CycleGAN isn't the only model out there that is able to do unpaired image-to-image translation. There is a variant that called UNIT, which stands for Unsupervised Image-to-Image Translation, which you can think of unpaired image-to-image translation as being unsupervised cause you don't have labels for it. The key insight in this model is known as this shared latent space, which basically states that given a noise vectors Z in this latent space, it can generate an image in domain X1, so it can generate that image, and there's a mapping back to that latency. And the same latency also is able to map to another domain saying X2 and also map back to that latent again.


<table><tr>
<td> <img src="images/gan_var_1.PNG" style="width: 500px;"/> </td>
<td> <img src="images/gan_var_2.PNG" style="width: 500px;"/> </td>
</tr></table>

And taking UNIT a step further is Multimodal UNIT, or MUNIT, and what multimodal means is that you can go from one domain of a sketch to a lot of modes in the second domain. So MUNIT is able to find not only this one mapping all of these others as well. You actually never tell the model about all of these different styles within your shoe pile. It's unsupervised.

#### The CycleGAN Paper

Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks (Zhu, Park, Isola, and Efros, 2020): https://arxiv.org/abs/1703.10593

#### CycleGAN for Medical Imaging

Data augmentation using generative adversarial networks (CycleGAN) to improve generalizability in CT segmentation tasks (Sandfort, Yan, Pickhardt, and Summers, 2019): https://www.nature.com/articles/s41598-019-52737-x.pdf

#### MUNIT

In this notebook, you will learn about and implement MUNIT, a method for unsupervised image-to-image translation, as proposed in Multimodal Unsupervised Image-to-Image Translation (Huang et al. 2018).
https://colab.research.google.com/github/https-deeplearning-ai/GANs-Public/blob/master/C3W3_MUNIT_(Optional).ipynb

#### From the videos:

* Image-to-Image Translation with Conditional Adversarial Networks (Isola, Zhu, Zhou, and Efros, 2018): https://arxiv.org/abs/1611.07004
* Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks (Zhu, Park, Isola, and Efros, 2020): https://arxiv.org/abs/1703.10593
* PyTorch implementation of CycleGAN (2017): https://github.com/togheppi/CycleGAN
* Distribution Matching Losses Can Hallucinate Features in Medical Image Translation (Cohen, Luck, and Honari, 2018): https://arxiv.org/abs/1805.08841
* Data augmentation using generative adversarial networks (CycleGAN) to improve generalizability in CT segmentation tasks (Sandfort, Yan, Pickhardt, and Summers, 2019): https://www.nature.com/articles/s41598-019-52737-x.pdf
* Unsupervised Image-to-Image Translation (NVIDIA, 2018): https://github.com/mingyuliutw/UNIT
* Multimodal Unsupervised Image-to-Image Translation (Huang et al., 2018): https://github.com/NVlabs/MUNIT

#### From the notebooks:

* PyTorch-CycleGAN (2017): https://github.com/aitorzip/PyTorch-CycleGAN/blob/master/datasets.py
* Horse and Zebra Images Dataset: https://people.eecs.berkeley.edu/~taesung_park/CycleGAN/datasets/horse2zebra.zip