> Nicolas Sholten  
Valentin Portillo

<span style="font-size: 30px">**Pix2Pix**</span>

<span style="font-size: 20px">**Image to image translation with Conditional Adversarial Networks**</span>

## Context

<span style="font-size: 18px">Pix2Pix algorithm works in the context of **Generative Models** on the **Machine Learning** terrain. More specifically, in **Deep Learning**. This algorithm uses the principles of **Generative Adversarial Networks (GANs)**, meaning that a discriminator and generator are present in the arquitecture trying to correct or fool one another respectively. **The generator** is built with an "auto-encoder architecture", meaning that it would try to capture the intrinsec distributions needed to satisfy the goal representation from the input representation. In the contrary, **the discriminator** will evaluate the output of the generator and calssify it as fake or real.</span>

![alt text](https://www.kdnuggets.com/wp-content/uploads/generative-adversarial-network.png)



## Problem

<span style="font-size: 18px">Pix2Pix answers to the need of **translating the input image to the corresponding output image**. Meaning that as an input you can have a representation of an image and at the output you would have the image it represents. The examples are variate: </span>

<p></p>
<span style="font-size: 18px">
    <ul>
        <li>Semantic labels ↔ photothese</li>
        <li>Architectural labels → photo</li>
        <li>Map ↔ aerial photo</li>
        <li>BW → color photos</li>
        <li>Edges → photo</li>
        <li>Sketch → photo: tests edges → photo models on human-drawn sketches</li>
        <li>Day → night</li>
        <li>Thermal → color photos</li>
        <li>Photo with missing pixels → inpainted photo</li>
    </ul>   
</span>

![alt text](https://phillipi.github.io/pix2pix/images/teaser_v3.png)

## Formulation

<span style="font-size: 18px">**Objective function:**</span>

\begin{equation}
L_{cGAN}(G, D) = E_{x,y}[logD(x,y)]+E_{x,z}[log(1−D(x,G(x,z))]
\end{equation}

<span style="font-size: 18px">-> *G tries to minimize the objective function against an adversarial D that tries to maximize it*</span>

\begin{equation}
G^∗ = arg min_{G}max_{D}L_{cGAN}(G,D)
\end{equation}

<span style="font-size: 18px">**Generator Loss:**</span>

\begin{equation}
L_{L1}(G) =E_{x,y,z}[‖y−G(x,z)‖]
\end{equation}

<span style="font-size: 18px">then, the final objective is:</span>

\begin{equation}
G^∗ = arg min_{G}max_{D}L_{cGAN}(G,D) + λL_{L1}(G)
\end{equation}

> where  
        *  z - noise
        *  y - output
        *  x - condition (objective image)

## Architecture

<span style="font-size: 18px">In order to perform these tasks, for the proposed tool, more specifically than a GAN, there are some considerations or adaptations done to this architecture. As a first thing, the GAN is a **conditional GAN (cGAN)** which means that as an input, the GAN not only expects noise but a type. This type could be a label such as "drawing" or in our case, a kind of image such as a handdrawed purse and as an output we expect an image of a purse. Additionally, they use a helper feature to do the downsampling and upsampling of the image: they propose a "bridge" between equivalent levels of downsampling and upsampling in order to help correct the possible errors inherent of the information loss done by this operation. This "bridge" is called **"skip connections"** and the U-Net is the architecture that proposes this type of connection. Also, another tweak done to the GAN initial arquitecture is to use **L1 loss type** for the Generator since it is "less strong" than a L2 (squared error) it prevents the pixels to turn blurry, even though, for the discriminator, it still uses a L2 loss type. Furthermore, a key part on doing the downsampling and upsampling is to apply **normalization** after each activation function output in order to avoid skyreocketed values and maintian the values controlled between 0 and 1. Finally, at the discriminator level, the fake/real note is given by an ensemble of notes, it's not just one, so the last layer is divided into patches, each evaluating a part of the resulting last layer, so penalizing at the scale of patches. This type of structure models the image as a Markov random field which assumes independence between pixels separated by more than a patch diameter causing to model only high frequency structure correcting mainly low frequencies. This architecture is called **PatchGAN**.
</span>

<span style="font-size: 18px">**cGan**</span>
> Difference with GAN -> we use a condition as an input (label, image ...)

<img style="width: 500px" src="http://nooverfit.com/wp/wp-content/uploads/2017/10/Screenshot-from-2017-10-07-120039.png">

<span style="font-size: 18px">**U-net**</span>
> On Generator, non exact example, there's an extra convolutional layer and missing normalization layer

<img style="width: 800px" src="https://external-content.duckduckgo.com/iu/?u=http%3A%2F%2Fblog.qure.ai%2Fassets%2Fimages%2Fsegmentation-review%2Funet.png&f=1&nofb=1">

<span style="font-size: 18px">**L1 loss**</span>
> Comparison between L1 (absolute value) and L2 (squared value) losses

<img style="width: 400px" src="https://t1.daumcdn.net/cfile/tistory/999BFC475AE189661D">

<span style="font-size: 18px">**Patch Gan**</span>
> Example

<img style="width: 350px" src="https://paper-attachments.dropbox.com/s_84D9D849F786EC83B26BF2A0F74F0C33230682E8BA1D41AD8C3F3D770D23236A_1566269789757_pg.png">



## Results

## Conclusions

## References

[1] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, Alexei A. Efros, *"Image-to-Image Translation with Conditional Adversarial Networks"*, Berkeley AI Research (BAIR) Laboratory, UC Berkeley

## Annexes

<img src="./datasets/Modèle discriminateur.png">

<img src="./datasets/Modèle générateur.png">