# Introduction

Last lesson we gave to the GAN a little bit of an upgrade to generate images better. This week we will explore another major issue with GAN training. That's with a **GAN generating the same thing each time**. A GAN trained on all different dog breeds will only generate a golden retriever. This issue happens because the discriminator improves but it gets stuck between saying an image of a dog looks extremely fake or extremely real. As a classifier, it's encouraged to say it's one real or zero fake as it gets better. But in a single round of training, if the discriminator only thinks the generator's golden retriever looks real, even if it doesn't even look that real. Then, the generator will cling on to that golden retriever and only produce golden retrievers. 

Now when the discriminator learns that this golden retriever is fake in the next round of training, the generator won't know where to go because there's really nothing else it has in its arsenal of different images and that's the end of learning. The end of learning is very, very bad for these networks. 

Digging one level deeper, this happens due to the binary cross-entropy loss, where the discriminator is forced to produce a value between zero or one, and even though there's an infinite number of decimal values between zero and one, it will approach zero and one as it gets better. 

Now, we will be introduced to a new loss function that allows a discriminator to say -4 or +100 (any number between negative infinity and infinity), which mitigates this problem and allows both networks to keep on learning.

# Mode Collapse

A **mode** in a distribution of data is just an area with a high concentration of observations. For instance, the mean value in a normal distribution is the single mode of that distribution. There are certainly distributions with multiple modes where the mean doesn't have to be one of them. In the distribution below, that's bimodal, i.e., it has two modes, or multimodal, meaning it has multiple modes. 

<img src="images/mode_distribution.svg" width="50%"/>

More intuitively, any peak on the probability density distribution over features is a mode of that distribution. For example, take handwritten digits represented by features $x_1$ and $x_2$, meaning that these are just dimensions along which you can represent these numbers, like values you can use to represent different handwritten digits. 

<img src="images/multi_mode.svg" width="40%"/>

The probability density distribution in this case will be a surface with many peaks corresponding to each digit. This is definitely multimodal with 10 different modes and different observations of the number 7, for example, will be represented by similar $x_1$ and $x_2$ pairs. 

<img src="images/mode_seven.svg" width="45%"/>

Those who values there would both be sevens and the one marked in red there in the mean would be an average looking seven. You can imagine each of these peaks coming out at you in a 3D representation where the darker circle represents higher altitudes. Different pairs of $x_1$ and $x_2$ features would create the handwritten fives and a value in between the seven and five (marked as a green line), where it's very low density, so very low probability of generating something that is an intermediate between seven and five in the real dataset. It would be probably a mixture of seven and five. However, there's a very low probability that you would see that $x_1$, $x_2$ pair in this in-between space, that would produce an intermediate five seven looking number. 

<img src="images/mode_seven_five.svg" width="40%"/>

Different observations of the same digit will be grouped together in this feature space with high concentration in the area with the most common way to write that digit or where that average seven is there and, of course, an average five is in the center of the five mode peak. This probability density distribution over these features, $x_1$ and $x_2$, will have 10 modes, one for each of these digits. 

Many real life distributions used for training GANs are multimodal. Take a discriminator that has learned to be good at identifying which handwritten digits are fakes, except for cases where the generated images look like `1` and `7`. This could mean the discriminator is at of local minima of its cost function. The discriminator classifies most of the digits correctly, except for the ones that resembled those ones and sevens, then this information is passed on to the generator. The generator sees this and looks at the feedback from the discriminator and gets a good idea of how to fool the discriminator in the next round. It sees that all the images were misclassified by the discriminator, resemble either a one or a seven, so it generates a lot of pictures that resemble either of those numbers. Then these generated images are then passed on to the discriminator in the next round who then misclassifies every picture except for maybe the one felt looks more like a seven. 

Generator gets that feedback and sees that the discriminator's weakness is with the pictures that resembled a handwritten one, so this time all the pictures it produces resembled that digit, collapsing to a single mode or the whole distribution of possible handwritten digits. Eventually the discriminator will probably catch on and learn to catch the generator's fake handwritten number ones by getting out of that local minima. However, the generator could also migrate to another mode of the distribution and again would collapse again to a different mode. Or the generator would not be able to figure out where else to diversify. 

To sum up, modes are peaks and the probability distribution of our features. Real-world datasets have many modes related to each possible class within them, like the digits in this dataset of handwritten digits. Mode collapse happens when the generator learns to fool the discriminator by producing examples from a single class from the whole training dataset like handwritten number ones. This is unfortunate because, while the generator is optimizing to fool the discriminator, that's not what you ultimately want your generator to do.

# Problem with BCE Loss

Binary Cross-Entropy loss or BCE loss, is traditionally used for training GANs, but it isn't the best way to do it. With BCE loss, GANs are prone to mode collapse and other problems. Remember that the form of the BCE loss function is just an average of the cost for the discriminator for misclassifying real and fake observations. 

$$
J(\theta) = - \frac{1}{m} \sum_\limits{i=1}^{m} [y^{(i)} \log h(x^{(i)}, \theta) + (1 - y^{(i)}) \log(1 - h(x^{(i)}, \theta))]
$$

Where the first term is for reals and the second term is for the fakes. The higher this cost value is, the worse the discriminator is doing at it. The generator wants to maximize this cost because that means the discriminator is doing poorly and is classifying it's fake values into reals. Whereas the discriminator wants to minimize this cost function because that means it's classifying things correctly. Of course the generator only sees the fake side of things, so it actually doesn't see anything about the reals. This maximization and minimization is often called a `minimax` game. 

At the end of this **minimax** game, the generator and discriminator interaction translates to a more general objective for the whole GAN architecture. That is to make the real in generated data distributions of features very similar. Trying to get the generated distribution to be as close as possible to the reals. 

<img src="images/gans_distribution.svg" width="45%"/>

This minimax of the Binary Cross-Entropy loss function is somewhat approximating the minimization of another complex hash function that's trying to make this happen. Of course, during this whole training process, the discriminator naturally is trying to delineate this real and fake distribution as much as possible, whereas the generator is trying to make the generated distribution look more like the reals. 

However, let's take a step back again to the generator and discriminators roles. The discriminator and again, needs to output just a single value prediction within zero and one. Whereas the generator actually needs to produce a pretty complex output composed of multiple features to try and fool the discriminator, for example, an image. As a result that discriminators job tends to be a little bit easier. To put it in another way, it's more straightforward to look at images in a museum than it is to paint those masterpieces. 

During training it's possible for the discriminator to outperform the generator, very possible, in fact, quite common. But at the beginning of training, this isn't such a big problem because the discriminator isn't that good. It has trouble distinguishing the generated and real distributions. There's some overlap and it's not quite sure. As a result, it's able to give useful feedback in the form of a non-zero gradient back to the generator. 

<img src="images/gan_learning.svg" width="60%"/>

However, as it gets better at training, it starts to delineate the generated and real distributions a little bit more such that it can start distinguishing them much more. Where the real distribution will be centered around one and the generated distribution will start to approach zero. As a result, when it's starting to get better, as this discriminator is getting better, it'll start giving less informative feedback. In fact, it might give gradients closer to zero, and that becomes unhelpful for the generator because then the generator doesn't know how to improve. 

<img src="images/gan_vanishing.svg" width="60%"/>

This is how the vanishing gradient problem will arise. In summary, GANs try to make the generated distribution look similar to the real one by minimizing the underlying cost function that measures how different the distributions are. As a discriminator improves during training and sometimes improves more easily than the generator, that underlying cost function will have those flat regions when the distributions are very different from one another, where the discriminator is able to distinguish between the reals and the fakes much more easily, and be able to say, "Reals look really real, a label of one and fakes look really fake, a label of zero." All of this will cause vanishing gradient problems.

# Earth Mover’s Distance

When using BCE loss to train a GAN, we often encounter mode collapse, and vanishing gradient problems due to the underlying cost function of the whole architecture. Even though there is an infinite number of decimal values between zero and one, the discriminator, as it improves, will be pushing towards those ends. 

So take this generated and real distributions with the same variance but different means, and assume they might be normal distributions. What the Earth Mover's distance (EMD) does, is it measures how different these two distributions are, by estimating the amount of effort it takes to make the generated distribution equal to the real. 

<img src="images/distributions.svg" width="50%"/>

So intuitively, the generate distribution was a pile of dirt, how difficult would it be to move that pile of dirt and mold it into the shape and location of the real distribution? So that's what this Earth mover's distance means. The function depends on both the distance and the amount that the generated distribution needs to be moved. 

So the problem with BCE loss is that as a discriminator improves, it would start giving more extreme values between zero and one, so values closer to one and closer to zero. As a result, this became less helpful feedback back to the generator. So the generator would stop learning due to vanishing gradient problems. 

<img src="images/gan_vanishing.svg" width="60%"/>

With Earth mover's distance, however, there's no such ceiling to the zero and one. So the cost function continues to grow regardless of how far apart these distributions are. The gradient of this measure won't approach zero and as a result, GANs are less prone to vanishing gradient problems and from vanishing gradient problems, mode collapse. 

<img src="images/emd_loss.svg" width="60%"/>

So wrapping up, Earth mover's distance is a function of the effort to make a distribution equal to another. So it depends on both distance and amount. Unlike BCE, it doesn't have flat regions when the distributions start to get very different, and the discriminator starts to improve a lot. So approximating this measure eliminates the vanishing gradient problem, and reduces the likelihood of mode collapse in GANs.

# Wasserstein Loss


As you've seen previously, BCE Loss is used traditionally to train GANs. However, it has many problems due the form of the function it's approximated by. So now, we will see an alternative loss function called Wasserstein Loss (W-Loss), that approximates the Earth Mover's Distance. So to that end, first we'll see an alternative way to look at the BCE Loss function that's more simple and compact, and then we will see how W-Loss is calculated and compared this loss with BCE Loss. 

BCE Loss is computed by a long equation that essentially measures how bad, on average, some observations are being classified by the discriminator, as fake and real. The generator in GANs wants to maximize this cost, because that means the discriminator is saying that its fake values seem really real, while the discriminator wants to minimize that cost.

$$
J(\theta) = - \frac{1}{m} \sum_\limits{i=1}^{m} [y^{(i)} \log h(x^{(i)}, \theta) + (1 - y^{(i)}) \log(1 - h(x^{(i)}, \theta))]
$$

And so, this is often referred to as a Minimax game, and this very long equation for BCE Loss can be simplified as follows. The sum and division over examples $M$ is nothing but a mean or expected value. In the first part, inside the sum, measures how bad the discriminator classifies real observations, where $y=1$, and 1 means real, and the second part measures how bad it classifies fake observations produced by the generator, where $y-1$ means fake. 

$$
J(\theta) = min_d max_g\ \ -[\mathbb{E}(\log(d_x)) + \mathbb{E}(1-\log(d(g(z))))]
$$

W-Loss, on the other hand, approximates the Earth Mover's Distance between the real and generated distributions, but it has nicer properties than BCE. However, it does look very similar to the simplified form for the BCE Loss, and in this case the function calculates the difference between the expected values of the predictions of the discriminator (it's called the **critic** and is represented with a $c$ here). This is $c$ of a real example $x$, versus $c$ of a fake example $g(z)$. Generator taking in a noise vector to produce a fake image $g(z)$, or perhaps you can call it $\hat{x}$.

$$
\color{grey}{min_g max_c} \mathbb{E}(c(x)) - \mathbb{E}(c(g(z)))
$$

So the discriminator looks at these two things, and it wants to maximize the distance between its thoughts on the reals versus its thoughts on the fakes. So it's trying to push away these two distributions to be as far apart as possible. 

$$
max_c\ \ \mathbb{E}(c(x)) - \mathbb{E}(c(g(z)))
$$

Meanwhile, the generator wants to minimize this difference, because it wants the discriminator to think that its fake images are as close as possible to the reals. I know that in contrast with BCE there are no logs in this function, since the critics outputs are no longer bounded to be between 0 and 1.

$$
min_g\ \ \mathbb{E}(c(x)) - \mathbb{E}(c(g(z)))
$$

So, for the BCE Loss to make sense, the output of the discriminator needs to be a prediction between 0 and 1. And so the discriminator's neural network for GANs, trained with BCE Loss, have a sigmoid activation function in the output layer to then squash the values between 0 and 1. W-Loss, however, doesn't have that requirement at all, so you can actually have a linear layer at the end of the discriminator's neural network, and that could produce any real value output. And you can interpret that output as, how real an image is considered by the critic. It's no longer bounded between 0 and 1, where 0 means fake, and 1 means real. It's no longer classifying into these two, or discriminating between these two classes. And so, as a result, it wouldn't make that much sense to call that neural network a discriminator, because it doesn't discriminate between the classes. For W-Loss, the equivalent to a discriminator is called a **critic**, and what it tries to do is, maximize the distance between its evaluation on a fake, and its evaluation on a real.

<img src="images/critic_output.svg" width="60%"/>

So, some of the main differences between W-Loss and BCE Loss is that, the discriminator under BCE Loss outputs a value between 0 and 1, while the critic in W-Loss will output any number. Additionally, the forms of the cost functions is very similar, but W-Loss doesn't have any logarithms within it, and that's because it's a measure of how far the prediction of the critic for the real is from its prediction on the fake. Meanwhile, BCE Loss does measure that distance between fake or a real, but to a ground truth of 1 or 0. And so what's important to take away here is largely that, the discriminator is bounded between 0 and 1, whereas the critic is no longer bounded ,and just trying to separate the two distributions as much as possible. And as a result, because it's not bounded, the critic is allowed to improve without degrading its feedback back to the generator. And this is because, it doesn't have a vanishing gradient problem, and this will mitigate against mode collapse, because the generator will always get useful feedback back. 

In summary, W-Loss looks very similar to BCE Loss, but it isn't as complex a mathematical expression. Under the hood what it does is, approximates the Earth Mover's Distance, so it prevents mode collapse in vanishing gradient problems. 

# Condition on Wasserstein Critic

Wasserstein Loss or W-Loss solves some problems faced by GANs, like mode claps and vanishing gradients. But for it to work well, there is a special condition that needs to be met by the critic. W-Loss is a simple expression that computes the difference between the expected values of the critics output for the real examples $x$ and its predictions on the fake examples $g(z)$. 

The generator tries to minimize this expression, trying to get the generative examples to be as close as possible to the real examples while the critic wants to maximize this expression because it wants to differentiate between the reals and the fakes, it wants the distance to be as large as possible. 

$$
\color{blue}{min_g}\ \color{red}{max_c} \mathbb{E}(c(x)) - \mathbb{E}(c(g(z)))
$$

However, for training GANs using W-Loss, the critic has a special condition. It needs to be something called **1-Lipschitz Continuous** or **1-L Continuous**. For a function like the critics neural network to be at 1-Lipschitz Continuous, the norm of its gradient needs to be at most one. What that means is that, the slope can't be greater than one at any point, its gradient can't be greater than one. To check if a function here, for example, this function you see here, $f(x)=x^2$, is 1-Lipschitz Continuous, you want to go along every point in this function and make sure its slope is less than or equal to one, or its gradient is less than or equal to one. What you can do is, you can actually draw two lines, one where the slope is exactly one at this certain point that you're evaluating function, and one where the slope is negative one where you're evaluating our function. 

<img src="images/wloss_limitation.svg" width="35%"/>

You want to make sure that the growth of this function never goes out of bounds from these lines because staying within these lines means that the function is growing linearly. Here this function is **not** Lipschitz Continuous because it's coming out in all these sections. It's not staying within this green area, which suggests that it's growing more than linearly. 

Look at another example here. This is a smooth curve functions. You want to again check every single point on this function before you can determine whether or not that this is 1-Lipschitz Continuous. Let's say you take every single value and the function never grows more than linearly. This function is 1-Lipschitz Continuous. 

<img src="images/l_continuous.svg" width="35%"/>

This condition on the critics neural network is important for W-Loss because it assures that the W-Loss function is not only continuous and differentiable, but also that it doesn't grow too much and maintain some stability during training. This is what makes the underlying Earth Movers Distance valid, which is what W-Loss is founded on. This is required for training both the critic and generators neural networks and it also increases stability because the variation as the GAN learns will be bounded. 

To recap, the critic, and again that uses W-Loss for training needs to be 1-Lipschitz Continuous in order for its underlying Earth Mover's Distance comparison between the reals and the fakes to be a valid comparison. 

# 1-Lipschitz Continuity Enforcement

1-Lipschitz continuity or 1-L continuity of the critic neural network in your Wasserstein loss and gain ensures that Wasserstein loss is valid. Recall that the critic being 1-L continuous means that the norm of its gradient is at most one at every single point of this function. 

$$
||\nabla f(x)||_2 \le 1
$$

where the upside down triangle is assigned for gradient, the function $f$ is the critic and $x$ is the image. This represents the norm of that gradient being less than or equal to one (using the L2 norm is very common), which means its Euclidean distance or often thought of as your triangle distance of your hypotenuse. Intuitively in two-dimensions, the slope is less than or equal to one. Thus, at every single point of the function, the gradient remains within the generated triangles as show in the image before. 

Two common ways of ensuring this condition are **weight clipping** and **gradient penalty**. With **weight clipping**, the weights of the critics neural network are forced to take values between a fixed interval. After you update the weights during gradient descent, you actually will clip any weights outside of the desired interval. Basically what that means is that weights over that interval, either too high or too low, will be set to the maximum or the minimum amount allowed. That's clipping the weights there. This is one way of enforcing the 1-L continuity, but it has a way to downside. Forcing the weights of the critic to a limited range of values could limit the critics ability to learn and ultimately for the gradient to perform because if the critic can't take on many different parameter values, it's weights can't take on many different values, it might not be able to improve easily or find good loop optimal for it to be in. Not only is this trying to do 1-L continuity enforcement, this might also limit the critic too much. Or on the other hand, it might actually limit the critic too little if you don't clip the weights enough. There's a lot of hyperparameter tuning involved. 

The **gradient penalty**, which is another method, is a much softer way to enforce the critic to be one lipschitz continuous. With the gradient penalty, all you need to do is add a regularization term to your loss function. 

$$
min_g\ max_c \mathbb{E}(c(x)) - \mathbb{E}(c(g(z))) + \lambda\text{reg}
$$

What this regularization term does to your W-loss function, is that it penalizes the critic when it's gradient norm is higher than one. The regularization term is the $reg$ and $\lambda$ is just a hyperparameter value of how much to weigh this regularization term against the main loss function. 

In order to check the critics gradient at every possible point of the feature space, that's virtually impossible or at least not practical. Instead with gradient penalty during implementation, all you do is sample some points by interpolating between real and fake examples. For instance, you could sample an image with a set of reals and an image of the set of fakes, and you grab one of each and you can get an intermediate image by interpolating those two images using a random number epsilon. Epsilon here it could be a weight of 0.3, and here it would evaluate one minus epsilon would be 0.7. 

<img src="images/interpolation.svg" width="50%"/>

That would get you this random interpolated image that's in-between these two images. I'll call this random interpolated image $\hat{x}$. It's on $\hat{x}$ that you want to get the critics gradient to be less than or equal to one. This is exactly what's happening here. The critic looks at $\hat{x}$, you get the gradient of the critics prediction on $\hat{x}$, and then you take the norm of that gradient and you want the norm to be one. 

$$
\text{reg} = (||\nabla c(\hat{x})||_2 -1)^2
$$

Here it's simpler to say, "Hey, can I get the norm of the gradient to be one as opposed to at most one?" Because this in fact is penalizing any value outside of one. The two here is just saying,"I want the squared distance as opposed to perhaps the absolute value between them, penalizing values much more when they're further away from one." Specifically, that $\hat{x}$ is an intermediate image where it's weighted against the real and a fake using epsilon. 

$$
\hat{x} = \epsilon x + (1 - \epsilon) g(z)
$$

With this method, you're not strictly enforcing 1-L continuity, but you're just encouraging it. This has proven to work well and much better than weight clipping. The complete expression, the loss function that you use for training again with W-loss ingredient penalty now has these two components. 

$$
min_g\ max_c \mathbb{E}(c(x)) - \mathbb{E}(c(g(z))) + \lambda(\mathbb{E}((||\nabla c(\hat{x})||_2 -1)^2)
$$

First, you approximate Earth Mover's distance with this main W-loss component. This makes again less parental mode collapse and managed ingredients. The second part of this loss function is a regularization term that meets the condition for what the critic desires in order to make this main term valid. Of course, this is a soft constraint on making the critic one lipschitz continuous, but it has been shown to be very effective. Keeping the norm of the critic close to one almost everywhere is actually the technical term is almost anywhere. 

Wrapping up, we saw two ways of enforcing the critic to be one lipschitz continuous or 1L continuous, weight clipping as one and ingredient penalty as the other. Weight clipping might be problematic because you're strongly limiting the way the critic learns during training or you're being too soft, so there's a bit of hyperparameter tuning. Gradient penalty on the other hand, is a softer way to enforce one Lipschitz continuity. While it doesn't strictly enforce the critics gradient norm to be less than one at every point, it works better in practice than weight clipping.