# Wasserstein GAN
> Martin Arjovsky(Courant Institute of Mathematical Sciences), Soumith Chintala(Facebook AI Research), and L´eon Bottou1(Courant Institute of Mathematical Sciences, Facebook AI Research)

- toc:true
- branch: master
- badges: false
- comments: false 
- author: 최서연
- categories: [Wasserstein GAN, GAN]

ref: https://arxiv.org/pdf/1701.07875.pdf

https://jonathan-hui.medium.com/gan-wasserstein-gan-wgan-gp-6a1a2aa1b490

https://ahjeong.tistory.com/7

## Different Distances

- Let $X$ be a compact metric set (such as the space of images $[0, 1]^d$) 
- and let $Σ$ denote the set of all the Borel subsets of $X$ . 
- Let $Prob(X )$ denote the space of probability measures defined on $X$ .
- We can now define elementary distances and divergences between two distributions $P_r, P_g ∈ Prob(X )$
    - The Total Variation (TV) distance, The Kullback-Leibler (KL) divergence, The Jensen-Shannon (JS) divergence, The Earth-Mover (EM) distance or Wasserstein-1이 사용될 것.
    - p값들은 절대적으로 연속!
    - 그러므로 카이제곱에서 정의되는 같은 측정치 뮤가 나옴.
- The following example illustrates how apparently simple sequences of probability distributions converge under the EM distance but do not converge under the other distances and divergences defined above
    - 다음 예시에서 보기에 단순한 확률 분포 시퀀스가 EM 거리에서는 수렴하지만 제시된 4개의 거리 및 발산에서는 수렴하지 않음을 보임.

*Example 1 (Learning parallel lines)*

- Let $Z ∼ U[0, 1]$ the uniform distribution on the unit interval.
- Let $P_0$ be the distribution of $(0, Z) ∈ R^2$ (a 0 on the x-axis and the random variable Z on the y-axis), uniform on a straight vertical line passing through the origin.
- Now let $g_θ(z) = (θ, z)$ with $θ$ a single real parameter. 
- It is easy to see that in this case,

- When $θ_t → 0$, the sequence ($P_{θ_t})_{t∈N}$ converges to $P_0$ under the EM distance, but does not converge at all under either the JS, KL, reverse KL, or TV divergences.
- Figure 1 illustrates this for the case of the EM and JS distances

- Example 1 gives us a case where we can learn a probability distribution over a low dimensional manifold by doing gradient descent on the EM distance. 
- This cannot be done with the other distances and divergences because the resulting loss function is not even continuous.
    - 결과 손실 함수가 연속적이지 않기 때문에 다른 거리와 발산으로 할 수 없음.
- Although this simple example features distributions with disjoint supports, the same conclusion holds when the supports have a non empty intersection contained in a set of measure zero.
    - disjoint suppoert를 가진 분포를 특장으로 하는 이 단순한 예시는 supports가 0 집합에 포함된 비어 있지 않은 interaction을 가진 때에도 동일한 결과가 유지된다.
- This happens to be the case when two low dimensional manifolds intersect in general position.

![](https://xiucheng.org/assets/images/wgan-example.png)

Since the Wasserstein distance is much weaker than the JS distance , we can now ask whether $W(P_r, P_θ)$ is a continuous loss function on $θ$ under mild assumptions. This, and more, is true, as we now state and prove.

*Theorem 1.*

- Let $P_r$ be a fixed distribution over $X$ .
- Let $Z$ be a random variable(e.g Gaussian) over another space $Z$. 
- Let $g : Z × R^d → X$ be a function, that will be denoted $g_θ(z)$ with $z$ the first coordinate and $θ$ the second.
- Let $P_θ$ denote the distribution of $g_θ(Z)$. Then,
1. If $g$ is continuous in $θ$, so is $W(P_r, P_θ)$.
    - g가 세타에 대해 연속이면 EM 거리도 연속!
2. If $g$ is locally Lipschitz and satisfies regularity assumption 1, then $W(P_r, P_θ)$ is continuous everywhere, and differentiable almost everywhere.
    - g가 립시츠 조건 만족한다면  EM 거리도 연속!
3. Statements 1-2 are false for the Jensen-Shannon divergence $JS(P_r, P_θ)$ and all the KLs.

The following corollary tells us that learning by minimizing the EM distance makes sense (at least in theory) with neural networks.
- 다음 결과는 EM 거리를 최소화하여 학습하는 것이 신경망에 타당하다는 것을 말해줌

*Corollary 1*

- Let $g_θ$ be any feedforward neural network parameterized by $θ$, and $p(z)$ a prior over $z$ such that $E_{z∼p(z)}(||z||) < ∞$ (e.g. Gaussian, uniform, etc.)
- Then assumption 1 is satisfied and therefore $W(P_r, P_θ)$ is continuous everywhere and differentiable almost everywhere.

All this shows that EM is a much more sensible cost function for our problem than at least the Jensen-Shannon divergence. The following theorem describes the relative strength of the topologies induced by these distances and divergences, with KL the strongest, followed by JS and TV, and EM the weakest.
- EM이 최소 JS 발산보다 제시한 문제에 보다 더 합리적인 cost function이라는 것을 보임.
- 상대적인 강도는 KL > JS > TV > EM 의 순을 이룸

*Theorem 2.*

- Let $P$ be a distribution on a compact space $X$ and $(P_n)_{n∈N}$ be a sequence of distributions on $X$ .
- Then, considering all limits as $n → ∞$, 
1. The following statements are equivalent 
    - $δ(P_n, P) → 0$ with δ the total variation distance 
    - $JS(P_n, P) → 0$ with JS the Jensen-Shannon divergence.
2. The following statements are equivalent
    - $W(P_n, P) → 0.$
    - $P_n D→ P$ where $D→$ represents convergence in distribution for random variables.
3. KL($P_n|P) → 0$ or KL($P|P_n) → 0$ imply the statements in (1).
4. The statements in (1) imply the statements in (2).

This highlights the fact that the KL, JS, and TV distances are not sensible cost functions when learning distributions supported by low dimensional manifolds. However the EM distance is sensible in that setup. This obviously leads us to the next section where we introduce a practical approximation of optimizing the EM distance.

## Wasserstein GAN

- Again, Theorem 2 points to the fact that $W(P_r, P_θ)$ might have nicer properties when optimized than $JS(P_r, P_θ)$. 
- However, the infimum in (1) is highly intractable. 