Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tiled-mode for projector #21

Closed
woctezuma opened this issue Sep 9, 2020 · 5 comments
Closed

Tiled-mode for projector #21

woctezuma opened this issue Sep 9, 2020 · 5 comments

Comments

@woctezuma
Copy link

Tiled-mode for projector seems to lead to better fits to real images compared to Nvidia's original project. It consists in using 18 dimensions for the latent instead of just the first one, and I think you are the one who introduced this change the first.

Would you mind explaining the idea behind it and why it works?

And whether it has limitations based on what we want to do with the latent? For instance, is it fitting noise?

It is mentioned in this pull request:
#9

And this short commit which showcases the vanilla mode and the tiled mode:
kreativai@2036fb8

And this longer commit of yours:
bc3face#diff-bc58d315f42a097b984deff88b4698b5

@woctezuma
Copy link
Author

woctezuma commented Sep 9, 2020

Limitations of W(18,*) compared to W(1,*) are mentioned in:

  1. Improving initialization #2 (comment)

Encoder output can have high visual quality, but bad semantics.

The W(18, 512) projector, for example, explores W-space so well that its output, beyond a certain number of iterations, becomes meaningless. It is so far from w_avg that the usual applications -- interpolate with dlatents, apply direction vectors obtained from samples, etc. -- won't work as expected.

For comparison: the mean semantic quality of Z -> W(1, 512) dlatents is 0.44.
2 is okay (1000 iterations with W(18, 512) projector).
4 is bad (5000 iterations with W(18, 512) projector).

  1. Improving initialization #2 (comment)

On the left is a Z -> W(1, 512) face, ψ=0.75, with a semantics score of 0.28.
On the right is the same face projected into W(18, 512), it=5000, with a score of 3.36.
They both transition along the same "surprise" vector.
On the left, this looks gimmicky, but visually okay.
On the right, you have to multiply the vector by 10 to achieve a comparable amount of change, which leads to obvious artifacts.
As long as you obtain your vectors from annotated Z -> W(1, 512) samples, you're going to run into this problem.

Should you just try to adjust your vectors more cleverly, or find better ones?
My understanding is that this won't work, and that there is no outer W-space where you can smoothly interpolate between all the cool projections that are missing from the regular inner W-space mappings.
(Simplified: Z is a unit vector, a point on a 512D-sphere.
Ideally, young-old would be north-south pole, male-female east-west pole, smile-unsmile front-back pole, and so on.
W(1, 512) is a learned deformation of that surface that accounts for the uneven distribution of features in FFHQ.
W(18, 512) is a synthesizer option that allows for style mixing and can be abused for projection.
But all the semantics of StyleGAN reside on W(1, 512). W(18, 512) vectors filled with 18 Z -> W(1, 512) mappings already belong to a different species.
High-quality projections are paintings of faces.)

  1. Improving initialization #2 (comment)

Single-layer mixing of a face with the projected face based on the "Mona Lisa" using W(18,*).
Artifacts appear if using the projection result after 5000 iterations compared to the projection result with 1000 iterations.

An interesting paper (Image2StyleGAN) is mentioned here.

@woctezuma
Copy link
Author

woctezuma commented Sep 10, 2020

Related issues:
pbaylies/stylegan-encoder#1
pbaylies/stylegan-encoder#2
Puzer/stylegan-encoder#6

Limitations mentioned there:

  1. Latent layers importance Puzer/stylegan-encoder#6 (comment)

When experimenting with the net, I've noticed StyleGAN behaves much better when it comes to interpolation & mixing if you "play by the rules". eg, use a single 1x512 dlatent vector to represent your target image.
With 18x512, we're kindof cheating. In fact, Image2Stylegan shows that you can encode images this way on a completely randomly initialized net! (although interpolation is pretty meaningless in this in this instance)

  1. https://github.com/pender/stylegan-encoder

Why limit the encoded latent vectors to shape [1, 512] rather than [18, 512]?

  • The mapping network of the original StyleGAN outputs [1, 512] latent vectors, suggesting that the reconstructed images may better resemble the natural outputs of the StyleGAN network.
  • Image2StyleGAN: How to Embed Images Into the StyleGAN Latent Space? (Abdal, Qin & Wonka 2019) demonstrated that use of the full [18, 512] latent space allows all manner of images to be reproduced by the pretrained StyleGAN network, even images highly dissimilar to training data, perhaps suggesting that the accuracy of the encoded images more reflects the amount of freedom afforded by the expanded latent vector than the domain expertise of the network.
  1. Inverse network output shape pbaylies/stylegan-encoder#1 (comment)

my goal with this encoder is to be able to encode faces well into the latent space of the model; by constraining the total output to [1, 512] it's tougher to get a very realistic face without artifacts. Because the dlatents run from coarse to fine, it's possible to mix them for more variation and finer control over details, which NVIDIA does in the original paper. In my experience, an encoder trained like this does a good job of generating smooth and realistic faces, with less artifacts than the original generator.

I am open to having [1, 512] as an option for building a model, but not as the only option, because I don't believe it will ultimately perform as well for encoding as using the entire latent space -- but it will surely train faster!

@woctezuma
Copy link
Author

woctezuma commented Sep 13, 2020

It is actually discussed in the StyleGAN2 paper, because the trick to increase image quality has been used by some with StyleGAN1.

article

@woctezuma
Copy link
Author

Also, in StyleGAN1 repository:

The dlatents array stores a separate copy of the same w vector for each layer of the synthesis network to facilitate style mixing.

https://github.com/NVlabs/stylegan#using-pre-trained-networks

@woctezuma
Copy link
Author

woctezuma commented Feb 16, 2021

Relevant paper discussing the trade-off between visual quality and semantic quality:

Tov, O., Alaluf, Y., Nitzan, Y., Patashnik, O., & Cohen-Or, D. (2021). Designing an Encoder for StyleGAN Image Manipulation. arXiv preprint arXiv:2102.02766.
https://arxiv.org/abs/2102.02766

It is done by the people behind https://github.com/orpatashnik/StyleCLIP

Semantic quality is called "editability".
Visual quality is divided into two elements:

  • distortion (distance between the input and the projection)
  • perception (realism of the projection).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant