-
Notifications
You must be signed in to change notification settings - Fork 119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
model synthesizes features to fool the loss #28
Comments
also I wonder when target pose / representation is learned early in the training process, does it still improve / change later ? |
Hi @petercmh01. Yeah, this is a failure mode that can pop-up during training. It's a result of the second STN generating high frequency flow fields to fool the perceptual loss. Here are a few tips that might be helpful for mitigating it. (1) We have two regularizers that try to keep the highly-expressive flow STN under control: (2) As a more extreme version of (1), you can also try training a similarity-only STN and see how performance is there. You won't get a super accurate alignment, but it might be a good sanity check to make sure you get an accurate coarse alignment before adding on the flow STN. (3) Training with reflection padding ( (4) A good sanity check is looking at the visuals created for the learned target mode (template) during training. You want to make sure that whatever StyleGAN latent mode is being learned is "reachable" from most images in your dataset. For example, if you trained a unimodal GANgealing model on LSUN Horses, you might get a side-profile of a horse as the discovered latent template. It will be virtually impossible to align, e.g., images of the front of horses to this mode, and you'll get bad high frequency flows for those images instead. If the learned mode seems "bad" or unreasonable, then you can try to increase (5) One thing to keep in mind is that there will always be some images for which this will happen even if the model is generally quite good. For example, our LSUN Cats model generates high frequency flows fields when it gets an image without a cat's upper-body visible or with significant out-of-plane rotation of the cat's face (we show some examples of this in the last figure in the paper supplementary materials). The hope of course is that these failures by-and-large should only occur for "unalignable" images. The above tips are all for training. There are a couple of things you can do at test-time that somewhat mitigate this problem, but in my experience most of the mileage comes from adjusting training hyperparameters. We use a technique at test time called "recursive alignment" which is important to get the best results on hard datasets (LSUN). We talk about this a bit in our supplementary, but the idea is really simple: you just repeatedly feed the aligned image output by the similarity STN back into itself N times before forwarding to the flow STN (we set N=3 for LSUN). This can be done in the code simply by doing Our flipping algorithm takes advantage of the tendency of the STN to "fail loudly" (i.e., it produces high frequency flows for both failure case images as well as unalignable images); basically, you can probe the flow STN with x and flip(x), and see which of the two yields the smoother flows. In general, a smooth flow --> no/few high frequencies --> STN is doing its job correctly without "cheating." Flipping is a test-time only operation that can significantly help address the issue without re-training (training visuals don't use flipping). But again, if all of your images are exhibiting high frequency flows, this won't solve the issue by itself. Also, it only helps models with asymmetric learned templates. So for example, flipping helps with LSUN Bicycles and CUB, but it won't really do anything for LSUN Cats and CelebA. Regarding training length, you almost certainly don't need to train beyond a million iterations unless you're trying to squeeze-out every last drop of performance. I have seen high frequency flows get a bit better for "hard" images at around the ~700K iteration mark, but you can usually get a reasonable sense how training is going by ~200K iterations, and the remaining hard images seem to work themselves out by ~700K. For your question about the target latent, in my experience it converges very rapidly (usually after like 150K to 200K iterations) and doesn't change much afterwards. We continue training it jointly with the STN for the entirety of training out of simplicity, but I'm pretty sure you could get away with freezing it after some point if you wanted to. You'd probably get a nice training speed-up from this since backpropagating through the generator is quite costly. |
The I never tried changing |
thanks a lot, I will give it a try! |
Yeah that should do the trick. Also make sure to set |
thanks a lot! I have got very robust coarse alignment on my dataset now by using tv_weight of 40000. For the mask of dense tracking should I use average transform image or truncated image ? |
Awesome! I made all the masks using the average transformed image. You'll probably get similar results using the average truncated image though. |
hey @wpeebles now I'm and trying to slowly approach to more accurate alignment base on previous parameters with good coarse alignment. I think my latent target mode is "reachable" but is missing a bit of features (for example, saying I try to align cat but the target mode only has its face and is missing the cat's ear, or I try to align bird but it's missing the bird's head ). Will slightly adjust the --ndirs or --inject parameter be helpful? I've seen the inject parameter impact it a bit from my previous experiment Thank you |
It's possible changing
If you try either of these methods, you'll want to add |
just to follow up, I tried this gan inversion method https://github.com/abdulium/gan-inversion-stylegan2/blob/main/Task2_GANInversion_YOURNAMEHERE.ipynb and I have got my desire target mode. Thanks for you help again! |
Hello:
I've been using the model on my own custom dataset for a while. When I visualize the congealing process on test set and the propagated dense tracking, I noticed:
I read the paper about using flow smoothness and flip to avoid this issue and I understand this can occur a lot. How exactly are flip and flow smoothness helping avoid this issue? What parameter can I adjust to make my model more robust? Does the model improve on this issue after 1 million epochs ? For resources reason I haven't been able to run to 1 million epochs yet but I read the other issue post that you mention it usually takes that long for the model to improve.
I have try the default setting script for both 1, 2 head and 4 head. Then I also tried increase the inject and flow_size parameter. I also tried turn on the sample_from_full_resolution option, but haven't got any good progress from these trial yet.
Thanks in advance and it's really appriciated that you're consistently helping out : )
The text was updated successfully, but these errors were encountered: