Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BigGAN: consistency regularization (SimCLR-style) loss #11

Open
gwern opened this issue May 3, 2020 · 4 comments
Open

BigGAN: consistency regularization (SimCLR-style) loss #11

gwern opened this issue May 3, 2020 · 4 comments

Comments

@gwern
Copy link

gwern commented May 3, 2020

Self-supervision/semi-supervised learning is ultra-hot now, with new SOTAs being set in DRL using shockingly simple method, and self-supervised being competitive with classical supervised CNNs at ImageNet classification. Self-supervised auxiliary losses have also been slightly helpful in the latest variants on BigGAN.

Hypothetically, adding a self-supervised loss where the Discriminator is forced to learn more about images could stabilize training (by providing a second loss which is unrelated to the unstable zero-sum dynamics of GAN training) and make the D learn better semantics & meaningful classifications for teaching G.

Skylion did initial experiments with a simple rotation loss from SS-GAN, where the D tries to predict how an image has been randomly rotated. This helped a little bit.

SimCLR establishes that cropping->color-distorting an image and forcing the D to try to encode them in a similar way ('consistency') works extremely well at learning classification, and various DRL papers establish that even just cropping & consistency loss training is amazingly effective in DRL. A prototype by lucidrains of just cropping+flipping showed some promise in BigGAN runs, where it seemed like proto-CLR runs learned better overall structure despite problems with balancing the proto-CLR loss with the regular classification loss and the slowdown an additional training phase introduces.

We would like to use full SimCLR-like distortion + consistency training on BigGAN to train D on distorted real & fake images (Zhao shows that doing it on both is better than on just reals for GANs).

@gwern
Copy link
Author

gwern commented May 8, 2020

We can do SimCLR on G by treating a minibatch of generated samples as negative samples for each other (the negative loss), and by generating slightly-variant sibling images by slightly varying z vectors which serve as positive samples for each other (positive loss). This may enable removing the adversarial loss entirely, replacing it with a self-supervised perceptual loss.


One idea that the ICR paper introduces is zCR-GAN: a CR-GAN where the Generator is given its own consistency loss, where 2 images generated from almost the same z latent vector are encouraged to have different D losses, as a way of fighting mode collapse and encouraging diversity.

This raises a question: if you have SimCLR implemented in D, learning to create embeddings using a contrastive loss on data-augmented images as we outline above as a souped-up CR-GAN, why not add an embedding-related loss to the outputs of G as well, similar to zCR-GAN?

In this idea, G takes random z latents, generates images and feeds them into D to get the SimCLR embedding. These embeddings can then be trained using a contrastive loss, updating parameters of G (while freezing D) to push the minibatch of sampled images apart from each other (each sample is a negative sample for all the others)

This is fine, but using only negative samples would seem inadequate, and it might simply blow out the G. We also need a source of positive examples.

Data augmentations won't work like it does for D because G is generating the same image, so there's nothing to train. However, G has built in data augmentation in the form of small random changes to a given z latent vector. So G can generate positive samples by simply jittering each z and generating similar images. This provides both positive & negative samples for a SimCLR-like contrastive loss on G.

One might wonder: is the adversarial loss even necessary at this point? Why not have 'D' simply be a SimCLR CNN which constantly trains on real & fake images, and then G does its generative variant, benefiting from the 'D' SimCLR embedding? No supervision beyond the two contrastive losses.

So D loops through batches of reals/fakes, refining its SimCLR embedding; meanwhile, G keeps generating batches of fakes, pushing random z vectors as far apart, while still generating similar images for neighboring z vectors. All very smooth and nonadversarial: the negative examples fight mode collapse, while the positive examples create a meaningful latent space, and the fake images help bootstrap the simCLR loss by providing lots of weird images to train on. So it's like zCR-GAN except way cooler and contrastier. (The G loss is only the contrastive loss: the G tries to make nearby z vectors get similar contrastive embeddings. My hypothesis is that only realistic images will produce well-behaved embeddings from the real-image-trained SimCLR D model, so G will constantly be molded towards realism.)

One question would be: should the fakes be fed into the D during SimCLR training or should it only see real images? It seems like D should see fakes in order to provide more of a signal to G (imagine at the beginning, when it is only generating random static noise - a SimCLR trained on real images might have little to say about those early samples), but then it does introduce some dependency/feedback I'm not sure how works out: G will generate weird images for D, which possibly can be rewarded by changing the embedding, but maybe that's aligned in a good way?

The advantage of this (aside from being intellectually interesting) is that it may reduce mode dropping, as diversity in each minibatch is directly optimized for, and increase stability, as it is not (as) adversarial.


Prototype update:An initial attempt by gna using lightweight DCGAN on FFHQ has been made (training D on reals only). Sample is odd: it clearly is learning something, but why tessellated hair/eyes? Suspect: the highly-aggressive cropping of SimCLR is throwing away global structure. Needs bigger images, presumably.

EXxFk-ZXgAEfCsq

But if you use light cropping which includes most of the image, will there be enough data...? That might be a justification for training on G fake samples too: that gives you an arbitrarily large number of whole (albeit low quality) images to pseudo-data-augment the whole reals. gna is skeptical that training on fakes will do anything but confuse D or potentially allow collapse to degenerate minima where D creates an 'easy' embedding that G can excel on, but I think this must be tested out empirically.

@gwern
Copy link
Author

gwern commented May 11, 2020

Current status of the full SimCLR loss for BigGAN according to shawwn:

I haven't gotten it working on TPU cores. Only TPU's CPU, which resulted in a memory leak - or rather it didn't work with dataset caching, and I didn't run with caching disabled

@gwern
Copy link
Author

gwern commented May 28, 2020

gna has continued fiddling with a CLR-like GAN, moving over to a perceptual loss instead on the internal embedding as trained by a SimCLR final loss, under the reasoning that the final embedding throws away too much global information, but may still induce useful intermediate representations:

When I've trained with G's output included in what D trains on and use D's logits for G's loss, I see nothing resembling learning by G.

Meanwhile, when G is not included in D's data (as in our example), what we're seeing is absolutely an example of what Goff and you are referring to: eyes and mouths with no global coherence, because D isn't being trained to learn global coherence.

clrgan-second-EXxFk-ZXgAEfCsq

That's why I've moved towards using D's rich intermediate features [as a perceptual loss] to train G instead of using D's output, which by the nature of SimCLR does not include much information (because it's supposed to be invariant to augmentation).

I guess I could just say "D's output for G's loss" because I generally don't use just the logits, but:
https://github.com/google-research/simclr/blob/master/objective.py

The logits are made when D's output is l2 normalized and then matrix multiplied with itself and then with the other outputs from the symmetrically augmented samples. The logits are essentially (if not exactly?) the cosine similarity between each batch of representations and within each batch of representations. These logits are divided by the temperature, then softmaxed, and then cross entropy loss is run on them with labels such that positive samples are 1 and negative samples are 0. It isn't classification, but it uses softmax.

You can take this process and use different labels to encourage G to make its samples to all look like positive samples of real images.

The setup is different here [from original CLR-GAN]: I'm not maximizing D's output; instead I'm matching its intermediate layers. I take the pairwise perceptual distance within batches for real and generated images, then I take the difference between those, get the gram matrix (a la style transfer) of that, and minimize the absolute value of the gram matrix. The intuition is that we want the differences between the images within a generated batch to look just like the differences between real images.

...the setup for the loss is surprisingly involved (& yet still kind-of elegant to me somehow). Of course, it all still amounts to them telling D: "make these very similar if they're matched" and "make these very different if not." Oh, and as to whether I've gotten past the tessellated eyes, I've found some... really awful but somewhat promising results while looking for good intermediate layers in the SimCLR pretrained ResNet. These are from CelebA---because you can't get a much easier dataset than that...reviewers have said that most perceptual-loss-based generation relies heavily on the perceptual model's supervised learning, so they aren't fairly comparable to GANs (since it isn't really fully unsupervised). I'd like to see if I can't show that it's possible to do with no labels.

clrgan-perceptualosses-faces-download_-_2020-05-25T225220 009

@gwern
Copy link
Author

gwern commented Jun 7, 2020

SimCLR news: @lucidrains has experimented with a StyleGAN2 implementation drawing on his contrastive-learner library (which may or may not be helping).


SimCLR has been confirmed to work for BigGAN and to help!

The new paper "Image Augmentations for GAN Training", Zhao et al 2020b, reports:

Data augmentations have been widely studied to improve the accuracy and robustness of classifiers. However, the potential of image augmentation in improving GAN models for image synthesis has not been thoroughly investigated in previous studies. In this work, we systematically study the effectiveness of various existing augmentation techniques for GAN training in a variety of settings. We provide insights and guidelines on how to augment images for both vanilla GANs and GANs with regularizations, improving the fidelity of the generated images substantially. Surprisingly, we find that vanilla GANs attain generation quality on par with recent state-of-the-art results if we use augmentations on both real and generated images. When this GAN training is combined with other augmentation-based regularization techniques, such as contrastive loss and consistency regularization, the augmentations further improve the quality of generated images. We provide new state-of-the-art results for conditional generation on CIFAR-10 with both consistency loss and contrastive loss as additional regularizations.

Their SimCLR BigGAN, which they call "Cntr-GAN", builds on the fixed data augmentation earlier in the paper (discussed in more detail in #35 ), their previous work on additional consistency losses (essentially, jittering G/D and requiring the losses be similar/dissimilar for similar images), and then adds in SimCLR for even further benefits. The gains:

Table 1: BigGAN and regularizations (FID & Inception Scores on CIFAR-10)

SimCLR on its own is a small improvement, but it comes in addition to all of the other gains. (Interestingly, SimCLR benefits from MixUp almost as much as the usual crop+scale+color-distortion SimCLR data augmentation, while the regular D data augs did not benefit at all from MixUp, suggesting that it's doing something qualitatively different.)

The writeup is very brief and sketchy, unfortunately:

Specific SimCLR data augmentations used

"Appendix D. Cntr-GAN: GAN with Contrastive Loss"

However, it seems to be pretty much what we were trying? So if we can fix the final bugs, this should get us another few FID points / quality improvement of 10% or so.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant