Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Learned Prior #11

Open
MoeinSorkhei opened this issue Jan 15, 2020 · 12 comments
Open

Learned Prior #11

MoeinSorkhei opened this issue Jan 15, 2020 · 12 comments

Comments

@MoeinSorkhei
Copy link

Hi,
Thank you for your great implementation.

Regarding the "learned" prior, I wanted to ask:
1- Why are you considering the prior to be a Gaussian with trainable parameters rather than a unit Gaussian, for instance.
2- For the Gaussian prior, what is the motivation behind obtaining the mean and std of the Gaussian by passing out through the CNN? Is it just because you found it to be more useful? (https://github.com/rosinality/glow-pytorch/blob/master/model.py#L285)

Thanks in advance.

@rosinality
Copy link
Owner

I found it from official implementations. It was beneficial to training stability. I think the intuition of this is not found in the papers, but I think having spatial, content dependent priors will make overall problem easier.

@MoeinSorkhei
Copy link
Author

Thanks for the response!
I will write here in case I gain more knowledge about this technique.

@MoeinSorkhei
Copy link
Author

MoeinSorkhei commented Feb 10, 2020

Hi,

I have some detailed questions regarding your code. I will be happy if you have the time to answer them.

  1. In both the Actornom and AffineCoupling modules where we need to apply scale and shift operations, in both your code and the original OpenAI code you are first applying the shift and then scale. Although it should not matter very much as long as optimization is concerned, did you find any reason to switch the operations? In the paper, the scale operation is done before the shift operation (Code lines: https://github.com/rosinality/glow-pytorch/blob/master/model.py#L54 and https://github.com/rosinality/glow-pytorch/blob/master/model.py#L184).

  2. Can you explain why there is an additional learnable (scale * 3) factor in the ZeroConv2d module (like OpenAI code)? And why is the scale is multiplied by 3? (Code line: https://github.com/rosinality/glow-pytorch/blob/master/model.py#L151)

  3. Can you explain why the input is shifted by +2 before applying Sigmoid in the AffinCouping module (like OpenAI code)? (Code line: https://github.com/rosinality/glow-pytorch/blob/master/model.py#L182)

  4. Why are you padding with 1 (and not 0) in the ZeroConv2D module (Code line: https://github.com/rosinality/glow-pytorch/blob/master/model.py#L149), while you are padding with 0 in the other Conv2D layers in the AffineCoupling? OpeAI pads with 0 if I am not mistaken (https://github.com/openai/glow/blob/master/tfops.py#L203).

  5. If I am not mistaken, in the official implementation, the authors incorporate additional ActNorm right after both the Conv2D and ZeroConv2D layers (Code lines https://github.com/openai/glow/blob/master/tfops.py#L256 and https://github.com/openai/glow/blob/master/tfops.py#L309). Is there any reason you have not considered this in your code?

I appreciate your time.
Thank you in advance.

@rosinality
Copy link
Owner

Actually I don't know the exact reasons of these details - I have tried to replicate the paper and after it not works well I took these details from the official implementations. Anyway it is related to training stability, though.

  1. It could make optimization process different as location parameter is also scaled.
  2. I don't know why...Maybe it is to accelerate adaptation during initial training phases.
  3. It will make scaling factor close to 1 - This will be helpful for information propagation.
  4. I think they also used 1 padding. (https://github.com/openai/glow/blob/master/tfops.py#L217)
  5. Did they used ActNorm after ZeroConv2D? I can't find it in conv2d_zeros.

@MoeinSorkhei
Copy link
Author

Hi,

Thanks for your response, you are right.
I have a question about the amount of noise you add to each data point when training. As can be seen in your code here (and OpenAI's code), you are adding noise to the data based on 1/(2 ** n_bits). Since the 8-bit image pixels which naturally take values from [0-255] are divided by 255 when bein converted to [0-1] interval, why should we not choose 1/(2 ** n_bits - 1), that is, 1/255 in case of 8-bit images as the right quantity?

I appreciate your thoughts.
Thanks very much in advance.

@rosinality
Copy link
Owner

Hmm, wouldn't it be better to have noises slight smaller than actual pixel value changes in [0, 255]? Though I guess it will not make much differences.

@MoeinSorkhei
Copy link
Author

Yes, might be. I also think it does not make a huge difference overall.

@MoeinSorkhei
Copy link
Author

Actually I don't know the exact reasons of these details - I have tried to replicate the paper and after it not works well I took these details from the official implementations. Anyway it is related to training stability, though.

  1. It could make optimization process different as location parameter is also scaled.
  2. I don't know why...Maybe it is to accelerate adaptation during initial training phases.
  3. It will make scaling factor close to 1 - This will be helpful for information propagation.
  4. I think they also used 1 padding. (https://github.com/openai/glow/blob/master/tfops.py#L217)
  5. Did they used ActNorm after ZeroConv2D? I can't find it in conv2d_zeros.

Hi, after investigating the Glow official implementation more thorougly, I wanted to make some clarifications regarding this topic:

  • First, I think they alway pad with with one, since they use the add_edge_padding function which pads with 1, as you referred in your response. However, in your code, you pad with 0 (Pytorch default) except for ZeroConv2D where you pad with 1. Am I right?

  • Secondly, about the Actnorm, the official implementation always uses actnorm after Conv2D operations that they define except for ZeroInitConv2D that does not do the actnorm, as far as I understand their code.

Thanks.

@rosinality
Copy link
Owner

First, I think they alway pad with with one, since they use the add_edge_padding function which pads with 1, as you referred in your response. However, in your code, you pad with 0 (Pytorch default) except for ZeroConv2D where you pad with 1. Am I right?

Slightly different. They also have used zero padding, but concatenated additional input channels that all zeros except borders.

@MoeinSorkhei
Copy link
Author

First, I think they alway pad with with one, since they use the add_edge_padding function which pads with 1, as you referred in your response. However, in your code, you pad with 0 (Pytorch default) except for ZeroConv2D where you pad with 1. Am I right?

Slightly different. They also have used zero padding, but concatenated additional input channels that all zeros except borders.

Absolutely right. Do you understand what the usage of that additional input (the pad variable) is? In other words, why they concatenate with this additional input after padding with zeros?

@rosinality
Copy link
Owner

It will help features maps to have different values (or kind of biases) on the edge. I think zero padding can also achieves that some degree, though.

@MoeinSorkhei
Copy link
Author

Right. As you mentioned, zero-padding allows for different values since the kernel that will be convolved with the input will have a bias by default, so it allows for different values although the kernel weight is multiplied by zeros (pads). But the pad variable that is concatenated to the input also allows the kernel weight to have an effect, so in theory it might allow for more degree of freedom. However, I am not sure how helpful it might be in practice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants