Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Correct approach to separate static and dynamic regions #3

Closed
kwea123 opened this issue Mar 10, 2021 · 8 comments
Closed

Correct approach to separate static and dynamic regions #3

kwea123 opened this issue Mar 10, 2021 · 8 comments

Comments

@kwea123
Copy link

kwea123 commented Mar 10, 2021

Hi, I have originally mentioned this issue in #1, but it seems to deviate from the original question, so I decided to open a new issue.

As discussed in #1, I tried setting the raw_blend_w to either 0 or 1 to create "static only" and "dynamic only" images that theoretically would look like the Fig. 5 in the paper and in the video. However, this approach seems to be wrong because from the result, the static part looks ok-ish but the dynamic part is almost everything, which is not good at all (we want only the moving part, e.g. only the running kid).

It's been a week that I have been testing this while waiting for some response, but still to no avail. @zhengqili @sniklaus @snavely @owang Sorry for bothering, but could any of the authors kindly clarify what's wrong with my approach to separate static/dynamic by setting the blending weight to either 0 or 1? I also tried blending the sigmas (opacity in the code) instead of alphas as in the paper, or directly use the rgb_map_ref_dy as output image, but neither helped.

opacity_dy = act_fn(raw_dy[..., 3] + noise)#.detach() #* raw_blend_w
opacity_rigid = act_fn(raw_rigid[..., 3] + noise)#.detach() #* (1. - raw_blend_w)
# alpha with blending weights
alpha_dy = (1. - torch.exp(-opacity_dy * dists) ) * raw_blend_w
alpha_rig = (1. - torch.exp(-opacity_rigid * dists)) * (1. - raw_blend_w)

I have applied the above approach to other pretrained scenes, but none of them produces good results.


Left: static (raw_blend_w=0). Right: dynamic (raw_blend_w=1).


Left: static (raw_blend_w=0). Right: dynamic (raw_blend_w=1).

I believe there's something wrong with my approach, but I cannot figure out. I really appreciate if the authors could kindly point out what's the correct approach. Thank you very much.

@zhengqili
Copy link
Owner

zhengqili commented Mar 10, 2021

Hi, Sorry for the late reply. One way you can visualize rendered dynamic part is as follows:

In render_utils.py, go to function "raw2outputs_blending", change computaion of rendered rgb_map from
" rgb_map = torch.sum(weights_dy[..., None] * rgb_dy + weights_rig[..., None] * rgb_rigid, -2) "
to
"rgb_map = torch.sum(weights_dy[..., None] * rgb_dy, -2)". In that case, you should get something like below, which softly mask most of background region:
000

Let me know if you have any question.

@kwea123
Copy link
Author

kwea123 commented Mar 10, 2021

Hi, thanks for the suggestion. I'm able to reproduce your result now:
000
000
000

However there is still problem,

  1. The result is far from what you showed in the paper: the background is still largely included, especially for the last truck scene.
  2. From my understanding, the method you suggest is like setting the rgb_rigid to all zeros, but retains its weight weights_rig in the composition. This is like explicitly setting the background to black but retains its occupancy to let this "black cloud" block what's behind it. In my opinion the black shouldn't be this "black cloud", i.e. black static sample points with high opacity blocking dynamic points, but should be a "black vacuum", i.e. black because the static and dynamic opacities are both zero. What do you think?
  3. Finally, this method does not work if I want to generate "static only" image, because of reason 2., it just lets the black cloud block what's behind it, resulting in black holes.
    000

@zhengqili
Copy link
Owner

zhengqili commented Mar 10, 2021

Yes, from my observation the current method does not segment the moving region and static region very well for some examples we tried, and from my observation segmentation quality seems to depends on the hyperparameters such as depth/flow loss in our training paradigm.

As mentioned in the paper, the main reason we introduce this static scene model is to improve rendering quality of static region without relying too much on segmentation mask(which indeed helps recover high frequency contents a lot), and segmentation is a by-product application. And thanks for valuable suggestion, I will think about if there is a better way to visualize them and revise them in the final version of paper.

@kwea123
Copy link
Author

kwea123 commented Mar 10, 2021

From the paper, I cannot find reasons why the model is even able to learn fg/bg so clearly with absolute zero supervision: it is perfectly fine for the model to always learn v=1 i.e. everything as dynamic without any constraint. I agree that adding static component improves the total score according to your experiments, but I suspect that it's due to the insufficient expressivity of the dynamic model. Currently the positionally-encoded time input has 21 dimensions, maybe it's insufficient to express the whole scene, but what if we use more dimensions? Like a recent paper where they use 1024 dimensional time input, they are able to learn everything as dynamic part.

Next I don't get how depth/flow involves in the process of learning v. If, for example, you use the learnt flow's magnitude to regularize v, e.g. minimize v if its flow is smaller than some threshold, then I agree that flow loss has some influence, and maybe the segmentation quality will increase in this case, but I don't think that the current code does.

To sum up, I'm not convinced that the current method can correctly separate fg/bg (for any scene, not just some...) like what's claimed in the paper and video from the paper's description and by running your pretrained models. Thanks again for the discussion, and I hope that you'll find a better way to visualize them and revise them in the final version of paper!

@zhengqili
Copy link
Owner

zhengqili commented Mar 10, 2021

Hi Thanks for suggestion! The motivation and reason I think why models can separate fg/bg is similar to https://retiming.github.io/ (although they did use more supervision), where for static region, the static model should converge faster than dynamic model and gives less error, therefore dominate the static region, whereas for dynamic region, dynamic model converges faster. And I do observe this phenomena during training such as the example shown below, where I found even I used human masks, most of human regions still belong to static part because that person is almost still in the sequence:
balloon1_fg
balloon1_bg

But I agree for some examples such as truck or playground in Nvidia dataset you might saw, this idea without any regularization didn't quite work but do work for some in-the-wild videos I tried. I do tried add regularization or extra supervision for scene decomposition during experiments, which provide better separation, but this makes our losses more complicated and deviate from our initial goal. Therefore, I decide not to include these in our final system.

@kwea123
Copy link
Author

kwea123 commented Mar 11, 2021

I read the paper you mentioned, I think that one's setting is much simpler than yours, and like you said they use mask loss to explicitly supervise the model.

where for static region, the static model should converge faster than dynamic model and gives less error, therefore dominate the static region, whereas for dynamic region, dynamic model converges faster

I think you are right. However the concern is that even if the static model learns faster in static region, it probably isn't able to fully learn it before the dynamic model learns something on it. As a result the "static only" is not bad (as shown in #1) but dynamic model still adds some layers on top of it to complete the scene and minimize color loss, making the segmentation (fg/bg separation) unmeaningful.

I agree that it's not the main contribution of the paper (the flow guidance is excellent!) and only a by-product, but I believe there are some people like me that are overwhelmed by the ability of such accurate segmentation, but could not find a satisfactory result (at least for now) in this code.

A final comment, the loss you propose is already very complicated, haha, but they really do work! I had implemented your paper before you published the code, and it worked pretty well on our data, producing beautiful flows and depths (and of course the images). Thanks really for the nice job!

@zhengqili
Copy link
Owner

zhengqili commented Mar 11, 2021

Thanks! I believe blending weight is not the optimal way to go (although it gives me the least artifacts in my system when performing rendering), and removing weights and directly compose two models along the ray should be better, but this idea is not so easy to train on monocular videos.

And for challenging and complex dynamic scenes with a monocular camera, I have to add more and more components to make it work, which might not be the optimal way as well, so I hope someone could figure out a better and more elegant solution in the near future for this problem.

@kwea123
Copy link
Author

kwea123 commented Mar 11, 2021

I actually tried the composition as in "nerf in the wild." It performs very well, separates almost perfectly the fg from the bg (with only very coarse mask as supervision). However my data is multiview, so I'm not sure if it translates to your monocular setting. I think this requires future solution!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants