New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Correct approach to separate static and dynamic regions #3
Comments
Yes, from my observation the current method does not segment the moving region and static region very well for some examples we tried, and from my observation segmentation quality seems to depends on the hyperparameters such as depth/flow loss in our training paradigm. As mentioned in the paper, the main reason we introduce this static scene model is to improve rendering quality of static region without relying too much on segmentation mask(which indeed helps recover high frequency contents a lot), and segmentation is a by-product application. And thanks for valuable suggestion, I will think about if there is a better way to visualize them and revise them in the final version of paper. |
From the paper, I cannot find reasons why the model is even able to learn fg/bg so clearly with absolute zero supervision: it is perfectly fine for the model to always learn Next I don't get how depth/flow involves in the process of learning To sum up, I'm not convinced that the current method can correctly separate fg/bg (for any scene, not just some...) like what's claimed in the paper and video from the paper's description and by running your pretrained models. Thanks again for the discussion, and I hope that you'll find a better way to visualize them and revise them in the final version of paper! |
Hi Thanks for suggestion! The motivation and reason I think why models can separate fg/bg is similar to https://retiming.github.io/ (although they did use more supervision), where for static region, the static model should converge faster than dynamic model and gives less error, therefore dominate the static region, whereas for dynamic region, dynamic model converges faster. And I do observe this phenomena during training such as the example shown below, where I found even I used human masks, most of human regions still belong to static part because that person is almost still in the sequence: But I agree for some examples such as truck or playground in Nvidia dataset you might saw, this idea without any regularization didn't quite work but do work for some in-the-wild videos I tried. I do tried add regularization or extra supervision for scene decomposition during experiments, which provide better separation, but this makes our losses more complicated and deviate from our initial goal. Therefore, I decide not to include these in our final system. |
I read the paper you mentioned, I think that one's setting is much simpler than yours, and like you said they use mask loss to explicitly supervise the model.
I think you are right. However the concern is that even if the static model learns faster in static region, it probably isn't able to fully learn it before the dynamic model learns something on it. As a result the "static only" is not bad (as shown in #1) but dynamic model still adds some layers on top of it to complete the scene and minimize color loss, making the segmentation (fg/bg separation) unmeaningful. I agree that it's not the main contribution of the paper (the flow guidance is excellent!) and only a by-product, but I believe there are some people like me that are overwhelmed by the ability of such accurate segmentation, but could not find a satisfactory result (at least for now) in this code. A final comment, the loss you propose is already very complicated, haha, but they really do work! I had implemented your paper before you published the code, and it worked pretty well on our data, producing beautiful flows and depths (and of course the images). Thanks really for the nice job! |
Thanks! I believe blending weight is not the optimal way to go (although it gives me the least artifacts in my system when performing rendering), and removing weights and directly compose two models along the ray should be better, but this idea is not so easy to train on monocular videos. And for challenging and complex dynamic scenes with a monocular camera, I have to add more and more components to make it work, which might not be the optimal way as well, so I hope someone could figure out a better and more elegant solution in the near future for this problem. |
I actually tried the composition as in "nerf in the wild." It performs very well, separates almost perfectly the fg from the bg (with only very coarse mask as supervision). However my data is multiview, so I'm not sure if it translates to your monocular setting. I think this requires future solution! |
Hi, I have originally mentioned this issue in #1, but it seems to deviate from the original question, so I decided to open a new issue.
As discussed in #1, I tried setting the
raw_blend_w
to either 0 or 1 to create "static only" and "dynamic only" images that theoretically would look like the Fig. 5 in the paper and in the video. However, this approach seems to be wrong because from the result, the static part looks ok-ish but the dynamic part is almost everything, which is not good at all (we want only the moving part, e.g. only the running kid).It's been a week that I have been testing this while waiting for some response, but still to no avail. @zhengqili @sniklaus @snavely @owang Sorry for bothering, but could any of the authors kindly clarify what's wrong with my approach to separate static/dynamic by setting the blending weight to either 0 or 1? I also tried blending the sigmas (opacity in the code) instead of alphas as in the paper, or directly use the
rgb_map_ref_dy
as output image, but neither helped.Neural-Scene-Flow-Fields/nsff_exp/render_utils.py
Lines 804 to 809 in 7d8a336
I have applied the above approach to other pretrained scenes, but none of them produces good results.
Left: static (raw_blend_w=0). Right: dynamic (raw_blend_w=1).
Left: static (raw_blend_w=0). Right: dynamic (raw_blend_w=1).
I believe there's something wrong with my approach, but I cannot figure out. I really appreciate if the authors could kindly point out what's the correct approach. Thank you very much.
The text was updated successfully, but these errors were encountered: