-
Notifications
You must be signed in to change notification settings - Fork 226
Evaluate on NYUV2 test set? #34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Sorry for not having uploaded it, I noticed some inconsistencies that I wanted to look into further but haven't had time to do so yet. Does the following script work for you and does it replicate the results presented in the paper? Thanks!
|
Great! It works just as mentioned in the paper! However, I have a question. You are solving least squares problem to transform prediction to target domain. Does it mean that the output of the model is not in meters? |
I am happy to hear that you can replicate the results stated in our paper! And you are correct, the prediction does not have a unit. By the way, I think that this line:
Should be changed to reverse the channel ordering to convert RGB to BGR:
In other words, it seems like that I have accidentally done the evaluation on an incorrect channel ordering. As such, our paper reports the NYU metrics as higher than they should be. |
Yes! i have done the changes and got much better results. Results after changes: |
I get pretty much the same results on my end, it just seems a little too good to be true when comparing to the other methods. |
You mean visual comparison? I have not seen better results yet on single image... Also, You mentioned that you are planning to release the dataset? Do you have any updates? |
Another question is about predictions transformations. Is it fair to use True targets for evaluation? You solve least square by X = (A.T A)^-1 (A.T B) which already have B (True targets) in it to transform your outputs? |
I just re-ran the NYU evaluation with the corrected color channel order. As such, the part that corresponds to NYU in Table 1 of our paper should be corrected as follows.
I am just always a little skeptic when results are much better than competing methods (on a side note, MiDaS is a great single image depth estimation method as well but we have not compared to it in our paper). We have a great dataset and spent a lot of time optimizing our network, but there always is the worry about having made a mistake in the evaluation. If anyone spots an issue with the evaluation code I posted above then please do not hesitate to let me know. As for the dataset, please see #25 for any updates. We are currently still discussing with our legal team but hope to have an update soon. Regarding your question about the transform. Yes, it is fair since we 1) use the same transform optimization for all methods we compare to, and 2) it is just solving for a scale (and bias) that is impossible to predict otherwise. |
Great! Thanks for reply! |
Hello Simon! Thank you very much for the script! Actually, I would like to echo @oljike regarding the use of least squares. The question is what is the fair and correct way to evaluate a model that returns scale- and shift-invariant disparity against the ground truth which is absolute depth. It seems using the fact that the ground truth is within a certain range (0 to 10) is not fair - and you do not use it. At the same time, however, this could be implicitly part of the model (e.g. you actually trained on NYU, whereas other models, such as MegaDepth, did not) - or even explicitly if somebody forces his model to return values in the target range (e.g. by putting sigmoid * 10 at the end). Another important issue is outliers. Minimal clamping or some more robust mapping (median, MAD, etc.) in addition to or instead of least squares could have a huge impact on the result (talking in general, not specifically about your model). Alternatively, typical disparity visualization when you do just affine mapping to 0..1 could result in much worse results - so manual evaluation of the output could seem pretty bad relative to the benchmark results. Also, what is the reason you remove 16 pixels from each edge? For example, you could compute the depth on the whole image, but then skip comparing depth on the border because you were not sure of the final extrapolation/interpolation in that area. However, you start by removing the border from the image and GT depth and then compute and compare on the whole resulting image and depth map. Would be happy to hear your considerations. And, once again, thank you very much for sharing your code! |
You are very welcome to change the error minimization metric and redo the evaluation. 🙂 I remain that our evaluation is fair, we used the same error minimization for all baselines. If you disagree with evaluating MegaDepth in this way because it has not been trained on NYU then ignore those results and focus on the ones from DIW, they use the same network architecture after all. Also, DIW themselves use a similar error minimization metric in their paper but instead of on a per-sample basis they do it on the entire set. If you do some literature review you will find that there are a few different approaches for dealing with the scale ambiguity in the evaluation, none of them are fully satisfying. The important thing is to be consistent though and to treat all methods in the same way, which we do. I am afraid that I do not understand you concerns about the disparity visualization. We do not compare visualizations of depth or disparity maps in our paper. We only do so in the supplementary material since one reviewer asked us for such a comparison. However, we belief that visualizations of depth maps have little meaning. Side-by-side comparisons, for example, make it impossible to judge how well the depth edges are aligned with image edges at depth discontinuities. I strongly believe that cropping the input image by 16 pixels is more appropriate than cropping the depth estimate since the input images are subject to a white boundary as shown below. Besides, evaluating the cropped depth estimate when using the unaltered input yields insignificant changes in the error measurement as also shown below. And again, the most important thing is to be consistent and to treat all methods in the same way which we do.
|
I just added the |
Hi Sniklaus, |
Hi! Could You please provide a script how you evaluated the NYUV2 test set. I am trying to get the same results as you mentioned in the paper but can't.
The text was updated successfully, but these errors were encountered: