Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Code related clarifications #3

Closed
poornimajd opened this issue Jun 23, 2020 · 7 comments
Closed

Code related clarifications #3

poornimajd opened this issue Jun 23, 2020 · 7 comments

Comments

@poornimajd
Copy link

Hello @hurjunhwa and team!
I am using this code to train on my own dataset.In regard to this I had a few questions as follows:
I wanted to know how did you get the scaling of 0.54 in the following code:

depth = k_value.unsqueeze(1).unsqueeze(1).unsqueeze(1) * 0.54 / (disp + (1.0 - mask))

Also in the below mentioned line of code dividing the disp by 256 is only specific to kitti dataset right?:
disp_np = io.imread(disp_file).astype(np.uint16) / 256.0

Regarding the flow,I am not sure exactly what these numbers in the following line mean,are they to be changes while using a different dataset?
flow_u = np.clip((u * 64 + 2 ** 15), 0.0, 65535.0).astype(np.uint16)

And I was able to visualize the sceneflow but I am not sure how to validate it,because I do not have the ground truth flow nor the disparity.Can you please help me out in this case?
It was earlier mentioned that this line of code gives the scene flow:
out_sceneflow = interpolate2d_as(output_dict['flow_f_pp'][0], input_l1, mode="bilinear")

But I am unsure of what each value in this output sceneflow represent,Is it the normalized x,y and z coordinate of the motion vector?
Can you please help me understand this line of code which gives the output for the sceneflow?
Any help is greatly appreciated!
Thank you

@hurjunhwa
Copy link
Collaborator

hurjunhwa commented Jun 23, 2020

Hi,

  1. This is the baseline distance between two cameras in the stereo rig, 0.54 (m).

  2. and 3.
    Yes, these are only specific to KITTI dataset.
    They store the disparity or flow in uint16 type.
    After loading, the disparity & flow is in the pixel unit.

  1. The output scene flow is defined in the meter scale (m), not normalized.
    It's not easy to validate if you don't have any ground truth or pseudo ground truth of your dataset.
    As a sanity check, you may check whether the source image and the warped target image (using the estimated disparity and scene flow) look similar.
    If you know the camera intrinsics of your datasets, you may adjust the scale of the estimation accordingly by referring the KITTI's intrinsics.

Best,
Jun

@poornimajd
Copy link
Author

poornimajd commented Jun 23, 2020

Thanks for the quick and detailed reply!

3\. The output scene flow is defined in the meter scale (m) world coordinate, not normalized.

Also a small clarification on this
For example if the output of out_sceneflow is the following

`tensor([[[[-0.3044, -0.3008, -0.2973,  ...,  0.4740,  0.4754,  0.4768],
          [-0.3038, -0.3004, -0.2970,  ...,  0.4728,  0.4741,  0.4753],
          [-0.3032, -0.3000, -0.2967,  ...,  0.4716,  0.4727,  0.4738],
          ...,
          [-0.1817, -0.1817, -0.1817,  ...,  0.1915,  0.1919,  0.1922],
          [-0.1817, -0.1817, -0.1818,  ...,  0.1915,  0.1919,  0.1923],
          [-0.1816, -0.1817, -0.1818,  ...,  0.1916,  0.1920,  0.1924]],

         [[-0.0584, -0.0579, -0.0574,  ..., -0.0122, -0.0131, -0.0140],
          [-0.0585, -0.0579, -0.0574,  ..., -0.0113, -0.0123, -0.0132],
          [-0.0585, -0.0579, -0.0574,  ..., -0.0104, -0.0114, -0.0124],
          ...,
          [ 0.0720,  0.0713,  0.0706,  ...,  0.0674,  0.0678,  0.0682],
          [ 0.0723,  0.0716,  0.0708,  ...,  0.0677,  0.0681,  0.0685],
          [ 0.0725,  0.0718,  0.0711,  ...,  0.0679,  0.0683,  0.0687]],

         [[-1.0730, -1.0736, -1.0742,  ..., -0.8998, -0.8987, -0.8977],
          [-1.0746, -1.0751, -1.0757,  ..., -0.9013, -0.9002, -0.8992],
          [-1.0761, -1.0766, -1.0771,  ..., -0.9027, -0.9017, -0.9007],
          ...,
          [-1.2378, -1.2389, -1.2400,  ..., -1.1398, -1.1389, -1.1380],
          [-1.2371, -1.2383, -1.2395,  ..., -1.1393, -1.1384, -1.1375],
          [-1.2364, -1.2376, -1.2389,  ..., -1.1388, -1.1379, -1.1369]]]],
       device='cuda:0')`

and with size- ([1, 3, 370, 1226])
Then the x coordinate (in meters) is -0.3044, similarly y(in meters) is -0.0584, z(in meters) is -1.0730 right,This means the object has moved by this value in a particular direction as compared to the previous frame right?
Sorry for the too much in depth analysis.
Any suggestion is appreciated!

@hurjunhwa
Copy link
Collaborator

Hi,

Yes, that's right.
You can find the definition of the coordinate and calibration information in their paper or devkit in the dataset web page.
https://www.mrt.kit.edu/z/publ/download/2013/GeigerAl2013IJRR.pdf
(Fig. 1, red-colored coordinate)
No worries!

@rohaldb
Copy link

rohaldb commented Jul 7, 2021

I am running this on a custom dataset, and I'm confused about this line:

depth = k_value.unsqueeze(1).unsqueeze(1).unsqueeze(1) * 0.54 / (disp + (1.0 - mask))

If during evaluation we supply the model with only monocular images, are we able to leave this value of 0.54? It doesn't really make sense to set it to the value between two cameras when there is only one.

Thanks!

@hurjunhwa
Copy link
Collaborator

Hi,
Yes, you can leave it out when testing on a custom dataset.
Then, of course, the scale of the output depth and scene flow is unknown.
It's only for the KITTI dataset to recover scales of depth and scene flow, given the camera intrinsic and the stereo baseline distance.

@rohaldb
Copy link

rohaldb commented Jul 7, 2021

Thanks so much for the speedy reply!

Just to clarify, it would be unknown up to scale and shift, not just scale, correct? Apologies if this is a trivial question, my graphics/vision is not so strong!

@hurjunhwa
Copy link
Collaborator

Yes, you are right :). both scale and shift.

By the way probably as you may know, at CVPR this year, there is a very nice paper that recovers scale, shift, including the focal length: Learning to Recover 3D Scene Shape from a Single Image
It would be also interesting to read!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants