Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Results Using Predicted 2D Keypoints #13

Closed
juxuan27 opened this issue Jan 27, 2022 · 9 comments
Closed

Results Using Predicted 2D Keypoints #13

juxuan27 opened this issue Jan 27, 2022 · 9 comments

Comments

@juxuan27
Copy link

Thank you so much for your excellent work! But I got some problems while trying to test the model on predicted 2D keypoints(using Alphapose-Fast Pose, same as the backbone mentioned in README) on the dataset 3dpw
This is how I tried:

  • For the reason that it is pretty hard for me to map multi-pose results generated by Alphapose to 3DPW ground-truth, I selected videos with a single person. The list is as follows:
                         "courtyard_backpack_00",
                         "courtyard_basketball_01",
                         "courtyard_bodyScannerMotions_00",
                         "courtyard_box_00",
                         "courtyard_golf_00",
                         "courtyard_jacket_00",
                          "courtyard_jumpBench_01",
                         "courtyard_laceShoe_00",
                         "courtyard_relaxOnBench_00",
                         "courtyard_relaxOnBench_01",
                         "downtown_stairs_00",
                         "downtown_walkDownhill_00",
                         "flat_guitar_01",
                         "flat_packBags_00",
                         "outdoors_climbing_00",
                         "outdoors_climbing_01",
                         "outdoors_climbing_02",
                         "outdoors_crosscountry_00",
                         "outdoors_fencing_01",
                         "outdoors_freestyle_00",
                         "outdoors_freestyle_01",
                         "outdoors_golf_00",
                         "outdoors_parcours_00",
                         "outdoors_parcours_01",
                         "outdoors_slalom_00",
                         "outdoors_slalom_01",
    
  • Then, I ran the Internet video baseline and got predicted cam, rotmat, beta parameters for each frame.
  • After that, I calculate the MPJPE, PA-MPJPE, and PVE for each step.

The final results are as follows (plus with MPJPE on X, Y, Z axis):

metricMPJP dynaBOA w gt 2d dynaBOA w pred 2d
MPJPE 65.56047058105469 186.74376
PA-MPJPE 40.92316436767578 77.56925
PVE 83.11467019999202 195.08884
MPJPE(X Axis) 21.0639544272907 67.5  
MPJPE(Y Axis) 25.5786684319053 57.8
MPJPE(Z Axis) 50.4342290491508 140.7

I was quite confused why the results would be so bad. So I tried to make Gaussian Perturbation on ground truth 2d. And run the 3dpw baseline. The code I changed is as follows.

self.smpl_j2ds.append(smpl_j2ds)

changed to (e.g. sigma=1):

self.smpl_j2ds.append(smpl_j2ds+np.random.normal(0, 1, size=tuple(smpl_j2ds.shape)))

And here is the result:
Gaussian Perturbation on ground truth 2d

Furthermore, I calculate the mean-variance of ground truth 2d and Alphapose predicted 2d, and the result is 12.65. Take the assumption detected 2d is Gaussian noise added on ground truth 2d, the result is supposed to be worse.

So is that mean DynaBOA is not combination incorporable with detected 2d keypoints? Or is that because of my improper operation?

Thank you so much for your patience in reading my issue.

@syguan96
Copy link
Owner

Hi @juxuan27, sorry for the late reply. I just finished a due.
The experiment is very insightful and inspiring. I'm so glad to discuss with you.

In the results from alphapose, some joints will be lost. These results are not good supervision for network adaptation, especially for the online adaptation. Top-down methods are more appropriate indeed. And I think the assumption on the Gaussian noise is not very appropriate for results from bottom-up methods.

@syguan96
Copy link
Owner

Also, the annotation gap will also influence the evaluation results.

@syguan96
Copy link
Owner

The hyperparameters also should be tuned.

@syguan96
Copy link
Owner

''I calculate the mean-variance of ground truth 2d and Alphapose predicted 2d, and the result is 12.65. ''
Will you ignore the joint if it is missed?

@juxuan27
Copy link
Author

Thank you for your reply! To calculate the mean-variance of ground truth 2d and Alphapose predicted 2d, I have filtered out the missed and mismatched joints. But maybe the result is not completely accurate since Alphapose sometimes outputs more than 1 person's annotation for a single person image. In this situation, I calculate the one with minimum MPJPE. Also, I have found the mean value of ground truth 2d and Alphapose predicted 2d is about 0(maybe -0.0xxx, I forget the exact number).
I agree with your idea that the results of Alphapose should not be assumpted as Gaussian noise. Because if it is Gaussian noise, the mean-variance of ground truth 2d and Alphapose predicted 2d is supposed to be between 1 and 1.5.
I don't know whether there is a possibility that when input 2d ground truth of the present image, the model tends to overfit on the 2d annotation instead of the temporal information. Note that the lower-level optimization step and upper-level optimization step all have 2d ground truth in the loss function. What's more, I think maybe there is a need to conduct experiments on freezing the model after fine-tuning on the 3DPW train set, then directly inference on the 3DPW train set. This may help us further understand how it works 😄

@juxuan27
Copy link
Author

The hyperparameters also should be tuned.

Also, the annotation gap will also influence the evaluation results.

I agree hahhhh

@syguan96
Copy link
Owner

syguan96 commented Feb 8, 2022

I don't know whether there is a possibility that when input 2d ground truth of the present image, the model tends to overfit on the 2d annotation instead of the temporal information. Note that the lower-level optimization step and upper-level optimization step all have 2d ground truth in the loss function. What's more, I think maybe there is a need to conduct experiments on freezing the model after fine-tuning on the 3DPW train set, then directly inference on the 3DPW train set. This may help us further understand how it works 😄

Hi, I just finished the Spring Festival holiday.
For point 1, refer to tab.7 (our paper), with temporal constraint, MPJPE and PVE have more significant improvement. And these metrics are tightly related to temporal correlation. So, I think with bilevel optimization, temporal and single-frame(gt kp2d) are auxiliary constraints. But, I agree single-frame is more important.

For point 2, if I finetune on the 3DPW training set, should GT 3D mesh/joints be used? In tab.4, I fine-tuned SPIN on 3DPW test set with GT 2D keypoints (termed as *SPIN). Also, I compared with other baselines which are fine-tuned on 3DPW training set. Please refer to this table for more details.

@syguan96
Copy link
Owner

syguan96 commented Feb 8, 2022

As for analyzing the noise distribution of each joint, I think using the results detected by AlphaPose is not appropriate. How to deal with missed joints is hard. Top-down methods may be a more appropriate alternative.
This is my guess. Maybe we can chat by email or WeChat(shuishiguanshanyan).

@juxuan27
Copy link
Author

juxuan27 commented Feb 8, 2022

Thank you for your answer! I've added your WeChat!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants