Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Confusion about Affordance Features #1

Closed
ASMIftekhar opened this issue Apr 9, 2021 · 10 comments
Closed

Confusion about Affordance Features #1

ASMIftekhar opened this issue Apr 9, 2021 · 10 comments

Comments

@ASMIftekhar
Copy link

Hello, thanks for your nice works. After reading the ATL paper I am confused about the affordance features. You said in the paper

We first extract the human, object, and affordance features via the ROI-Pooling from the feature pyramids.

What are these affordance features actually? I mean from where these features are pooled?

@zhihou7
Copy link
Owner

zhihou7 commented Apr 10, 2021

Hi, thanks for you interests.
In our experiment, the affordance features are pooled from the union box of human and object. i.e. the same as verb in VCL and FCL. We think verb describes the existing interaction from person perspective, while affordance illustrates the interaction possibilities( or action possibilities) of an object.

Regards,

@ASMIftekhar
Copy link
Author

Ok got it. In that case, I am curious what would happen if we just use humans as affordance features. Affordance features are basically concatenating with object features to compose new HOI. It would be interesting to see the results. I am not sure if you have tested it already. Anyway, thanks for the reply.

@zhihou7
Copy link
Owner

zhihou7 commented Apr 11, 2021

Hi, With human box feature as affordance, the performance of HOI detection decreases apparently (see Table 3 in VCL) compared to union box. However, human box with compositional approach still effectively increases the baseline. We also evaluate the human box in FCL where we witness the similar trend. (human box baseline is 22.91 16.66 24.77, human box FCL is 23.83 18.62 25.39). Thus, human box does not affect the effectiveness of compositional approach.

However, we do not evaluate human box on affordance recognition since we find union box achieves consistent improvement than human box. We think human box would not affect the effectiveness of visual compositional approach on affordance recognition compared to the baseline. For the comparison between human box and union box on affordance recognition, we intuitively think union box might be better because union box achieves better verb representation. But we are not sure. We have removed the model weights of human box and thus we can not evaluate this right now. In the process of considering your question, we find a set of experiment about the verb auxiliary loss which achieves better verb representation and HOI detection result,

method val2017 object365_coco gthico object365 HOI detection
baseline with verb auxiliary loss 19.71 17.86 23.18 6.80 23.44
baseline without verb auxiliary loss 19.77 17.85 27.23 6.90 22.83

The table (reported in mAP, we first evaluate ATL in F1. but we find F1 might be not robust compared to mAP when we prepare the camera ready) is corresponding to Tab 5 in Appendix. However, auxiliary loss doesn't seem to always improve affordance recognition. Thanks for your comments. we didn't notice this before.

@ASMIftekhar
Copy link
Author

Thanks a lot for your clarification. I am just a bit skeptical to use union boxes as affordance features since union boxes have the old object features.

@zhihou7
Copy link
Owner

zhihou7 commented Apr 12, 2021

You are welcome. Your question is valuable. I think the compositional approach (compose verb and object among different images) also enforces the verb representation be more discriminative (See the t-SNE figure in VCL). This approach might alleviate the effect from old object features. Otherwise, VCL would not improve the corresponding union box baseline.

When the affordance recognition result of human box model is finished, I'll post the result.

@zhihou7
Copy link
Owner

zhihou7 commented Apr 12, 2021

Well, the mAP of ATL (HICO) model on HICO test dataset is 46.32, which is much worse than the result (59.44) of the corresponding union box model in Tab 12 in Appendix. I'll check the result again after the model converges.

@zhihou7
Copy link
Owner

zhihou7 commented Apr 14, 2021

HOI detection performance of human box (ATL (HICO)) in 22. 99% mAP. The mAP on COCO validation2017 is 39.40%, which is also much worse than 52.01 of the corresponding union box model in Tab 12 in Appendix. All the results are worse than I thought.

@ASMIftekhar
Copy link
Author

I really appreciate how you take the time and run experiments to answer my questions. You might wanna add this experiment in the supplemental material of the paper.

@zhihou7
Copy link
Owner

zhihou7 commented Apr 15, 2021

Thanks. It is just because I have benefited a lot from taking questions (especially comments from peer review) seriously. The first two works (VCL & ATL) were rejected in the first submission. But the comments from reviewer make the paper better and sometimes inspire me a lot. I'll consider to add this experiment in the Appendix.
emmm, it doesn't take time to run those small experiments. I just submit a job and wait for the results.

@zhihou7
Copy link
Owner

zhihou7 commented Jun 10, 2021

hi, I add more experiments and update the pre-print version in arxiv: https://arxiv.org/abs/2104.02867. Interestingly, I find with human box verb representation, the performance of baseline increases, while the performance of ATL drops.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants