GitHub - yamand16/CAPE: Official PyTorch implementation of "CAPE: A CLIP-Aware Pointing Ensemble of Complementary Heatmap Cues for Embodied Reference Understanding" (WACV 2026). CAPE resolves referential ambiguity in cluttered scenes by fusing complementary pointing heatmaps with CLIP semantic guidance.

CAPE

Official PyTorch implementation of "CAPE: A CLIP-Aware Pointing Ensemble of Complementary Heatmap Cues for Embodied Reference Understanding" (WACV 2026). CAPE resolves referential ambiguity in cluttered scenes by fusing complementary pointing heatmaps with CLIP semantic guidance.

📢 Updates

[Mar 2026] Visit our poster: WACV 2026, March 9, Poster Session 3.
[Mar 2026] Training code and evaluation scripts will be released soon.
[Nov 2025] Paper accepted to WACV 2026.

💡 Abstract

Embodied Reference Understanding (ERU) is the task of identifying a specific object in a visual scene based on language instructions and human pointing cues. Previous methods rely heavily on the assumption that a target aligns perfectly with the head-to-fingertip direction. However, this "single-line assumption" fails when the user looks elsewhere, the scene is cluttered, or pointing is driven by the wrist.

CAPE tackles this by moving beyond strict geometric assumptions. We propose a dual-model framework that fuses complementary head-to-fingertip and wrist-to-fingertip Gaussian ray heatmaps. These models are dynamically combined at inference time using a CLIP-Aware Pointing Ensemble (CAPE), leveraging semantic similarity and a size-aware adaptation rule to achieve state-of-the-art robustness on datasets like YouRefIt, ISL Pointing, and CAESAR.

📖 Citation

@inproceedings{eyiokur2026cape,
  title={CAPE: A CLIP-Aware Pointing Ensemble of Complementary Heatmap Cues for Embodied Reference Understanding},
  author={Eyiokur, Fevziye Irem and Yaman, Dogucan and Ekenel, Haz{\i}m Kemal and Waibel, Alexander},
  booktitle={Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)},
  year={2026}
}

🙏 Acknowledgements

This work was supported in part by the European Union's Horizon research and innovation program, project Meetween (101135798) and project DVPS (Diversibus Viis Plurima Solvo) (101213369), and KIT Campus Transfer GmbH (KCT).

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CAPE

📢 Updates

💡 Abstract

📖 Citation

🙏 Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Folders and files

Latest commit

History

Repository files navigation

CAPE

📢 Updates

💡 Abstract

📖 Citation

🙏 Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 0

Packages

Contributors