This is the official repository for the paper "High-Level Adaptive Feature Enhancement and Attention Mask-Guided Aggregation for Visual Place Recognition".
HAM-VPR is an enhanced Visual Position Recognition (VPR) framework designed to improve robustness against challenges like dynamic occlusion and viewpoint variations. Key innovations include:
- High-Level Adaptive Feature Enhancement
- Integrates a lightweight
AdapterFormermodule into DINOv2's Transformer Block to enhance semantic adaptability while preserving fine-grained features. - Reduces parameter redundancy and generates structured segmentation feature maps, bridging the gap between pre-trained models and VPR tasks.
- Integrates a lightweight
- Attention Mask-Guided Aggregation
- A lightweight attention module generates implicit masks to guide global feature aggregation, suppressing irrelevant regions and amplifying discriminative areas.
- Two-stage training ensures seamless fusion of mask and segmentation features without re-extracting base features.
- Dataset & Validation
- Introduces the VPR-City-Mask dataset (derived from GSV-City) with region annotations for real-world mask validation.
- Achieves state-of-the-art performance on multiple VPR benchmarks, demonstrating scalability and robustness.
The dataset should be organized in a directory tree as such:
├── datasets_vpr
└── datasets
└── VPR-City-Mask
└── images
└── train
├── database
├── database_mask
├── queries
└── queries_mask
└── val
├── database
├── database_mask
├── queries
└── queries_mask
└── test
├── database
├── database_mask
├── queries
└── queries_mask
We used the pre-trained foundation model DINOv2 (ViT-L/14) (HERE) as the basis for fine-tuning training .
The model finetuned on VPR-City-Mask (for diverse scenes).
| Pitts30k-test | Pitts250k-test | MSLS-val | Tokyo24/7 | SF-XL-testv1 | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| R@1 | R@5 | R@10 | R@1 | R@5 | R@10 | R@1 | R@5 | R@10 | R@1 | R@5 | R@10 | R@1 | R@5 | R@10 |
| 89.7 | 95.9 | 96.6 | 93.7 | 98.2 | 98.6 | 83.6 | 93.0 | 95.0 | 85.6 | 92.2 | 94.3 | 76.9 | 83.6 | 80.5 |
Set rerank_num=100 to reproduce the results in paper, and set rerank_num=20 to achieve a close result with only 1/5 re-ranking runtime (0.018s for a query).
python3 eval.py --datasets_folder=/path/to/your/datasets_vg/datasets --dataset_name=pitts30k --resume=./weight/HAM-VPR.pth
Parts of this repo are inspired by the following repositories:
If you find this repo useful for your research, please consider leaving a star⭐️ and citing the paper
@inproceedings{HAM-VPR,
title={High-Level Adaptive Feature Enhancement and Attention Mask-Guided Aggregation for Visual Place Recognition},
author={Wang Longhao, Lan Chaozhen, Wu Beibei, Yao Fushan, Wei Zijun, Gao Tian, Yu Hanyang},
booktitle={***},
year={2025}
}
