This repo is the official implementation of "Efficient Remote Sensing with Harmonized Transfer Learning and Modality Alignment"(ICLRW 2024).
This paper proposes a framework for efficient remote sensing using Harmonized Transfer Learning and Modality Alignment (HarMA), addressing key challenges in the field of remote sensing image-text retrieval. HarMA leverages a unified perspective on multimodal transfer learning to enhance task performance, modality alignment, and single-modality uniform alignment. The core innovation lies in the hierarchical multimodal adapter inspired by the human brain's information processing, which integrates shared mini-adapters to improve fine-grained semantic alignment. By employing parameter-efficient fine-tuning, HarMA significantly reduces training overhead while achieving state-of-the-art performance on popular multimodal retrieval tasks without relying on external data or other tricks. This approach outperforms fully fine-tuned models with minimal parameter adjustments, making it a versatile and resource-efficient solution for remote sensing applications. Experiments validate the effectiveness of HarMA, showcasing its potential to enhance vision and language representations in remote sensing tasks.
Set up the environment by running:
pip install -r requirements.txt
All experiments are based on the RSITMD and RSICD datasets.
Download the images from Baidu Disk or Google Drive and modify the configs/yaml
file accordingly:
image_root: './images/datasets_name/'
The annotation files for the datasets are located in the data/finetune
directory.
Download the GeoRSCLIP pre-trained model from this link and place it in the models/pretrain/
directory.
If you encounter environmental issues, you can modify the get_dist_launch
function in run.py
. For example, for a 2-GPU setup:
elif args.dist == 'f2':
return "CUDA_VISIBLE_DEVICES=0,1 WORLD_SIZE=2 /root/miniconda3/bin/python -W ignore -m torch.distributed.launch --master_port 9999 --nproc_per_node=2 " \
"--nnodes=1 "
Start training with:
python run.py --task 'itr_rsitmd_vit' --dist "f2" --config 'configs/Retrieval_rsitmd_vit.yaml' --output_dir './checkpoints/HARMA/full_rsitmd_vit'
python run.py --task 'itr_rsicd_vit' --dist "f2" --config 'configs/Retrieval_rsicd_vit.yaml' --output_dir './checkpoints/HARMA/full_rsicd_vit'
python run.py --task 'itr_rsitmd_geo' --dist "f2" --config 'configs/Retrieval_rsitmd_geo.yaml' --output_dir './checkpoints/HARMA/full_rsitmd_geo'
python run.py --task 'itr_rsicd_geo' --dist "f2" --config 'configs/Retrieval_rsicd_geo.yaml' --output_dir './checkpoints/HARMA/full_rsicd_geo'
To evaluate the model, change if_evaluation
to True
in the configs/yaml
, then run:
python run.py --task 'itr_rsitmd_vit' --dist "f2" --config 'configs/Retrieval_rsitmd_vit.yaml' --output_dir './checkpoints/HARMA/test' --checkpoint './checkpoints/HARMA/full_rsitmd_vit/checkpoint_best.pth' --evaluate
python run.py --task 'itr_rsicd_vit' --dist "f2" --config 'configs/Retrieval_rsicd_vit.yaml' --output_dir './checkpoints/HARMA/test' --checkpoint './checkpoints/HARMA/full_rsicd_vit/checkpoint_best.pth' --evaluate
python run.py --task 'itr_rsitmd_geo' --dist "f2" --config 'configs/Retrieval_rsitmd_geo.yaml' --output_dir './checkpoints/HARMA/test' --checkpoint './checkpoints/HARMA/full_rsitmd_geo/checkpoint_best.pth' --evaluate
python run.py --task 'itr_rsicd_geo' --dist "f2" --config 'configs/Retrieval_rsicd_geo.yaml' --output_dir './checkpoints/HARMA/test' --checkpoint './checkpoints/HARMA/full_rsicd_geo/checkpoint_best.pth' --evaluate
Note: We provide a Jupyter notebook for direct execution. Please refer to the begin.ipynb
file. If you want to test or use the pre-trained models directly, you can download the checkpoints from Checkpoints-v1.0.0.
If you find this paper or repository useful for your work, please give it a star ⭐ and cite it as follows:
@article{huang2024efficient,
title={Efficient Remote Sensing with Harmonized Transfer Learning and Modality Alignment},
author={Huang, Tengjun},
journal={arXiv preprint arXiv:2404.18253},
year={2024}
}
This code builds upon the excellent work of PIR by Pan et al.