BITA

This is the official code for "Bootstrapping Interactive Image-Text Alignment for Remote Sensing Image Captioning"

Dependencies

The project environment in my local is PyTorch 2.0:

pip install -r requirements.txt

Dataset

This paper utilizes the NWPU-Caption, RSICD, and UCM-Caption datasets. During the pre-training phase, we exclusively employ the training sets of these three datasets. For the final fine-tuning stage, please uncomment the val and test fields for the three datasets located in the BITA/configs/datasets/ directory.

Weights

The download links for the weights from the two-stage pre-training and the final fine-tuning stage are available here. Within this, the 'Caption' folder contains the model weights with the best validation accuracy on the validation set during the fine-tuning stage.

Pre-training (stage1)

In the first stage of pre-training, the visual encoder used for training is ViT-L/14 from CLIP. Please ensure the value of the 'pretrained' field in the 'BITA/configs/models/bita/bita_pretrain_vitL.yaml' file is set to 'https://storage.googleapis.com/sfr-vision-language-research/LAVIS/models/BLIP2/blip2_pretrained_vitL.pth'. Then, run the following script:

bash ./scripts/bita/train/pretrain_stage1.sh

Pre-training (stage2)

"In the second stage of pre-training, please replace the value of the 'pretrained' field in the 'BITA/configs/models/bita/bita_pretrain_opt2.7b.yaml' file with the weights from the completion of the first stage of pre-training, located at '/usr/code/BITA/BITA_weights/Stage1/checkpoint_best.pth'. Then, run the following script:"

bash ./scripts/bita/train/pretrain_stage2.sh

Fine-tune & Evaluation

In the final fine-tuning stage, please replace the value of the 'pretrained' field in the 'BITA/configs/models/bita/bita_caption_opt2.7b.yaml' file with the weights from after the completion of the second stage of pre-training, '/usr/code/BITA/BITA_weights/Stage2/checkpoint_best.pth'. Then, run the following script:

bash ./scripts/bita/train/train_caption.sh

Evaluating Only

bash ./scripts/bita/eval/eval_caption.sh

Acknowledgments

This implementation is largely based on the code of LAVIS - A Library for Language-Vision Intelligence. Thanks a lot.

Citation

If you find our work helpful for your research, please consider citing the following BibTeX entry.

@article{10415446,
  author={Yang, Cong and Li, Zuchao and Zhang, Lefei},
  journal={IEEE Transactions on Geoscience and Remote Sensing}, 
  title={Bootstrapping Interactive Image–Text Alignment for Remote Sensing Image Captioning}, 
  year={2024},
  volume={62},
  number={},
  pages={1-12},
  doi={10.1109/TGRS.2024.3359316}}

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
BITA		BITA
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BITA

BITA

.gitignore

.gitignore

README.md

README.md

requirements.txt

requirements.txt

Repository files navigation

BITA

Dependencies

Dataset

Weights

Pre-training (stage1)

Pre-training (stage2)

Fine-tune & Evaluation

Evaluating Only

Acknowledgments

Citation

About

Releases

Packages

Languages

yangcong356/BITA

Folders and files

Latest commit

History

Repository files navigation

BITA

Dependencies

Dataset

Weights

Pre-training (stage1)

Pre-training (stage2)

Fine-tune & Evaluation

Evaluating Only

Acknowledgments

Citation

About

Resources

Stars

Watchers

Forks

Languages