Skip to content

Multimodal Models for Remote Sensing Image Change Retrieval and Captioning

Notifications You must be signed in to change notification settings

rogerferrod/RSICRC

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Multimodal Models for Remote Sensing Image Change Retrieval and Captioning

Recently there has been increasing interest in multimodal applications that integrate text with other data modalities, such as images, audio and video, to facilitate natural language interactions with AI systems and fully express the potential of multimodal models. This could be critical for Remote Sensing (RS) applications like environmental protection, disaster monitoring and land planning. The available solutions, though, lack the ability to account for temporal changes between multiple observations, or are too focused on specific tasks like classification, captioning and retrieval, with few foundational models available.

To address this gap, we propose a novel foundation model for bi-temporal RS image pairs, leveraging Contrastive Learning and the LEVIR-CC dataset for both captioning and text-image retrieval. By jointly training a contrastive encoder and captioning decoder, our model add text-image retrieval capabilities while maintaining captioning performances that are comparable to the state of the art.

Pretrained weights are available at drive.google.com/RSICRC

About

Multimodal Models for Remote Sensing Image Change Retrieval and Captioning

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages