Jiayang Ao · Yanbei Jiang · Qiuhong Ke · Krista A. Ehinger
We introduce a training-free framework that expands amodal completion capabilities by accepting flexible text queries as input. Our approach generalizes to arbitrary objects specified by both direct terms and abstract queries. We term this capability reasoning amodal completion, where the system reconstructs the full appearance of the queried object based on the provided image and language query. Our framework unifies segmentation, occlusion analysis, and inpainting to handle complex occlusions and generates completed objects as RGBA elements, enabling seamless integration into applications such as 3D reconstruction and image editing.
Overview of our framework. Starting with a text query, a VLM generates a visible mask to locate the target object in the input image. The framework then identifies all objects and background segments for occlusion analysis. An auto-generated prompt guides the inpainting model, which iteratively reconstructs the occluded object to produce a transparent RGBA amodal completion output.
This repository provides the implementation of our Open-World Amodal Appearance Completion pipeline. The core logic resides in main.py.
Python: 3.10.14
PyTorch: 1.13.1+cu117
Install dependencies via:
pip install -r requirements.txtThe pipeline uses several pre-trained models.
-
LISA (mapping textual queries to visible object regions): Clone and install from official LISA repository. Download the checkpoint LISA-13B-llama2-v1 from Hugging Face.
⚠️ Replace the original LISA/app.py with our modified version in this repository. This modified version introduces minimal changes to line 310 and line 322 to return the raw segmentation mask (pred_mask) for integration with our pipeline.
We access LISA via API to avoid dependency conflicts. Run the LISA server locally, and update LISA_SERVER_URL in main.py accordingly.
-
InstaOrder (for occlusion relationships): Clone and install from the InstaOrder repository, download the checkpoint
InstaOrder_InstaOrderNet_od.pth.tar. -
RAM-Grounded-SAM: Install RAM++ following the instructions from the official recognize-anything repository, download the checkpoint
ram_plus_swin_large_14m.pth. Install Grounded-SAM following the instructions from the official Grounded-Segment-Anything repository, download the checkpointgroundingdino_swint_ogc.pthandsam_vit_h_4b8939.pth. -
Stable Diffusion (for inpainting): Stable Diffusion v2 inpainting model.
To run the pipeline, you need to prepare the following input files:
-
Input images: Place all image files in a directory (e.g., ./images_example/).
-
Image filename list: A text file (e.g.,
img_filenames_example.txt) that contains one image filename per line. For large-scale runs, the pipeline will process the file in batches. -
JSON annotations: A JSON file (e.g.,
example_annotation.json) containing text query for each image.
To process images in batches (e.g., 5 images at a time), you can use the provided script main_batch_example.sh.
Our framework does not require any training. For evaluation, we use a collection of real-world occluded images and corresponding annotations, download here.
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
We thank the following papers for their open-source code, pre-trained models and datasets:
- Amodal Completion via Progressive Mixed Context Diffusion [CVPR 2024]
- LISA: Reasoning Segmentation via Large Language Model [CVPR 2024]
- Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection [ECCV 2024]
- Segment anything [CVPR 2023]
- Open-set image tagging with multi-grained text supervision [arXiv 2023]
- Instance-wise occlusion and depth orders in natural scenes [CVPR 2022]
- High-resolution image synthesis with latent diffusion models [CVPR 2022]
- Learning transferable visual models from natural language supervision [PMLR 2021]
- Semantic amodal segmentation [CVPR 2017]
- Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations [IJCV 2017]
- LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs [Data Centric AI NeurIPS Workshop 2021]
If you find this helpful in your work, please consider citing our paper:
@inproceedings{ao2025open,
title={Open-world amodal appearance completion},
author={Ao, Jiayang and Jiang, Yanbei and Ke, Qiuhong and Ehinger, Krista A},
booktitle={Proceedings of the Computer Vision and Pattern Recognition Conference},
pages={6490--6499},
year={2025}
}
If you have any questions regarding this work, please send email to jiayang.ao@student.unimelb.edu.au.

