GSRFormer: Grounded Situation Recognition Transformer with Alternate Semantic Attention Refinement

GSRFormer is an approach for grounded situation recognition (GSR) that aims to mimic human-like understanding of visual scenes. While machines can detect objects and classify images well, interpreting the narrative and semantics conveyed in an image remains challenging.

GSRFormer seeks to advance GSR by modeling not just primary actions, but the associated entities and roles that form a cohesive visual situation.

Key Features

Alternating Learning Scheme: Uses an innovative bidirectional learning process between verbs and nouns to ensure a holistic semantic understanding beyond unidirectional interpretations.
Pseudo Labeling: Initially assumes pseudo labels for semantic roles to focus directly on learning intermediate representations from images, avoiding verb ambiguity issues in conventional GSR.
Support Images: Leverages supplementary images during training to refine verbs using corresponding nouns and vice versa, enhancing generalization.

For more details, please refer to our ACM Multimedia 2022 paper.

GSRFormer achieves state-of-the-art performance on two benchmark GSR datasets, advancing scene understanding and narrative interpretation capabilities.

Setup & Installation

Prerequisites

Conda
PyTorch

Installation Steps

Start by cloning the repository:

git clone https://github.com/zhiqic/GSRFormer.git
cd GSRFormer

Create and activate the Conda environment:

conda create --name GSRFormer python=3.9              
conda activate GSRFormer
conda install pytorch==1.8.0 torchvision==0.9.0 cudatoolkit=11.1 -c pytorch -c conda-forge

Install the packages required for the project:

pip install -r requirements.txt

Dataset: SWiG

The SWiG dataset plays a pivotal role in the model's training and validation:

Annotations: Found in "SWiG/SWiG_jsons/".
Images: Download them here and place them in "SWiG/images_512/".

Directory Structure

Images: "SWiG/images_512/"
Training Set: train.json
Development Set: dev.json
Testing Set: test.json

Training

Kickstart the training with:

python -m torch.distributed.launch --nproc_per_node=4 --use_env main.py \
           --backbone resnet50 --dataset_file swig \
		   --encoder_epochs 20 --decoder_epochs 25 \
           --preprocess True \
           --num_workers 4 --num_enc_layers 6 --num_dec_layers 5 \
           --dropout 0.15 --hidden_dim 512 --output_dir GSRFormer

Evaluation

Assess your model using:

python main.py --output_dir GSRFormer --dev
python main.py --output_dir GSRFormer --test

Inference

For real-time application on custom images:

python inference.py --image_path inference/filename.jpg \
                    --output_dir inference

Acknowledgments

Thank you to the authors of CoFormer repository for providing an excellent codebase that enabled our advancements. We sincerely appreciate the support of Microsoft Research throughout this project.

Citation

Enriching the AI community is our goal. If building upon this work, please reference:

@inproceedings{cheng2022gsrformer,
  title={GSRFormer: Grounded Situation Recognition Transformer with Alternate Semantic Attention Refinement},
  author={Cheng, Zhi-Qi and Dai, Qi and Li, Siyao and Mitamura, Teruko and Hauptmann, Alexander},
  booktitle={Proceedings of the 30th ACM International Conference on Multimedia},
  pages={3272--3281},
  year={2022}
}

License

Refer to the Apache 2.0 license provided in LICENSE for usage details.

Name		Name	Last commit message	Last commit date
Latest commit History 62 Commits
SWiG/SWiG_jsons		SWiG/SWiG_jsons
datasets		datasets
models		models
util		util
LICENSE		LICENSE
README.md		README.md
engine.py		engine.py
inference.py		inference.py
main.py		main.py
requirements.txt		requirements.txt
run_with_submitit.py		run_with_submitit.py

License

zhiqic/GSRFormer

Folders and files

Latest commit

History

Repository files navigation

GSRFormer: Grounded Situation Recognition Transformer with Alternate Semantic Attention Refinement

Key Features

Setup & Installation

Prerequisites

Installation Steps

Dataset: SWiG

Directory Structure

Training

Evaluation

Inference

Acknowledgments

Citation

License

About

Resources

License

Stars

Watchers

Forks

Languages