[WACV'25] Spatio-Temporal Context Prompting for Zero-Shot Action Detection (ST-CLIP)

The Official PyTorch implementation of Spatio-Temporal Context Prompting for Zero-Shot Action Detection (WACV'25).

Wei-Jhe Huang¹, Min-Hung Chen², Shang-Hong Lai¹
¹National Tsing Hua University, ²NVIDIA Research Taiwan

[Paper] [Website] [BibTeX]

This work proposes Spatio-Temporal Context Prompting for Zero-Shot Action Detection (ST-CLIP), which aims to adapt the pretrained image-language model to detect unseen actions. We propose the Person-Context Interaction which employs pretrained knowledge to model the relationship between people and their surroundings, and the Context Prompting module which can utilize visual information to augment the text content. To address multi-action videos, we further introduce the Interest Token Spotting mechanism to identify the visual tokens most relevant to each individual action. To evaluate the ability to detect unseen actions, we propose a comprehensive benchmark on different datasets. The experiments show that our method achieves superior results compared to previous approaches and can be further extended to multi-action videos.

For business inquiries, please visit our website and submit the form: NVIDIA Research Licensing.

Installation

Please check INSTALL.md to install the environment.

Data Preparation

Please check DATA.md to prepare data needed for training or inference. We provide the data for J-HMDB first, and the others will be released as soon as possible.

Training and Inference

Please check GETTING_STARTED.md for the training/inference instructions.

Acknowledgement

We are very grateful to the authors of HIT and X-CLIP for open-sourcing their code from which this repository is heavily sourced. If your find these researchs useful, please consider citing their paper as well.

Citation

If this project helps you in your research or project, please cite this paper:

@article{huang2024spatio,
        title={Spatio-Temporal Context Prompting for Zero-Shot Action Detection},
        author={Huang, Wei-Jhe and Chen, Min-Hung and Lai, Shang-Hong},
        journal={arXiv preprint arXiv:2408.15996},
        year={2024}
      }

Licenses

This work is made available under the NVIDIA Source Code License-NC. Click here to view a copy of this license.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
ZS_JHMDB		ZS_JHMDB
DATA.md		DATA.md
GETTING_STARTED.md		GETTING_STARTED.md
INSTALL.md		INSTALL.md
LICENSE		LICENSE
README.md		README.md
overview.png		overview.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

[WACV'25] Spatio-Temporal Context Prompting for Zero-Shot Action Detection (ST-CLIP)

Installation

Data Preparation

Training and Inference

Acknowledgement

Citation

Licenses

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

[WACV'25] Spatio-Temporal Context Prompting for Zero-Shot Action Detection (ST-CLIP)

Installation

Data Preparation

Training and Inference

Acknowledgement

Citation

Licenses

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages