TAP: Temporally-Aggregative Pretraining with Transformers for Temporal Action Detection Given the long duration of untrimmed videos and the difficulties with end-to-end training, contemporary temporal action detection (TAD) methods heavily depend on pre-computed video feature sequences for subsequent analysis. However, video clip features extracted directly from video encoders trained for trimmed action classification commonly lack temporal sensitivity. To overcome this limitation, we propose an innovative framework named temporally-aggregative pretraining (TAP). TAP is rooted in the design principle of extracting TAD features of temporal sensitivity to improve discrimination between them and those from trimmed action classification. The proposed TAP consists of two fundamental modules at its core: a feature encoding module and a temporal aggregation module, which use both local and global features for pretraining. The feature encoding module employs a novel video encoder of a multiscale vision transformer, which ingeniously combines the essential concept of multiscale feature hierarchy with transformers to achieve effective feature extraction from video clips. The temporal aggregation module introduces a temporal pyramid pooling layer that effectively captures temporal-contextual semantic information from video feature sequences, enhancing more discriminative global video representations. Extensive experiments validate the significantly improved discriminative potency of our pretrained features for two commonly used datasets.
-
Notifications
You must be signed in to change notification settings - Fork 0
weike382/TAP
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
About
TAP: Temporally-Aggregative Pretraining with Transformers for Temporal Action Detection
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published