TAP

TAP: Temporally-Aggregative Pretraining with Transformers for Temporal Action Detection Given the long duration of untrimmed videos and the difficulties with end-to-end training, contemporary temporal action detection (TAD) methods heavily depend on pre-computed video feature sequences for subsequent analysis. However, video clip features extracted directly from video encoders trained for trimmed action classification commonly lack temporal sensitivity. To overcome this limitation, we propose an innovative framework named temporally-aggregative pretraining (TAP). TAP is rooted in the design principle of extracting TAD features of temporal sensitivity to improve discrimination between them and those from trimmed action classification. The proposed TAP consists of two fundamental modules at its core: a feature encoding module and a temporal aggregation module, which use both local and global features for pretraining. The feature encoding module employs a novel video encoder of a multiscale vision transformer, which ingeniously combines the essential concept of multiscale feature hierarchy with transformers to achieve effective feature extraction from video clips. The temporal aggregation module introduces a temporal pyramid pooling layer that effectively captures temporal-contextual semantic information from video feature sequences, enhancing more discriminative global video representations. Extensive experiments validate the significantly improved discriminative potency of our pretrained features for two commonly used datasets.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TAP

About

Releases

Packages

weike382/TAP

Folders and files

Latest commit

History

Repository files navigation

TAP

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages