ActNetFormer: Transformer-ResNet Hybrid Method for Semi-Supervised Action Recognition in Videos [ICPR 2024]
This page contains all the Datasets and Code bases (experiments and evaluations) involved in experimenting and establishing our newly proposed ActNetFormer framework for video action recognition.
The official repository of the paper with supplementary: ActNetFormer!!
This project is a carried out in Monash University, Malaysia campus.
Project Members -
Sharana Dharshikgan Suresh Dass (Monash University, Malaysia)
Hrishav Bakul Barua (Monash University and TCS Research, Kolkata, India),
Ganesh Krishnasami (Monash University, Malaysia)
Raveendran Paramesran (Monash University, Malaysia)
Raphaël C.-W. Phan (Monash University, Malaysia).
This work has been accpeted in ICPR 2024.
This work is supported by the Global Research Excellence Scholarship
, Monash University, Malaysia. This research is also supported, in part, by the prestigious Global Excellence and Mobility Scholarship (GEMS)
, Monash University (Malaysia & Melbourne, Australia).
Human action or activity recognition in videos is a fundamental task in computer vision with applications in surveillance and monitoring, self-driving cars, sports analytics, human-robot interaction and many more. Traditional supervised methods require large annotated datasets for training, which are expensive and time-consuming to acquire. This work proposes a novel approach using Cross-Architecture Pseudo-Labeling with contrastive learning for semi-supervised action recognition. Our framework leverages both labeled and unlabeled data to robustly learn action representations in videos, combining pseudo-labeling with contrastive learning for effective learning from both types of samples. We introduce a novel cross-architecture approach where 3D Convolutional Neural Networks (3D CNNs) and Video Transformers (VITs) are utilized to capture different aspects of action representations hence we call it ActNetFormer. The 3D CNNs excel at capturing spatial features and local dependencies in the temporal domain, while VITs excel at capturing long-range dependencies across frames. By integrating these complementary architectures within the ActNetFormer framework, our approach can effectively capture both local and global contextual information of an action. This comprehensive representation learning enables the model to achieve better performance in semi-supervised action recognition tasks by leveraging the strengths of each of these architectures. Experimental results on standard action recognition datasets demonstrate that our approach performs better than the existing methods, achieving state-of-the-art performance with only a fraction of labeled data.
Components | Details/Link | Structure |
---|---|---|
3D CNN as Primary model | 3D-ResNet50, Paper | Inpur: 3 × 8 × 224 x 224, Layers: 6 |
Video Transformer (VIT) as Auxiliary model | VIT-S: We employ the Vision Transformer (ViT) extended with the video TimeSformer as the auxiliary model in our ActNetFormer inspired by DeiT-S | Dimensions: 384; Heads: 6; Layers: 12 |
Spatial Data Augmentation | We utilize techniques in SlowFast and this | NA |
Temporal Data Augmentation | We incorporate variations in frame rates for temporal data augmentations, inspired by prior research in TCL and VTHCL | NA |
Contrastive Learning | We use weakly augmented samples from each architecture for cross-architecture contrastive learning, inspired by this, moreover the contrastive loss adapted is adopted from SimCLR and TCL | NA |
NeurIPS 2020
| FixMarch
- Simplifying semi-supervised learning with consistency and confidence | Code
CVPR 2021
| TCL
- Semi-Supervised Action Recognition with Temporal Contrastive Learning | Code
ICCV 2022
| MvPL
- Multiview Pseudo-Labeling for Semi-supervised Learning from Video | Code
IEEE TCSVT 2022
| TACL
- Semi-Supervised Action Recognition From Temporal Augmentation Using Curriculum Learning | Code
CVPR 2022
| LTG
- Learning from Temporal Gradient for Semi-supervised Action Recognition | Code
CVPR 2022
| CMPL
- Cross-Model Pseudo-Labeling for Semi-Supervised Action Recognition | Code
IEEE TIP 2023
| NCCL
- Neighbor-guided consistent and contrastive learning for semi-supervised action recognition | Code
Elsevier NN 2023
| DANet
- DANet: Semi-supervised differentiated auxiliaries guided network for video action recognition | Code
Kinetics-500 2017
| Kinetics400 Dataset: The Kinetics Human Action Video Dataset | Link
CRCV-TR-12-01 2012
|UCF-101
| UCF101: A Dataset of 101 Human Action Classes From Videos in The Wild | Link
For more details and experimental results please check out the paper!!
If you find our work (i.e. the code, the theory/concept, or the dataset) useful for your research or development activities, please consider citing our work as follows:
@article{dass2024actnetformer,
title={ActNetFormer: Transformer-ResNet Hybrid Method for Semi-Supervised Action Recognition in Videos},
author={Dass, Sharana Dharshikgan Suresh and Barua, Hrishav Bakul and Krishnasamy, Ganesh and Paramesran, Raveendran and Phan, Raphael C-W},
journal={arXiv preprint arXiv:2404.06243},
year={2024}
}
----------------------------------------------------------------------------------------
Copyright 2024 | All the authors and contributors of this repository as mentioned above.
----------------------------------------------------------------------------------------
Please check the License Agreement.