Skip to content

Theory, Experiments, Dataset, and Code for our newly proposed Deep Learning Method ActNetFormer for Action Recognition in Videos

License

Notifications You must be signed in to change notification settings

rana2149/ActNetFormer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

76 Commits
 
 
 
 
 
 

Repository files navigation

We will release the code soon!!

ActNetFormer: Transformer-ResNet Hybrid Method for Semi-Supervised Action Recognition in Videos [ICPR 2024]

My Image

This page contains all the Datasets and Code bases (experiments and evaluations) involved in experimenting and establishing our newly proposed ActNetFormer framework for video action recognition.

The official repository of the paper with supplementary: ActNetFormer!!

About the project

This project is a carried out in Monash University, Malaysia campus.

Project Members -
Sharana Dharshikgan Suresh Dass (Monash University, Malaysia)
Hrishav Bakul Barua (Monash University and TCS Research, Kolkata, India),
Ganesh Krishnasami (Monash University, Malaysia)
Raveendran Paramesran (Monash University, Malaysia)
Raphaël C.-W. Phan (Monash University, Malaysia).

This work has been accpeted in ICPR 2024.

Funding details

This work is supported by the Global Research Excellence Scholarship, Monash University, Malaysia. This research is also supported, in part, by the prestigious Global Excellence and Mobility Scholarship (GEMS), Monash University (Malaysia & Melbourne, Australia).

Overview

Human action or activity recognition in videos is a fundamental task in computer vision with applications in surveillance and monitoring, self-driving cars, sports analytics, human-robot interaction and many more. Traditional supervised methods require large annotated datasets for training, which are expensive and time-consuming to acquire. This work proposes a novel approach using Cross-Architecture Pseudo-Labeling with contrastive learning for semi-supervised action recognition. Our framework leverages both labeled and unlabeled data to robustly learn action representations in videos, combining pseudo-labeling with contrastive learning for effective learning from both types of samples. We introduce a novel cross-architecture approach where 3D Convolutional Neural Networks (3D CNNs) and Video Transformers (VITs) are utilized to capture different aspects of action representations hence we call it ActNetFormer. The 3D CNNs excel at capturing spatial features and local dependencies in the temporal domain, while VITs excel at capturing long-range dependencies across frames. By integrating these complementary architectures within the ActNetFormer framework, our approach can effectively capture both local and global contextual information of an action. This comprehensive representation learning enables the model to achieve better performance in semi-supervised action recognition tasks by leveraging the strengths of each of these architectures. Experimental results on standard action recognition datasets demonstrate that our approach performs better than the existing methods, achieving state-of-the-art performance with only a fraction of labeled data.

Comparison of performance between different architectural models

My Image

The components of our ActNetFormer architecture

Components Details/Link Structure
3D CNN as Primary model 3D-ResNet50, Paper Inpur: 3 × 8 × 224 x 224, Layers: 6
Video Transformer (VIT) as Auxiliary model VIT-S: We employ the Vision Transformer (ViT) extended with the video TimeSformer as the auxiliary model in our ActNetFormer inspired by DeiT-S Dimensions: 384; Heads: 6; Layers: 12
Spatial Data Augmentation We utilize techniques in SlowFast and this NA
Temporal Data Augmentation We incorporate variations in frame rates for temporal data augmentations, inspired by prior research in TCL and VTHCL NA
Contrastive Learning We use weakly augmented samples from each architecture for cross-architecture contrastive learning, inspired by this, moreover the contrastive loss adapted is adopted from SimCLR and TCL NA

Our work utilizes the following:

State-of-the-art learning models for action/activity recognition

NeurIPS 2020 | FixMarch - Simplifying semi-supervised learning with consistency and confidence | Code

CVPR 2021 | TCL - Semi-Supervised Action Recognition with Temporal Contrastive Learning | Code

ICCV 2022 | MvPL - Multiview Pseudo-Labeling for Semi-supervised Learning from Video | Code

IEEE TCSVT 2022 | TACL - Semi-Supervised Action Recognition From Temporal Augmentation Using Curriculum Learning | Code

CVPR 2022 | LTG - Learning from Temporal Gradient for Semi-supervised Action Recognition | Code

CVPR 2022 | CMPL - Cross-Model Pseudo-Labeling for Semi-Supervised Action Recognition | Code

IEEE TIP 2023 | NCCL - Neighbor-guided consistent and contrastive learning for semi-supervised action recognition | Code

Elsevier NN 2023 | DANet - DANet: Semi-supervised differentiated auxiliaries guided network for video action recognition | Code

Action recognition datasets

Kinetics-500 2017 | Kinetics400 Dataset: The Kinetics Human Action Video Dataset | Link

CRCV-TR-12-01 2012 |UCF-101 | UCF101: A Dataset of 101 Human Action Classes From Videos in The Wild | Link

Experiments and Results

My Image For more details and experimental results please check out the paper!!

Citation

If you find our work (i.e. the code, the theory/concept, or the dataset) useful for your research or development activities, please consider citing our work as follows:

@article{dass2024actnetformer,
  title={ActNetFormer: Transformer-ResNet Hybrid Method for Semi-Supervised Action Recognition in Videos},
  author={Dass, Sharana Dharshikgan Suresh and Barua, Hrishav Bakul and Krishnasamy, Ganesh and Paramesran, Raveendran and Phan, Raphael C-W},
  journal={arXiv preprint arXiv:2404.06243},
  year={2024}
}

License and Copyright

----------------------------------------------------------------------------------------
Copyright 2024 | All the authors and contributors of this repository as mentioned above.
----------------------------------------------------------------------------------------

Please check the License Agreement.

About

Theory, Experiments, Dataset, and Code for our newly proposed Deep Learning Method ActNetFormer for Action Recognition in Videos

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published