We will release the code soon!!

ActNetFormer: Transformer-ResNet Hybrid Method for Semi-Supervised Action Recognition in Videos

This page contains all the Datasets and Code bases (experiments and evaluations) involved in experimenting and establishing our newly proposed ActNetFormer framework for video action recognition.

The official repository of the paper with supplementary: ActNetFormer!!

About the project

This project is a carried out in Monash University, Malaysia campus.

Project Members -
Sharana Dharshikgan Suresh Dass (Monash University, Malaysia)
Hrishav Bakul Barua (Monash University and TCS Research, Kolkata, India),
Ganesh Krishnasami (Monash University, Malaysia)
Raveendran Paramesran (Monash University, Malaysia)
Raphaël C.-W. Phan (Monash University, Malaysia).

Funding details

This work is supported by the Global Research Excellence Scholarship, Monash University, Malaysia. This research is also supported, in part, by the prestigious Global Excellence and Mobility Scholarship (GEMS), Monash University (Malaysia & Australia).

Overview

Human action or activity recognition in videos is a fundamental task in computer vision with applications in surveillance and monitoring, self-driving cars, sports analytics, human-robot interaction and many more. Traditional supervised methods require large annotated datasets for training, which are expensive and time-consuming to acquire. This work proposes a novel approach using Cross-Architecture Pseudo-Labeling with contrastive learning for semi-supervised action recognition. Our framework leverages both labeled and unlabeled data to robustly learn action representations in videos, combining pseudo-labeling with contrastive learning for effective learning from both types of samples. We introduce a novel cross-architecture approach where 3D Convolutional Neural Networks (3D CNNs) and Video Transformers (VITs) are utilized to capture different aspects of action representations hence we call it ActNetFormer. The 3D CNNs excel at capturing spatial features and local dependencies in the temporal domain, while VITs excel at capturing long-range dependencies across frames. By integrating these complementary architectures within the ActNetFormer framework, our approach can effectively capture both local and global contextual information of an action. This comprehensive representation learning enables the model to achieve better performance in semi-supervised action recognition tasks by leveraging the strengths of each of these architectures. Experimental results on standard action recognition datasets demonstrate that our approach performs better than the existing methods, achieving state-of-the-art performance with only a fraction of labeled data.

Comparison of performance between different architectural models

The components of our ActNetFormer architecture

Components	Details/Link	Structure
3D CNN as Primary model	3D-ResNet50, Paper	Inpur: 3 × 8 × 224 x 224, Layers: 6
Video Transformer (VIT) as Auxiliary model	VIT-S: We employ the Vision Transformer (ViT) extended with the video TimeSformer as the auxiliary model in our ActNetFormer inspired by DeiT-S	Dimensions: 384; Heads: 6; Layers: 12
Spatial Data Augmentation	We utilize techniques in SlowFast and this	NA
Temporal Data Augmentation	We incorporate variations in frame rates for temporal data augmentations, inspired by prior research in TCL and VTHCL	NA
Contrastive Learning	We use weakly augmented samples from each architecture for cross-architecture contrastive learning, inspired by this, moreover the contrastive loss adapted is adopted from SimCLR and TCL	NA

Our work utilizes the following:

State-of-the-art learning models for action/activity recognition

NeurIPS 2020 | FixMarch - Simplifying semi-supervised learning with consistency and confidence | Code

CVPR 2021 | TCL - Semi-Supervised Action Recognition with Temporal Contrastive Learning | Code

ICCV 2022 | MvPL - Multiview Pseudo-Labeling for Semi-supervised Learning from Video | Code

IEEE TCSVT 2022 | TACL - Semi-Supervised Action Recognition From Temporal Augmentation Using Curriculum Learning | Code

CVPR 2022 | LTG - Learning from Temporal Gradient for Semi-supervised Action Recognition | Code

CVPR 2022 | CMPL - Cross-Model Pseudo-Labeling for Semi-Supervised Action Recognition | Code

IEEE TIP 2023 | NCCL - Neighbor-guided consistent and contrastive learning for semi-supervised action recognition | Code

Elsevier NN 2023 | DANet - DANet: Semi-supervised differentiated auxiliaries guided network for video action recognition | Code

Action recognition datasets

Kinetics-500 2017 | Kinetics400 Dataset: The Kinetics Human Action Video Dataset | Link

CRCV-TR-12-01 2012 |UCF-101 | UCF101: A Dataset of 101 Human Action Classes From Videos in The Wild | Link

Experiments and Results

For more details and experimental results please check out the paper!!

Citation

If you find our work (i.e. the code, the theory/concept, or the dataset) useful for your research or development activities, please consider citing our work as follows:

@article{dass2024actnetformer,
  title={ActNetFormer: Transformer-ResNet Hybrid Method for Semi-Supervised Action Recognition in Videos},
  author={Dass, Sharana Dharshikgan Suresh and Barua, Hrishav Bakul and Krishnasamy, Ganesh and Paramesran, Raveendran and Phan, Raphael C-W},
  journal={arXiv preprint arXiv:2404.06243},
  year={2024}
}

License and Copyright

----------------------------------------------------------------------------------------
Copyright 2024 | All the authors and contributors of this repository as mentioned above.
----------------------------------------------------------------------------------------

Please check the License Agreement.

Name		Name	Last commit message	Last commit date
Latest commit History 67 Commits
assets		assets
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

assets

assets

LICENSE

LICENSE

README.md

README.md

Repository files navigation

We will release the code soon!!

ActNetFormer: Transformer-ResNet Hybrid Method for Semi-Supervised Action Recognition in Videos

About the project

Funding details

Overview

Comparison of performance between different architectural models

The components of our ActNetFormer architecture

Our work utilizes the following:

State-of-the-art learning models for action/activity recognition

Action recognition datasets

Experiments and Results

Citation

License and Copyright

About

Releases

Packages

Contributors 2

License

rana2149/ActNetFormer

Folders and files

Latest commit

History

Repository files navigation

We will release the code soon!!

ActNetFormer: Transformer-ResNet Hybrid Method for Semi-Supervised Action Recognition in Videos

About the project

Funding details

Overview

Comparison of performance between different architectural models

The components of our ActNetFormer architecture

Our work utilizes the following:

State-of-the-art learning models for action/activity recognition

Action recognition datasets

Experiments and Results

Citation

License and Copyright

About

Resources

License

Stars

Watchers

Forks