Enhancing Human Pose Estimation in Videos: A Comparative Study of GRU and Transformer Architectures - ANU ENGN8501/COMP8539 Project

Overview

This project is an extension of the original VIBE (Video Inference for Human Body Pose and Shape Estimation) paper by Kocabas et al. (2020). Our implementation seeks to improve upon the VIBE architecture by integrating transformer networks for temporal encoding and motion discrimination, and by utilizing an expanded training dataset. The goal is to refine 3D human pose estimation in videos, leveraging the power of transformers to capture complex temporal dependencies.

Authors

This project is a collaborative effort by the following contributors:

Rohit Mistry (u7034818@anu.edu.au)
Lewis Aston (u7056676@anu.edu.au)
Dian Jiao (u6504904@anu.edu.au)

Original Paper: Kocabas, M., Athanasiou, N., & Black, M. J. (2020). VIBE: Video Inference for Human Body Pose and Shape Estimation. CVPR.

Check the YouTube videos below for more details.

Original Paper Video	Qualitative Results

VIBE: Video Inference for Human Body Pose and Shape Estimation,
Muhammed Kocabas, Nikos Athanasiou, Michael J. Black,
IEEE Computer Vision and Pattern Recognition, 2020

Features and Contributions

This project builds upon the foundational work of the VIBE model with the following features and contributions:

Transformer Integration: Replaced the GRU-based temporal encoder and motion discriminator with transformer models to explore their efficacy in capturing temporal dependencies in video data.
Dataset Expansion: Utilized an updated and larger AMASS dataset for the motion discriminator, aiming to expose the model to a wider variety of human poses and improve its predictive performance.
Model Comparison: Conducted a comprehensive comparison between the original VIBE model and our modified versions, providing insights into the impact of architectural changes on performance.
Performance Analysis: Evaluated the models using standard metrics on the 3DPW test set, offering a detailed analysis of the strengths and limitations of each approach.
Open Source: The entire project is open-sourced, including the modified codebase and environment setup, to facilitate replication and further research by the community.

These contributions are aimed at advancing the field of video-based 3D human pose estimation and providing a platform for future explorations into the use of transformer models in this domain.

Setting Up the Project

VIBE has been implemented and tested on Ubuntu 18.04 with python >= 3.7. It supports both GPU and CPU inference.

Clone the repo:

git clone https://github.com/sriparashiva/VIBE-anu_8501.git
cd VIBE-anu_8501

Install the requirements using `virtualenv` or `conda`:

# pip
source scripts/install_pip.sh

# conda
source scripts/install_conda.sh

OR

Creating an Environment from the environment.yml File

Once Anaconda is installed, you can create a conda environment using the environment.yml file provided in the repository. This file contains all the necessary packages and their specific versions required for the project.

Ensure that you are in the root directory of the cloned repository where the environment.yml file is located.
Create the conda environment by running:

conda env create -f environment.yml

Activate the newly created environment:

conda activate your-env-name

Replace your-env-name with the name of the conda environment specified in the environment.yml file.

Verify that the environment was set up correctly and all packages were installed:

conda list

You should see a list of all packages specified in the environment.yml file.

Training

Run the commands below to start training:

source scripts/prepare_training_data.sh
python train.py --cfg configs/config.yaml

Note that the training datasets should be downloaded and prepared before running data processing script. Please see doc/train.md for details on how to prepare them.

Evaluation

Here we compare VIBE with recent state-of-the-art methods on 3D pose estimation datasets. Evaluation metric is Procrustes Aligned Mean Per Joint Position Error (PA-MPJPE) in mm.

Configuration	MPJPE	PA-MPJPE	PVE	ACCEL
Baseline Model, Original Paper's Dataset	82.9	51.9	99.1	23.4
Baseline Model, Baseline Dataset	91.9928	56.4201	108.9558	27.9665
Baseline Model, Updated Dataset	91.2920	55.5521	108.0769	27.5020
Modified Model V1, Baseline Dataset	103.3061	59.1042	122.8817	19.5794
Modified Model V1, Updated Dataset	100.9460	62.3262	123.0027	18.8801
Modified Model V2, Baseline Dataset	101.8598	60.1312	120.3034	19.7361
Modified Model V2, Updated Dataset	104.7027	59.4259	125.0943	16.8924

See doc/eval.md to reproduce the results in this table or evaluate a pretrained model.

Citation

@inproceedings{kocabas2019vibe,
  title={VIBE: Video Inference for Human Body Pose and Shape Estimation},
  author={Kocabas, Muhammed and Athanasiou, Nikos and Black, Michael J.},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}

References

The codebase is forked from the codebase provided by the original authors of the VIBE paper. The main contributions and modifications of our team are marked with comments in .py files. Majority of the contributions are in the following files:

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.github/ISSUE_TEMPLATE		.github/ISSUE_TEMPLATE
.ipynb_checkpoints		.ipynb_checkpoints
configs		configs
doc		doc
lib		lib
results/vibe_tests		results/vibe_tests
scripts		scripts
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
demo.py		demo.py
environment.yml		environment.yml
eval.ipynb		eval.ipynb
eval.py		eval.py
requirements.txt		requirements.txt
setup.sh		setup.sh
train.ipynb		train.ipynb
train.py		train.py
train_transformer_encoder.ipynb		train_transformer_encoder.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Enhancing Human Pose Estimation in Videos: A Comparative Study of GRU and Transformer Architectures - ANU ENGN8501/COMP8539 Project

Overview

Authors

Features and Contributions

Setting Up the Project

Install the requirements using `virtualenv` or `conda`:

OR

Creating an Environment from the environment.yml File

Training

Evaluation

Citation

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Enhancing Human Pose Estimation in Videos: A Comparative Study of GRU and Transformer Architectures - ANU ENGN8501/COMP8539 Project

Overview

Authors

Features and Contributions

Setting Up the Project

Install the requirements using virtualenv or conda:

OR

Creating an Environment from the environment.yml File

Training

Evaluation

Citation

References

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Install the requirements using `virtualenv` or `conda`:

Packages