VRB: Affordances from Human Videos as a Versatile Representation for Robotics

Carnegie Mellon University, Meta AI Research

Shikhar Bahl*, Russell Mendonca*, Lili Chen, Unnat Jain, Deepak Pathak

[Paper] [Project] [Demo] [Video] [Dataset] [BibTeX]

Given a scene, our model (VRB) learns actionable representations for robot learning. VRB predicts contact points and a post-contact trajectory learned from human videos. We aim to seamlessly integrate VRB with robotic manipulation, across 4 real world environments, over 10 different tasks, and 2 robotic platforms operating in the wild.

Our model takes a human-agnostic frame as input. The contact head outputs a contact heatmap (left) and the trajectory transformer predicts wrist waypoints (orange). This output can be directly used at inference time (with sparse 3D information, such as depth, and robot kinematics).

Installation

This code uses python>=3.9, and pytorch>=2.0, which can be installed by running the following:

First create the conda environment:

conda env create -f environment.yml

Install required libraries:

conda activate vrb
pip install -r requirements.txt
pip install -U git+https://github.com/luca-medeiros/lang-segment-anything.git

Either download the model weights and place in models folders or run:

mkdir models
bash download_model.sh

Running VRB

To run the model:

python demo.py --image ./kitchen.jpeg --model_path ./models/model_checkpoint_1249.pth.tar

The output should look like the following:

Helpful pointers

Citing VRB

If you find our model useful for your research, please cite the following:

@inproceedings{bahl2023affordances,
              title={Affordances from Human Videos as a Versatile Representation for Robotics},
              author={Bahl, Shikhar and Mendonca, Russell and Chen, Lili and Jain, Unnat and Pathak, Deepak},
              journal={CVPR},
              year={2023}
            }

Acknowledgements

We thank Shivam Duggal, Yufei Ye and Homanga Bharadhwaj for fruitful discussions and are grateful to Shagun Uppal, Ananye Agarwal, Murtaza Dalal and Jason Zhang for comments on early drafts of this paper. We would also like to thank the authors of HOI-Forecast [1], as the training code for VRB is adapted from their codebase. RM, LC, and DP are supported by NSF IIS-2024594, ONR MURI N00014-22-1-2773 and ONR N00014-22-1-2096.

[1] Joint Hand Motion and Interaction Hotspots Prediction from Egocentric Videos. Shaowei Liu, Subarna Tripathi, Somdeb Majumdar, Xiaolong Wang. CVPR 2022.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
assets		assets
networks		networks
LICENSE		LICENSE
README.md		README.md
demo.py		demo.py
download_model.sh		download_model.sh
environment.yml		environment.yml
inference.py		inference.py
kitchen.jpeg		kitchen.jpeg
out_kitchen.png		out_kitchen.png
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VRB: Affordances from Human Videos as a Versatile Representation for Robotics

Installation

Running VRB

Helpful pointers

Citing VRB

Acknowledgements

About

Releases

Packages

Languages

License

shikharbahl/vrb

Folders and files

Latest commit

History

Repository files navigation

VRB: Affordances from Human Videos as a Versatile Representation for Robotics

Installation

Running VRB

Helpful pointers

Citing VRB

Acknowledgements

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages