Carnegie Mellon University, Meta AI Research
Shikhar Bahl*, Russell Mendonca*, Lili Chen, Unnat Jain, Deepak Pathak
[Paper
] [Project
] [Demo
] [Video
] [Dataset
] [BibTeX
]
Given a scene, our model (VRB) learns actionable representations for robot learning. VRB predicts contact points and a post-contact trajectory learned from human videos. We aim to seamlessly integrate VRB with robotic manipulation, across 4 real world environments, over 10 different tasks, and 2 robotic platforms operating in the wild.
Our model takes a human-agnostic frame as input. The contact head outputs a contact heatmap (left) and the trajectory transformer predicts wrist waypoints (orange). This output can be directly used at inference time (with sparse 3D information, such as depth, and robot kinematics).
This code uses python>=3.9
, and pytorch>=2.0
, which can be installed by running the following:
First create the conda environment:
conda env create -f environment.yml
Install required libraries:
conda activate vrb
pip install -r requirements.txt
pip install -U git+https://github.com/luca-medeiros/lang-segment-anything.git
Either download the model weights and place in models
folders or run:
mkdir models
bash download_model.sh
To run the model:
python demo.py --image ./kitchen.jpeg --model_path ./models/model_checkpoint_1249.pth.tar
The output should look like the following:
- Paper
- Project Website
- Project Video
- Twitter Thread
- Learning Manipulation from Watching Human Videos
- Learning Dexterous Policies from Human Videos
If you find our model useful for your research, please cite the following:
@inproceedings{bahl2023affordances,
title={Affordances from Human Videos as a Versatile Representation for Robotics},
author={Bahl, Shikhar and Mendonca, Russell and Chen, Lili and Jain, Unnat and Pathak, Deepak},
journal={CVPR},
year={2023}
}
We thank Shivam Duggal, Yufei Ye and Homanga Bharadhwaj for fruitful discussions and are grateful to Shagun Uppal, Ananye Agarwal, Murtaza Dalal and Jason Zhang for comments on early drafts of this paper. We would also like to thank the authors of HOI-Forecast [1], as the training code for VRB is adapted from their codebase. RM, LC, and DP are supported by NSF IIS-2024594, ONR MURI N00014-22-1-2773 and ONR N00014-22-1-2096.
[1] Joint Hand Motion and Interaction Hotspots Prediction from Egocentric Videos. Shaowei Liu, Subarna Tripathi, Somdeb Majumdar, Xiaolong Wang. CVPR 2022.