VIMA: General Robot Manipulation with Multimodal Prompts

[Website] [arXiv] [PDF] [Pretrained Models] [VIMA-Bench] [Training Data] [Model Card]

Prompt-based learning has emerged as a successful paradigm in natural language processing, where a single general-purpose language model can be instructed to perform any task specified by input prompts. However, different robotics tasks are still tackled by specialized models. This work shows that we can express a wide spectrum of robot manipulation tasks with multimodal prompts, interleaving textual and visual tokens. We introduce VIMA (VisuoMotor Attention model, reads "v-eye-ma"), a novel scalable multi-task robot learner with a uniform sequence IO interface achieved through multimodal prompts. The architecture follows the encoder-decoder transformer design proven to be effective and scalable in NLP. VIMA encodes an input sequence of interleaving textual and visual prompt tokens with a pretrained language model, and decodes robot control actions autoregressively for each environment interaction step. The transformer decoder is conditioned on the prompt via cross-attention layers that alternate with the usual causal self-attention. Instead of operating on raw pixels, VIMA adopts an object-centric approach. We parse all images in the prompt or observation into objects by off-the-shelf detectors, and flatten them into sequences of object tokens. All these design choices combined deliver a conceptually simple architecture with strong model and data scaling properties.

In this repo, we provide VIMA model code, pre-trained checkpoints covering a spectrum of model sizes, and demo and eval scripts. This codebase is under MIT License.

Installation

VIMA requires Python ≥ 3.9. We have tested on Ubuntu 20.04. Installing VIMA codebase is as simple as:

pip install git+https://github.com/vimalabs/VIMA

Pretrained Models

We host pretrained models covering a spectrum of model capacity on HuggingFace. Download links are listed below.

200M	92M	43M	20M	9M	4M	2M

Demo

To run the live demonstration, first follow the instruction to install VIMA-Bench.Then we can run a live demo through

python3 scripts/example.py --ckpt={ckpt_path} --device={device} --partition={eval_level} --task={task}

Here eval_level means one out of four evaluation levels and can be chosen from placement_generalization, combinatorial_generalization, novel_object_generalization, and novel_task_generalization. task means a specific meta-task. Please refer to task suite and benchmark for more details.

After running the above command, we should see a PyBullet GUI pop up, alongside a small window showing the multimodal prompt. Then a robot arm should move to complete the corresponding task. Note that this demo may not work on headless machines since the PyBullet GUI requires a display.

Paper and Citation

Our paper is posted on arXiv. If you find our work useful, please consider citing us!

@article{jiang2022vima,
  title   = {VIMA: General Robot Manipulation with Multimodal Prompts},
  author  = {Yunfan Jiang and Agrim Gupta and Zichen Zhang and Guanzhi Wang and Yongqiang Dou and Yanjun Chen and Li Fei-Fei and Anima Anandkumar and Yuke Zhu and Linxi Fan},
  year    = {2022},
  journal = {arXiv preprint arXiv: Arxiv-2210.03094}
}

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
images		images
scripts		scripts
vima		vima
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
model-card.md		model-card.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

images

images

scripts

scripts

vima

vima

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

model-card.md

model-card.md

requirements.txt

requirements.txt

setup.py

setup.py

Repository files navigation

VIMA: General Robot Manipulation with Multimodal Prompts

Installation

Pretrained Models

Demo

Paper and Citation

About

Releases

Packages

Languages

License

stjordanis/VIMA

Folders and files

Latest commit

History

Repository files navigation

VIMA: General Robot Manipulation with Multimodal Prompts

Installation

Pretrained Models

Demo

Paper and Citation

About

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Languages