# HAMSTER: Hierarchical Action Models for Open-World Robot Manipulation  
### Authors: Yi Li, Yuquan Deng, Jesse Zhang, et al.  
### Date: 10 May 2025  
### Institution: NVIDIA, University of Washington, USC  
### Link: [https://arxiv.org/abs/2502.05485](https://arxiv.org/abs/2502.05485)

---
HAMSTER helps robots understand what to do using cheaper, widely available data (from videos, simulations) without requiring costly demonstrations. By separating high-level planning (done by a large VLM) from low-level control (encoder), HAMSTER combines strong reasoning and robust execution. It sets a new bar for generalization across tasks, environments, and instructions, and makes scalable deployment of intelligent robots more practical for industry.

### 1. Problem Context & Motivation
- It addresses the challenge of enabling robots to generalize across diverse real-world tasks without relying on massive, expensive robot-specific datasets.

- The relevance lies in the cost of getting labeled robot data. VLM's have revolutionized fields like NLP and computer vision, and have shown promise in robotics, but wider adoption of this technilogy, especially for commercial use, requires it to be a lot cheaper - which HAMSTER presents a path for.

- Prior models ("monolithic VLAs") directly map language and vision inputs to robot actions but generalize poorly to new tasks or environments. They also require costly on-robot training data. So far, past tools, like Pi0, couldn't leverage cheaper, action-free data like videos.

---

### 2. Prior Work and State-of-the-Art
- **Previous approaches:**  
  - Monolithic VLA models like OpenVLA and RT-2 (Google DeepMind) that predict actions directly from images and instructions.
  - 3D policy networks like RVT and 3D-DA for dexterous control, trained on small in-domain datasets.
  - RT-Trajectory and RoboPoint for task specification via waypoints or affordances.

- **Strengths:**  
  - Monolithic VLAs showed promise in integrating vision and language.
  - 3D policies were good at precise manipulation once trained.

- **Weaknesses:**  
  - Poor generalization to new tasks, domains, or environments.
  - Dependency on expensive robot data.
  - Inability to efficiently use large off-domain (cheap) data.

---

### 3. Summary of the Paper’s Contributions
- **Main idea:**  
  Introduces **HAMSTER**, a hierarchical architecture that separates high-level semantic planning from low-level control.

- **Key innovations:**  
  - Uses VLMs to predict **2D paths** instead of direct actions.
  - Low-level policy interprets these paths to execute 3D robot motions.
  - Enables use of **off-domain data** like videos and simulations for training.


---

### 4. Technical Methods (with Plain Language Explanation)
- **Core architecture:**  
  - **High-level VLM (e.g., VILA-1.5-13B)** takes an image and instruction, outputs a 2D path representing the trajectory and gripper state.
  - **Low-level controller** (e.g., 3D-DA or RVT-2) receives the 2D path, visual depth, and proprioception data to compute motor actions.

- **Key techniques:**  
  - Finetuning VLMs using **off-domain datasets** (action-free videos, simulations, robot demos).
  - Representing trajectories as simplified 2D paths (via Ramer–Douglas–Peucker).
  - Overlaying paths on camera input to guide policies without altering architecture.

- **Plain terms:**  
  - The model doesn’t "control the robot" directly. Instead, it shows “where to go,” and a smaller, faster system figures out *how* to go there precisely.

---

### 5. Experimental Results and Performance
- **Key benchmarks:**  
  - Tested on 222 real-world tasks with novel objects, lighting, and instructions.
  - Simulation environment: Colosseum benchmark.

- **Results:**  
  - HAMSTER outperforms OpenVLA by **20% on average**, with a **50% relative improvement**.
  - Achieves **2x success rate** with **half the data** compared to standard 3D policies.
  - Robust across novel camera views, object types, and language instructions.

---


---

