1 Show Lab, National University of Singapore
2 NVIDIA
* Equal contribution † Project lead ✉ Corresponding author
ActionMap replaces the single-point action decoder of vision-language-action models with a voxel action heatmap, improving success rate, data efficiency, and convergence across LIBERO simulation and real-world Franka manipulation.
Our code is coming soon. As a preview, we release the core implementation of our action head. This action head could be used to replace a VLA's native action decoder (e.g., OpenVLA-OFT's L1 regression head). The example below shows how to plug it in.
import torch
from heatmap_action_head import HeatmapActionHead
head = HeatmapActionHead(
input_dim=4096, # VLA backbone hidden size
num_actions_chunk=8, # action tokens per chunk
action_dim=7, # [x, y, z, r, p, w, grip]
trans_grid=(32, 32, 16), # translation voxel grid
rot_grid=(16, 16, 16), # rotation voxel grid
)
# Run your VLA backbone and keep the last hidden layer.
outputs = backbone(input_ids=input_ids, attention_mask=attention_mask, output_hidden_states=True)
hidden = outputs.hidden_states[-1] # (B, seq_len, llm_dim)
# Gather the hidden states at the action-token positions:
# (B, num_actions_chunk * action_dim, llm_dim)
actions_hidden = hidden[:, action_token_indices]
# Training (ground-truth actions are normalized to [-1, 1]):
pred_actions, loss = head.predict_action_with_loss(actions_hidden, gt_actions)
loss.backward()
# Inference:
pred_actions = head.predict_action(actions_hidden) # (B, num_actions_chunk, 7)@article{actionmap,
title={ActionMap: Robot Policy Learning via Voxel Action Heatmap},
author={Pei Yang and Hai Ci and Yanzhe Chen and Qi Lv and Han Cai and Mike Zheng Shou},
year={2026},
archivePrefix={arXiv},
}