# Gemini Robotics: Bringing AI into the Physical World
### Authors: Gemini Robotics team, Year: 25 Mar [2025], Institution: Google DeepMind
### Link:(https://arxiv.org/html/2503.20020v1)

---

### 1. Problem Context & Motivation
- There is difficulty translating the capabilities of large multimodal models from digital tasks to the physical world via robots. E.g. there are models that are able to identify objects, but getting a robot to move a soup can over a pan right next to it might be difficult.
- Particularly relevant now because physical agents (robots) are desirable for generalist tasks, which requires to them needing to have a data-heavy backbone (e.g gemini 2 VLM)
- Previous work in VLM's (Vision language models) were able to accurately interpret images but can't generate robot actions or properly work in a 3d space. 

---

### 2. Prior Work and State-of-the-Art
- Key work and SoTA is the pi0 generalist policy
- π₀ combines a pre-trained vision-language model (the backbone that gives reasoning) + flow‑matching for continuous control (i.e. making robot actions smooth) + training from different robot types + a two-stage training recipe (pretraining on general work -> fine tuning on task). 
- Brought zero-shot generalization (can handle new tasks without retraining), high-frequency smooth dexterity (robot moves fluidly/naturally with instructions at 50 Hz), and cross‑robot adaptability (form of robot doesnt matter.)


---

### 3. Summary of the Paper’s Contributions

- Gemini Robotics is merging the power of large multimodal models (text + vision + action) with real-world robot control.
- paper introduces reasoning benchmark Embodied Reasoning Question Answering (ERQA) and 2 models both with Gemini 2.0 as VLM backbone. 
    - Gemini 2.0 model  excels at tasks like detecting objects and points in 2D, leveraging 2D pointing for grasping and trajectories, and corresponding points and detecting objects in 3D. This is a large step towards bringing the reasoning capabilities of consumer tools, like ChatGPT, to robotics.
    1. Gemini Robotics-ER: Focuses on understanding the physical world — it reasons about objects, space, and actions using just images and text.

    2. Gemini Robotics: Builds on ER but adds direct robot control. It can execute tasks like folding clothes, packing lunch boxes, and even playing cards using real robots.

    Gemini Robotics can turn instructions like "zip the lunch bag" into precise motor actions, in real-time, 50Hz robot motion. Furthermore, is exceptionally generalized - it can handle new environments, instructions (including typos in language), and robot types. Can perform new tasks without training (zero-shot) and improves when given new data (few shot). Also, with further tuning, it can complete complicated multi-step tasks like doing origami.
---

### 4. Technical Methods (with Plain Language Explanation)
- Pretrained model (Gemini 2) with onboard encoder to deliver actions at 50Hz and ~250ms from input to action
- Gemini Robotics is a very large model that was trained on multimodal data + robot demonstrations + robot state data (poses in variety of environments)


---

### 5. Experimental Results and Performance
- Summary of the main results and where they were tested
- Key benchmarks and evaluation metrics
- Performance relative to previous methods

---

### 6. Comparison to Prior Work
- Summary of how this method improves over earlier work
- The previous SoTA model, PI0 was outperfomed by Gemini robotics.
- Gemini was able to accomplish this by improving on a variety of aspects:
1. Gemini 2.0 is a much larger model than the 3B token model PI0 used as its backbone
2. Gemini was trained on a more diverse data set (see 4)
3. Gemini includes an embodied reasoning (gemini ER) layer that allows it to reason through 3D envs, predict actions, and infer
4. Gemini and PI0 were both fine tuned on tasks, however Gemini had extensive training on multi step tasks 

---

### 7. Broader Implications and Applications
- Gemini has established itself as the new SoTA for general task 3D space interaction
- Because of Zero shot and few shot capabilities, this model can be deployed faster than previous related tools like RT and PI0
- Reestablishes/improves feasibility of generalist robot


### 8. Related Papers and Integration Points
- Mention other directly related papers
- Explain how this work builds on, improves, or contrasts with them
- Include internal links to other notebook summaries if applicable


---

### 10. Executive Summary
- One-paragraph summary focused on business relevance
- What problem it solves
- How it improves upon earlier approaches
- What it enables in real-world robotics or automation contexts

---
