# Foundation Models for Robotics: A Deep Dive into Google's Approach

This notebook explores the concepts presented in Ted Xiao's lecture on the shift towards foundation models in robotics. We will deconstruct the philosophy, architecture, and experiments behind key projects like RT-1, SayCan, Inner Monologue, and DIAL. The goal is to provide an interactive, educational experience that follows the evolution of thought at Google's robotics team, moving from online Reinforcement Learning to large-scale, offline, language-guided imitation learning.

## Section 1: Overview & Prerequisites

### Summary of the Research Paper/Lecture Topic

This lecture outlines a paradigm shift in robot learning, advocating for the application of principles from foundation models (like large language models) to robotics. The core thesis is that scaling up data, model size, and compute—a recipe successful in other domains—can unlock emergent capabilities necessary for robots to operate in the complex, unstructured real world. The talk argues for a move from online, trial-and-error learning (Reinforcement Learning) to an offline paradigm where data collection and policy learning are decoupled. The proposed recipe for modern embodied intelligence is: **Large, diverse offline datasets + High-capacity architectures (Transformers) + Language as a universal glue.**

The lecture details several key projects that embody this philosophy:
- **RT-1 (Robotics Transformer 1):** A Transformer-based architecture for multi-task imitation learning that demonstrates robustness and scalability with diverse data.
- **SayCan:** A framework that uses large language models (LLMs) for high-level, long-horizon planning, grounding the LLM's suggestions with the robot's learned skills (affordances).
- **Inner Monologue:** An extension of SayCan that incorporates closed-loop feedback from the environment using Vision-Language Models (VLMs) to enable dynamic replanning and error recovery.
- **DIAL:** A data-centric method that uses VLMs to automatically relabel a large, templated robotics dataset with rich, natural language descriptions, significantly enhancing the policy's language understanding and generalization.

### Prerequisite Mathematical Concepts

- **Self-Attention Mechanism:** The core component of the Transformer architecture, allowing the model to weigh the importance of different tokens in an input sequence.
- **Cross-Entropy Loss:** A standard loss function for classification problems, used here for predicting discretized robot actions from a fixed set of possibilities.
- **Value Functions (Q-Learning):** Functions that estimate the expected return (reward) of taking an action in a given state. Used conceptually in SayCan to represent the robot's confidence or `affordance` for a specific skill.

### Prerequisite ML/CS Concepts

- **Transformer Architecture:** Understanding of the encoder-decoder structure, tokenization, embeddings, and the role of self-attention.
- **Imitation Learning (Behavioral Cloning):** The supervised learning approach of training a policy to mimic expert demonstrations (mapping observations to actions).
- **Offline vs. Online Reinforcement Learning:** Understanding the distinction between learning from a static, pre-collected dataset (offline) versus learning through active, real-time interaction with an environment (online).
- **Large Language Models (LLMs):** Familiarity with the capabilities of models like PaLM or GPT-3, particularly their use in few-shot prompting for planning and reasoning.
- **Vision-Language Models (VLMs):** Familiarity with models like CLIP that can connect text and images, used for tasks like visual question answering (VQA) and image captioning.

### Hierarchy of Topics

1.  **The Philosophical Shift:** Why Foundation Models? The Bitter Lessons 1.0 & 2.0.
2.  **Mathematical Foundations:** Implementing Self-Attention from scratch.
3.  **Prerequisite Algorithm:** A simple implementation of Behavioral Cloning (BC).
4.  **Core Research - RT-1:** Deconstructing the Robotics Transformer architecture for robust skill learning.
5.  **Core Research - SayCan & Inner Monologue:** Using LLMs and VLMs for planning and feedback.
6.  **Core Research - DIAL:** Scaling data diversity through automatic VLM-based relabeling.
7.  **Experimental Analysis:** Reproducing the key findings and ablation studies discussed in the lecture.
8.  **Context & Extensions:** The broader implications and future of this research direction.

### Learning Objectives

- Understand the motivation behind applying foundation model principles to robotics.
- Grasp the architecture of the RT-1 model and its key design choices (e.g., action discretization).
- Implement the core logic of grounding LLM planners with robotic affordances (SayCan).
- Appreciate how VLMs can provide closed-loop feedback and augment training data.
- Analyze the trade-offs between data quantity, task diversity, and model size based on the lecture's experimental results.

**Estimated Time:** 2-3 hours

## Section 2: Mathematical Foundations

### The Transformer's Core: Self-Attention

The lecture credits high-capacity architectures like the Transformer as a key ingredient for success. The magic of the Transformer lies in the self-attention mechanism, which allows the model to dynamically weigh the importance of different parts of the input sequence. For a robot, this could mean paying more attention to the object it needs to grasp rather than the distracting background. 

The formula for scaled dot-product attention is:
$$ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $$

Where $Q$ (Query), $K$ (Key), and $V$ (Value) are learned linear projections of the input tokens. Let's implement this from scratch.

In [None]:
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
import matplotlib.pyplot as plt
import seaborn as sns
from ipywidgets import interact, FloatSlider

def educational_scaled_dot_product_attention(Q, K, V):
    """
    Clear implementation of scaled dot-product attention for understanding.
    - Based directly on the mathematical formulation.
    - Extensive comments explaining each step.
    """
    # Get the dimension of the key vectors
    d_k = K.shape[-1]
    
    # Step 1: Compute the dot products of the query with all keys (QK^T).
    # This gives us a raw score of how much each token should attend to every other token.
    # Shape: (batch_size, num_queries, num_keys)
    attention_scores = torch.matmul(Q, K.transpose(-2, -1))
    
    # Step 2: Scale the attention scores by the square root of the key dimension.
    # This prevents the gradients from becoming too small, especially for large d_k.
    scaled_attention_scores = attention_scores / np.sqrt(d_k)
    
    # Step 3: Apply softmax to get the attention weights.
    # This converts the scores into a probability distribution, where weights sum to 1.
    # Shape: (batch_size, num_queries, num_keys)
    attention_weights = F.softmax(scaled_attention_scores, dim=-1)
    
    # Step 4: Multiply the weights by the value vectors.
    # This creates a weighted sum of the values, where tokens that are more "important"
    # (have higher attention weights) contribute more to the output.
    # Shape: (batch_size, num_queries, value_dim)
    output = torch.matmul(attention_weights, V)
    
    return output, attention_weights

# --- Let's create some dummy data to test it ---
batch_size = 1
sequence_length = 5 # e.g., 1 language token + 4 image patch tokens
embedding_dim = 64 # Dimension of each token

# In a real transformer, Q, K, and V are projections of the same input sequence.
input_sequence = torch.randn(batch_size, sequence_length, embedding_dim)

# For simplicity, let's treat the input as Q, K, and V directly.
Q, K, V = input_sequence, input_sequence, input_sequence

output, attention_weights = educational_scaled_dot_product_attention(Q, K, V)

print("Input shape:", input_sequence.shape)
print("Output shape:", output.shape)
print("Attention weights shape:", attention_weights.shape)

# Visualize the attention weights
plt.figure(figsize=(6, 5))
sns.heatmap(attention_weights.squeeze(0).detach().numpy(), annot=True, cmap='viridis')
plt.xlabel("Key Tokens")
plt.ylabel("Query Tokens")
plt.title("Self-Attention Weights Heatmap")
plt.show()

### Value Functions as Affordances
The SayCan paper uses a `value function` to represent the robot's `affordance` for a task. In simple terms, this is a score from the robot policy that says, "Given the current world state (what I see), how likely am I to succeed at task X?" A high value means high confidence. This is crucial for grounding the LLM's abstract plans in the robot's actual capabilities.

Let's create a mock value function to simulate this.

In [None]:
def educational_mock_affordance_function(state, possible_skills):
    """
    Simulates a robot's value function to estimate success probability for skills.
    - In a real system, this would be a learned neural network.
    - Here, we use simple heuristics for educational purposes.
    """
    affordances = {}
    # The 'state' is a simple dictionary describing what the robot sees.
    objects_in_scene = state.get('objects', [])
    robot_hand_empty = state.get('hand_empty', True)
    
    print(f"\nRobot State: {state}")
    
    for skill in possible_skills:
        # Default confidence is low.
        confidence = 0.1
        
        # Heuristics to determine confidence
        if 'pick' in skill:
            # Can only pick an object if it's in the scene and hand is empty
            target_object = skill.split(' ')[1]
            if target_object in objects_in_scene and robot_hand_empty:
                confidence = 0.9
            else:
                confidence = 0.05 # Very low confidence if conditions aren't met
        
        elif 'place' in skill:
            # Can only place if the hand is not empty
            if not robot_hand_empty:
                confidence = 0.85
        
        elif 'find' in skill:
            # Robot can always try to find things
            confidence = 0.7
        
        affordances[skill] = confidence
        
    return affordances

# --- Example Usage ---
robot_skills = ['pick apple', 'pick coke_can', 'place on_table', 'find sponge']

# Scenario 1: Robot sees an apple, hand is empty
current_state_1 = {'objects': ['apple', 'table'], 'hand_empty': True}
affordances_1 = educational_mock_affordance_function(current_state_1, robot_skills)
print("Affordances (Scenario 1):", affordances_1)

# Scenario 2: Robot is holding something, sees a table
current_state_2 = {'objects': ['table'], 'hand_empty': False}
affordances_2 = educational_mock_affordance_function(current_state_2, robot_skills)
print("Affordances (Scenario 2):", affordances_2)

## Section 3: Prerequisite Algorithms

### Behavioral Cloning (BC)

The lecture highlights the team's consolidation around multi-task imitation learning. The simplest form of this is Behavioral Cloning (BC). It treats robot learning as a supervised learning problem: given a large dataset of `(observation, action)` pairs from an expert (e.g., a human teleoperator), we train a policy network $\pi(action | observation)$ to predict the expert's action given an observation. The RT-1 model is fundamentally a large, sophisticated BC policy.

Let's create a toy example.

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset

def educational_behavioral_cloning():
    """
    A clear, from-scratch implementation of Behavioral Cloning for understanding.
    - Creates a simple dataset.
    - Defines a simple neural network policy.
    - Trains the policy to mimic the 'expert' data.
    """
    # Step 1: Create a synthetic dataset of (observation, action) pairs.
    # Imagine a robot trying to move to a target on a 1D line.
    # Observation: [current_position, target_position]
    # Action: [velocity]
    num_samples = 1000
    # Observations: current position is random, target is random
    observations = torch.rand(num_samples, 2) * 10  # values between 0 and 10
    
    # Expert actions: move towards the target. Action = target - current
    actions = observations[:, 1] - observations[:, 0]
    actions = actions.unsqueeze(1) # Reshape for the model
    
    # Create a PyTorch DataLoader
    dataset = TensorDataset(observations, actions)
    dataloader = DataLoader(dataset, batch_size=32, shuffle=True)
    
    # Step 2: Define the policy network.
    # A simple Multi-Layer Perceptron (MLP).
    policy_net = nn.Sequential(
        nn.Linear(2, 64),
        nn.ReLU(),
        nn.Linear(64, 64),
        nn.ReLU(),
        nn.Linear(64, 1) # Output a single action value
    )
    
    # Step 3: Define loss function and optimizer.
    # Mean Squared Error is common for continuous actions.
    loss_fn = nn.MSELoss()
    optimizer = optim.Adam(policy_net.parameters(), lr=0.001)
    
    # Step 4: The training loop.
    print("Starting Behavioral Cloning training...")
    num_epochs = 10
    for epoch in range(num_epochs):
        total_loss = 0
        for obs_batch, act_batch in dataloader:
            # Forward pass: get the policy's predicted action
            predicted_actions = policy_net(obs_batch)
            
            # Compute the loss between predicted and expert actions
            loss = loss_fn(predicted_actions, act_batch)
            
            # Backward pass and optimization
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            
            total_loss += loss.item()
        
        avg_loss = total_loss / len(dataloader)
        if (epoch + 1) % 2 == 0:
            print(f"Epoch {epoch+1}/{num_epochs}, Average Loss: {avg_loss:.4f}")
    
    print("\nTraining complete.")
    return policy_net

bc_policy = educational_behavioral_cloning()

# --- Test the trained policy ---
test_obs = torch.tensor([[2.0, 8.0]]) # Should move right (positive action)
expert_action = test_obs[0, 1] - test_obs[0, 0]
predicted_action = bc_policy(test_obs)
print(f"\nTest Observation: {test_obs.numpy()}")
print(f"Expert Action: {expert_action.item():.2f}")
print(f"BC Policy Predicted Action: {predicted_action.item():.2f}")

## Section 4: Core Research Content

### RT-1: Robotics Transformer

RT-1 is a decoder-only Transformer that takes a sequence of images and a natural language instruction as input and outputs a discretized action. Let's outline its architecture conceptually based on the lecture.

**Key Components:**
1.  **Image Tokenizer (FiLM-EfficientNet):** An EfficientNet model (pre-trained on vision tasks) processes each image in a history of 6 images. Each image is split into patches. The language instruction is used to condition the EfficientNet's features using FiLM (Feature-wise Linear Modulation) layers. This helps the vision system focus on instruction-relevant features early on.
2.  **TokenLearner:** To manage the computational cost of processing many image patches over a history of images, TokenLearner dynamically selects the most relevant 8 image patch tokens (out of 81) at each timestep. This is a learned attention mechanism over the patches.
3.  **Language Tokenizer:** A standard text encoder (e.g., from T5) converts the natural language instruction into a sequence of embeddings.
4.  **Transformer:** The selected image tokens and language tokens are fed into a decoder-only Transformer. It uses causal self-attention to predict the next action token.
5.  **Action Discretization:** Instead of predicting continuous 7-DoF end-effector poses, RT-1 predicts a single token from a vocabulary of 256 discrete bins for each dimension of the action space. This turns the regression problem into a classification problem, which is often more stable to train.

In [None]:
class EducationalRT1(nn.Module):
    """
    A conceptual PyTorch implementation of the RT-1 architecture.
    - Focuses on the data flow and interaction between components.
    - Uses placeholders for complex modules like TokenLearner and EfficientNet.
    """
    def __init__(self, num_action_bins=256, action_dim=7, embedding_dim=128, num_heads=4, num_layers=4):
        super().__init__()
        self.num_action_bins = num_action_bins
        self.action_dim = action_dim
        self.embedding_dim = embedding_dim

        # 1. Placeholder for Image Tokenizer (FiLM-EfficientNet)
        # This would take an image (e.g., 3x224x224) and language embedding and output patch embeddings
        # Let's assume it outputs 81 patches of `embedding_dim`
        self.image_tokenizer = nn.Linear(3*224*224, 81 * embedding_dim) # Highly simplified!
        print("Component [1]: Image Tokenizer (Placeholder) Initialized.")

        # 2. Placeholder for TokenLearner
        # This would take 81 patch tokens and select the top 8
        # We simulate this by simply selecting the first 8 tokens
        self.token_learner = lambda x: x[:, :8, :] # Simplified selector
        print("Component [2]: TokenLearner (Placeholder) Initialized.")

        # 3. Placeholder for Language Tokenizer
        # A real one would use a pre-trained model like BERT or T5.
        self.language_embedding = nn.Embedding(num_embeddings=100, embedding_dim=embedding_dim) # vocab size = 100
        print("Component [3]: Language Tokenizer (Placeholder) Initialized.")

        # 4. The Transformer Decoder
        decoder_layer = nn.TransformerDecoderLayer(d_model=embedding_dim, nhead=num_heads, batch_first=True)
        self.transformer_decoder = nn.TransformerDecoder(decoder_layer, num_layers=num_layers)
        print("Component [4]: Transformer Decoder Initialized.")

        # 5. The Action Prediction Head
        # Predicts a distribution over bins for EACH action dimension
        self.action_head = nn.Linear(embedding_dim, action_dim * num_action_bins)
        print("Component [5]: Action Prediction Head Initialized.")

    def forward(self, image_history, language_instruction):
        """
        image_history: (batch, history_len, C, H, W) e.g., (1, 6, 3, 224, 224)
        language_instruction: (batch, seq_len) e.g., (1, 10)
        """
        batch_size = image_history.shape[0]
        history_len = image_history.shape[1]
        
        # Process language instruction
        lang_tokens = self.language_embedding(language_instruction)
        # lang_tokens shape: (batch_size, lang_seq_len, embedding_dim)

        # Process image history
        image_tokens_list = []
        for i in range(history_len):
            img = image_history[:, i, :, :, :].view(batch_size, -1)
            # Pass through tokenizer (FiLM would also take lang_tokens here)
            patch_embeddings = self.image_tokenizer(img).view(batch_size, 81, self.embedding_dim)
            # Pass through TokenLearner
            selected_patches = self.token_learner(patch_embeddings)
            image_tokens_list.append(selected_patches)
        
        # Concatenate all tokens: [lang, img1_p1..p8, img2_p1..p8, ...]
        image_tokens = torch.cat(image_tokens_list, dim=1)
        # Memory for the decoder (what it attends to)
        memory = torch.cat([lang_tokens, image_tokens], dim=1)

        # The decoder takes a target sequence. For inference, this starts with a special [START] token
        # and we predict autoregressively. For simplicity, we'll just use a dummy target.
        # The target for a decoder is what it's trying to predict.
        # We use a single learnable embedding as the initial query to predict the first token.
        tgt = torch.randn(batch_size, 1, self.embedding_dim)
        
        # The transformer decoder will attend to the `memory` (images + language)
        transformer_output = self.transformer_decoder(tgt, memory)
        
        # The output of the transformer is used to predict the action
        action_logits = self.action_head(transformer_output)
        
        # Reshape to (batch, action_dim, num_bins)
        action_logits = action_logits.view(batch_size, self.action_dim, self.num_action_bins)
        
        return action_logits

# --- Instantiate and test the model with dummy data ---
rt1_model = EducationalRT1()
dummy_images = torch.randn(1, 6, 3, 224, 224) # batch=1, history=6
dummy_instruction = torch.randint(0, 100, (1, 10)) # batch=1, 10 words

predicted_action_logits = rt1_model(dummy_images, dummy_instruction)
print(f"\nOutput Logits Shape: {predicted_action_logits.shape}")
print("(Batch, Action_Dimension, Num_Bins)")

### SayCan: Grounding LLMs in Reality

The core idea of SayCan is to combine the semantic common-sense knowledge of an LLM with the practical, learned knowledge of a robot policy. An LLM might suggest many plausible steps to solve a problem, but most are not physically possible for the robot. SayCan filters these suggestions.

**Workflow:**
1.  **Prompt an LLM:** Give a high-level instruction (e.g., "I spilled my drink, can you help?") to an LLM, along with a list of the robot's available low-level skills (e.g., `['find sponge', 'pick sponge', 'wipe table']`).
2.  **Get LLM Scores:** The LLM assigns a probability to each skill, indicating how semantically relevant it is as the next step.
3.  **Get Affordance Scores:** The robot's policy (a value function) looks at the current scene and scores each skill based on its predicted success probability (affordance).
4.  **Combine Scores:** Multiply the LLM score and the affordance score for each skill.
5.  **Select Action:** The skill with the highest combined score is chosen and executed.

In [None]:
import pandas as pd

def educational_mock_llm(instruction, possible_skills):
    """
    Simulates an LLM providing semantic scores for the next best step.
    """
    scores = {}
    # Simple keyword matching to simulate LLM reasoning
    if 'apple' in instruction and 'table' in instruction:
        scores['find apple'] = 0.9
        scores['pick apple'] = 0.8 # Less likely first step than finding
        scores['place on_table'] = 0.5 # Can't place before picking
    elif 'coke' in instruction:
        scores['find coke_can'] = 0.85
        scores['pick coke_can'] = 0.75
    else:
        # Default uniform distribution if instruction is not understood
        for skill in possible_skills: scores[skill] = 1.0 / len(possible_skills)
    
    # Normalize to make it a probability distribution
    total_score = sum(scores.values())
    normalized_scores = {k: v / total_score for k, v in scores.items()}
    return normalized_scores

def educational_saycan_planner(instruction, robot_state, possible_skills):
    """
    Implements the core logic of the SayCan planner.
    """
    # Step 1 & 2: Get scores from the LLM
    llm_scores = educational_mock_llm(instruction, possible_skills)
    
    # Step 3: Get scores from the robot's affordance function
    affordance_scores = educational_mock_affordance_function(robot_state, possible_skills)
    
    # Step 4: Combine the scores
    combined_scores = {}
    for skill in possible_skills:
        # Ensure all skills are present in both dictionaries
        llm_score = llm_scores.get(skill, 0.0)
        affordance_score = affordance_scores.get(skill, 0.0)
        combined_scores[skill] = llm_score * affordance_score
    
    # Step 5: Select the best action
    if not combined_scores:
        return "No valid action", None
    
    best_action = max(combined_scores, key=combined_scores.get)
    
    # --- For visualization ---
    results = pd.DataFrame([
        llm_scores,
        affordance_scores,
        combined_scores
    ], index=['LLM Score (Can)', 'Affordance Score (Say)', 'Combined Score (SayCan)'])
    results = results.T.sort_values(by='Combined Score (SayCan)', ascending=False)
    print("\n--- SayCan Decision Matrix ---")
    print(results)
    
    return best_action

# --- Example Usage: "put an apple on the table" ---
high_level_instruction = "put an apple on the table"
robot_skills = ['find apple', 'pick apple', 'place on_table', 'find coke_can']

# State 1: Robot hand is empty, doesn't see an apple yet.
state_1 = {'objects': ['table'], 'hand_empty': True}
print(f"\nExecuting SayCan for instruction: '{high_level_instruction}'")
action_1 = educational_saycan_planner(high_level_instruction, state_1, robot_skills)
print(f"\n==> Selected Action 1: {action_1}")

# State 2: Robot has found and picked the apple.
state_2 = {'objects': ['table', 'apple'], 'hand_empty': False}
print(f"\nExecuting SayCan for instruction: '{high_level_instruction}'")
action_2 = educational_saycan_planner(high_level_instruction, state_2, robot_skills)
print(f"\n==> Selected Action 2: {action_2}")



### Inner Monologue: Adding Closed-Loop Feedback

Inner Monologue improves upon SayCan by making the planning process dynamic. Instead of a one-shot plan, it creates a continuous dialogue where the robot gets feedback from the environment, allowing it to recover from errors and react to changes.

**Key Feedback Mechanisms:**
1.  **Passive Scene Description:** A VLM captions the scene (e.g., "I see a coke can and an apple on the table.") and adds this text to the LLM's context.
2.  **Success Detection:** After executing an action, a VLM-based classifier determines if it succeeded or failed. This result ("Success: picked up the coke can" or "Failure: failed to pick up the coke can") is fed back to the LLM.
3.  **Active Scene Description (VQA):** The LLM can ask questions about the scene if it's uncertain (e.g., "Is the drawer open?"), which are answered by a VQA model or a human.

In [None]:
def educational_inner_monologue():
    """
    Simulates a long-horizon task using the Inner Monologue loop.
    """
    # -- Mocks for VLM components --
    def mock_scene_describer(state):
        return f"Scene description: I see {', '.join(state['objects'])}. My hand is {'empty' if state['hand_empty'] else 'holding something'}."
    
    def mock_success_detector(action, state_before, state_after):
        if 'pick' in action:
            return not state_after['hand_empty'] # Success if hand is not empty after picking
        if 'place' in action:
            return state_after['hand_empty'] # Success if hand is empty after placing
        return True # Assume other actions always succeed

    # -- Simulation Setup --
    instruction = "pick up the coke can"
    skills = ['find coke_can', 'pick coke_can', 'place on_table']
    robot_state = {'objects': ['table', 'coke_can'], 'hand_empty': True}
    llm_context = [f"User instruction: {instruction}"]
    max_steps = 3

    print("--- Starting Inner Monologue Execution ---")
    for step in range(max_steps):
        print(f"\n--- Step {step+1} ---")
        # 1. Add passive scene description to context
        scene_text = mock_scene_describer(robot_state)
        llm_context.append(scene_text)
        print(f"LLM Context: ... {llm_context[-1]}")

        # 2. Run SayCan to get the next action
        # (Passing the full context to the LLM would be ideal, we simulate its effect)
        action_to_execute = educational_saycan_planner(instruction, robot_state, skills)
        print(f"Selected Action: {action_to_execute}")

        # 3. Simulate executing the action and detecting success
        state_before = robot_state.copy()
        
        # Simulate a failure on the first attempt!
        if action_to_execute == 'pick coke_can' and state_before['hand_empty']:
            print("\033[91mSimulating a grasp failure!\033[0m")
            success = False
            state_after = state_before # State does not change
        else: # On retry, it will succeed
            success = True
            robot_state['hand_empty'] = False
            state_after = robot_state.copy()

        # 4. Add success/failure feedback to context
        feedback = f"Executed '{action_to_execute}'. Result: {'Success' if success else 'Failure'}."
        llm_context.append(feedback)
        print(feedback)
        robot_state = state_after
        
        if not success:
            print("\nAction failed. The LLM will now use this feedback to replan.")
        else:
            print("\nAction succeeded.")
            break # End simulation after success

educational_inner_monologue()

### DIAL: Data-Informed Augmentation of Language

A key bottleneck is the cost of collecting and labeling data. The original dataset for RT-1 used 700 templated instructions (e.g., "pick coke can"). This limits the robot's ability to understand natural, varied language. DIAL addresses this by using a VLM to automatically relabel the entire dataset.

**Workflow:**
1.  **Collect a small, high-quality labeled set:** A small fraction (3%) of the data is manually labeled with rich, descriptive captions (e.g., "lift up the red coke can that is next to the bag of chips").
2.  **Fine-tune a VLM:** A pre-trained VLM (like CLIP) is fine-tuned on this small dataset to become a specialized robot trajectory captioner.
3.  **Pseudo-label the large dataset:** The fine-tuned VLM is used to generate new, descriptive labels for the remaining 97% of the dataset that only had templated labels.
4.  **Train the policy:** The RT-1 model is trained on the combined dataset (original templated data + human-labeled data + VLM pseudo-labeled data), vastly increasing the linguistic diversity of its training signal.

In [None]:
def educational_dial_relabeling():
    """
    Simulates the DIAL process of relabeling a dataset with a VLM.
    """
    # Step 1: The original dataset with templated labels
    original_dataset = {
        'trajectory_001': {'template_label': 'pick coke_can', 'visuals': 'coke can is on the left'},
        'trajectory_002': {'template_label': 'pick coke_can', 'visuals': 'coke can is red, near chips'},
        'trajectory_003': {'template_label': 'open drawer', 'visuals': 'top drawer is being opened'},
        'trajectory_004': {'template_label': 'pick apple', 'visuals': 'green apple picked from bowl'}
    }
    print("--- Original Dataset ---")
    for traj, data in original_dataset.items():
        print(f"{traj}: '{data['template_label']}'")
        
    # Step 2 & 3: Simulate a fine-tuned VLM generating new labels
    # The VLM takes the trajectory's visuals and outputs a rich caption.
    def mock_finetuned_vlm(visuals):
        if 'left' in visuals:
            return "pick up the coke can from the left side of the table"
        if 'red' in visuals and 'chips' in visuals:
            return "get the red can of coke that is next to the chip bag"
        if 'top drawer' in visuals:
            return "hold and pull out the top drawer"
        if 'green apple' in visuals:
            return "lift the green apple from the white bowl"
        return "unknown action"
        
    # Apply the VLM to relabel the dataset
    relabeling_results = {}
    for traj_id, data in original_dataset.items():
        new_label = mock_finetuned_vlm(data['visuals'])
        relabeling_results[traj_id] = new_label
        
    print("\n--- VLM Relabeling Results ---")
    for traj, new_label in relabeling_results.items():
        print(f"{traj}: '{new_label}'")
        
    # Step 4: The new training data for RT-1 now includes these diverse labels
    print("\nThis enriched dataset is now used to train the RT-1 policy,")
    print("improving its generalization to novel language instructions.")
    
educational_dial_relabeling()

## Section 5: Experimental Analysis

### RT-1: Data Size vs. Task Diversity

The lecture presented a key finding from the RT-1 experiments: **task diversity is more important than the number of demonstrations per task.** Reducing the number of unique tasks hurt performance more than proportionally reducing the number of episodes for every task. This suggests that for robots to scale, collecting a wider variety of behaviors is more critical than collecting more data for the same few behaviors.

Let's visualize this finding.

In [None]:
def plot_rt1_scaling_ablation():
    """
    Reproduces the conceptual plot from the RT-1 ablation study.
    """
    # Percentage of total data used
    data_fraction = np.array([100, 75, 50, 25, 10])
    
    # Max success rate
    baseline_success = 85
    
    # Performance when reducing episodes per task (less data, same diversity)
    # Performance degrades gracefully.
    perf_reduce_episodes = baseline_success * (data_fraction / 100)**0.5
    
    # Performance when reducing the number of unique tasks (less diversity)
    # Performance degrades much more sharply.
    perf_reduce_tasks = baseline_success * (data_fraction / 100)**1.2

    plt.figure(figsize=(10, 6))
    plt.plot(data_fraction, perf_reduce_episodes, marker='o', linestyle='--', label='Reducing Demos per Task (High Diversity)')
    plt.plot(data_fraction, perf_reduce_tasks, marker='s', linestyle='--', label='Reducing Number of Tasks (Low Diversity)')
    
    plt.title('RT-1 Ablation: Task Diversity vs. Data per Task')
    plt.xlabel('Percentage of Training Data Used')
    plt.ylabel('Success Rate (%)')
    plt.gca().invert_xaxis() # Match the lecture's style
    plt.grid(True, which='both', linestyle='-', linewidth=0.5)
    plt.legend()
    plt.show()

plot_rt1_scaling_ablation()

### SayCan: Scaling with Better Foundation Models

Another key insight was related to the "Bitter Lesson 2.0": we should leverage methods that improve as foundation models improve. The SayCan experiment demonstrated this perfectly. When the team hot-swapped the underlying LLM from a smaller model (FLAN) to a much larger one (PaLM), the high-level planning success rate increased significantly, without changing anything else about the robot or its skills.

This shows the power of decoupling the planning module; as external foundation models get better, the robot's capabilities improve for free.

In [None]:
def plot_saycan_llm_scaling():
    """
    Visualizes the improvement in planning success with larger LLMs.
    """
    # Model sizes are illustrative, based on known model scales
    model_params = np.array([8, 62, 540]) # Billions of parameters (e.g., small, FLAN, PaLM)
    planning_success_rate = np.array([75, 84, 95])
    execution_success_rate = np.array([65, 74, 84]) # Overall success is planning * execution

    plt.figure(figsize=(10, 6))
    plt.plot(model_params, planning_success_rate, marker='o', linestyle='--', label='Planning Success Rate')
    plt.plot(model_params, execution_success_rate, marker='s', linestyle='--', label='Overall Execution Success Rate')

    plt.title('SayCan: Performance Scales with LLM Size (Bitter Lesson 2.0)')
    plt.xlabel('LLM Parameters (Billions)')
    plt.ylabel('Success Rate (%)')
    plt.xscale('log')
    plt.xticks(model_params, [f'{x}B' for x in model_params])
    plt.grid(True, which='both', linestyle='-', linewidth=0.5)
    plt.ylim(50, 100)
    plt.legend()
    plt.show()

plot_saycan_llm_scaling()

### Interactive Parameter Exploration

Let's create an interactive widget to explore the SayCan decision logic. You can adjust the LLM's confidence and the robot's affordance for a specific skill and see how it affects the final choice. This demonstrates the balance between "what makes sense" (language) and "what is possible" (affordance).

In [None]:
def interactive_saycan_explorer(llm_pick_apple=0.8, affordance_pick_apple=0.1):
    """
    Interactive widget to explore the SayCan scoring.
    """
    # Other skills have fixed scores for this example
    scores = {
        'find apple': {'llm': 0.9, 'affordance': 0.7},
        'pick apple': {'llm': llm_pick_apple, 'affordance': affordance_pick_apple},
        'place on_table': {'llm': 0.3, 'affordance': 0.05} # can't place if hand is empty
    }
    
    combined_scores = {skill: data['llm'] * data['affordance'] for skill, data in scores.items()}
    best_action = max(combined_scores, key=combined_scores.get)
    
    df = pd.DataFrame(scores).T
    df.columns = ['LLM Score', 'Affordance Score']
    df['Combined Score'] = df['LLM Score'] * df['Affordance Score']
    df = df.sort_values(by='Combined Score', ascending=False)
    
    print("--- SayCan Interactive Decision Matrix ---")
    display(df)
    print(f"==> Best Action: {best_action}")
    print("\nTry lowering the affordance for 'pick apple' to see when 'find apple' becomes the better choice.")


interact(
    interactive_saycan_explorer,
    llm_pick_apple=FloatSlider(min=0.0, max=1.0, step=0.05, value=0.8, description='LLM Score (pick apple)'),
    affordance_pick_apple=FloatSlider(min=0.0, max=1.0, step=0.05, value=0.9, description='Affordance (pick apple)')
);

## Section 6: Research Context & Extensions

### Research Contribution in Context

The body of work presented in the lecture represents a significant contribution by championing a specific, scalable paradigm for robot learning. It places itself firmly in the "data-driven" camp, moving away from classical robotics and even the dominant online RL approaches of the recent past.

- **Historical Context:** The lecture traces the team's journey from the "arm farm" (on-policy RL, 2016), through various methods like QT-Opt and BC-Zero, to a consolidation around multi-task imitation learning (2022). This history highlights the "bitter lessons" learned: methods that scale with data and general computation (like Transformers on large offline datasets) ultimately win out over more complex, specialized algorithms.
- **The Bitter Lesson 2.0:** A key contextual idea is Karol Hausman's "Bitter Lesson 2.0," which posits that researchers should leverage methods that can utilize improvements in foundation models. SayCan and Inner Monologue are prime examples, as their performance is directly tied to the quality of the underlying LLMs and VLMs, which are improving at a staggering rate externally.
- **Language as the Universal API:** The central theme is using natural language as the glue connecting different components (human intent, high-level planning, low-level skills, visual feedback). This makes the system more modular and allows each part to be improved independently.

### Current Research Directions Mentioned

The lecture hints at several ongoing and future research directions:

1.  **Scaling Data Collection:** The current bottleneck is the acquisition of large, diverse, real-world robotics data. While DIAL helps get more out of existing data, the ultimate goal is to find cheaper, more autonomous ways to collect new data beyond human teleoperation.
2.  **Overcoming Inference Speed Bottlenecks:** RT-1 was constrained to 35M parameters to run at 3Hz. A major open question is how to run much larger, more capable foundation models on real robots without sacrificing reactivity. This might involve techniques like model distillation, quantization, or new hardware.
3.  **Bringing RL Back In:** The current approach is based on imitation learning. A future step is to use this strong imitation policy as a starting point for further improvement with offline reinforcement learning, allowing the robot to learn from autonomous experience and surpass the initial human demonstrators.
4.  **Improving Long-Context Understanding:** The current RT-1 model uses a history of only 6 images (2 seconds). For tasks that take minutes, this is insufficient. Future work will need to explore architectures that can handle much longer contexts, perhaps using retrieval-based methods or recurrent state management.
5.  **Embodied Action Improving LLMs:** The relationship could be symbiotic. While foundation models help robots, large-scale datasets of embodied, causal interactions from robots could be used to improve the reasoning and grounding of future LLMs.

### Practical Applications

While still in the research phase, the practical application driving this work is the development of general-purpose robots that can operate usefully in human environments. The choice of the "micro-kitchen" domain is intentional, as it contains a wide variety of objects, tasks, and semantic concepts relevant to everyday human life.

The ultimate goal is to move beyond single-task, factory-floor robots to create assistants that can understand high-level, ambiguous natural language commands and execute them robustly in the chaotic real world. The projects discussed are concrete steps toward building the foundational components—robust perception, scalable skill learning, and common-sense planning—required for such an application.