# MU5EEH15: Interactive Robot Learning

**Objective**: Learn how to do programming in `Python` interactive robot learning.
- Machine learning / Human Robot Interaction (HRI)
- Reinforcement learning (rewards, human feedback)
- Supervised learning
- Immitation learning

**Organization**: Lectures and Practical Labs **(TP 40%)** + Final exams **(60%)**

**Evaluation**: C++ Programming exam **(30%)** + ROS Driver Design exam **(30%)**

**Teacher**: Mohamed CHETOUANI - mail: mohamed.chetouani@sorbonne-universite.fr

___

# Table of Contents
- [Organization](#organization)
- [Introduction](#introduction)
- [Machine Learning (ML) and Interactive ML](#machine-learning-ml-and-interactive-ml)
- [Learning to summarize from human feedback](#learning-to-summarize-from-human-feedback)
- [Agents Learning from Human Teachers](#agents-learning-from-human-teachers)
- [Robot Social Learning](#robot-social-learning)
  - [Social Interaction Between Learner and Tutor](#social-interaction-between-learner-and-tutor)
  - [Beliefs, Desires, and Intentions (BDI) in Interactive Learning](#beliefs-desires-and-intentions-bdi-in-interactive-learning)
  - [Ostensive Signals (Attention)](#ostensive-signals-attention)
  - [Communication in Action](#communication-in-action)
- [Humans in the Machine Learning Process](#humans-in-the-machine-learning-process)
  - [a) Standard Imitation Learning](#a-standard-imitation-learning)
  - [b) Evaluative Feedback](#b-evaluative-feedback)
  - [c) Imitation from Observation](#c-imitation-from-observation)
  - [d) Learning Attention from Humans](#d-learning-attention-from-humans)
  - [e) Learning from Human Preference](#e-learning-from-human-preference)
  - [f) Hierarchical Imitation](#f-hierarchical-imitation)
- [Teaching and Learning Costs](#teaching-and-learning-costs)
  - [A) Imitation](#a-imitation)
  - [B) Feedback](#b-feedback)
- [Interactive Robot Learning Paradigm](#interactive-robot-learning-paradigm)
- [Human Strategies: Example of graspable objects](#human-strategies)
- [Conclusion](#conclusion)

# Organization

- Introduction to Interactive: Machine Learning
- Strategies to teach machines: How do humans teach machines ?
- Learning from evaluative feedback and/or demonstrations
- Open challenges in Interactive Robot Learning
___

# Introduction

*Don't hesitate to take a look to the document [Open PDF](InteractiveRobotLearning.pdf)...*

___

# Machine Learning (ML) and Interactive ML

For Machine Learning, we usually need the following:
- Define the **features** (what choices does the machine make),
- Define the **metrics** (accuracy, precision, recall, F1-score…),
- Get to **know**/**understand** the problem (ask the users: data, insights, labels…),
- Choose the **design/algorithm** (random forest, support vector machine, k-means, etc.),
- In neural networks:
  - Define the **architecture** (layers, neurons, activation functions),
  - Specify the **loss function** and optimization method,
  - Use **gradient-based optimization** (e.g., gradient descent, backpropagation) to adjust weights,
  - Train with **labeled data** until convergence.

With Interactive Machine Learning,
- We can, as humans, directly **teach the robot** through **feedback** (reinforcement, corrections, demonstrations),
- The learning is **incremental and adaptive**, guided by user input instead of only large offline datasets,
- Adjustments can also rely on **gradient-based updates**, but influenced by **human feedback** rather than just static loss functions.
- Needs **less data** because human feedback is **targeted and informative**, correcting mistakes directly and guiding the model toward the right solution without requiring thousands of redundant examples.

___

# Learning to Summarize from Human Feedback

#### 1. Sample Selection
- Humans typically provide a **large set of samples** (examples, demonstrations, corrections).
- The robot learns to identify and generate a **smaller set of preferred samples**, focusing on the most relevant or useful ones.
- This process is known as **Preference Learning**.

#### 2. Training with Human Preferences
- Training involves concepts like **loss functions**, **entropy**, and **reward signals**.
- The robot may be given two candidate summaries, and a human indicates **which one is better**.
- This resembles supervised learning, but only for the **preference selection step**, not the entire learning process.

#### 3. Design
- The model architecture is typically a **neural network** capable of scoring or ranking candidate summaries.
- The **loss function** is designed to **maximize the preference agreement** with human feedback.
- Training may include **gradient-based updates** guided by the reward signal from human preferences.
- The system is designed to **generalize from few examples**, leveraging the targeted human feedback efficiently.

___

# Agents Learning from Human Teachers

**Objective:** Learning from a human teacher → Interactive task learning
(It refers to any process by which an agent learns:
(i) to communicate about a task, and
(ii) to perform a task through natural interaction with a human.)

**Focus on teaching signals:**
- Human teaching signals are **multimodal cues** communicated through different **forms**, such as feedback, demonstrations, or instructions.
- These signals can be more or less informative, being **explicit** (clear choice) or **implicit**, and aim to **intentionally** shape the agent's behavior.
- How can an agent make sense of all these diverse teaching signals?
- Flow: **Human teacher → teaching signals → learning agent → feedback to human teacher**

___


# Robot Social Learning

## Social Interaction Between Learner and Tutor
- Learning occurs through **interaction between the learner (robot) and the tutor (human)**.
- Signals can be divided into two main channels:
  1. **Task-channel signals** – communication about the **task itself**, such as goals, actions, or outcomes.
  2. **Social-channel signals** – communication about the **social context**, such as encouragement, attention, or guidance.
- Task-channel signals help in **communicating the robot's understanding and performance of the task**.
- Social-channel signals help the robot interpret **intentions and expectations of the human tutor**.

## Beliefs, Desires, and Intentions (BDI) in Interactive Learning
- **Beliefs (hypotheses):** The robot’s understanding of the world, including task states and human cues.
- **Desires (goals):** The objectives or goals the robot wants to achieve (aligned with the human tutor’s goals).
- **Intentions (plans):** The specific actions the robot commits to in order to achieve its goals (*complex for humans*).
- The robot can **reason about human instructions, anticipate outcomes, and plan actions** more effectively within both task and social channels.

## Ostensive Signals (Attention)

- Ostensive signals are cues that indicate **communication is intended for the learner**.
- In humans, studies show that the **infant brain responds strongly to dynamic mutual and averted gaze stimuli**, which signal attention and intention from the caregiver.
- These signals help the learner **detect relevant information and understand the teacher’s intention**, forming the basis for social learning.

## Communication in Action

- An **actor** (human or robot) intends not just to **perform an action**, but also to **convey information about the action**.
- This involves two complementary processes:
  1. **Planning and Acting (A → C):** Deciding and executing the action while embedding communicative intent.
  2. **Inference (B → C):** Observers infer the actor's intention or meaning from the action.

___



# Humans in the Machine Learning Process

*Slide 16 [Open PDF: slides](InteractiveRobotLearning_slides.pdf)*

- Humans play a central role in **guiding and supervising the learning process**.
- Key contributions include:
  - **Data collection and labeling** – providing high-quality datasets for training.
  - **Feature selection and engineering** – helping define what information the machine should focus on.
  - **Designing reward or feedback signals** – enabling interactive or reinforcement learning.
  - **Interpreting results and correcting errors** – guiding the model toward better performance.
- In **Interactive Machine Learning**, humans act as **teachers**, giving targeted feedback, demonstrations, or preferences to shape learning efficiently.

#### a) Standard Imitation Learning
- The robot **learns by observing human demonstrations** of a task.
- It tries to **replicate the demonstrated behavior** without explicit programming.
- Useful when **explicit reward functions are difficult to define**.
- Focuses on **copying correct actions** rather than evaluating performance.

#### b) Evaluative Feedback
- The robot improves based on **human-provided evaluations** of its actions.
- Feedback can indicate **good/bad performance** or **how well the task was done**.
- Enables **incremental and interactive learning**, guiding the robot toward better behavior.
- Often combined with **preference learning** or **gradient-based updates** for efficiency.

#### c) Imitation from Observation
- The robot **learns by passively observing human actions** without explicit instructions.
- Focuses on **extracting patterns and goals** from observed behavior.
- Useful when **direct demonstrations or guidance are limited**.
- Enables the robot to **generalize actions** in similar contexts.

#### d) Learning Attention from Humans
- The robot **learns where to focus** by observing **human gaze, gestures, or cues**.
- Helps the robot **identify relevant parts of the environment or task**.
- Critical for **social learning** and improving **task performance efficiency**.
- Can be combined with imitation or feedback to **guide learning priorities**.

#### e) Learning from Human Preference
- The robot **learns by comparing alternatives** based on human preferences.
- Humans indicate **which option is better** among multiple candidate actions or outputs.
- Enables **efficient learning from fewer examples** because feedback is **targeted and informative**.
- Often combined with **preference learning algorithms** and **gradient-based updates**.

#### f) Hierarchical Imitation
- The robot **learns complex tasks by decomposing them into sub-tasks**.
- Imitates human demonstrations at **different levels of abstraction** (high-level goals and low-level actions).
- Useful for **structured or multi-step tasks**, allowing **modular learning and generalization**.
- Can integrate with **task planning** and **BDI frameworks** for more efficient execution.

___

# Teaching and Learning Costs

#### A) Imitation
- Cost is mainly on the **human teacher**, who must provide demonstrations of the task.
- The robot learns **passively**, so less human intervention is needed during learning.
- **Challenges:** Requires high-quality demonstrations; errors in demonstration can propagate to the robot.

#### B) Feedback
- Cost is distributed between **human teacher and robot**.
- Humans provide **evaluative or corrective signals** during learning, which may require ongoing attention.
- The robot can **learn incrementally**, reducing the total number of required examples.
- **Challenges:** Feedback must be **consistent and informative**; poorly timed or ambiguous feedback can slow learning.

___

# Interactive Robot Learning Paradigm

**Table 1.** Description of main Human Teaching Strategies. Robot action is performed at time-step $ t $. A teaching signal is the physical support of the strategy using social and/or task channels.

| Categories        | Teaching signals  | Feedback             | Demonstration                          | Instruction             |
|-------------------|-------------------|----------------------|----------------------------------------|-------------------------|
| **Nature**        |   Notation        | $ H(s,a) $           | $D=\{(s_t,a_t^*), (s_{t+1},a_{t+1}^*)...\}$ | $ I_\pi(s) = a_t^* $ |
| **Nature**        |   Value           | Binary / Scalar      | State-Action pairs                     | Probability of an action |
| **Time-step**     |   t-1             |                      | ✓                                      | ✓                       |
| **Time-step**     |   t               |                      | ✓                                      |                         |
| **Time-step**     |   t+1             | ✓                    |                                        |                         |
| **Human**         |   Intention       | Evaluating or Correcting | Showing                            | Telling                 |
| **Human**         |   Teaching cost   | Low                  | High                                   | Medium                  |
| **Robot**         |   Interpretation  | State-Action evaluation (Reward-/Value-like) | Optimal actions (Policy-like) | Optimal action 
| **Robot**         |   Learning cost   | High                 | Low                                    | High                    |

# Human Strategies

Humans can adopt different strategies when teaching a robot or machine:

- **Optimal teaching (boundary strategy):** Provide examples **closest to the decision boundary** to maximize learning efficiency.
- **Extreme strategy:** Present examples **from easy to hard** or only the extremes of the spectrum.

#### Example: Graspable Objects

**Objective:** Classify objects as **graspable** or **not graspable**.

**Three typical teaching behaviors:**
1. **Extremes:** Only the most clear-cut examples; e.g., “this is very graspable” vs. “this is absolutely not graspable.” *Can be too binary...*
2. **Linear:** Examples are presented **gradually from not graspable to graspable**, covering the spectrum. *High teaching cost...*
3. **Positive only:** Only **graspable examples** are shown; the robot assumes everything else is non-graspable. *No negative can be a mistake...*


# Autonomous and Interactive Learning

- **Goal:** Combine the strengths of
  - **Autonomous learning:** the robot explores the environment on its own using **trial-and-error** to maximize rewards.
  - **Interactive learning:** the robot learns from **human-provided guidance**, such as feedback, demonstrations, or preferences.

#### Formal Framework: Markov Decision Process (MDP)

**State, Action, Transition, Reward** (SATR) elements: 
- **State and Action Spaces:**
  - $ s \in S $ — state belongs to the state space $S$
  - $ a \in A $ — action belongs to the action space $A$
- **State-Action Mapping (Transition Function):**
  - $ T: S \times A \to S$ : Maps a **state-action pair** $(s, a)$ to the **next state** $s'$.
- **Reward Function:**
  - $ R: S \times A \to \mathbb{R} $ : Assigns a **reward** for taking action $a$ in state $s$.
- **Policy:**
  - $ \Pi(s)$ : Defines the **action to take** in each state.

By **combining autonomous exploration with human guidance**, the robot can:
- Explore the environment **on its own**.
- Incorporate **targeted human input** to improve learning speed and accuracy.

#### *Rest of ressources are availble here:*
- *Mathematical formulas slide 26 [Open PDF: slides](InteractiveRobotLearning_slides.pdf)*
- *Q-learning scheme slide 27 [Open PDF: slides](InteractiveRobotLearning_slides.pdf)*
- *Shaping with evaluative feedback [Open PDF: slides](InteractiveRobotLearning_slides.pdf)*