# MU5EEH15: Interactive Robot Learning

Objective: Learn how to do programming in `Python` interactive robot learning.
- Machine learning / Human Robot Interaction (HRI)
- Reinforcement learning (rewards, human feedback)
- Supervised learning
- Immitation learning

**Organization**: Lectures and Practical Labs **(TP 40%)** + Final project **(40%)** and exam **(20%)**

**Lecturers**: 
- Silvia TULLI - mail: silvia.tulli@sorbonne-universite.fr
- Hamed RAHIMI - mail: hamed.rahimi@sorbonne-universite.fr

**Student**: William WU - mail: william.wu@etu.sorbonne-universite.fr

___

# Learning goals

By the end of this lecture, you should be able to:
- Explain each of the key ingredients of
imitation learning (demonstrations,
environment, policy class, loss function,
learning algorithm)
- Define different imitation learning techniques, their strengths and limitations state visitation frequency, occupancy
frequency and distributional shift
- Explain the underlying differences between direct policy learning vs. Inverse Reinforcement Learning (IRL)
- Explain different reward function representations in IRL
- Define four main problems and solutions related to IRL:
    - reward function ambiguity (degeneracy)
    - expert suboptimality
    - computational intensiveness
- Master the following three learning algorithms:
    - Behavioral Cloning (BC)
    - DAgger
    - Apprenticeship Learning IRL

___

# 1) Behavioral Cloning Algorithm

IL formulated as standard machine learning problem:
- Fix a policy class
    - e.g., support vector machine, neural network, decision tree, deep belief net...
- Estimate a policy (= mapping from states to actions) from the training examples

$D = {(s_0, a0), (s1, a_1), (s_2, a_2), ..., (s_n, a_n)}$

### Distributional Shift

Common assumption is that train and test set are independent and identically distributed (i.i.d.)
However, $p_{\pi*}(O_t) \neq p_{\pi_{\theta}}(O_t)$

#### Why does it matter ?

- At each step, $BC$ policy has error rate $\epsilon$ (e.g., 5% mistakes)
- After $T$ steps, expected errors: $O(\epsilon T ^ 2)$
- Example: 1% error rate over 100
steps ≈ 63% chance of failure

#### Why i.i.d. breaks:

- Supervised learning: test data ~ training data
- Sequential decisions: $p(s_t|\pi) \neq p(s_t|\pi*)$ after errors

#### Behavioral Cloning (supervised learning):

$ \arg \min \mathbb{E}_{(s,a^*) \sim P^*} L(a^*, \pi_{\theta}(s)) $ : Distribution provided exogenously

#### (General) Imitation Learning:

$ \arg \min \mathbb{E}_{s \sim P(s|\theta)} L(\pi^*(s), \pi_{\theta}(s)) $ : Distribution depends on the rollout $P(s|\theta)$ = state distribution of $\pi_{\theta}$

#### Types of Imitation Learning

| Method | Direct Policy Learning | Reward Learning | Access to Environment | Interactive Demonstrator | Pre-collected Demonstrations |
|--------|------------------------|-----------------|----------------------|-------------------------|-----------------------------|
| Behavioral Cloning | Yes | No | No | No | Yes |
| Direct Policy Learning (Interactive IL) | Yes | No | Yes | Yes | Optional |
| Inverse Reinforcement Learning | No | Yes | Yes | No | Yes |



# 2) DAgger Algorithm

Initialize $D \leftarrow \emptyset $.  
Initialize $\hat{\pi}_1$ to any policy in $\Pi$.  

**for** $i = 1 $ **to** $N $ **do**  
 Let $\pi_i = \beta_i \pi^* + (1 - \beta_i) \hat{\pi}_i$.  
 Sample $T $-step trajectories using $\pi_i$.  
 Get dataset $D_i = \{(s, \pi^*(s))\} $ of visited states by $\pi_i$ and actions given by expert.  
 Aggregate datasets: $D \leftarrow D \cup D_i $.  
 Train classifier $\hat{\pi}_{i+1}$ on $D $.  
**end for**  

Return best $\hat{\pi}_i$ on validation.

#### Interactive Expert

- Can query expert at any state
- Construct loss function: $ L(\pi^*(s), \pi(s)) $
- Typically applied to rollout trajectories of policies we are training: $ s \sim P(s | \pi) $
- Driving example: $ L(\pi^*(s), \pi(s)) = (\pi^*(s) - \pi(s))^2 $

#### BC vs Interactive IL: Concrete example

| Method | Training Data | What Happens |
|--------|---------------|--------------|
| Behavioral Cloning | Only center-of-lane images | "steer straight" |
| | Center images + near-edge images (from rollout) → expert corrections | Slight drift → edge of lane → NO DATA for recovery → crash |
| | Center images + near-edge images (from rollout) → expert corrections | Slight drift → have data for "steer back to center" → recovers |

#### When to Use Which Method

| Criterion | Behavioral Cloning | DAgger | Apprenticeship IRL |
|-----------|-------------------|--------|-------------------|
| Expert availability | Offline dataset only | Must be interactive | Offline dataset |
| Environment access | Not needed | Required | Required |
| Best for | Large datasets, deterministic tasks | Critical applications | Transfer, understanding intent |
| Computational cost | Low (one-time supervised learning) | Medium (iterative) | High (RL in inner loop) |
| Sample efficiency | Poor (needs many demos) | Good (targeted queries) | Moderate |
| Handles suboptimality | No | No | Yes (with modifications) |
| Transfer to new tasks | Poor | Poor | Good |

- BC: Large expert dataset, deterministic environments, offline learning
- DAgger: Access to expert during training, expensive but effective
- IRL: Transfer to new environments, expert suboptimality, understanding intent matters

# 3) Apprenticeship Learning via IRL

<img src="picture_apprenticeship.png">

#### Problem setup

**Input:**
- State space $S$, action space $A$
- Transition model: $P_{sa} (s_{t+1} | s_t, a_t)$
- **MDP with no reward function**

**Teacher's demonstration:**
$ D = \{(s_0, a_0), (s_1, a_1), (s_2, a_2), \cdots, (s_n, a_n)\} $

Trace of the teacher's policy $\pi^*$ that maps states to distributions over actions:
$ \pi^*(s) \to P(s) $

___


### Behavioral Cloning or Direct Imitation:

Can we directly learn the teacher's policy using supervised learning?

- **Inverse RL:**  
  Can we recover $R$ ?

- **Apprenticeship learning via inverse RL:**  
  Can we then use this $R$ to find a good policy?

#### Problem setup

**Rollout:** Sequentially execute $\pi(s_0)$ on an initial state, produce trajectory:
$ \tau = s_0, a_0, s_1, a_1, \ldots $

#### $P(\tau | \pi)$: Distribution of trajectories induced by a policy

1. Sample $s_0$ from $P_0$ (distribution over initial states)
2. Initialize $t = 1$. Sample action $a_t$ from $\pi(s_{t-1})$
3. Sample next state $s_t$ from applying $a_t$ to $s_{t-1}$ (requires access to environment)
4. Repeat from step 2 with $t = t + 1$

#### $P(s | \pi)$: Distribution of states induced by a policy

1. Let $P_t(s|\pi)$ denote distribution over $t$-th state
2. $ P(s | \pi) = \frac{1}{T} \sum_{t} P_t(s | \pi) $

#### Challenges
- IRL is a undefined problem
- It is difficult to evaluate a learned reward
- Demonstrations might not be optimal

___

### Behavioral Cloning vs Inverse Control

Behavioral Cloning has no notion of intention:
- Expert suboptimality
- Embodiment (correspondence problem)
- Robustness

Which has the most succinct description of a task
$ \pi^* \text{ vs. } R? $

Especially in planning oriented tasks, the reward function is often much more succinct than the optimal policy.

Let $R(s) = w^T \phi(s)$, where $w \in \mathbb{R}^n$ and $\phi : S \to \mathbb{R}^n$

$ E \left[ \sum_{t=0}^{\infty} \gamma^t R(s_t) \, | \, \pi \right] = E \left[ \sum_{t=0}^{\infty} \gamma^t w^T \phi(s_t) \, | \, \pi \right] $

$ = w^T E \left[ \sum_{t=0}^{\infty} \gamma^t \phi(s_t) \, | \, \pi \right] $

$ = w^T \mu(\pi) $

**Snipping into:** $ E \left[ \sum_{t=0}^{\infty} \gamma^t R^*(s_t) \, | \, \pi^* \right] \geq E \left[ \sum_{t=0}^{\infty} \gamma^t R^*(s_t) \, | \, \pi \right] \quad \forall \pi $

Find $w^*$ such that $w^{*T} \mu(\pi^*) \geq w^{*T} \mu(\pi) \quad \forall \pi$

Where $\mu(\pi)$ is the expected cumulative discounted sum of feature values or "feature expectations"

___

### Feature based reward function
Grid Navigation Example:
```cpp
φ(s) =
[ distance_to_goal(s),    // Feature vector (observable quantities)
// Lower is better near_obstacle(s),
// Avoid velocity(s),
// Efficiency smoothness(s)
// Comfort 
]

// Learned weights (Weight vector (what matters?))
w = [-1.0, -5.0, +0.5, +0.2]
```

#### Why linear?

- Tractable: Convex optimization
- Interpretable: See what expert values
- Transferable: Same features, new weights for new tasks
- Limited: Can't capture complex preferences

Alternative representations:
- Deep networks: $R(s) = NN(s)$
- State-action: $R(s,a) = w^T \phi(s,a)$
- Trajectory: $R(\tau) = f(s_0,...,s_T)$

___

# Apprenticeship Learning via IRL

$$ E \left[ \sum_{t=0}^{\infty} \gamma^t R^*(s_t) \mid \pi^* \right] \geq E \left[ \sum_{t=0}^{\infty} \gamma^t R^*(s_t) \mid \pi \right] \quad \forall \pi $$

Let $R(s) = w^\top \phi(s)$, where $w \in \mathbb{R}^n$ and $\phi : S \to \mathbb{R}^n$

Find $w^*$ such that $w^{*\top} \mu(\pi^*) \geq w^{*\top} \mu(\pi) \quad \forall \pi$

For a policy to be guaranteed to perform as well as the expert policy, it suffices that the feature expectations "match"

#### Apprenticeship Learning via IRL Algorithm

Let $R(s) = w^T \phi(s)$, where $w \in \mathbb{R}^n$ and $\phi : S \to \mathbb{R}^n$

- Initialize some policy $\Pi_0$
- Iterate for $i = 1, 2, 3, \ldots$
  - Guess the reward: find a reward function such that the demonstrator policy maximally outperforms all previously found policies
  - Find an optimal control policy $\Pi_i$ for the current reward function
    - If the expert is suboptimal, pick the best policy in a mixture
  - Exit if $\gamma \geq \epsilon/2$

**Grid Example:** Expert always goes: S → → ↓ → G

All these rewards make this optimal:

1. $R_1(s) = 0$ everywhere (degenerate!)  
2. $R_2(s) = +10$ at goal, 0 elsewhere  
3. $R_3(s) = -\text{distance}$ to goal  
4. $R_4(s) = +100$ at goal, -1 per step  

**Why this matters:** In supervised learning - one correct $f(x)=y$, in IRL - infinitely many correct $R(s)$

**Degeneracy:** We have a lot of reward functions that could possibly be optimal, how do we pick one?

#### Solutions to Degeneracy

- Max Margin: Find R that makes expert better than alternatives by maximum margin
- Maximum Entropy: Find R that makes expert have maximum entropy distribution
- Bayesian: Posterior distribution over R given demos
- Pragmatic: Just match feature expectations

