- title: Imitation Learning
- summary: step by step practical about BC
- author: Silvia TULLI
- feedback and revision: Kim BARAKA, Mohamed CHETOUANI
- teaching assistant: Louis SIMON
- date: 2025-09-23

This notebook contains an excerpt from the **Human-Interactive Robot Learning (HIRL)** educational module.\
For more information check out [our website](https://sites.google.com/view/hirl-education?usp=sharing).


The practical can be done alone or with a colleague. Please add below your information.

Student(s):\
(1) NAME___________________ SURNAME___________________ ID___________________ Course___________________\
(2) NAME___________________ SURNAME___________________ ID___________________ Course___________________



# **Goals**
* Understand and apply Imitation Learning (IL) techniques to simple tasks.
* Implement direct Behavioral Cloning (BC) and a more advanced (interactive) IL algorithm.
* Experiment with parameters such as the number of demonstrations and corrupted demonstrations to analyze their impact on performance.

# **Prerequisites**

* mathematics for machine learning: linear algebra, calculus, probability and statitics
* python programming for data science
* lecture on interactive robot learning, in particular read the chapter Interactive Robot Learning (https://hal.science/hal-04060804/file/ACAI2021_chetouani_author-version.pdf)



# **Imitation learning**

Imitation Learning (IL) is a set of techniques aimed at training a model to directly mimic an expert's actions from a collection of demonstrations. In this practical, we focus on **offline imitation learning**, meaning we assume there exists a dataset consisting of demonstration data in which each sample correspond to a state-action pair collected from the expert. In practice, this might mean collecting keystrokes of a human player on an Atari game, controls of a human-driven car, or medical decisions of a doctor along a course of treatment. In this case, the goal of an IL algorithm is to learn a policy (i.e., a full mapping from states to actions) that mimics the expert policy (for which you only have a set of observations).

# **Part 1: Behavioral Cloning (BC)**

Behavioral Cloning is a simple but effective imitation learning technique that involves training a model by learning a direct mapping from states to actions, without any intermediate representation.



We're tackling imitation learning within a specific environment characterized by a discrete Markov Decision Process (MDP) with a fixed time horizon of $T$. There is an expert policy, denoted as $\pi^*$, which provides deterministic actions at each state.

The input of a Behavioral Cloning algorithm is a restricted policy class $\Pi=\{\pi: S \mapsto \Delta(A)\}$

$$
\begin{array}{r}
\pi_\theta =\arg \min _{\pi \in \Pi} \sum_{i=1}^{M} \ell\left(\pi, s^{\star}, a^{\star}\right) \\
\text { loss function }
\end{array}
$$

We consider a set of human expert trajectories that adhere to $\pi$ and we aim to create an imitation policy, $\pi_\theta$, that replicates these expert trajectories effectively. The objective is to ensure that for each state, the action chosen by our policy $\pi_θ$ is exactly the same as the action chosen by the expert policy $\pi$.

In simpler terms, we aim to develop a policy that perfectly imitates the expert's actions in the given environment, making deterministic choices at each state to mimic the expert's behavior precisely.

# **Environment Set up**

First, we define the environment, in this case a 5x5 gridworld. We also define the transition probabilities for each action.

## Environment 1 definition

The environment looks like this:\
![grid](img/grid.png)
![action space](img/grid_action.png)
![actions](img/action.png)\
 Each state is represented by a number.
 The goal is to reach the `State 25` (i.e., flag cell) from any random state. There are three obstacles (i.e., TNT cells). The four actions that the agent can execute are: Up - Down - Left - Right. The agent cannot cross walls; therefore, the action space looks as pictured above.




The transition matrix defines an MDP (Markov Decision Process) representing a system's dynamics and the impact of a specific action, labeled as `Action Up`. The matrix's rows correspond to distinct states within the MDP, ranging from state 1 (cell 0) to state 25 (cell 24). Each column represents a potential subsequent state that the system might transition to when `Action Up` is taken.

Example of a transition matrix for Action Up (0)
```
#               1   2   3   4   5   6   7   8   9   10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25                
Pu = np.array([[1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],  # 1
               [0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],  # 2
               [0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],  # 3
               [0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],  # 4
               [0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],  # 5
               ...
])
```


In the given context:

The rows symbolize the current state of the system within the MDP, comprising $25$ distinct states.
The columns represent the feasible future states that the system can transition to when `Action Up` is executed.
The values in the matrix signify the probabilities of transitioning from the current state (row) to a specific subsequent state (column) upon taking `Action Up`.

For instance, examining the first row `State 1`, it demonstrates a probability of $1.0$ (or $100$%) of transitioning to `State 1` when `Action Up` is performed. This implies that if the system is presently in `State 1` and `Action Up` is executed, it will unquestionably move to `State 1`.

Conversely, examining the first column provides insights into the probabilities of transitioning to different states from any initial state when `Action Up` is employed. For `Action Up`, the system is more inclined to remain in the same state (diagonal elements are 1.0) and less likely to transition to other states (off-diagonal elements are 0.0).

It's important to note that this MDP describes the system's probabilistic behavior in response to `Action Up` and is distinct from a standard Markov chain, as it incorporates the notion of making decisions (in this case, taking an action) and observing the subsequent state transitions.

Below an example representation of an MDP from wikipedia:\
![wikipedia](https://upload.wikimedia.org/wikipedia/commons/thumb/a/ad/Markov_Decision_Process.svg/400px-Markov_Decision_Process.svg.png).\

Here you can observe that from $S0$ by performing $a0$ there is 50% of probability of staying in $S0$ and 50% of probability of going to $S2$.

In [None]:
!pip install torch

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import torch
import torch.nn as nn
import torch.optim as optim

In [None]:
%load_ext autoreload
%autoreload 2

### **Question 1.1**
* Define a different environment and represent the environment dynamics as a MDP using a graph (pen and paper).

* Specify the MDP

### **Question 1.2**
You will now modify the `grid.py` file by

* Writing a `print()` method to visualize the grid environment
* Implementing a `set_mdp()` method to set the MDP as well as `reset()` method to initialize the state
* Completing the `step()` using the MDP

In [None]:
from grid import GridWorldEnvironment

In [None]:
grid = (5,5)
obstacle = np.random.randint(0,np.prod(grid),size=3)


env = GridWorldEnvironment(grid_size=(5,5), obstacles=np.random.randint(0,24,size=3))
env.reset()

## Test print() and step()
action = ["UP", "DOWN", "LEFT", "RIGHT"]
env.print()
for i, a in enumerate(action):
    print("Going", a)
    env.step(i)
    env.print()


# **Dataset Generation**
Then we generate our dataset of expert demonstrations, which are represented as a state-action pairs $(s,a)$. In this example the expert agent attempts to move right when it's possible (i.e., not in the last column), and when it reaches the last column, it moves down. This policy is designed to guide the agent toward the goal state located in the bottom-right corner of the grid.

### **Question 2.1**

* Implement the *right-down* expert policy and generate demonstration
* Print the expert demonstration

In [None]:
# Generate expert demonstrations
def generate_expert_demonstrations(env, num_demos):
    expert_demonstrations = []

    for _ in range(num_demos):
        state = env.reset()
        trajectory = []

        while state != env.state_space - 1:  # Continue until reaching the goal state
            # Define a policy to reach the goal state while avoiding obstacles
            action = simple_policy(env, state)
            next_state, reward, done = env.step(action)
            trajectory.append((state, action))
            state = next_state


        if len(trajectory) > 0:
            expert_demonstrations.append(trajectory)

    return expert_demonstrations

def simple_policy(env, state):
    # A simple policy that avoids obstacles and follows a right-down strategy.
    
    """ Add your code here """


    """ """



    return action

# Create the grid world environment
env = GridWorldEnvironment()

# Generate expert demonstrations
num_demos = 10  # You can adjust the number of demonstrations
expert_demonstrations = generate_expert_demonstrations(env, num_demos)




# Print the expert demonstrations
""" Add your code here """


""" """



Below we can observe that the demonstrations dataset comprises of state-action pairs.

In [None]:
expert_demonstrations

### **Question 2.2**
Write a function `demo_summary()` with statistics about demonstration length, number of state visited, etc ...

In [None]:
def demo_summary(demo_list):

    """ Add your code here """


    """ """

### **Question 2.3**

We provided demonstrations as they were given from a human.
* Can you provide provide alternative demonstrations for this environment (e.g., based on an alternative policy)?
* Can you suggest ways to generate these demonstrations?

$\rightarrow$ Play with the environment variables (e.g., number of obstacles) and the number of expert demonstrations.

# **Define the Behavioral Cloning Model**

## Model Structure
The BehavioralCloningModel described below consists of a Neural Network with three fully connected (dense) layers:\
Input (1D) -> [fc1] -> [ReLU] -> [fc2] -> [ReLU] -> [fc3] -> Output (Mapping from states to deterministic actions)
* `Input (1D)` represents the input layer with a single node, as state_dim is 1.
* `[fc1]` represents the first fully connected layer with 64 neurons.
* `[ReLU]` represents the Rectified Linear Unit (ReLU) activation function applied after each fully connected layer.
* `[fc2]` represents the second fully connected layer with 64 neurons.
* `[fc3]` represents the third fully connected layer, which outputs the mapping from states to deterministic actions. The number of neurons in this layer is determined by action_dim.

The arrows between layers represent the connections and transformations of data as it flows through the network during forward pass. The ReLU activation functions introduce non-linearity in the model.

The model learns a policy using **supervised learning** by minimizing the loss function. In behavioral cloning, the model is trained to predict actions that are as close as possible to the actions taken by the expert. Accuracy is then used to evaluate the model.

In the code below, the **cross-entropy loss** quantifies the dissimilarity between the model's predicted action and the true actions from the expert.

In the context of behavioral cloning for deterministic environments (where expert actions are treated as deterministic), the cross-entropy loss is used to train the model to predict actions that resemble the expert's actions. However, it's important to note that this approach assumes that the expert's actions are ground truth.


Find out more about cross-entropy loss in [3.2.1.3 Log Loss Function](https://takaosa.github.io/paper/algorithmic-perspective-imitation.pdf).
A general overview of imitation learning approaches, including different loss functions can be found in [Imitation Learning Lecture](https://web.stanford.edu/class/cs237b/pdfs/lecture/lecture_10111213.pdf) from Standford.

![architecture](img/archi.png)


### **Question 3.1**
* What other loss function can be used? Why? Try them and explain your results
* Add code to compute the accuracy at each epoch

In [None]:
# Define the Behavioral Cloning Model
class BehavioralCloningModel(nn.Module):
    def __init__(self, state_dim, action_dim):
        super(BehavioralCloningModel, self).__init__()
        self.state_dim = state_dim
        self.action_dim = action_dim
        self.fc1 = nn.Linear(state_dim, 64)
        self.fc2 = nn.Linear(64, 64)
        self.fc3 = nn.Linear(64, action_dim)

    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = torch.relu(self.fc2(x))
        x = self.fc3(x)
        return x

    def train(self, model, states, actions, epochs=50, batch_size=32):
        optimizer = optim.Adam(self.parameters(), lr=0.001)
        loss_fn = nn.CrossEntropyLoss()

        states = torch.tensor(states, dtype=torch.float32)
        # print('states', states)
        print("States size", states.size())

        actions = torch.tensor(actions, dtype=torch.int64)
        print("Actions size", actions.size())

        dataset = torch.utils.data.TensorDataset(states, actions)
        dataloader = torch.utils.data.DataLoader(dataset, batch_size=batch_size, shuffle=True)

        accuracy_values = []
        loss_values = []  # Initialize loss_values to track loss values

        for epoch in range(epochs):
            total_loss = 0.0
            correct_predictions = 0

            for batch_states, batch_actions in dataloader:
                optimizer.zero_grad()
                predicted_actions = model(batch_states)
                loss = loss_fn(predicted_actions, batch_actions)
                loss.backward()
                optimizer.step()
                total_loss += loss.item()

                # Calculate the number of correct predictions in the current batch
                correct_predictions += (predicted_actions.argmax(dim=1) == batch_actions).sum().item()

            # Calculate accuracy for the current epoch and store it
            """ Add your code here"""

            """ """

            # Append the average loss for the current epoch to loss_values
            loss_values.append(total_loss / len(dataloader))

            print(f'Epoch {epoch + 1}/{epochs}, Loss: {loss_values[-1]:.4f}, Accuracy: {accuracy * 100:.2f}%')

        # Calculate standard deviation for loss and accuracy
        loss_std = np.std(loss_values)
        accuracy_std = np.std(accuracy_values)

        return loss_values, accuracy_values, loss_std, accuracy_std # Return loss and accuracy values


In [None]:
# Extract states and actions from expert demonstrations
states = []
actions = []

for demonstration in expert_demonstrations:
    for state, action in demonstration:
        states.append([state])  # Wrap the state in a list to make it 2D (1x1)
        actions.append(action)

# Define state_dim and action_dim based on your data
state_dim = 1  # Assuming state is a scalar value
action_dim = max(actions) + 1  # Calculate action_dim based on the maximum action value

# Instantiate the model
model = BehavioralCloningModel(state_dim, action_dim)

# Train the model with the expert demonstration data
loss_values, accuracy_values, loss_std, accuracy_std = model.train(model, states, actions, epochs=500, batch_size=32)

In [None]:
epochs = len(accuracy_values)
# Define lighter colors for the standard deviation lines
lighter_blue = "b"
lighter_red = "r"

# Plot Loss and Accuracy
""" Add you code here """


""" """

In [None]:
# Visualize the Distribution of Actions
action_counts = [actions.count(i) for i in range(action_dim)]
plt.bar(range(action_dim), action_counts)
plt.xticks(range(action_dim))
plt.xlabel("Action")
plt.ylabel("Count")
plt.title("Distribution of Actions")
plt.show()

## Rollout policy

We evaluate the efficiency of the learned model by using it as a policy in the RL environment and logging the final reward for multiple episodes.

### **Question 3.2**
Implement policy rollout
* Select action from the neural network 
* Execute action the environment
* Update `done`, `reward`, and `step` variables

In [None]:
def policy_rollout(env, bc_policy, n_episode=10, max_step=100):

  avg_reward, avg_nstep = [], []

  for e in range(n_episode):

    #Starting one episode
    env.reset()
    reward, step = 0, 0
    done = False

    while not done:

      """ Add your code here """

      # Select action with behavioral cloning policy


      # Execute action
   


      # Update done, reward, and step



      """ """

      if step == max_step:
        reward = -5
        break


    avg_reward.append(reward)
    avg_nstep.append(step)

  return avg_reward, avg_nstep

In [None]:
avg_reward, avg_nstep = policy_rollout(env, model, 50)
print(f"Steps {np.mean(avg_nstep)} / {np.std(avg_nstep):.3f}")
print(f"Reward {np.mean(avg_reward)} / {np.std(avg_reward):.3f}")

# **Bonus**: *Stochastic-Icy-MDP*
![charlie_brown](img/charlie_brown.jpg)

Note that while the term *probabilities* might imply a probabilistic policy, in this deterministic case, these values represent the model's confidence or preference for each possible action given a particular state.

Now, consider an *icy*-environment with some inertia. For example, if the agent positioned at `state = 0` goes to the right, it might end up in state 1, 2, 3, or even 4! The probability of reaching a state on the same column or row decreases with respect to the distance to the initial state. For example :

$$ \mathcal{P}([1,2,3,4] | s=0, a=right) = [0.8, 0.1, 0.08, 0.02] $$

## **Question 4**
* Define a new `set_mdp()` method in the `IcyGridWorldEnvironment` class and run experiment again with this stochastic environment. Make sure that the probability sums to 1, *i.e.*, $\sum_{s_{t+1}}\mathcal{P}(s_{t+1}|s=s, a=a) = 1$.

**Hint**: To define transition probabilities that sum to 1, you can either use a softmax or a power law.

In [None]:
""" Initialize a IcyGridWorldEnvironment and run Behavioral Cloning"""

# **Open-ended Project**
Compare the performance of Behavioral Cloning with another Imitation Learning algorithm of your choice (e.g., IRL, DAgger).
To choose your alternative algorithm, feel free to check out the implementation offered by [imitation](https://imitation.readthedocs.io/en/latest/algorithms/bc.html).
* Do a benchmark of similar existing projects/approaches.
* Together with the alternative algorithm of your choice, provide details about how the algorithm works.
* Discuss your results and highlighting the advantages and limitations of your approach.

# **References**
* D. A. Pomerleau. Efficient training of artificial neural networks for autonomous navigation. Neural Computation, 3:88–97, 1991.
* F. Torabi, G. Warnell, and P. Stone. Behavioral cloning from observation. ArXiv, abs/1805.01954, 2018.
* [Imitation Learning open source library](https://imitation.readthedocs.io/en/latest/algorithms/bc.html)
* CS 285 at UC Berkeley, [Deep Reinforcement Learning](https://rail.eecs.berkeley.edu/deeprlcourse/)
* M. Chetouani. Interactive Robot Learning: An Overview. Chetouani, M.; Dignum, V.; Lukow- icz, P.; Sierra, C. Human-Centered Artificial Intelligence, 13500, Springer International Publishing, pp.140-172, 2023, Lecture Notes in Computer Science, 10.1007/978-3-031-24349-3_9 . [hal-04060804](https://hal.science/hal-04060804/file/ACAI2021_chetouani_author-version.pdf)
