# Behavior Cloning (BC)

In Behavior Cloning (BC), we find optimal parameter $\theta$ in policy $\pi_{\theta}$ by solving a regression (or classification) problem using expert's dataset $\mathcal{D}$ as a supervised learning.<br>
Therefore, you can simply apply existing regression (or classification) methods - such as, Gaussian model, GMM, non-parametric method (LWR, GPR), or neural network learners.<br>
See [my post](https://tsmatz.wordpress.com/2017/08/30/regression-in-machine-learning-math-for-beginners/) for the design choice of regression problems.

In this notebook, I'll build neural network policy $\pi_{\theta}$ and then optimize parameters (weights) by minimizing cross-entropy loss in PyTorch.

The trained policy is then available in regular reinforcement learning (RL) methods, if you refine models to get better performance. (See [here](https://github.com/tsmatz/reinforcement-learning-tutorials) for RL algorithms.)

BC is a basic approach for imitation learning, and easily applied into the various scenarios.

But it's worth noting that it also has the shortcomings to apply in some situations.<br>
One of these is that the agent trained by BC might sometimes happens to encounter unknown states which are not included in the initial expert's behaviors. (Because expert dataset doesn't have enough data for failure scenarios.) In most cases, the trained agent in BC works well in success cases, but it fails when it encounters the irregular states.<br>
In such cases, you can apply [DAgger](./02_dagger.ipynb) (next example), or the policy can be transferred to regular reinforcement learning after BC has been applied.

Now let's start.

*(back to [index](https://github.com/tsmatz/imitation-learning-tutorials/))*

Before we start, we need to install the required packages.

In [None]:
!pip install torch numpy

## Restore environment

Firstly, I restore GridWorld environment from JSON file.

For details about this environment, see [Readme.md](https://github.com/tsmatz/imitation-learning-tutorials/blob/master/Readme.md).

> Note : See [this script](./00_generate_expert_trajectories.ipynb) for generating the same environment.

In [1]:
import torch
import json
from gridworld import GridWorld

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

with open("gridworld.json", "r") as f:
    json_object = json.load(f)
    env = GridWorld(**json_object, device=device)

Now I visualize our GridWorld environment.

The number in each cell indicates the reward score on this state.<br>
The goal state is on the right-bottom corner (in which the reward is ```10.0```), and the initial state is uniformly picked up from the gray-colored cells.<br>
If the agent can reach to goal state without losing any rewards, it will get ```10.0``` for total reward.

See [Readme.md](https://github.com/tsmatz/imitation-learning-tutorials/blob/master/Readme.md) for details about the game rule of this environment.

In [2]:
from IPython.display import HTML, display

valid_states_all = torch.cat((env.valid_states, torch.tensor([env.grid_size-1,env.grid_size-1]).to(device).unsqueeze(dim=0)))
valid_states_all = valid_states_all[:,0] * env.grid_size + valid_states_all[:,1]

html_text = "<table>"
for row in range(env.grid_size):
    html_text += "<tr>"
    for col in range(env.grid_size):
        if row*env.grid_size + col in valid_states_all:
            html_text += "<td bgcolor=\"gray\">"
        else:
            html_text += "<td>"
        html_text += str(env.reward_map[row*env.grid_size+col].tolist())
        html_text += "</td>"
    html_text += "</tr>"
html_text += "</table>"

display(HTML(html_text))

0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49
0,-1,0,0,0,-1,0,0,0,0,0,0,-1,0,-1,-1,-1,-1,0,-1,0,-1,0,0,0,-1,0,0,0,-1,-1,0,-1,0,0,0,0,-1,-1,0,0,0,0,0,0,0,0,0,0,0
-1,0,0,0,0,-1,0,-1,0,0,0,0,0,0,0,0,0,0,0,0,0,-1,-1,-1,0,-1,0,0,0,0,0,-1,0,-1,0,-1,0,-1,0,-1,-1,0,0,0,-1,-1,0,0,-1,-1
0,0,-1,-1,0,-1,0,-1,0,-1,0,0,0,0,0,0,0,-1,0,0,-1,0,0,-1,0,0,0,0,0,0,0,-1,0,0,-1,0,0,-1,-1,0,0,-1,0,0,0,0,-1,0,0,0
0,0,-1,0,-1,-1,0,0,-1,0,-1,0,0,0,0,-1,0,-1,0,-1,0,-1,0,0,-1,-1,0,0,-1,0,0,0,0,0,0,0,0,0,0,0,0,-1,0,0,0,0,0,0,0,-1
-1,-1,-1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,-1,-1,0,-1,0,0,0,0,0,0,0,0,0,0,0,0,-1,0,-1,0,0,0,-1,0,0,-1,0,-1,-1,0,0
0,-1,-1,0,0,-1,0,-1,0,0,-1,0,0,-1,0,0,-1,-1,0,0,-1,0,0,0,-1,0,0,0,0,-1,-1,0,0,-1,0,-1,-1,-1,-1,0,0,-1,0,-1,0,0,0,0,0,-1
-1,-1,0,0,0,0,0,0,0,0,0,0,-1,0,-1,0,0,0,0,0,-1,-1,0,0,-1,-1,0,0,-1,-1,0,-1,0,0,0,0,0,0,0,-1,-1,0,0,0,-1,0,-1,0,-1,-1
0,0,0,0,-1,0,0,0,-1,0,0,0,0,-1,0,0,-1,-1,0,0,-1,0,0,-1,0,0,-1,0,0,0,-1,-1,0,0,-1,0,-1,0,-1,0,0,-1,0,-1,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,-1,-1,0,-1,0,0,-1,0,0,-1,-1,0,0,0,0,0,0,-1,0,0,0,-1,-1,-1,-1,-1,-1,0,-1,-1,0,-1,0,0,0,-1,0,0,0,0
0,0,0,-1,0,-1,0,0,-1,-1,0,0,-1,0,0,0,0,0,0,0,0,0,-1,-1,-1,0,-1,0,0,0,0,0,0,0,0,0,0,-1,0,0,0,0,0,0,-1,-1,0,-1,-1,-1


## Define policy

Now I build a policy $\pi_{\theta}$.

This network receives the current state (one-hot state) as input and returns the optimal action (action's logits) as output.

In [3]:
import torch.nn as nn
from torch.nn import functional as F

STATE_SIZE = env.grid_size*env.grid_size  # 2500
ACTION_SIZE = env.action_size             # 4

#
# Define model
#
class PolicyNet(nn.Module):
    def __init__(self, hidden_dim=64):
        super().__init__()
        self.hidden = nn.Linear(STATE_SIZE, hidden_dim)
        self.classify = nn.Linear(hidden_dim, ACTION_SIZE)

    def forward(self, s):
        outs = self.hidden(s)
        outs = F.relu(outs)
        logits = self.classify(outs)
        return logits

#
# Generate model
#
policy_func = PolicyNet().to(device)

## Run agent before training

For comparison, now I run this agent without any training.

In this game, the maximum episode's reward without losing any rewards is ```10.0```. (See [Readme.md](https://github.com/tsmatz/imitation-learning-tutorials/blob/master/Readme.md) for game rule in this environment.)<br>
As you can see below, it has low average of rewards.

In [4]:
# Pick stochastic samples with policy model
def pick_sample_and_logits(policy, s):
    """
    Stochastically pick up action and logits with policy model.

    Parameters
    ----------
    policy : torch.nn.Module
        Policy network to use
    s : torch.tensor((..., STATE_SIZE), dtype=int)
        The feature (one-hot) of state.
        The above "..." can have arbitrary shape with 0 or 1 dimension.

    Returns
    ----------
    action : torch.tensor((...), dtype=int)
        The picked-up actions.
        If input shape is (*, STATE_SIZE), this shape becomes (*).
    logits : torch.tensor((..., ACTION_SIZE), dtype=float)
        Logits of categorical distribution (used to optimize model).
        If input shape is (*, STATE_SIZE), this shape becomes (*, ACTION_SIZE).
    """
    # get logits from state
    # --> size : (*, ACTION_SIZE)
    logits = policy(s.float())
    # from logits to probabilities
    # --> size : (*, ACTION_SIZE)
    probs = F.softmax(logits, dim=-1)
    # pick up action's sample
    # --> size : (*, 1)
    a = torch.multinomial(probs, num_samples=1)
    # --> size : (*)
    a = a.squeeze()

    # Return
    return a, logits

In [5]:
def evaluate(policy, batch_size):
    total_reward = torch.tensor(0.0).to(device)
    s = env.reset(batch_size)
    while True:
        s_onehot = F.one_hot(s, num_classes=STATE_SIZE)
        with torch.no_grad():
            a, _ = pick_sample_and_logits(policy, s_onehot)
        s, r, term, trunc = env.step(a, s)
        total_reward += torch.sum(r)
        done = torch.logical_or(term, trunc)
        work_indices = (done==False).nonzero().squeeze(dim=-1)
        if not (len(work_indices) > 0):
            break;
        s = s[work_indices]
    return total_reward.item() / batch_size

avg_reward = evaluate(policy_func, 300)
print(f"Estimated rewards (before training): {avg_reward}")

Estimated rewards (before training): -67.3


## Train policy

Now we train our policy with expert data.

> Note : The expert data is located in ```./expert_data``` folder in this repository. See [this script](./00_generate_expert_trajectories.ipynb) for generating expert dataset.

In this training, I compute cross-entropy loss for categorical distribution and then optimize the policy with only expert dataset.<br>
Unlike [reinforcement learning](https://github.com/tsmatz/reinforcement-learning-tutorials), the reward is unknown in this training.

As you can see below, the average reward becomes high, and the policy is well-trained. (See [Readme.md](https://github.com/tsmatz/imitation-learning-tutorials/blob/master/Readme.md) for game rule in this environment.)

> Note : You can run as a batch to speed up training. (Here I get loss one by one, because the training is very easy.)

In [6]:
import pickle

# use the following expert dataset
dest_dir = "./expert_data"
checkpoint_files = ["ckpt0.pkl"]

# create optimizer
opt = torch.optim.AdamW(policy_func.parameters(), lr=0.001)

for ckpt in checkpoint_files:
    # load expert data from pickle
    with open(f"{dest_dir}/{ckpt}", "rb") as f:
        all_data = pickle.load(f)
    all_states = all_data["states"]
    all_actions = all_data["actions"]
    timestep_lens = all_data["timestep_lens"]
    # loop all episodes in demonstration
    current_timestep = 0
    for i, timestep_len in enumerate(timestep_lens):
        # pick up states and actions in a single episode
        states = all_states[current_timestep:current_timestep+timestep_len]
        actions = all_actions[current_timestep:current_timestep+timestep_len]
        # collect loss and optimize (train)
        opt.zero_grad()
        loss = []
        for s, a in zip(states, actions):
            s_onehot = F.one_hot(torch.tensor(s).to(device), num_classes=STATE_SIZE)
            _, logits = pick_sample_and_logits(policy_func, s_onehot)
            l = F.cross_entropy(logits, torch.tensor(a).to(device), reduction="none")
            loss.append(l)
        total_loss = torch.stack(loss, dim=0)
        total_loss.sum().backward()
        opt.step()
        # log
        print("Processed {:5d} episodes in checkpoint {}...".format(i + 1, ckpt), end="\r")
        # run evaluation in each 1000 episodes
        if i % 1000 == 999:
            avg = evaluate(policy_func, 200)
            print(f"\nEvaluation result (Average reward): {avg}")
        # proceed to next episode
        current_timestep += timestep_len

Processed  1000 episodes in checkpoint ckpt0.pkl...
Evaluation result (Average reward): 3.13
Processed  2000 episodes in checkpoint ckpt0.pkl...
Evaluation result (Average reward): 7.215
Processed  3000 episodes in checkpoint ckpt0.pkl...
Evaluation result (Average reward): 7.58
Processed  4000 episodes in checkpoint ckpt0.pkl...
Evaluation result (Average reward): 8.285
Processed  5000 episodes in checkpoint ckpt0.pkl...
Evaluation result (Average reward): 8.225
Processed  6000 episodes in checkpoint ckpt0.pkl...
Evaluation result (Average reward): 8.575
Processed  7000 episodes in checkpoint ckpt0.pkl...
Evaluation result (Average reward): 8.485
Processed  8000 episodes in checkpoint ckpt0.pkl...
Evaluation result (Average reward): 8.715
Processed  9000 episodes in checkpoint ckpt0.pkl...
Evaluation result (Average reward): 8.63
Processed 10000 episodes in checkpoint ckpt0.pkl...
Evaluation result (Average reward): 8.705
