<h1><a href="https://arxiv.org/abs/1606.03476">
Generative Adversarial Imitation Learning</a></h1>
Jonathan Ho, Stefano Ermon


<h2>Summary</h2>

* Reformulated IRL as a dual of an `occupancy measure` matching problem

* Drew an analogy between IRL and GAN

* Proposed a model-free imitation learning algorithm utilizing the connection between IRL and GAN
    

<h2>Motivation</h2>

* Many IRL algorithms are expensive to run, requiring solving a reinforcement learning problem in an inner loop.

* IRL learns a cost function but does not tell learner how to act



<h2>A General Framework for IRL</h2>

<h3>Basis</h3>

* State space is $S$; 
* Action space is $A$;
* Dynamics model $P(s'|s,a)$ although not used; 
* Cost $c:S\times A\rightarrow \mathbb{R}$

<h3>RL problem</h3>

Having a cost function $c$ already learnt, RL problem aims at finding the corresponding policy that has high entropy and low cost.
$$RL(c)=argmin_{\pi\in\Pi}-H(\pi)+\mathbb{E}_\pi[c(s,a)]$$
where $H(\pi)\triangleq -\sum_{(s,a)} \rho_\pi(s,a)log\pi(a|s) $ is defined as $\gamma$-discounted causal entropy and $\rho_\pi(s,a)=\pi(a|s)\sum^\infty_{t=0} \gamma^t P(s_t=s|\pi)$ is the occupancy measure for a policy.
><font color="green">The $\gamma$-discounted causal entropy of a policy $\pi$ is 
\begin{eqnarray}
H(\pi)&=&-\sum_{(s,a)}\rho_\pi(s,a)log\pi(a|s)\\
&=&-\sum_{(s,a)}\rho_\pi(s)\pi(a|s)log\pi(a|s)\\
&=&-\sum_{s}\rho_\pi(s)\sum_{a}\pi(a|s)log\pi(a|s)\\
&=&\sum_{s}\rho_\pi(s)H(\pi(\cdot|s))    
\end{eqnarray}
This entropy is not the mean of the entropy $H(\pi(\cdot|s))$ of each state $s$, because
\begin{eqnarray}
\sum_s\rho_\pi(s)&=&\sum_s\sum^\infty_{t=0}\gamma^t Prob(s_t=s|\pi)\\
&=&\sum^\infty_{t=0}\gamma^t \sum_s prob(s_t=s|\pi)\\
&=&\sum^\infty_{t=0}\gamma^t\\
&=&\frac{1}{1-\gamma}\neq 1
\end{eqnarray}
</font>




<h3>IRL problem</h3>

Assume that the set of possible cost function is $C$, e.g. $C=\mathbb{R}^{S\times A}$, then an IRL formula with entropy regularizer is

\begin{eqnarray}
IRL({\pi_E})&=&argmax_{c\in C} (min_{\pi\in\Pi}(-H(\pi)+\mathbb{E}_\pi[c(s,a)]))-\mathbb{E}_{\pi_E}[c(s,a)] \\
&=&argmax_{c\in C} min_{\pi\in\Pi}(-H(\pi)+\mathbb{E}_\pi[c(s,a)]-\mathbb{E}_{\pi_E}[c(s,a)]) 
\end{eqnarray}
It can be proved that the cost of policy $\pi$ under cost function $c$ can be evaluated by using the occupancy measure $\rho_\pi(s,a)$ such that $\mathbb{E}_\pi[c(s,a)]=\sum_{s,a}\rho_\pi(s,a)c(s,a)=\mathbb{E}_{\rho_\pi}[c(s,a)]$. Likewise, the entropy term $H(\pi)$ can also be represented as

\begin{eqnarray}
H(\pi)&=&-\sum_{(s,a)}\rho_\pi(s,a)log\pi(a|s)\\
&=&-\sum_{(s,a)}\rho_\pi(s,a)log\frac{\rho_\pi(s,a)}{\sum_{a'}\rho_\pi(s,a')}\\
&=&\bar{\\H}(\rho_\pi)
\end{eqnarray}

Use occupancy measure $\rho_\pi$ to replace $\pi$ in $IRL({\pi_E})$ and get $IRL(\rho_{\pi_E})$.

\begin{eqnarray}
IRL(\rho_{\pi_E})&=&argmax_{c\in C} min_{\rho_\pi\in D}(-\bar{\\H}(\rho_\pi)+\sum_{s,a}\rho_\pi(s,a)c(s,a)-\sum_{s,a}\rho_{\pi_E}(s,a)c(s,a))\\
&=&argmax_{c\in C} min_{\rho_\pi\in D}[-\bar{\\H}(\rho_\pi)+\sum_{s,a}(\rho_\pi(s,a)-\rho_{\pi_E}(s,a))c(s,a)]
\end{eqnarray}
where $D$ is the feasible set for $\rho$ determined by the set of policies. The concavity and convexity of the components in the formula make sure that the min\max problem has a saddle point.

Obviously, $IRL(\rho_{\pi_E})$ is the dual problem of the primal
$$min_{\rho_\pi\in D} -\bar{\\H}(\pi_\pi)\qquad s.t. \rho_\pi(s,a)=\rho_{\pi_E}(s,a)\ \forall (s,a)\in S\times A$$



<h3>General IRL formulation</h3>

**According to the paper, to avoid overfitting in the finite dataset (to be clarified)**, a convex cost function regularizer $\psi:\mathbb{R}^{S\times A}\rightarrow \bar{\\\mathbb{R}}$ can be introduced to IRL formula such that

$$IRL(\rho_{\pi_E})=argmax_{c\in C} min_{\pi\in\Pi}(-\psi(c)-H(\pi)+\mathbb{E}_\pi[c(s,a)]-\mathbb{E}_{\pi_E}[c(s,a)])$$

The IRL formula can be generalized to other existing IRL algorithms by accustomzing the regularizer $\psi(c)$. 


<em>For instance, the apprenticeship learning formula is

$$AL(\rho_{\pi_E})=argmax_{c\in C} \mathbb{E}_\pi[c(s,a)] -\mathbb{E}_{\pi_E}[c(s,a)]$$
where $c(s,a)={\bf \omega}^T {\bf f(s,a)}$ is a linear combination of a set of feature functions ${\bf f}(s,a)=[f_1(s,a), f_2(s,a),\ldots, f_k(s,a)]^T$ with $||{\bf \omega}||_2\leq 1$. 

Then $IRL(\rho_{\pi_E})$ recovers $AL(\rho_{\pi_E})$ by defining 
$$
\psi(c) = \left\{
        \begin{array}{ll}
            0 & \quad if\ c\in C \\
            \infty &\quad otherwise
        \end{array}
    \right.
$$
</em>

Define the convex conjugate of $\psi(c)$ as $\psi^*(x)=sup_c x^T c-\psi(c)$. Let $x=\rho_\pi-\rho_{\pi_E}$, then
\begin{eqnarray}
IRL(\rho_{\pi_E})&=&argmax_{c\in C} min_{\pi\in\Pi}[-\psi(c)-\bar{\\H}(\rho_\pi)+\sum_{s,a}(\rho_\pi(s,a)-\rho_{\pi_E}(s,a))c(s,a)]\\
&=& \psi^*(c)-\bar{\\H}(\rho_\pi)
\end{eqnarray}


<h2> Generative Adversarial Imitation Learning</h2>

* Define the regularizer for GAIL to as follows
$$\psi_{GA}(c)\triangleq\left\{
        \begin{array}{ll}
            \mathbb{E}_{\rho_{\pi_E}}[g(c(s,a))] & \quad c<0 \\
           +\infty & \quad otherwise
        \end{array}
    \right.$$
where 
$$g(x)=\left\{
        \begin{array}{ll}
            -x-log(1-e^x) & \quad x<0 \\
           +\infty & \quad otherwise
        \end{array}
    \right.$$
    
> <font color="green">This regularizer is convex and is positively correlated to $c$ when $c$ approximiates $0$ and becomes $-\infty$ when $c\geq 0$. For a state-action pair $(s,a)$ with $c(s,a)<0$, when $\rho_{\pi_E}(s,a)$ is high, $\rho_{\pi_E}(s,a)c(s,a)$ is highly negative. Therefore the regularizer $\psi_{GA}(c)$ lays a heavy penalty on $c(s,a)>0$ when $\rho_{\pi_E}(s,a)>0$. </font>

Then conjugate of $\psi_{GA}(c)$ is
$$\psi^*_{GA}(\rho_\pi-\rho_{\pi_E})=sup_{c\in C}\sum_{(s,a)}(\rho_\pi(s,a)-\rho_{\pi_E}(s,a))c(s,a)-\psi_{GA}(c)$$

Rearrange the items in $\psi^*_{GA}(\rho_\pi-\rho_{\pi_E})$ such that

\begin{eqnarray}
\psi^*_{GA}(\rho_\pi-\rho_{\pi_E})&=&sup_{c\in C}\sum_{(s,a)}(\rho_\pi(s,a)-\rho_{\pi_E}(s,a))c(s,a)-\psi_{GA}(c)\\
&=&sup_{c\in C} \sum_{(s,a)}\rho_\pi(s,a)c(s,a)-\sum_{(s,a)}\rho_{\pi_E}(s,a)c(s,a)-\sum_{(s,a)}\rho_{\pi_E}[g(c(s,a))]\\
&=&sup_{c\in C} \sum_{(s,a)}\rho_\pi(s,a))c(s,a)+\sum_{(s,a)}\rho_{\pi_E}(s,a)[-c(s,a)-g(c(s,a)]\\
&=&sup_{c\in C} \sum_{(s,a)}\rho_\pi(s,a))c(s,a)+\sum_{(s,a)}\rho_{\pi_E}(s,a)[-c(s,a)+c(s,a)+log(1-e^{c(s,a)})]\\
&=&sup_{c\in C} \sum_{(s,a)}\rho_\pi(s,a))c(s,a)+\sum_{(s,a)}\rho_{\pi_E}(s,a)[log(1-e^{c(s,a)})]\\
\end{eqnarray}

Let $c(s,a)=log\frac{1}{1+e^{-\gamma(s,a)}}$ where $\gamma(s,a)\in(-\infty, +\infty)$ can be the output of an NN model given input $(s,a)$ while $c(s,a)$ is logistic loss. Then $log(1-e^{c(s,a)})=log(1-\frac{1}{1+e^{-\gamma(s,a)}})$.

Rewrite the logistic loss $c(s,a)=log\frac{1}{1+e^{-\gamma(s,a)}}$ as $log D(s,a)$ and $log(1-e^{c(s,a)})$ as $log(1-D(s,a))$. Obviously $D(s,a)=\frac{1}{1+e^{-\gamma(s,a)}}\in[0,1]$. Use $D(s,a)$ instead of $c(s,a)$ in IRL formula.

$$IRL(\rho_{\pi_E})=min_{\rho_\pi}max_{D\in(0,1)^{S\times A}} -\bar{\\H}(\rho_\pi)+\mathbb{E}_{\rho_\pi} [log(D(s,a))]+\mathbb{E}_{\rho_{\pi_E}} [log(1-D(s,a))]$$

This formula resembles GAN formula if regarding $D(s,a)$ as the output of a discriminant model while $\rho_\pi$ as the ouput of a generative model.

In addition, when optimizing w.r.t $\rho_\pi$, prameterize $\pi$ with $\theta$ such that the gradient of the loss becomes
$$\nabla_\theta -\bar{\\H}(\rho_\pi)+\sum_{(s,a)} \rho_{\pi_\theta} c(s,a)=-\nabla_\theta\bar{\\H}(\rho_{\pi_\theta})+\mathbb{E}_{(s,a)}\nabla_\theta log\pi_\theta(a|s) Q(s,a)$$
which recovers the policy gradient formula. Note that by solving the policy gradient problem iteratively, $\mathbb{E}_{(s,a)}$ and $Q(s,a)$ can be obtained from last iteration. The paper chooses TRPO to finish this step.

<h3>Algorithm</h3>

1. Given a set of expert trajectories $\tau_E\sim\pi_E$. 

2. Initialize a parameterized policy $\pi_{\theta_0}$ and a parameterized discriminator $D_{\omega_0}$

3. In iteration $i\in\{0,1,2,\ldots\}$
    * Sample trajectories $\tau_i$ from policy $\pi_{\theta_i}$
    * Update $\omega$ with gradient 
    $$\mathbb{E}_{\tau_i} [\nabla_\omega log(D_\omega (s,a))] + \mathbb{E}_{\tau_E}[\nabla_\omega log(1-D_\omega(s,a))]$$
    * Update $\theta$ by using TRPO with cost function $c(s,a)=log(D_{\omega_{i+1}}(s,a))$. The gradient of the penalized loss is 
    $$\mathbb{E}_{\tau_i}[\nabla_\theta log \pi_\theta(a|s)Q(s,a)]-\lambda\nabla_\theta H(\pi_\theta)$$ where $Q(s,a)=\mathbb{E}_{\tau_i}[log(D_{\omega_{i+1}}(s,a))|s_0=s, a_0= a]$