<h1><a href="https://arxiv.org/abs/1606.03476">
Generative Adversarial Imitation Learning</a></h1>
Jonathan Ho, Stefano Ermon


<h2>Summary</h2>
* Reformulated IRL as a dual of an `occupancy measure` matching problem
* Proposed a model-free imitation learning algorithm
    * GAN-like framework
    * Learn policy from data and bypass the intermediate step of learning cost function

<h2>Motivation</h2>

* Many IRL algorithms are expensive to run, requiring reinforcement learning in an inner loop.

* IRL learns a cost function but does not tell learner how to act



<h2>Study on IRL</h2>

<h3>Basis</h3>

* State space is $S$; 
* Action space is $A$;
* Dynamics model $P(s'|s,a)$; 
* Cost $c:S\times A\rightarrow \mathbb{R}$


* Maximum causal entropy IRL 

\begin{eqnarray}
&&max_{c\in C} (min_{\pi\in\Pi}(-H(\pi)+\mathbb{E}_\pi[c(s,a)]))-\mathbb{E}_{\pi_E}[c(s,a)] \\
&=&max_{c\in C} min_{\pi\in\Pi}(-H(\pi)+\mathbb{E}_\pi[c(s,a)]-\mathbb{E}_{\pi_E}[c(s,a)]) 
\end{eqnarray}

>where $H(\pi)\triangleq \sum_{(s,a)} \rho_\pi(s,a)-log\pi(a|s) $ is the $\gamma$-discounted causal entropy and $\rho_\pi(s,a)$ is the unnormalized discounted state-action encounter rate which will be explained later.

MaxEnt IRL tries to find a cost function $c$ from family $C$ so that the cost of expert policy is lower than any other policies meanwhile the entropy of the entropy is high. 

Having a cost function $c$ learnt by IRL, it is a RL problem to find the corresponding policy that has high entropy and low cost.
$$RL(c)=argmin_{\pi\in\Pi}-H(\pi)+\mathbb{E}_\pi[c(s,a)]$$

<h3>Reformulate IRL problem</h3>

* Assume that the set of possible cost function $C$ is the set of all functions $\mathbb{R}^{S\times A}$ instead of affine combinations of feature functions. 

* To avoid overfitting in the finite dataset, introduce a convex cost function regularizer $\psi:\mathbb{R}^{S\times A}\rightarrow \bar{\\\mathbb{R}}$ which maps cost function to real values. Then primitive IRL problem is expressed as

$$max_{c\in C} (min_{\pi\in\Pi}(-H(\pi)+\mathbb{E}_\pi[c(s,a)]))-\mathbb{E}_{\pi_E}[c(s,a)]$$

* Assume that cost function $\bar{\\c}$ is one solution of $IRL_\psi(\pi_E)$, to characterize $RL(\bar{\\c})$, define occupancy measure for a policy $\rho_\pi(s,a)=\pi(a|s)\sum^\infty_{t=0} \gamma^t P(s_t=s|\pi)$.
> It is not distribution of state-action pair because it is discounted and is not normalized

* Then use $\mathbb{E}_\pi[(s,a)]=\sum_{s,a}\rho_\pi(s,a)c(s,a)$ to evaluates the cost of a policy under certain cost function.

* The convex conjugate of a function $f:\mathbb{R}^{S\times A}\rightarrow\bar{\\\mathbb{R}}$ is $f^*:\mathbb{R}^{S\times A}\rightarrow\bar{\\\mathbb{R}}$ and $f^*(x)=sup_{y\in\mathbb{R}^{S\times A}} x^Ty-f(y)$.

* Let the convex conjugate of $\psi(c)$ be $sup_{c\in\mathbb{R}^{S\times A}}(\sum(\rho_\pi - \rho_{\pi_E})c - \psi(c))$, then 
$$RL\circ IRL_\psi(\pi_E)=argmin_{\pi\in\Pi} - H(\pi)+\psi^*(\rho_\pi-\rho_{\pi_E})$$
> This equation holds because the min\max problem in $IRL\circ RL$ can form a saddle point (There is proof in the paper).

* Therefore, $\psi$-regularized IRL seeks a policy whose occupancy measure $\rho_\pi$ is close to the expert's $\rho_{\pi_E}$. And various choice $\psi$ lead to different imitatino learning algorithms.

* Suppose $\rho_{\pi_E}>0$. If $\psi$ is a constant function, $\tilde{\\c}\in IRL_\psi(\pi_E)$, and $\tilde{\\\pi}\in RL(\tilde{\\c})$, then $\rho_{\tilde{\\\pi}}=\rho_{\pi_E}$
> * $\psi$ being constant means that 
\begin{eqnarray}
\psi^*(\rho_{\tilde{\\\pi}}-\rho_{\pi_E})&=&sup_{c\in\mathbb{R}^{S\times A}}(\sum(\rho_\pi - \rho_{\pi_E})c - \psi(c))\\
&=&- \psi+sup_{c\in\mathbb{R}^{S\times A}}\sum(\rho_\pi - \rho_{\pi_E})c 
\end{eqnarray}
>* It can be proved that if $\rho\in D$ where $D=\{\rho| \sum_a \rho(s,a)=p_0(s)+\gamma\sum_{(s',a)}P(s|s',a)\rho(s',a)\forall s\in S\}$ and $\pi_\rho(a|s)\triangleq \rho(s,a)/\sum_{a'}\rho(s,a')$, then $\pi_\rho\sim\rho$ is injective.
>* Let $\bar{\\H}(\rho)=-\sum_{s,a}log(\rho(s,a)/\sum_{a'}\rho(s,a')$. Then $\bar{\\H}$ is concave and $H(\pi)=\bar{\\H}(\rho_\pi)$ (The paper provided proof)
>* Define $\bar{\\L}(\rho, c)=-\bar{\\H}(\rho)+\sum_{s,a}\rho(s,a)c(s,a)$, then $L(\pi_\rho, c)=\bar{\\L}(\rho, c)$

* The IRL problem can be transformed into
\begin{eqnarray}
IRL_\psi(\pi_E)&=& argmax_{c\in\mathbb{R}^{S\times A}} min_{\pi\in\Pi} -H(\pi)+\mathbb{E}_\pi c(s,a)-\mathbb{E}_{\pi_E} c(s,a)+\psi\\
&=&argmax_{c\in\mathbb{R}^{S\times A}}min_{\rho\in D}-\bar{\\H}(\rho)+\sum_{s,a} (\rho(s,a)-\rho_E(s,a))c(s,a)\\
&=& argmax_{c\in\mathbb{R}^{S\times A}} min_{\rho\in D} \bar{\\L}(\rho, c)\\
&\Rightarrow& min_{\rho\in D} -\bar{\\H}(\rho)\ s.t.\rho(s,a)=\rho_E(s,a)\ \forall s\in S, a\in A
\end{eqnarray}
> IRL formulation is actually a dual problem. The primal optmization problem is to maximize causal entropy subject to the equality between the expert policy's occupancy measure and the solution. 


* Classic IRL solve RL in an inner loop for `dual ascent`. With repeatedly fixed dual values, which are the costs, RL solves the optimal policies by optimizing the Lagrangian. The optimal policy uniquely matches occupancy measures with the expert.

* This paper consider IRL as finding a policy that matches the expert's occupancy measure instead of finding a cost function for which the expert policy is optimal.




<h2>Choose the Regularizer</h2>

* Considering $\psi$ as constant results in unuseful primal problem because the equality in the constraint may only cover a limited amount of state-action pairs that have appeared in the demonstration. State-action pairs that don't appear in the data set are simply regarded as being rarely visited, which is inappropriate in large environment.

* Use function approximation to learn parameterized policy $\pi_\theta$ is impractical once there are too many constraints.

* The constrained IRL optimization problem can be relaxed into a penalized optimization problem
$$min_\pi d_\psi(\rho_\pi, \rho_E)-H(\pi)$$
where $d_\psi(\rho_\pi, \rho_E)\triangleq\psi^*(\rho_\pi-\rho_E)$

* Apprenticeship Learning use a heavy penalty hence ignores the $-H(\pi)$ term. But the cost function is limited to the linear combination of features. And it is difficult to tune the feature function so that the cost function is expresssive enough to explain the occupancy measure.


<h2> Generative Adversarial Imitation Learning</h2>

* Define new regularizer
$$\psi_{GA}(c)\triangleq\left\{
        \begin{array}{ll}
            \mathbb{E}_{\pi_E}[g(c(s,a))] & \quad c<0 \\
           +\infty & \quad otherwise
        \end{array}
    \right.$$
where 
$$g(x)=\left\{
        \begin{array}{ll}
            -x-log(1-e^x) & \quad x<0 \\
           +\infty & \quad otherwise
        \end{array}
    \right.$$
    
> The regularizer is positively correlated to $c$ when $c<0$; it becomes $-\infty$ when $c$ approximate $0$. For a state-action pair $(s,a)$ with $c(s,a)<0$, when $\rho_E(s,a)$ is high, $\rho(s,a)c(s,a)$ is highly negative. All such pairs contribute to a low cost of the expert policy $\mathbb{E}_{\pi_E}[g(c(s,a))]$. 

* The paper claims that $\psi_{GA}$ is an average over expert data, and therefore can adjust to arbitrary expert datasets. (???????)

* Define a conjugate of $\psi_{GA}(C)$ as $\psi^*_{GA}(\rho_\pi-\rho_{\pi_E})$
$$\psi^*_{GA}(\rho_\pi-\rho_{\pi_E})=sup_{D\in(0,1)^{S\times A}} \mathbb{E}_\pi [log(D(s,a))]+\mathbb{E}_{\pi_E} [log(1-D(s,a))]$$
where $D:S\times A\rightarrow(0,1)$ is the discriminative classifier predicting that $(s,a)$ is given by $\pi$.

* The finally optimal result of the regularizer is Jensen-Shannon divergene $D_{JS}(\rho_\pi, \rho_{\pi_E})=D_{KL}(\rho_\pi||(\rho_\pi + \rho_E)/2) +D_{KL}(\rho_E||(\rho_E+\rho_\pi)/2)$ with normalized $\rho_\pi, \rho_{\pi_E}$(???????????). 

* Therefore, the optimization problem is to sovle
\begin{eqnarray}
&&min_\pi \psi^*_{GA}(\rho_\pi -\rho_{\pi_E})-\lambda H(\pi)\\
&\Rightarrow& min_\pi D_{JS}(\rho_\pi, \rho_{\pi_E}) - \lambda H(\pi)\\
&\Rightarrow& min_\pi max_D \mathbb{E}_\pi [log(D(s,a))]+\mathbb{E}_{\pi_E} [log(1-D(s,a))] -\lambda H(\pi)\\
\end{eqnarray}

* To fit to the framework of GAN, regard the learnt policy's occupancy measure $\rho_\pi$ as the data distribution generated by the generative model $G$; regard $\rho_E$ as the true data distribution.

<h3>Algorithm</h3>

1. Given a set of expert trajectories $\tau_E\sim\pi_E$. 

2. Initialize a parameterized policy $\pi_{\theta_0}$ and a parameterized discriminator $D_{\omega_0}$

3. In iteration $i\in\{0,1,2,\ldots\}$
    * Sample trajectories $\tau_i$ from policy $\pi_{\theta_i}$
    * Update $\omega$ with gradient 
    $$\mathbb{E}_{\tau_i} [\nabla_\omega log(D_\omega (s,a))] + \mathbb{E}_{\tau_E}[\nabla_\omega log(1-D_\omega(s,a))]$$
    * Update $\theta$ by using TRPO with cost function $c(s,a)=log(D_{\omega_{i+1}}(s,a))$. The gradient of the penalized loss is 
    $$\mathbb{E}_{\tau_i}[\nabla_\theta log \pi_\theta(a|s)Q(s,a)]-\lambda\nabla_\theta H(\pi_\theta)$$ where $Q(s,a)=\mathbb{E}_{\tau_i}[log(D_{\omega_{i+1}}(s,a))|s_0=s, a_0= a]$