### First part recap ###

Imitation Learning<br><br>

**Meta-learning**: better classifier by generating better training data from demonstrations. 

- rollin/outs
- DAgger algorithm
- DAgger with rollouts and LoLS

### Part 2: NLP Applications and practical advice

- Applications:
    - dependency parsing 
    - natural language generation
    - semantic parsing

- Practical advice
    - making things faster
    - debugging

<center>
<h2>Applying Imitation Learning on Dependency Parsing</h2>
</center>

### Dependency parsing ###

<img src="images/toBeAnimated/depParse1.png">

To represent the syntax of a sentence as directed labeled arcs between words.
- Labels represent dependencies between words.

###  Applying Imitation Learning ###

[Goldberg and Nivre 2012](http://www.aclweb.org/anthology/C12-1059) proposed an Imitation Learning system for dependency parsing.
- Very similar to DAgger.

### DAgger Reminder ###

<p style="border:3px; border-radius: 25px; background-color:lightgrey; border-style:solid; border-color:black; padding: 0.3em; font-size: 75%">
\begin{align}
& \textbf{Input:} \; D_{train} = \{(\mathbf{x}^1,\mathbf{y}^1)...(\mathbf{x}^M,\mathbf{y}^M)\}, \; expert\; \pi^{\star}, \; loss \; function \; L,\\
& classifier \; H,\; training\; examples\; \cal E = \emptyset, \; expert\; probability\; \beta=1\\
& \mathbf{while}\; \text{termination condition not reached}\; \mathbf{do}\\
& \quad \color{blue}{\text{set} \; rollin \; policy \; \pi^{in} = \beta \; H + (1-\beta)\pi^{\star}}\\
& \quad \mathbf{for} \; (\mathbf{x},\mathbf{y}) \in D_{train} \; \mathbf{do}\\
& \quad \quad \color{blue}{\text{rollin to predict} \; \hat \alpha_1\dots\hat \alpha_T  = \pi^{in}(\mathbf{x},\mathbf{y})}\\
& \quad \quad \mathbf{for} \; \hat \alpha_t \in \hat \alpha_1\dots\hat \alpha_T \; \mathbf{do}\\
& \quad \quad \quad \color{blue}{\text{ask expert for } \underline{\text{a set of best actions}}\; \{\alpha_{1}^{\star}\dots\alpha_{k}^{\star}\} = \pi^{\star}(\mathbf{x},S_{t-1})} \\
& \quad \quad \quad \text{extract } features=\phi(\mathbf{x},S_{t-1}) \\
& \quad \quad \quad \cal E = \cal E \cup (features,\alpha^{\star})\\
& \quad \text{learn classifier} \; \text{from}\; \cal E\\
& \quad \text{decrease} \; \beta\\
\end{align}
</p>

Multi-label classification<ul><li>The expert policy may return multiple correct actions.</li></ul>

To apply Imitation Learning on any task, we need to define:
- Transition
- Loss function
- Expert policy

### Transition system? ###

We can assume any transition-based system (e.g. Arc-Eager).

<span class="fragment" data-fragment-index="1"> <b>State:</b> arcs, stack, and buffer.</span>
<br>
<br>
<span class="fragment" data-fragment-index="2"> <b>Action space:</b> <br> <span style="font-variant: small-caps;">Shift, Reduce, Arc-Left</span>, and <span style="font-variant: small-caps;">Arc-Right.</span></span>
<ul>
<li class="fragment" data-fragment-index="3"><span style="font-variant: small-caps;">Arc-Left</span> / <span style="font-variant: small-caps;">Arc-Right</span> combine with arc labels, </li>
<li class="fragment" data-fragment-index="4"> but limited #labels in dependency parsing (~50).</li>
</ul>

The length of the transition sequence is variable.
<br>
<br>
<span style="font-variant: small-caps;">Shift -> Shift -> Arc-Right -> Shift -> ... -> Arc-Left</span>
<br>
<ul>
<li class="fragment" data-fragment-index="2"> In proportion to the length of the sentence. </li>
<li class="fragment" data-fragment-index="3">But not fixed.<br>  In what task would it be fixed? <span class="fragment" data-fragment-index="4"><b>POS tagging!</b></li>
</ul>

### Transition-based dependency parsing in action! ###

<br>
<img src="images/toBeAnimated/transitionEx_1.png">

<small>	
<b>Stack:</b> -
<br>
<b>Buffer:</b> ROOT, 'economic', 'news', 'had', 'little', 'effect', 'on', 'financial', 'markets', '.'
</small>	

### Transition-based dependency parsing in action! ###

<center><h3><span style="font-variant: small-caps;">Shift</span></h3>
<img src="images/toBeAnimated/transitionEx_1.png">
</center>
<small>	
<b>Stack:</b> ROOT
<br>
<b>Buffer:</b> 'economic', 'news', 'had', 'little', 'effect', 'on', 'financial', 'markets', '.'
</small>	

### Transition-based dependency parsing in action! ###

<center><h3><span style="font-variant: small-caps;">Shift</span></h3>
<img src="images/toBeAnimated/transitionEx_1.png">
</center>
<small>	
<b>Stack:</b> ROOT, 'economic'
<br>
<b>Buffer:</b> 'news', 'had', 'little', 'effect', 'on', 'financial', 'markets', '.'
</small>	

### Transition-based dependency parsing in action! ###

<center><h3><span style="font-variant: small-caps;">Arc-Left:</span> amod</h3>
<img src="images/toBeAnimated/transitionEx_2.png">
</center>
<small>	
<b>Stack:</b> ROOT
<br>
<b>Buffer:</b> 'news', 'had', 'little', 'effect', 'on', 'financial', 'markets', '.'
</small>	

### Transition-based dependency parsing in action! ###

<center><h3><span style="font-variant: small-caps;">Shift</span></h3>
<img src="images/toBeAnimated/transitionEx_2.png">
</center>
<small>	
<b>Stack:</b> ROOT, 'news'
<br>
<b>Buffer:</b> 'had', 'little', 'effect', 'on', 'financial', 'markets', '.'
</small>	

### Transition-based dependency parsing in action! ###

<center><h3><span style="font-variant: small-caps;">Arc-Left:</span> nsubj</h3>
<img src="images/toBeAnimated/transitionEx_3.png">
</center>
<small>	
<b>Stack:</b> ROOT
<br>
<b>Buffer:</b>'had', 'little', 'effect', 'on', 'financial', 'markets', '.'
</small>	

### Transition-based dependency parsing in action! ###

<center><h3>and so on</h3>
<img src="images/toBeAnimated/depParse1.png">
</center>
<small>	
<b>Stack:</b> ROOT
<br>
<b>Buffer:</b> -
</small>	

### Loss function? ###

<b>Hamming loss:</b> given predicted arcs, how many parents were incorrectly predicted?
<br>
<ul>
<li class="fragment" data-fragment-index="2"> Directly corresponds to attachment score metrics used to evaluate dependency parsers. </li>
<li class="fragment" data-fragment-index="3"> Decomposable? <span class="fragment" data-fragment-index="4"><b>Not with this transition model!</b> Cannot score <span style="font-variant: small-caps;">Shift</span> independent of <span style="font-variant: small-caps;">Arc-Right</span>!</span></li>
</ul>

### Expert policy? ###

<center>
<img src="images/oracle-delphi.jpg">
</center>


Returns the best action at the current state by looking at the gold standard assuming future actions are also optimal:

$$\alpha^{\star}=\pi^{\star}(S_t, \mathbf{y}) = \mathop{\arg \min}_{\alpha \in {\cal A}} L(S_t(\alpha,\pi^{\star}),\mathbf{y})$$

### How do we make an expert policy? ###

We can derive a <b>static</b> transition sequence from initial to terminal state using the golden standard.
- Static expert policy.
<br>
<br>
<center>
<span class="fragment" data-fragment-index="1" style="font-variant: small-caps;">Shift -> Shift -> Arc-Right -> Shift -> ... -> Arc-Left</span>
<br>
</center>

However, a static expert policy is only optimal on states visited through the static transition sequence.
- i.e. it assumes the previous actions (rollin) are optimal.

<center>
<img src="images/stateTransit.png">
</center>

A static expert policy may be sufficient for tasks where we do not care whether the previous actions were optimal.
<ul>
<li class="fragment" data-fragment-index="1"> What would such a task be? <span class="fragment" data-fragment-index="2"> <b>POS tagging.</b></span></li>

<table class="fragment" data-fragment-index="1" style="font-size:80%; border-style: hidden; border-collapse: collapse; padding: 50px">
<thead>
<tr>
<th>I</th> 
<th>studied</th>
<th>in</th>
<th>London</th>
<th>with</th>
<th>Sebastian</th>
<th>Riedel</th>
</tr>
</thead>
<tbody>
<tr>
<td>PRP</td>
<td>VBD</td>
<td>DET</td>
<td>NNP</td>
<td><font color="red"><b>VBD</b></font></td>
<td>NNP</td>
<td>NNP</td>
</tr>
</tbody>
</table>
<br>
- If the previous word is tagged incorrectly, the expert policy's suggestion remains the <b>same</b> and <b>optimal</b>!

### Rollin mistakes in dependency parsing ###

Let's assume that we rollin using the classifier.

<img src="images/stateTransitError.png">

The static policy has not encountered this state before.
<ul>
<li class="fragment" data-fragment-index="10"> Cannot know which action will lead to the optimal terminal state. </li>
<li class="fragment" data-fragment-index="11"> May default to an action (e.g. <span style="font-variant: small-caps;">Shift</span>). </li>
<ul>

<center><h3><span style="font-variant: small-caps; color: green;">Arc-Right:</span> iobj ?</h3>
<img src="images/toBeAnimated/depParse_mistake_1.png">
</center>
<small>	
<b>Stack:</b> 'wrote'
<br>
<b>Buffer:</b> 'her', 'a', 'letter', '.'
</small>	

<center><h3><span style="font-variant: small-caps; color: red;">Shift</span></h3>
<img src="images/toBeAnimated/depParse_mistake_2.png">
</center>
<small>	
<b>Stack:</b> 'wrote', 'her'
<br>
<b>Buffer:</b> 'a', 'letter', '.'
</small>	

<center><h3>Default: <span style="font-variant: small-caps;">Shift</span></h3>
<img src="images/toBeAnimated/depParse_mistake_2.png">
</center>
<small>	
<b>Stack:</b> 'wrote', 'her', 'a'
<br>
<b>Buffer:</b> 'letter', '.'
</small>	

<center><h3><span style="font-variant: small-caps;">Arc-Left: </span>det</h3>
<img src="images/toBeAnimated/depParse3.png">
</center>
<small>	
<b>Stack:</b> 'wrote', 'her'
<br>
<b>Buffer:</b> 'letter', '.'
</small>	

<center><h3>Default: <span style="font-variant: small-caps;">Shift</span></h3>
<img src="images/toBeAnimated/depParse3.png">
</center>
<small>
<b>Stack:</b> 'wrote', 'her', 'letter'
<br>
<b>Buffer:</b> '.'
</small>

<center><h3>Default: <span style="font-variant: small-caps;">Shift</span></h3>
<img src="images/toBeAnimated/depParse3.png">
</center>
<small>	
<b>Stack:</b> 'wrote', 'her', 'letter', '.'
<br>
<b>Buffer:</b> -
</small>	

Static expert policy cannot recover from errors in the rollin.

### Also, what if there are multiple correct transitions? ###

<img src="images/toBeAnimated/depParse_2.png">
<small>	
<b>Stack:</b> 'her'
<br>
<b>Buffer:</b> 'a', 'letter', '.'
</small>	

Two possible actions: <span style="font-variant: small-caps;">Reduce</span> 'her' / <span style="font-variant: small-caps;">Shift</span> 'a'

<span class="fragment" data-fragment-index="1">Static expert policy arbitarilly choses <br>(e.g. prioritize shifts over other actions). </span>
<ul>
<li class="fragment" data-fragment-index="2"> But chosing any one action indirectly labels the alternative actions as incorrect!</li>
<li class="fragment" data-fragment-index="3"> Leads to noise in the training signal.</li>
</ul>

### Dynamic expert policy ###
Takes into account the previous actions.<ul><li>Can recover from errors.</li></ul><br><br><span class="fragment" data-fragment-index="1">Allows for multiple optimal actions at each time-step.<ul><br><li class="fragment" data-fragment-index="1">Multi-label classification.</li></ul></span>

### Reachable terminal state ###

Reachable terminal state:
- Can be reached through a sequence of expert actions $\alpha_1^{\star}\dots \alpha_T^{\star}$, and
- no further actions can be taken at that state.

- For an optimal reachable terminal state, $L(S_{final}, \mathbf{y}) = 0$.

### How does a dynamic expert policy work? ###

For each possible action at a time-step:
<font color=white>- determine the reachable terminal state
  - Ideally, we can rollout using the expert policy.</font>
<br>
<br>
<center>
<img src="images/toBeAnimated/depParse3.png">
</center>
<small>	
<b>Stack:</b> 'wrote', 'her'
<br>
<b>Buffer:</b> 'letter', '.'
</small>	

### How does a dynamic expert policy work? ###

For each possible action at a time-step:
<font color=white>- determine the reachable terminal state
  - Ideally, we can rollout using the expert policy.</font>
<br>
<br>
<center><h3><span style="font-variant: small-caps;">Arc-Left? Arc-Right? Reduce? Shift?</span></h3>
<img src="images/expertPolicyRollOut1.png">
</center>
<small>	
<b>Stack:</b> 'wrote', 'her'
<br>
<b>Buffer:</b> 'letter', '.'
</small>	

### How does a dynamic expert policy work? ###

For each possible action at a time-step:
- determine the reachable terminal state
  - Ideally, rollout using the expert policy.
<br>
<br>
<center><h3><span style="font-variant: small-caps;">Arc-Left: </span>det</h3>
<img src="images/expertPolicyRoll2.png">
</center>
<small>	
<b>Stack:</b> 'wrote',
<br>
<b>Buffer:</b> 'letter', '.'
</small>	

### How does a dynamic expert policy work? ###

For each possible action at a time-step:
- determine the reachable terminal state
  - Ideally, rollout using the expert policy.
<br>
<br>
<center><h3><span style="font-variant: small-caps;">Arc-Right: </span>dobj</h3>
<img src="images/expertPolicyRoll3.png">
</center>
<small>	
<b>Stack:</b> 'wrote',
<br>
<b>Buffer:</b> '.'
</small>	

### How does a dynamic expert policy work? ###

For each possible action at a time-step:
- determine the reachable terminal state
  - Ideally, rollout using the expert policy.
<br>
<br>
<center><h3><span style="font-variant: small-caps;">Arc-Right: </span>p</h3>
<img src="images/expertPolicyRoll4.png">
</center>
<small>	
<b>Stack:</b> 'wrote',
<br>
<b>Buffer:</b> -
</small>	

### How does a dynamic expert policy work? ###

For each possible action at a time-step:<ul><li>determine the reachable terminal state,</li><li><font color=white>score it according to the loss function.</font></li></ul>
<font color=white>Return the set of actions that lead to best reachable terminal state.</font>
<br>
<img src="images/determReachStates.png">

### How does a dynamic expert policy work? ###

For each possible action at a time-step:<ul><li>determine the reachable terminal state,</li><li>score it according to the loss function.</li></ul><font color=white>Return the best set of actions: {<span style="font-variant: small-caps;">Arc-Left, Arc-Right</span>}</font>
<br>
<center>
<img src="images/determReachStatesScored.png">
</center>

### How does a dynamic expert policy work? ###

For each possible action at a time-step:
<ul><li>determine the reachable terminal state,</li><li>score it according to the loss function.</li></ul>Return the best set of actions: {<span style="font-variant: small-caps;">Arc-Right</span>}
<br>
<center>
<img src="images/determReachStatesScored.png">
</center>

### Full rollouts are expensive ###
To determine the reachable terminal state:<ul><li>Ideally, rollout using the expert policy.</li><li>Computionally expensive, use heauristics!</li></ul></ul>

<center><h3><span style="font-variant: small-caps;">Arc-Right:</span> det</h3>
<img src="images/expertPolicyRollOut_5.png" width="70%">
</center>
<small>
<b>Stack:</b> 'wrote', 'her'
<br>
<b>Buffer:</b> '.'
</small>

After this 'letter' cannot have 'wrote' as its head.

### Results ###
<img src="images/dependResultBars.png">

### Summary so far ### 

We discussed modifications to the DAgger framework.
- Hard decay schedule after $k$ epochs when determining the roll-in and roll-out policies.
- Using a mix of expert and learned policy during roll-outs.

We showed that dynamic oracles improves on the results of static orcales.