<center>
<h2>Applying Imitation Learning on Semantic Parsing</h2>
<h3>[Goodman et al. 2016](http://aclweb.org/anthology/P16-1001)</h3>
</center>

### Semantic parsing ###

Semantic parsers map natural language to meaning representations.
- Need to abstract over syntactic phenomena, 
- resolve anaphora, 
- and eliminate ambiguity in language.

- Essentially the inverted task of NLG.

### Abstract meaning representation ###
###### ([Banarescu et al. 2013](http://www.aclweb.org/anthology/W13-2322)) ###### 

<div style="width:60%; float:left;">
<br>
<br>
<br>
A MR formalism where concept <br><br>relations are represented in a DAG.
<ul><li class="fragment" data-fragment-index="1">Abstracts away from function words, and inflection.</li>
<br><li class="fragment" data-fragment-index="2">Transition-based approaches are common.</li></ul>
<br>
<br>
<span class="fragment" data-fragment-index="3">[AMR tutorial by Schneider et al. 2015](https://github.com/nschneid/amr-tutorial/tree/master/slides)</span>
</div>

<img src="images/amrExample.png" style="width:40%; float: right;">

### Transition system? ###

We consider a dependency tree as input.
- Dependency tree is derived from input sentence.

<span class="fragment" data-fragment-index="1"><b>State:</b> nodes, arcs, $\sigma$ and $\beta$ stacks.</span>
<ul><li class="fragment" data-fragment-index="3">In intermediate states, nodes may be labeled either with words, or AMR concepts.</li></ul>
<br><br><br>

<center><h3><span style="font-variant: small-caps; color:white;"> E </span></h3>
<img src="images/toBeAnimated/parse2amr_1.png"></center>
<font size="5">
<b>$\sigma$:</b> struck, by, attacks, cyber, in, 2007
<br>
<b>$\beta$:</b> -
</font>	

<center><h3><span style="font-variant: small-caps;">Insert:</span> date-entity</h3>
<img src="images/toBeAnimated/parse2amr_2.png"></center>
<font size="5">
<b>$\sigma$:</b> struck, by, attacks, cyber, in, date-entity
<br>
<b>$\beta$:</b> -
</font>	

<center><h3><span style="font-variant: small-caps;">ReplaceHead:</span> in</h3>
<img src="images/toBeAnimated/parse2amr_3.png"></center>
<font size="5">
<b>$\sigma$:</b> struck, by, attacks, cyber
<br>
<b>$\beta$:</b> -
</font>	

<center><h3><span style="font-variant: small-caps;">Reattach:</span> date-entity</h3>
<img src="images/toBeAnimated/parse2amr_4.png"></center>
<font size="5">
<b>$\sigma$:</b> struck, by, attacks
<br>
<b>$\beta$:</b> -
</font>	

<center><h3><span style="font-variant: small-caps;">ReplaceHead:</span> by </h3>
<img src="images/toBeAnimated/parse2amr_5.png"></center>
<font size="5">
<b>$\sigma$:</b> -
<br>
<b>$\beta$:</b> -
</font>	

### Action space ###

<div style="width:70%; float:left;">
<br>
<br>
Actions combine with labels<br><br>(PropBank framesets).<ul><li class="fragment" data-fragment-index="4">#labels in the order of 10<sup>3</sup> to 10<sup>4</sup>.</li><li class="fragment" data-fragment-index="5">Performing rollouts for all actions may be time-consuming.</li></ul></div>
<img src="images/amrActions.png" style="width:30%; float: right;">

The length of the transition sequence is variable. <ul><li>In the range of 50-200 actions.</li><li class="fragment" data-fragment-index="2">Need to prevent cycles between state transitions!</li>
<br>
<span class="fragment" data-fragment-index="2" style="font-variant: small-caps;">... -> Swap($e_i$, $e_j$) -> Swap($e_j$, $e_i$) -> ...</span></ul>
<br>
<br>
<span class="fragment" data-fragment-index="3">Also, transition system to preserve acyclicity and <br><br> full connectivity in the graph.</span>

### Loss function? ###

<b>Smatch</b> ([Cai and Knight, 2013](http://amr.isi.edu/smatch-13.pdf))
<br>
<ul>
<li> F<sub>1</sub>-Score between predicted and gold standard AMR. </li>
<li class="fragment" data-fragment-index="1">Calculates all possible mappings of nodes.</li>
<li class="fragment" data-fragment-index="2">Computationally expensive for every rollout<br>(NP-complete).</li>
</ul>

<b>Naive Smatch</b> employs heuristics.
<br>
<ul>
<li class="fragment" data-fragment-index="2"> How many labels and edges in the predicted and gold standard are not present in both? </li>
<li class="fragment" data-fragment-index="3"> Decomposable? <span class="fragment" data-fragment-index="4"><b>No!</b></span></li>
<br><br>
<li class="fragment" data-fragment-index="7"> To encourage short sequences, <br> a length penalty is applied to the loss.</li>
</ul>

### Expert policy? ###

A set of heuristic rules, based on alignments between nodes in dependency tree and AMR graph.
<ul>
<li>Mapped nodes and edges may need to be renamed.</li>
<li>Unmapped nodes may need to be inserted or deleted.</li>
<br><br>
<li class="fragment" data-fragment-index="5"> Suboptimal? <span class="fragment" data-fragment-index="6"><b>Yes!</b></span></li>
</ul>

### V-DAgger reminder ###

<p style="border:3px; border-radius: 25px; background-color:lightgrey; border-style:solid; border-color:black; padding: 0.3em; font-size: 70%">
\begin{align}
& \textbf{Input:} \; D_{train} = \{(\mathbf{x}^1,\mathbf{y}^1)...(\mathbf{x}^M,\mathbf{y}^M)\}, \; expert\; \pi^{\star}, \; loss \; function \; L, \\
& \quad \quad \quad learning\; rate\; p\\
& \text{set} \; training\; examples\; \cal E = \emptyset \\
& \mathbf{while}\; \text{termination condition not reached}\; \mathbf{do}\\
& \quad \color{red}{\beta = (1 - p)^i}\\
& \quad \color{red}{\text{set} \; rollin/out \; policy \; \pi^{in/out} = (1-\beta) H + \beta \pi^{\star}}\\
& \quad \mathbf{for} \; (\mathbf{x},\mathbf{y}) \in D_{train} \; \mathbf{do}\\
& \quad \quad \text{rollin to predict} \; \hat \alpha_1\dots\hat \alpha_T  = \pi^{in/out}(\mathbf{x},\mathbf{y})\\
& \quad \quad \mathbf{for} \; \hat \alpha_t \in \hat \alpha_1\dots\hat \alpha_T \; \mathbf{do}\\
& \quad \quad \quad \mathbf{for} \; \alpha \in {\cal A} \; \mathbf{do}\\
& \quad \quad \quad \quad \text{rollout} \; S_{final} = \pi^{in/out}(S_{t-1}, \alpha, \mathbf{x})\\
& \quad \quad \quad \quad cost\; c_{\alpha}=L(S_{final}, \mathbf{y})\\
& \quad \quad \quad \text{extract } features=\phi(\mathbf{x}, S_{t-1}) \\
& \quad \quad \quad \cal E = \cal E \cup (features,\mathbf{c})\\
& \quad \text{learn classifier} \; \text{from}\; \cal E\\
\end{align}
</p>

### Exploration variations ###

<center>
<img src="images/targetedExploration.png">
</center>

Rollout for 50-200 time-steps, and 10<sup>3</sup> to 10<sup>4</sup> actions.
<br><br>
<span class="fragment" data-fragment-index="1"><b>Partial  exploration</b> is used by SCB-LOLS ([Chang et al., 2015](https://arxiv.org/pdf/1502.02206.pdf)).</span><ul><li class="fragment" data-fragment-index="1">Randomly select time-steps and actions to rollout.</li></ul><br><br><br><br>

### Exploration variations ###

<center>
<img src="images/targetedExploration.png">
</center>

Rollout for 50-200 time-steps, and 10<sup>3</sup> to 10<sup>4</sup> actions.
<br><br>
<span><b>Targeted exploration</b> is used by [Goodman et al. 2016](http://aclweb.org/anthology/P16-1001):</span><ul><li class="fragment" data-fragment-index="3">Perform rollout only for the expert policy action,</li><li class="fragment" data-fragment-index="4">and actions scored within a threshold $\tau$ from the best.</li><br><li class="fragment" data-fragment-index="5">In first epoch (no classifier), randomly rollout actions.</li></ul>

### Targeted exploration results ###
<h5>[Goodman et al. 2016](http://aclweb.org/anthology/P16-1001)</h5>
<br><br>
<center>
<img src="images/amr_targetedExploration_results.png" width="90%">
</center>

### Step-level stochasticity ###

V-DAgger and SEARN use step-level mix during roll-out.
- Each rollout step either by <font color="blue">classifier</font> or <font color="green">expert</font>.
<br>
<br>

<center>
<img src="images/stepStoch_1.png">
</center><br>
<br>
<br>
<br>

### Step-level stochasticity ###

V-DAgger and SEARN use step-level mix during roll-out.
- Each rollout step either by <font color="blue">classifier</font> or <font color="green">expert</font>.
- Rollout on same $a$ may result on different sequence.
<br><br>
<center>
<img src="images/stepStoch_1.png">
<img src="images/stepStoch_2.png">
</center>

Step-level stochasticity causes high variance in training signal.
<ul><li class="fragment" data-fragment-index="1">Use LOLS instead?</li></ul>

<center>
<img src="images/perStepLOLS_1.png">
<img src="images/perStepLOLS_2.png">
</center>

<ul><li class="fragment" data-fragment-index="1">Sequence too long for full expert policy rollout.</li></ul>

### Focused costing ###

Introduced by [Vlachos and Craven, 2011](http://www.aclweb.org/anthology/W/W11/W11-0307.pdf).<ul><li>Use the <font color="blue">classifier</font> for first $b$ steps of rollout,</li><li>use <font color="green">expert policy</font> for the rest.</li></ul>
<br>
<center>
<img src="images/perStepFocused.png">
</center>
<br>
<span class="fragment" data-fragment-index="1">Classifier costing focused on immediate actions.</span><ul><li class="fragment" data-fragment-index="3">No errors in distant actions of the rollout.</li><br><li class="fragment" data-fragment-index="5">Gradually increase $b$.</li></ul>

### Focused costing results ###
<h5>[Goodman et al. 2016](http://aclweb.org/anthology/P16-1001)</h5>
<br><br>
<center>
<img src="images/amr_focusedCosting_results.png" width="50%">
</center>

### $a$-bound ###

Introduced by [Khardon and Wachman (2007)](http://www.jmlr.org/papers/volume8/khardon07a/khardon07a.pdf).

Reduce training noise by ignoring noisy training instances.<br><br>
<span class="fragment" data-fragment-index="1">During training, if the classifier makes > $a$ mistakes on a training instance:</span><ul><li class="fragment" data-fragment-index="2">Exclude instance from future training iterations.</li><li class="fragment" data-fragment-index="3">Related to Coaching (<a href="https://papers.nips.cc/paper/4545-imitation-learning-by-coaching.pdf">He et al., 2012</a>)</li></ul>

### $\alpha$-bound results ###
<h5>[Goodman et al. 2016](http://aclweb.org/anthology/P16-1001)</h5>
<br>
<center>
<img src="images/aboundResults.png" width="60%">
</center>

### Comparison between IL approaches ###
<h5>[Goodman et al. 2016](http://aclweb.org/anthology/P16-1001)</h5>

<center>
<img src="images/amrResults_otherIL.png">
</center>

### Comparison against state of the art ###
<h5>[Goodman et al. 2016](http://aclweb.org/anthology/P16-1001)</h5>

<center>
<img src="images/semParseRes.png">
</center>

### Summary so far ### 

We discussed more modifications to the DAgger framework.
- Targeted exploration consideres only actions for which the learned policy is uncertain and that disagree with the expert policy.
- Using and $a$-bound we filter out training examples that confuse the classifier.
- Focused costing performs the learned policy only on the actions that are immediately effected by the current explored one.

We showed that imitation learning improves on the results.