In [1]:
from IPython.html.services.config import ConfigManager
from IPython.paths import locate_profile
cm = ConfigManager(profile_dir=locate_profile(get_ipython().profile))
cm.update('livereveal', {
              'theme': 'solarized',
              'transition': 'slide',
              'start_slideshow_at': 'selected',
              'progress': 'true',
})



{'progress': 'true',
 'scroll': 'true',
 'start_slideshow_at': 'selected',
 'theme': 'solarized',
 'transition': 'slide'}

<center>
<h2>Imitation learning</h2>
</center>

### Imitation learning for part-of-speech tagging

<table style="float: left;">
<thead>
<tr>
<th>I</th>
<th>can</th>
<th>fly</th>
</tr>
</thead>
<tbody>
<tr>
<td><span>Pronoun</span></td>
<td><span>Modal</span></td>
<td><span>Verb</span></td>
</tr>
</tbody>
</table>

**Q1: Transition system**?

Label each token with a PoS tag, left-to-right

**Q2: Task loss**?

Hamming loss: the number of incorrectly predicted tags in a sentence

**Q3: Expert policy**?

For each word, return the label from the gold standard

<h3>Gold standard in search space</h3>

<img src="images/tikz/posImitGold.png" style="width:75%; float:left;">
<br>
<p style="float:left;">
Three actions to complete the output
<br>
Expert policy replicates the gold standard
</p>

<h3>Training a classifier<span class="fragment" data-fragment-index="1"> with structure features </span></h3>

<img src="images/tikz/posImitClassifierTraining.png" style="width:75%; float:left;">

<table style="float:left;">
<thead>
<tr>
<th>word</th>
<th>label</th>
<th>features</th>
</tr>
</thead>
<tbody>
<tr>
<td><i>I</i></td>
<td><b>Pronoun</b></td>
<td>token=I, ...<span class="fragment" data-fragment-index="1">, <font color="red">prev=<b>NULL</b></font></span></td>
</tr>
<tr>
<td><i>can</i></td>
<td><b>Modal</b></td>
<td>token=can, ...<span class="fragment" data-fragment-index="1">, <font color="red">prev=<b>Pronoun</b></font></span></td>
</tr>
<tr>
<td><i>fly</i></td>
<td><b>Verb</b></td>
<td>token=fly, ...<span class="fragment" data-fragment-index="1">, <font color="red">prev=<b>Modal</b></font></span></td>
</tr>
</tbody>
</table>

<p style="float:left; font-size: 70%" class="fragment" data-fragment-index="2">If logistic regression and $k$ previous tags are used, same as training a $kth$-order Maximum Entropy Markov Model (<a href="http://people.csail.mit.edu/mcollins/6864/slides/memm.pdf">McCallum et al., 2000</a>)</p>

The feature restricition though is needed only to be able to use dynamic programming (i.e. Viterbi) for efficient joint inference. In incremental model though this is not needed thus features can use all previous tags.

### Falling off the expert trajectory

### Dagger

Maybe use some of the driving images here

Also talk about the meaning of the learning rate

introduce mixed *roll-in*

### DAgger algorithm

<p style="border:3px; border-radius: 25px; background-color:lightgrey; border-style:solid; border-color:black; padding: 0.3em; font-size: 80%">
\begin{align}
& \textbf{Input:} \; D_{train} = \{(\mathbf{x}^1,\mathbf{y}^1)...(\mathbf{x}^M,\mathbf{y}^M)\}, \; expert\; \pi^{\star}, \; loss \; function \; \ell\\
& \textbf{Output:} \; classifier \; H\\
& training\; examples\; \cal E = \emptyset\\
& \mathbf{while}\; \text{termination condition not reached}\; \mathbf{do}\\
& \quad \text{set} \; rollin \; policy \; \pi^{in} = mix(H,\pi^{\star})\\
& \quad \mathbf{for} \; (\mathbf{x},\mathbf{y}) \in D_{train} \; \mathbf{do}\\
& \quad \quad \text{rollin to predict} \; \hat y_1\dots\hat y_N  = \pi^{in}(\mathbf{x})\\
& \quad \quad \mathbf{for} \; \hat y_n \in \hat y_1\dots\hat y_N \; \mathbf{do}\\
& \quad \quad \quad \text{ask expert for best action}\; y^{\star} = \pi^{\star}(\hat y_1\dots\hat y_{n-1}, \mathbf{x},\mathbf{y}) \; \\
& \quad \quad \quad \text{extract features}\; f=\phi(\mathbf{x},\hat y_1\dots \hat y_{n-1}) \\
& \quad \quad \quad \cal E = \cal E \cup (f,y^{\star})\\
& \quad \text{learn} \; H\; \text{from}\; \cal E\\
\end{align}
</p>

### Is that all?

So far we looked at how to deal with previous errors, also known as exposure bias.

What about the future ones?

### Training labels as costs

<img src="images/tikz/posImitClassifierTraining.png" style="width:75%; float:left;">

<table style="float:left;">
<thead>
<tr>
<th>word</th>
<th><b>Pronoun</b></th>
<th><b>Modal</b></th>
<th><b>Verb</b></th>
<th><b>Noun</b></th>
<th>features</th>
</tr>
</thead>
<tbody>
<tr>
<td><i>I</i></td>
<td>0</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>token=I, prev=<b>NULL</b>...</td>
</tr>
<tr>
<td>can</td>
<td>1</td>
<td>0</td>
<td>1</td>
<td>1</td>
<td>token=can, prev=<b>Pronoun</b>...</td>
</tr>
<tr>
<td>fly</td>
<td>1</td>
<td>1</td>
<td>0</td>
<td>1</td>
<td>token=fly, prev=<b>Modal</b>...</td>
</tr>
</tbody>
</table>

<h3>Cost break down</h3>

<img src="images/tikz/posImitActionCosting1.png" style="float: left; width:50%">
<img src="images/tikz/posImitActionCosting2.png" style="float: left;width:50%">


<p style="float:left;">
<ul>
<li><i>roll-in</i> to a point in the sentence</li>
<li>try each possible label and <i>rollout</i> till the end</li>
<li>evaluate the complete output with the task loss</li>
<li>If <i>roll-out</i> with expert, correct action has 0 cost, incorrect 1.</li>
</ul>
</p>

### Mixed roll-outs

### DAgger with roll-outs

Algorithm

### Roll-outs

First proposed in SEARN,


also used to hybridise DAgger by Vlachos

Also known as look-aheads ([Tsuruoka et al. 2011](http://www.anthology.aclweb.org/W/W11/W11-0328.pdf)) and shown to help incremental, transition-based models rival global ones.


Very useful, but can be expensive (wait for second part how to help ourselves!)


### LoLS

<h3>Generic imitation learning</h3>

<p style="border:3px; border-radius: 25px; background-color:lightgrey; border-style:solid; border-color:black; padding: 0.3em; font-size: 80%">
\begin{align}
& \textbf{Input:} \; D_{train} = \{(\mathbf{x}^1,\mathbf{y}^1)...(\mathbf{x}^M,\mathbf{y}^M)\}, \; expert\; \pi^{\star}, \; loss \; function \; \ell\\
& \textbf{Output:} \; classifier \; H\\
& training\; examples\; \cal E = \emptyset\\
& \mathbf{while}\; \text{termination condition not reached}\; \mathbf{do}\\
& \quad \text{set} \; rollin \; policy \; \pi^{in} = mix(H,\pi^{\star})\\
& \quad \text{set} \; rollout \; policy \; \pi^{out} = mix(H,\pi^{\star})\\
& \quad \mathbf{for} \; (\mathbf{x},\mathbf{y}) \in D_{train} \; \mathbf{do}\\
& \quad \quad \text{rollin to predict} \; \hat y_1\dots\hat y_N  = \pi^{in}(\mathbf{x})\\
& \quad \quad \mathbf{for} \; \hat y_n \in \hat y_1\dots\hat y_N \; \mathbf{do}\\
& \quad \quad \quad \text{rollout to obtain costs}\; c \; \text{for all possible actions using}\; \ell\;  \\
& \quad \quad \quad \text{extract features}\; f=\phi(\mathbf{x},\hat y_1\dots \hat y_{n-1}) \\
& \quad \quad \quad \cal E = \cal E \cup (f,c)\\
& \quad \text{learn} \; H\; \text{from}\; \cal E\\
\end{align}
</p>

### Non-decomposable

### Latent actions

Mention UNSEARN

### Summary so far