In [4]:
from IPython.html.services.config import ConfigManager
from IPython.paths import locate_profile
cm = ConfigManager(profile_dir=locate_profile(get_ipython().profile))
cm.update('livereveal', {
              'theme': 'solarized',
              'transition': 'slide',
              'start_slideshow_at': 'selected',
              'progress': 'true',
})

{'background': '#ff0000',
 'progress': 'true',
 'scroll': 'true',
 'start_slideshow_at': 'selected',
 'theme': 'solarized',
 'transition': 'slide'}

<center>
<h2>Preliminaries</h2>
</center>

<h3>Structured prediction</h3>

<table  style="border-style: hidden; border-collapse: collapse; padding: 50px">
<thead>
<tr>
<th>I</th> 
<th>studied</th>
<th>in</th>
<th>Sheffield</th>
<th>with</th>
<th>Lucia</th>
<th>Specia</th>
</tr>
</thead>
<tbody style="font-size:100%">
<tr>
<td><span class="fragment" data-fragment-index="1">PRP</span></td>
<td><span class="fragment" data-fragment-index="1">VBD</span></td>
<td><span class="fragment" data-fragment-index="1">IN</span></td>
<td><span class="fragment" data-fragment-index="1">NNP</span></td>
<td><span class="fragment" data-fragment-index="1">IN</span></td>
<td><span class="fragment" data-fragment-index="1">NNP</span></td>
<td><span class="fragment" data-fragment-index="1">NNP</span></td>
</tr>
<tr>
<td><span class="fragment" data-fragment-index="2">O</span></td>
<td><span class="fragment" data-fragment-index="2">O</span></td>
<td><span class="fragment" data-fragment-index="2">O</span></td>
<td><span class="fragment" data-fragment-index="2">B-LOC</span></td>
<td><span class="fragment" data-fragment-index="2">O</span></td>
<td><span class="fragment" data-fragment-index="2">B-PER</span></td>
<td><span class="fragment" data-fragment-index="2">I-PER</span></td>
</tr>
</tbody>
</table>

<p>
				<ul>
  			<li class="fragment" data-fragment-index="1">part of speech (PoS) tagging</li>
  			<li class="fragment" data-fragment-index="2">named entity recognition (NER)</li>
				</ul>
			</p>


<p><b>Input:</b> a sentence $\mathbf{x}=[x_1...x_N]$<br> <b>Output:</b> a sequence of labels $\mathbf{y}=[y_{1}\ldots y_{N}] \in {\cal Y}^N$</p>

<h3>Structured prediction</h3>

 <img src="images/tikz/semParse.png" style="width:100%;">

<p>Semantic parsing, but also syntactic parsing, semantic role labeling, question answering over knowledge bases, etc.)</p>
<p><b>Input:</b> a sentence $\mathbf{x}=[x_1...x_N]$<br>
<b>Output:</b> a meaning representation graph $\mathbf{G}=(V,E) \in {\cal G_{\mathbf{x}}}$</p> 

<h3>Structured prediction</h3>

<img src="images/nlg.png" style="width:100%;">

<p>Natural language generation (NLG), but also summarization, decoding in machine translation, etc.</p>

<p><b>Input:</b> a meaning representation<br>
<b>Output:</b> a sentence $\mathbf{w}=[w_1...w_N], w\in {\cal V}\cup END, w_N=END$</p>  

### Two main paradigms

Joint modeling, a.k.a: 
- global inference
- structured prediction

Incremental modeling, a.k.a:
- local 
- greedy
- pipeline
- transition-based
- history-based

### Joint modeling

A model (e.g. conditional random field) that scores complete outputs (e.g. label sequences):

$$\mathbf{\hat y} =\hat y_{1}\ldots \hat y_{N} = \mathop{\arg \max}_{Y \in {\cal Y}^N} f(y_{1}\ldots y_{N}, \mathbf{x})$$

<ul class="fragment">
					<li>no error propagation</li>
					<li>exhaustive exploration of the search space</li>
					<li>large/complex search spaces are challenging</li>
					<li>efficient dynamic programming restricts modelling flexibility
						(i.e. Markov assumptions)</li>
				</ul>


### Incremental modeling

A classifier that predicts one label at a time given the previous predictions:


\begin{align}
\hat y_1 &=\mathop{\arg \max}_{y \in {\cal Y}} f(y, \mathbf{x}),\\
\mathbf{\hat y} = \quad \hat y_2 &=\mathop{\arg \max}_{y \in {\cal Y}} f(y, \mathbf{x}, \hat y_1), \cdots\\
\hat y_N &=\mathop{\arg \max}_{y \in {\cal Y}} f(y, \mathbf{x}, \hat y_{1} \ldots \hat y_{N-1})
\end{align}

<ul class="fragment">
					<li>use our favourite classifier</li>
					<li>no restrictions on features</li>
					<li>prone to error propagation (i.i.d. assumption broken)</li>
					<li>local model not trained wrt the task-level loss</li>
				</ul>


### Imitation learning

Improve incremental modeling to:
- address error-propagation
- train wrt the task-level loss function

**Meta-learning**: use our favourite classifier and features,
but generate better (non-i.i.d.) training data

But let's see some basic concepts first

<h3>Transition system</h3>

<p>The actions $\cal A$ the classifier $f$ can predict from and their effect:</p>

<p style="text-align: left; border:3px; border-radius: 25px; background-color:lightgrey; border-style:solid; border-color:black; padding: 0.3em; font-size: 80%">
\begin{align}
& \textbf{Input:} \; \mathbf{x}\\
& state \; S_1=initialize(\mathbf{x}); timestep \; t = 1\\
& \mathbf{while}\; S_t \; \text{not final}\; \mathbf{do}\\
& \quad action \; \alpha_t = \mathop{\arg \max}_{\alpha \in {\cal A}} f(\alpha, \mathbf{x})\\
& \quad S_{c+1}=update(\alpha_t,S_t); t=t+1\\
& \textbf{Output:} \; S_{final} = S_t\\
\end{align}
</p>


<ul>
<li><b>PoS/NER tagging?</b> <span class="fragment">for each word in the sentence, left-to-right, predict a PoS tag which is added to the output</span></li>
<li class="fragment"><b>NLG?</b> <span class="fragment">predict a word from the vocabulary that is added to the output until the <code>END</code></span></li>
</ul>

<p class="fragment"> <b>state</b>: a data structure containing the output; so far $S=[\alpha_1...\alpha_T]$</p>
<!--<p class="fragment"> <b>trajectory</b>: the actions $[\alpha_1...\alpha_T]$ taken to reach $S_{final}$</p>-->

### Task loss

Given a final state $S_{final}$, how does it compare to the gold standard?

$$loss  = L(S_{final}, \mathbf{y}) \geq 0$$

<ul>
<li><b>PoS tagging?</b> <span class="fragment">Hamming loss: number of incorrect tags</span></li>
<li class="fragment"><b>NER?</b> <span class="fragment">number of false positives and false negatives</span></li>
<li class="fragment"><b>NLG?</b> <span class="fragment">BLEU: % of n-grams predicted present in the gold reference(s) ($L=1-BLEU(s_{final}, \mathbf{y})$)</span></li>
</ul>

<p class="fragment"><b>Goal:</b> models that minimize the loss on unseen data (generalization)</p>

### Decomposable loss

Given a transition system, a loss is **decomposable** if the loss due to an action $\alpha_t$ independently of the future actions $[\alpha_{t+1}...\alpha_T]$:

$$L(S,\mathbf{y})= \sum_{t=1}^T \ell(\alpha_t, \mathbf{y}, [\alpha_1 ...\alpha_{t-1}]) $$


<table class="fragment" data-fragment-index="1" style="border-style: hidden; border-collapse: collapse; padding: 50px">
<thead>
<tr>
<th>I</th> 
<th>studied</th>
<th>in</th>
<th>Sheffield</th>
<th>with</th>
<th>Lucia</th>
<th>Specia</th>
</tr>
</thead>
<tbody style="font-size:100%">
<tr>
<td><span class="fragment" data-fragment-index="1">PRP</span></td>
<td><span class="fragment" data-fragment-index="1">VBD</span></td>
<td><span class="fragment" data-fragment-index="1">IN</span></td>
<td><span class="fragment" data-fragment-index="1">NNP</span></td>
<td><span class="fragment" data-fragment-index="1">IN</span></td>
<td><span class="fragment" data-fragment-index="1">NNP</span></td>
<td><span class="fragment" data-fragment-index="2">NNP</span></td>
</tr>
<tr>
<td><span class="fragment" data-fragment-index="3">O</span></td>
<td><span class="fragment" data-fragment-index="3">O</span></td>
<td><span class="fragment" data-fragment-index="3">O</span></td>
<td><span class="fragment" data-fragment-index="3">B-LOC</span></td>
<td><span class="fragment" data-fragment-index="3">O</span></td>
<td><span class="fragment" data-fragment-index="3">B-PER</span></td>
<td><span class="fragment" data-fragment-index="4">I-PER</span></td>
</tr>
</tbody>
</table>

<p class="fragment" data-fragment-index="1">
Can we tell $\ell(\alpha_6| \cdot)$ for
<ul>
<li class="fragment" data-fragment-index="1">PoS tagging? <span class="fragment" data-fragment-index="2"><b>Yes!</b> $\ell(\alpha_6| \cdot)=0$ no matter $\alpha_7$ </span></li>
<li class="fragment" data-fragment-index="3">NER? <span class="fragment" data-fragment-index="4"><b>No!</b> If $\alpha_7$ is</span>
<ul class="fragment" data-fragment-index="4">
<li>I-PER:  $\ell(\alpha_6| \cdot)=0$ (correct)</li> 
<li>O: $\ell(\alpha_6| \cdot)=2$ (1 FP and 1 FN)</li>
<li>B-*: $\ell(\alpha_6| \cdot)=3$ (2 FP and 1 FN)</li>
</li>
</ul>
</ul>
</p>



### Non-decomposable loss

Is BLEU score decomposable?

$$
BLEU([\alpha_1...\alpha_T],\mathbf{y}) = \prod_{n=1}^N \frac{\# \text{n-grams} \in ([\alpha_1...\alpha_T] \cap \mathbf{y})}{\# \text{n-grams} \in [\alpha_1..\alpha_T]}
$$

**No**! Assuming N>1 and a word-by-word predictor.

The issue of (non-)decompopsability affects joint models too, in that the loss does not always decompose over the graphical model ([Tarlow and Zemel, 2012](http://www.cs.toronto.edu/~dtarlow/tarlow_zemel_aistats12.pdf))


Even F-score for binary classification is non-decomposable ([Narasimhan et al., 2015](http://jmlr.org/proceedings/papers/v37/narasimhana15.pdf))!

### Supervision for inceremental structured prediction

Usually what we have is labeled data, e.g.:

What can we do with it?

### Expert policy

A function that cheats (oracle) by looking at the labels

Only available for the training data

$\pi^{\star}$

Data as demonstrator, learning from demonstrations

### Action level supervision

- labeled words in PoS tagging (obvious to infer)
- oracle labels for transitions in dependency parsing (not so obvious)

Also known as (expert) demonstrations/trajectories

Show examples to reinforce the fact that we have sequence labeling

it is possible to have dynamic experts use NER example

### Per-action reward

Incorporate the effect a mistaken label has on the loss

### Cost-sensitive classification