<center>
<h2>Syntactic Parsing part 2: <br> Transition-based Dependency parsing</h2>
<p style="text-align:center">
Natural Language Processing<br>
(COM4513/6513)<br>
<br>
<a href="http://andreasvlachos.github.io">Andreas Vlachos</a><br>
a.vlachos@sheffield.ac.uk<br>
<small>Department of Computer Science<br>
University of Sheffield
</small>
</p>
</center>

### Previous lecture

Dependency parsing:
<img src="images/tikz/test.png" style="width:1000px; border:none; box-shadow:none;">

### Problem setup

Training data is pairs of word sequences (sentences) and dependency trees:

\begin{align}
D_{train} & = \{(\mathbf{x}^1,G_x^1)...(\mathbf{x}^M,G_x^M)\} \\
\mathbf{x}^m & = [x_1,... x_N]\\
graph\; G_\mathbf{x} &= (V_\mathbf{x}, A_\mathbf{x})\\
vertices\; V_\mathbf{x} &=\{0,1,...,N\}\\
edges\; A_\mathbf{x} &=\{(i,j,k)|i,j\in V, k \in L \text{(labels)}\}
\end{align}

We want to learn a model to predict the best graph:

$$
\hat G_\mathbf{x} = \mathop{\arg \max}\limits_{G_\mathbf{x} \in \cal G_\mathbf{x}} score(G_\mathbf{x},\mathbf{x})
$$

### Graph-based dependency parsing: Inference

<img src="images/MSTParsing.png" style="width:1000px; border:none; box-shadow:none;">

Pick the highest scoring edges forming a spanning tree of the graph

We assume we have a function $f$ scoring each edge $(head, dependent, label)$ given the sentence $\mathbf{x}$

### Graph-based dependency parsing: training

Decompose the graph score into arc scores:

\begin{align}
				\hat G_\mathbf{x} & = \mathop{\arg \max}\limits_{G_\mathbf{x} \in \cal G_\mathbf{x}} score(G_\mathbf{x},\mathbf{x})\\
				 & = \mathop{\arg \max}\limits_{G_\mathbf{x} \in \cal G_\mathbf{x}} \mathbf{w} \cdot \Phi(G_\mathbf{x},\mathbf{x}) \quad \text{(linear model)}\\
				 & = \mathop{\arg \max}\limits_{G_\mathbf{x} \in \cal G_\mathbf{x}} \sum_{(i,j,l) \in A_x} \mathbf{w} \cdot \phi((i,j,l),\mathbf{x}) \quad \text{(arc-factored)}
				 \end{align}

Can learn the weights with the structured perceptron!

### In this lecture

Graph-based dependency parsing restricts the features to
perform joint inference efficiently.

**Transition-based dependency parsing** trades joint inference for feature flexibility.

No more argmax over graphs, just use a classifier with any features we want!

### Joint vs incremental prediction

<p style="float: left;">**Joint**: score (and enumerate)<br> complete outputs (graphs)</p> 
<img src="images/tikz/StucturedPredictionDef.png" style="width:40%; float: right;">

<p style="float: left;">**Incremental**: predict a sequence <br>of actions (transitions) constructing<br> the output</p> 
<img src="images/tikz/StucturedPrediction.png" style="width:40%; float: right;">

<h3>Transition system</h3>

<p>The <b>actions</b> $\cal A$ the classifier $f$ can predict and their effect on the <b>state</b> which tracks the prediction: $S_{t+1}=S_1(\alpha_1\ldots\alpha_t)$</p>

<img src="images/tikz/IncrementalStructure.png" style="align:center; width:65%">

What should the actions (transitions) be for dependency parsing?

### Transition system setup

**input**: Vertices $V_\mathbf{x} =\{0,1,...,N\}$ (words sentence $\mathbf{x}$)

**state** $S=(Stack,B,A)$: 
- Arcs $A$ (dependencies predicted so far)
- Buffer $Buf$ (words left to process)
- Stack $Stack$ (last-in, first out memory)

Initial state: $S_0 = ([],[0,1,...,N],\{\}) $

Final state: $S_{final}=(Stack,[],A)$

### Transition system

$\text{Shift} \; (Stack, i|Buf, A)\rightarrow(Stack|i, Buf, A)$<br>
push next word from the buffer ($i$) to stack

$\text{Reduce} \; (Stack|i, Buf, A)\rightarrow(Stack, Buf, A)$<br>
pop word top of the stack ($i$) if it has a head

$\text{Right-Arc}(label)\; (Stack|i, j|Buf, A) \rightarrow (Stack|i|j, Buf, A\cup\{(i,j,l)\})$<br>
create edge $(i,j,label)$ between top of the stack ($i$) <br>and next in buffer ($j$), push $j$}

$\text{Left-Arc}(label)\; (Stack|i, j|Buf, A) \rightarrow (Stack, j|Buf, A\cup\{(j,i,l)\})$<br>
create edge $(j,i,label)$ and pop $i$, if $i$ has no head

### Example

<img src="images/tikz/depParseArcEager0.png" style="width:1000px; border:none; box-shadow:none;">

Stack = []

Buffer = [ROOT, Economic, news, had, little, effect, on, financial, markets, .]

Action?<span class="fragment">$\quad \text{Shift}$</span>

### Example

<img src="images/tikz/depParseArcEager0.png" style="width:1000px; border:none; box-shadow:none;">

Stack = [ROOT]

Buffer = [Economic, news, had, little, effect, on, financial, markets, .]

Action?<span class="fragment">$\quad \text{Shift}$</span>

### Example

<img src="images/tikz/depParseArcEager0.png" style="width:1000px; border:none; box-shadow:none;">

Stack = [ROOT, Economic]

Buffer = [news, had, little, effect, on, financial, markets, .]

Action?<span class="fragment">$\quad \text{Left-Arc}(amod)$</span>

### Example

<img src="images/tikz/depParseArcEager1.png" style="width:1000px; border:none; box-shadow:none;">

Stack = [ROOT]

Buffer = [news, had, little, effect, on, financial, markets, .]

Action?<span class="fragment">$\quad \text{Shift}$</span>

### Example

<img src="images/tikz/depParseArcEager1.png" style="width:1000px; border:none; box-shadow:none;">

Stack = [ROOT, news]

Buffer = [had, little, effect, on, financial, markets, .]

Action?<span class="fragment">$\quad \text{Left-Arc}(nsubj)$</span>

### Example

<img src="images/tikz/depParseArcEager2.png" style="width:1000px; border:none; box-shadow:none;">

Stack = [ROOT]

Buffer = [had, little, effect, on, financial, markets, .]

Action?<span class="fragment">$\quad \text{Right-Arc}(root)$</span>

### Example

<img src="images/tikz/depParseArcEager3.png" style="width:1000px; border:none; box-shadow:none;">

Stack = [ROOT, had]

Buffer = [little, effect, on, financial, markets, .]

Action?<span class="fragment">$\quad \text{Shift}$</span>

### Example

<img src="images/tikz/depParseArcEager3.png" style="width:1000px; border:none; box-shadow:none;">

Stack = [ROOT, had, little]

Buffer = [effect, on, financial, markets, .]

Action?<span class="fragment">$\quad \text{Left-Arc}(amod)$</span>

### Example

<img src="images/tikz/depParseArcEager4.png" style="width:1000px; border:none; box-shadow:none;">

Stack = [ROOT, had]

Buffer = [effect, on, financial, markets, .]

Action?<span class="fragment">$\quad \text{Right-Arc}(dobj)$</span>

### Example

<img src="images/tikz/depParseArcEager5.png" style="width:1000px; border:none; box-shadow:none;">

Stack = [ROOT, had, effect]

Buffer = [on, financial, markets, .]

Action?<span class="fragment">$\quad \text{a few more later...}$</span>

### Example

<img src="images/tikz/depParseArcEagerFinal.png" style="width:1000px; border:none; box-shadow:none;">

Stack = [ROOT, had, .]

Buffer = []

Empty buffer. DONE!

### Other transition systems?

This was the arc-eager system. Others:
- arc-standard (3 actions)
- easy-first (not left-to-right), etc.


All operate with actions combining:
- moving words from the buffer to the stack and back (shift/un-shift)
- popping words from the stack (reduce)
- creating labeled arcs left and right

Intuition: Define actions that are easy to learn

<h3>Transition-based parsing</h3>

<p style="text-align: left; border:3px; border-radius: 25px; background-color:lightgrey; border-style:solid; border-color:black; padding: 0.3em;">
\begin{align}
& \textbf{Input:} \; sentence \; \mathbf{x}\\
& state \; S_1=initialize(\mathbf{x}); timestep \; t = 1\\
& \mathbf{while}\; S_t \; \text{not final}\; \mathbf{do}\\
& \quad action \; \alpha_t = \mathop{\arg \max}_{\alpha \in {\cal A}} f(\alpha, S_t)\\
& \quad S_{t+1}=S_t(\alpha_t); t=t+1\\
\end{align}
</p>

What is $f$? <span class="fragment">A multiclass classifier</span>

What do we need to learn it?

- learning algorithm (perceptron, logistic regression)
- labeled training data
- feature extraction

### What are the right actions?

We only have sentences labeled with graphs: $D_{train} = \{(\mathbf{x}^1,G_x^1)...(\mathbf{x}^M,G_x^M)\}$

<center>
<a href="http://www.ancient-origins.net/sites/default/files/field/image/">
<img src="images/oracle-delphi.jpg" style="width:60%; border:none; box-shadow:none;"></a>
</center>

Ask an oracle to tell us the actions constructing the graph!

In our case, a set of **rules** comparing the current state $S=(Stack,Buffer,ArcsPredicted)$ with $G_x$ returning the correct action as label

<h3>Learning from an oracle</h3>
Given a labeled sentence and a transition system, an oracle returns states labeled with the correct actions.

\begin{align}
D_{train} & = \{(\mathbf{x}^1,G_x^1)...(\mathbf{x}^M,G_x^M)\} \\
\mathbf{x}^m & = [x_1,..., x_N]\\
graph\; G_\mathbf{x} &= (V_\mathbf{x}, A_\mathbf{x})\\
vertices\; V_\mathbf{x} &=\{0,1,...,N\}\\
edges\; A_\mathbf{x} &=\{(i,j,k)|i,j\in V, k \in L \text{(labels)}\}\\
\color{red}{states\; \mathbf{S}^m} & \color{red}{=[S_1,...,S_T]}\\
\color{red}{actions\; \mathbf{\alpha}^m} & \color{red}{=[\alpha_1,...,\alpha_T]}\\
\end{align}
</p>


### Features

<img src="images/tikz/depParseArcEager5.png" style="width:1000px; border:none; box-shadow:none;">

Stack = [ROOT, had, effect]

Buffer = [on, financial, markets, .]

What features would help us predict the correction action $\text{Right-Arc}(prep)$

Features based on the words/PoS in stack and buffer:
<br> wordS1=effect, wordB1=on, wordS2=had, posS1=NOUN, etc.

Features based on the dependencies so far:
<br> depS1=dobj, depLeftChildS1=amod, depRightChildS1=NULL, etc.

Features based on previous actions:
<br> $\alpha_{t-1}=\text{Right-Arc}(dobj)$, etc.

### Transition-based vs Graph-based parsing

- Transition-based tends to be better on shorter sentences, graph-based on longer ones

- Graph-based tends to be better on long-range dependencies

- Graph-based lacks the rich structural features

- Transition-based is greedy and suffers from early mistakes

Actually, can we ameliorate the greedy issue?

### Beam Search


<a href="http://slideplayer.com/slide/8593664/"><img width="80%" src="images/beam_pos.jpg"></a>

Don't need to be completely greedy if we are happy to increase CPU and memory usage

<h3>Non-Projectivity</h3>
<img src="images/tikz/depParseNonProjective.png" style="width:1000px; border:none; box-shadow:none;">
<ul>
<li>a.k.a. crossing dependencies</li>
<li>long-range dependencies</li>
<li>free word order</li>
</ul>

### Non-projective transition-based parsing

The standard stack-based systems cannot do it.

But there are extensions:
- swap actions: word reodering
- k-planar parsing: use multiple stacks (usually 2)


Standard graph-based parsing handles non-projectivity.

### Incremental language processing

Other problems solved with similar approaches (a.k.a. transition-based, greedy):
- semantic parsing
- coreference resolution
- etc.

Whenever you have a problem with a very large space of outputs, worth considering.

Humans process language incrementally, should machines do the same?

### Bibliography
- Nivre and McDonald's tutorial [slides](http://stp.lingfil.uu.se/~nivre/eacl14.html)
- Nivre's article on [deterministic transition-based dependency parsing](http://www.mitpressjournals.org/doi/pdf/10.1162/coli.07-056-R1-07-027)
- Nivre and McDonald's [paper](http://www.aclweb.org/anthology/D07-1013) comparing their approaches

### Coming up next

Continuous representations (getting ready for neural networks)