In [3]:
%matplotlib
from IPython.html.services.config import ConfigManager
from IPython.paths import locate_profile
from IPython.display import Image
cm = ConfigManager(profile_dir=locate_profile(get_ipython().profile))
cm.update('livereveal', {
              'theme': 'solarized',
              'transition': 'slide',
              'start_slideshow_at': 'selected',
              'progress': 'true',
})

Using matplotlib backend: Qt5Agg


{'progress': 'true',
 'start_slideshow_at': 'selected',
 'theme': 'solarized',
 'transition': 'slide'}

<center>
<h2>Applying Imitation Learning on Dependency Parsing</h2>
</center>

### Dependency parsing ###
###### ([Goldberg and Nivre 2012](http://www.aclweb.org/anthology/C12-1059), [Goldberg and Nivre 2013](https://www.aclweb.org/anthology/Q/Q13/Q13-1033.pdf)) ###### 

<img src="images/toBeAnimated/depParse1.png">

To represent the syntax of a sentence as directed labeled edges between words.
- Where labels represent dependencies between words.

### Error propagation ###

Due to greedy decoding, where the parser builds the parse while maintaining only the best hypothesis at each step.
- The first error encountered will confuse the classifier, since it moves the sequence to space not explored by the gold sequence of actions.
- More errors will likely follow, as the transition increasingly moves into more foreign states.

### How can Imitation Learning help with that? ### 

Imitation Learning addresses error propagation, by considering the interaction among the transition being considered and transitions to be predicted later in the sentence.
- Explores the search space, but avoids enumerating all possible outputs.
- Also learns how to recover from errors.

### Transition system? ###

We can assume any transition-based system (Arc-Eager, Arc-Standard, Easy-First, etc.).
- Each action, transforms the current state/graph until a terminal state/graph is reached.
- In essence, which arc and label should we add next?

### Transition-based dependency parsing in action! ###

<img src="images/toBeAnimated/depParse1.png">

### Task loss? ###

Hamming loss: the number of incorrectly predicted labeled or unlabeled dependency arcs.
- Directly related to the attachment score metrics used to evaluate dependency parsers.

### Expert policy? ###

A single static canonical sequence of actions from the initial to the terminal state.
- Derived from the reference graph.

Single static policies worked well for the Part-of-Speech tagging task.

### But what if there are multiple correct transitions? ###

<img src="images/toBeAnimated/depParse2.png">

A static policy could arbitarilly chose a transition (e.g. prioritize shifts over other actions).
- But this indirectly labels the alternative transition as false!

### And what if a mistake happens? ###

<img src="images/toBeAnimated/depParse3.png">

A static policy is not well defined in states that are not part of the gold transition.

### Dynamic policy ###

Non-deterministic and complete policy
- Allows ambiguous transitions.
- Defined for all states.
- Recovers from errors.
  
<img src="images/toBeAnimated/depParse4.png">

### What is the expert policy then? ###

Given a particular state, where an error may or may not have already occured:
- We need to determine the best reachable terminal state
  - Quite possibly not an optimal terminal state, if we have made an error before.
  - "Best" according to some loss function in relation to the gold terminal state.

- Return the set of transition actions that lead to that state.

### Expert policy in action! ###

<img src="images/toBeAnimated/depParse4.png">

Cost of 1, if we use the labeled attachment score as a loss function.

### Is that DAgger? ###

[Goldberg and Nivre 2013](https://www.aclweb.org/anthology/Q/Q13/Q13-1033.pdf)) proposed a system that employed dynamic expert policies for dependency parsing.
- As well as an algorithm to learn parameters by exploration.
- Very similar to DAgger.
  - Roll-in is a mix of the learned and expert policies, at the time-step level.
  - There may be multiple correct actions at each time-step.

### Effect of k and p ###
<img src="images/dependHeatMaps.png">

### Results ###
<img src="images/dependResults.png">

### Summary so far ### 

We discussed modifications to the DAgger framework.
- Hard decay schedule after $k$ epochs when determining the roll-in and roll-out policies.
- Using a mix of expert and learned policy during roll-outs.

We showed that dynamic oracles improves on the results of static orcales.