In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import il_tutorial.cost_graphs as cg
import il_tutorial.util as util
from IPython.display import HTML

In [3]:
HTML('''<script>
code_show=true; 
function code_toggle() {
 if (code_show){
 $('div.input').hide();
 } else {
 $('div.input').show();
 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
The raw code for this IPython notebook is by default hidden for easier reading.
To toggle on/off the raw code, click <a href="javascript:code_toggle()">here</a>.''')

<center>
<h2>Connections and interpretations</h2>
</center>

### Why does IL work?

A bit of theory (Goodman)

### Reinforcement learning

<p style="float: left;">
<ul style="float: left;">
<li>states defined via features</li>
<li>the agent is a classifier</li>
<li>rewards?</li>
</ul>
</p>

<a href="https://webdocs.cs.ualberta.ca/~sutton/book/ebook/node28.html"><img src="images/RL_sutton.png" style="width:40%; float: right;"></a>

**Inverse reinforcement learning**

- we have the expert policy (inferred from the gold standard in the training data)
- we infer the per-action reward function (rollin/out)				


LOLS with classifier only rollouts is RL ([Chang et al., 2015](https://arxiv.org/pdf/1502.02206.pdf))

### Inverse reinforcement learning

- and learn a policy (classifier) to generalize to unseen data



### Bandit learning

Rolling out each action can be expensive:
- many possible actions
- expensive loss functions to calculate

What if we only rollout one? 


<img src="images/bandits.jpg" style="float: right; width:40%;">

**Bandit Learning** 

Structured Contextual Bandit LOLS ([Chang et al., 2015](https://arxiv.org/pdf/1502.02206.pdf))

[Sokolov et al. (2016)](http://www.cl.uni-heidelberg.de/~sokolov/pubs/sokolov16learning.pdf) used bandit feedback for chunking

### Semi/Unsupervised learning

Learning with non-decomposable loss functions means
- no need to know the correct actions
- learn to predict them in order to minimize the loss
<img src="images/tikz/semParse.png" style="width:80%;">

UNSEARN ([Daumé III, 2009](http://www.umiacs.umd.edu/~hal/docs/daume09unsearn.pdf)): Predict the structured output so that you can predict the input from it (auto-encoder!)

### Negative data sampling


In [9]:
paths = [[],[(0,4),(1,3)],[(0,4),(1,3),(2,2)],[(0,4),(1,3),(2,2),(3,1)]]
rows = ['Noun', 'Verb', 'Modal', 'Pronoun','NULL']
columns = ['NULL','I', 'can', 'fly']
gold_path = [(0,4),(1,3),(2,2),(3,1)]
cbs=[]
cb_gold = cg.draw_cost_breakdown(rows, columns, gold_path)
cbs.append(cb_gold)
wrong_path = [(0,4),(1,2)]
cb_wrong = cg.draw_cost_breakdown(rows, columns, wrong_path)
cbs.append(cb_wrong)
p = gold_path.copy()
for i in range(4):
    p = gold_path.copy()
    p[2] = (gold_path[2][0],i)
    if p == gold_path:
        cost = 0
    elif i==1:
        cost =2
        p[3] = (3,0)
    else:
        cost = 1
    cbs.append(cg.draw_cost_breakdown(rows, columns, p, cost, p[3], roll_in_cell=p[1],roll_out_cell=(3,0), explore_cell=p[2]))
util.Carousel(cbs)

- Expert action sequence → positive example
- All other action sequences → negative examples

Generate useful negative samples around the expert, related to **adversarial training** ([Ho and Ermon, 2016](https://arxiv.org/abs/1606.03476)) )

### Coaching

<p style="float: left;">If the optimal action is difficult to<br>predict, the coach teaches a good one<br>that is easier (<a href="https://papers.nips.cc/paper/4545-imitation-learning-by-coaching.pdf">He et al., 2012</a>)</p> <a href="https://commons.wikimedia.org/wiki/File:US_Navy_091206-N-2013O-023_Sam_Givens,_a_player_for_the_Harlem_Ambassadors_basketball_team,_demonstrates_proper_dribbling_techniques_to_a_boy_during_a_basketball_camp_sponsored_by_Yokosuka%27s_Morale,_Welfare_and_Recreation_Youth.jpg"><img src="images/coaching.jpg" style="width:30%; float: right;"></a>

Expert: $\alpha^{\star}= \mathop{\arg \min}_{\alpha \in {\cal A}} L(S_t(\alpha,\pi^{\star}),\mathbf{y})$

Coach: $\alpha^{\dagger}= \mathop{\arg \min}_{\alpha \in {\cal A}} \lbrace L(S_t(\alpha,\pi^{\star}),\mathbf{y}) - \lambda \cdot f(\alpha, \mathbf{x}, S_t)\rbrace$

i.e. $\alpha^{\dagger}=\mathop{\arg \min}_{\alpha \in {\cal A}} \lbrace loss(\alpha) - \lambda \cdot classifier(\alpha) \rbrace$

### Changeprop

Speed-up action costing in LOLS
([Viera and Eisner, 2016](https://timvieira.github.io/doc/2016-tacl-pruning.pdf)):

If one action changes, don't assess the cost of the whole trajectory (under assumptions), **prop**agate the **change**!

<img src="images/changeprop.png" style="width:70%;"></a>

### Ranking

Jana Rao Doppa

<h3> What about Recurrent Neural Networks?</h3>

<img src="images/rnn.png" width="60%"  style="background:none;" />

<p>They face similar problems:
				<ul>
				<li>trained at the word rather than sentence level</li>
				<li>assume previous predictions are correct</li>
			</ul>
			</p>

### Imitation learning and RNNs

<img src="images/mixer.png" width="70%"  style="background:none;" />


- DAgger mixed rollins, similar to scheduled sampling ([Bengio et al., 2015](http://arxiv.org/abs/1506.03099))
- MIXER  (<a href="https://arxiv.org/abs/1511.06732">Ranzato et al., 2016</a>): Mix REINFORCE-ment learning with imitation: we have the expert policy!
- no rollouts, learn a  regressor to estimate action costs
- end-to-end back propagation through the sequence

### Actor-critic

![](images/actorcritic.png)

[Bahdanau et al. (2017)](https://arxiv.org/pdf/1607.07086.pdf):
- actor: the RNN we are learning to use during testing
- critic: another RNN that is trained to predict the value of the actions of the critic

### Summary so far

- basic concepts
  - loss function decomposability
  - expert policy
- imitation learning
    - rollin/outs
    - DAgger algorithm
    - DAgger with rollouts and LoLS
- connections and interpretations

### After the break

- Applications:
  - dependency parsing
  - natural language generation
  - semantic parsing
- Practical advice
    - making things faster
    - debugging

<center>

<h1>Break!</h1>