# Finite D(x) with constrained ITG

For the standard ITG 

```
S -> X
X -> [X X]     (Monotone)
X -> <X X>     (Inverted)
X -> x/y       (Translation)
X -> x/eps     (Deletion)
X -> eps/y     (Insertion)
```

\\( D(x) = \\{ d: yield_\Sigma(d)=x \\}  \\) 
is an infinite set. That's so because the insertion rule allows strings to grow unboundedly by mapping nothing in the source to a target word in the  lexicon.

We've seen how to constrain that set by explicitly intersecting it with a finite regular language. Here we'll show you a different approach. We will modify the grammar so that D(x) is finite by construction. This is convenient because in training a CRF based on such modified grammar, we will be able to save some calls to a general parsing/intersection procedure.

The approach is to first hard-code in the grammar whether an insertion happened. So we will define terminal rules whose nonterminals encode the purpose of the rule:

```
T -> x/y
D -> x/eps
I -> eps/y
```

We have then translation rules (T) where neither x nor y are empty, deletion rules (D) and insertion rules (I).

Then we will upgrade these rules to the status of phrases (X):

```
X -> T
X -> D
X -> I
```

And allow phrases to be concatenated in either order recursively:

```
X -> [X X]
X -> <X X>
```

Finally, phrases eventually become sentences:

```
S -> X
```

Now the crucial part. First we deal with the fact that we can delete as many words as we like by adding a recursive deletion: 

```
D -> [D D]     
```
(Note there's no need for inverted as on the target side this would produce two empty strings either way -- which is not something sensitive to change in word order)

But, we will not do the same for insertions! Instead, we will limit insertions to be accompanied by a translation (on either side):

```
X -> [T I]
X -> <T I>
X -> [I T]
X -> <I T>
```

And this is it! An ITG for which the set D(x) is finite! Here is the complete grammar:

```
S -> X
X -> [X X]
X -> <X X>
X -> T
X -> D
X -> I
X -> [T I]
X -> <T I>
X -> [I T]
X -> <I T>
D -> [D D]
T -> x/y
D -> x/eps
I -> eps/y
```

For you to try this out, we created two helper functions:

1. `libitg.make_source_side_finite_itg`
2. `libitg.make_target_side_finite_itg`

They are drop in replacements for the older (standard ITG) version.

Parsing with this grammar is much faster and we managed to parse longer sentences (We tested with as many as 20 words). You might even be able to play with a slightly bigger lexicon, but we make no promises ;). Check the file `fast_example.py` in the repository.

**Importantly!** There's no need for \\(D_n(x)\\) *whatsoever*, so the roadmap would become:

1.	Get a grammar 
    `src_grammar = libitg.make_source_side_finite_itg(...)`
2.	Parse the source
    `_Dx = earley with src_grammar and src_fsa`
3.	Project the forest getting a finite \\(D(x)\\)
    `Dx = libitg.make_source_target_finite_itg using _Dx`
4.	For training, also parse the target using the newly obtained \\(D(x)\\)
    `Dxy = earley with Dx and tgt_fsa`

### Extra note on feature function

The following paragraph was a bit confusing:

> You need to make sure your feature function is aware that now you will have a single span per symbol when dealing with \\(D(x)\\) and a pair of spans when dealing with \\(D(x,y)\\). Also, your feature function can now capitalise on the new types of rules (for example to count number of translations, insertions and deletions in each edge).

This might have implied that you need a separate feature function for \\(D(x, y)\\), but that is **not** the case. \\(D(x, y)\\) is just \\(D(x)\\) constrained.

To be able to restrict \\(D(x)\\)  we need to transfer memory from the automaton to the grammar. The way intersection transfers memory is by "refining" nonterminals as to make them contain enough information to partition the set of strings they project on to. That's why D(x,y) has more span information. Symbols in there tell you about paths on top of the target FSA.

**So quite formally, you are always featurising edges in \\(D(x)\\).** Every edge in \\(D(x, y)\\) is nothing but an edge of \\(D(x)\\) with extra span information (from where you should not really get any additional features).

Thus the right way to go about designing your feature function is to **design it once**. And it should be designed for D(x). Your log-potentials are now  \\(w \cdot \phi(r, s, x)\\) where r is the rule identity, s the source span, and x the source sentence. Compared to what we had before with \\(D_n(x)\\), we lost access to t (the target span), but hopefully you will see below that we haven't lost all that much.

### Why losing access to the target span does not matter a lot

Before, you couldn't really get much from the target span. For example, you couldn't use target span info to make features that would inspect target n-grams (because y was not really available and having those features would break the edge factorisation of the model: think of it this way, you wouldn't be able to get target n-gram features from \\(D_n(x)\\), and whatever you can get from \\(D(x,y)\\) only is useless because you will need to work with both when doing MLE).

The only thing you could really get from target span info was span size. And that's somewhat gone now (it's the price we pay for not intersecting an automaton that can count string length for us). Also note this: figuring length is indeed expensive, it makes the forest much bigger because you need to refine nonterminals as to have them encode the size of the strings they project onto. That's where the current improved performance comes from: our grammar manages to constrain insertions without counting directly.

Now, some of the target length information you can actually bootstrap without access to target span info directly. You can, locally to each edge in D(x), have a feature that counts whether translation/insertion/deletion happened, because LHS symbols like T/D/I inform you of that. Since this is a global model, the feature "insertion" for a certain derivation will be the sum of "insertion" along each edge, which will give you the number of insertions. So if you increment a feature like "target-length" each time either translation or insertion happens in an edge, globally for a path that will be correct.

**In sum, design one feature function only and do it limiting yourself to the degree of information a available in edges of D(x).** When dealing with D(x,y) just ignore the target span info and focus on the common part: r, s, and x.

In practical terms, if, for a training/test instance, you **hash** your features by (r,s) you'll not make mistakes. And in training, after featurising edges in D(x), you'll have all quantities necessary for D(x,y) already cached.

### Conclusion

We hope this helps everybody discard less data and parse quickly!

Note that sometimes it's normal that this grammar produces an empty \\(D(x,y)\\) because some strings are not within its generative capacity. In those cases you can simply discard the training instance. 

If you go with this new grammar, note that effectively you will be using \\(D(x)\\) everywhere where before you would be using \\(D_n(x). \\)



In [1]:
from fast_example import test
from collections import defaultdict

In [2]:
# Create test lexicon
lexicon = defaultdict(set)
lexicon['le'].update(['-EPS-', 'the', 'some', 'a', 'an'])  # we will assume that `le` can be deleted
lexicon['e'].update(['-EPS-', 'and', '&', 'also', 'as'])
lexicon['chien'].update(['-EPS-', 'dog', 'canine', 'wolf', 'puppy'])
lexicon['noir'].update(['-EPS-', 'black', 'noir', 'dark', 'void'])
lexicon['blanc'].update(['-EPS-', 'white', 'blank', 'clear', 'flash'])
lexicon['petit'].update(['-EPS-', 'small', 'little', 'mini', 'almost'])
lexicon['petite'].update(['-EPS-', 'small', 'little', 'mini', 'almost'])
lexicon['.'].update(['-EPS-', '.', '!', '?', ','])
lexicon['-EPS-'].update(['.', 'a', 'the', 'some', 'of'])  # we will assume that `the` and `a` can be inserted
print('LEXICON')
for src_word, tgt_words in lexicon.items():
    print('%s: %d options' % (src_word, len(tgt_words)))
print()

LEXICON
-EPS-: 5 options
noir: 5 options
le: 5 options
e: 5 options
petite: 5 options
.: 5 options
chien: 5 options
petit: 5 options
blanc: 5 options



In [3]:
# Let's test the faster parser!

test(lexicon,
        'le chien noir',
        'black dog',
        inspect_strings=False)
test(lexicon,
        'le chien noir',
        'the black dog .',
        inspect_strings=False)
test(lexicon,
        'le petit chien noir e le petit chien blanc .',
        'the little white dog and the little black dog .')
test(lexicon,
        'le petit chien noir e le petit chien blanc .',
        'the little white dog and the little black dog .')
test(lexicon,
        'le petit chien noir e le petit chien blanc e le petit petit chien .',
        'the little black dog and the little white dog and the mini dog .')
test(lexicon,
        'le petit chien noir e le petit chien blanc e le petit chien petit blanc .',
        'the little black dog and the little white dog and the mini almost white dog .')

print('**** The next example should be out of the space of the constrained ITG ***')
test(lexicon,
        'le petit chien noir e le petit chien blanc e le petit petit chien petit blanc e petit noir .',
        'the little black dog and the little white dog and the dog a bit white and a bit black .')


TRAINING INSTANCE: |x|=3 |y|=2
le chien noir
black dog

D(x): 70 rules in 0.0092 secs or clean=70 rules at extra 0.0019 secs
D(x,y): 52 rules in 0.0206 secs or clean=19 rules at extra 0.0014 secs
19 loaded

TRAINING INSTANCE: |x|=3 |y|=4
le chien noir
the black dog .

D(x): 70 rules in 0.0054 secs or clean=70 rules at extra 0.0013 secs
D(x,y): 94 rules in 0.0188 secs or clean=30 rules at extra 0.0008 secs
30 loaded

TRAINING INSTANCE: |x|=10 |y|=10
le petit chien noir e le petit chien blanc .
the little white dog and the little black dog .

D(x): 707 rules in 0.0366 secs or clean=707 rules at extra 0.0089 secs
D(x,y): 5347 rules in 0.2817 secs or clean=365 rules at extra 0.0221 secs
365 loaded

TRAINING INSTANCE: |x|=10 |y|=10
le petit chien noir e le petit chien blanc .
the little white dog and the little black dog .

D(x): 707 rules in 0.0261 secs or clean=707 rules at extra 0.0067 secs
D(x,y): 5347 rules in 0.2647 secs or clean=365 rules at extra 0.0431 secs
365 loaded

TRAINING INS

In [4]:
# Now let's see how slow the previous way of parsing was

from slow_example import test as slow_test

slow_test(lexicon,
        'le chien noir',
        'black dog',
        'length', inspect_strings=False)
slow_test(lexicon,
        'le chien noir',
        'the black dog .',
        'insertion', nb_insertions=1, inspect_strings=False)
slow_test(lexicon,
        'le petit chien noir e le petit chien blanc .',
        'the little white dog and the little black dog .',
        'length')
slow_test(lexicon,
        'le petit chien noir e le petit chien blanc .',
        'the little white dog and the little black dog .',
        'insertion', nb_insertions=3)

TRAINING INSTANCE: |x|=3 |y|=2
le chien noir
black dog

Using LengthConstraint
states=3
initial=0
final=0 1 2
arcs=2
origin=0 destination=1 label=-WILDCARD-
origin=1 destination=2 label=-WILDCARD-
D(x): 73 rules in 0.0060 secs or clean=73 rules at extra 0.0019 secs
D_n(x): 262 rules in 0.0324 secs or clean=262 rules at extra 0.0057 secs
D(x,y): 35 rules in 0.0079 secs or clean=14 rules at extra 0.0005 secs
14 loaded

TRAINING INSTANCE: |x|=3 |y|=4
le chien noir
the black dog .

Using InsertionConstraint
states=2
initial=0
final=0 1
arcs=3
origin=0 destination=0 label=-WILDCARD-
origin=0 destination=1 label=-EPS-
origin=1 destination=1 label=-WILDCARD-
D(x): 73 rules in 0.0037 secs or clean=73 rules at extra 0.0014 secs
D_n(x): 112 rules in 0.0076 secs or clean=112 rules at extra 0.0021 secs
D(x,y): 192 rules in 0.0225 secs or clean=26 rules at extra 0.0013 secs
26 loaded

TRAINING INSTANCE: |x|=10 |y|=10
le petit chien noir e le petit chien blanc .
the little white dog and the little b

In [None]:
# From here the length constrain is a bit too slow, but you can test if you are patient

slow_test(lexicon,
       'le petit chien noir e le petit chien blanc e le petit petit chien .',
       'the little black dog and the little white dog and the mini dog .',
       'length')
slow_test(lexicon,
        'le petit chien noir e le petit chien blanc e le petit petit chien .',
        'the little black dog and the little white dog and the mini dog .',
        'insertion', nb_insertions=3)
slow_test(lexicon,
       'le petit chien noir e le petit chien blanc e le petit petit chien petit blanc e petit noir .',
       'the little black dog and the little white dog and the dog a bit white and a bit black .',
       'length')
slow_test(lexicon,
        'le petit chien noir e le petit chien blanc e le petit petit chien petit blanc e petit noir .',
        'the little black dog and the little white dog and the dog a bit white and a bit black .',
        'insertion', nb_insertions=3)