In [8]:
import re

test_set_doc = 'validate.txt'
test_set_doc_checked = test_set_doc + '.checked'

## Step 1: Load test set

In [9]:
test_items = []

with open(test_set_doc, encoding='utf-8') as infile:
    for line in infile:
        if not '\t' in line:
            continue
        test_items.append(line.strip().split('\t')[1].strip())

print(f'Loaded {len(test_items)} lines')

Loaded 199 lines


Test set annotation:
- Preverb marked by a prefixed backslash followed by an ID number, e.g. `meg\1`
- Word from which the preverb was separated marked by a pipe followed by the same ID number, e.g. `főzve|1`
- Within the same line, different verb-prefix pairs must (obviously) receive different ID numbers.
- Only single-digit ID numbers allowed.
- A preverb that is not separated from any word in the sentence (ellipsis etc.) is marked with a zero ID:
    * `"Hazakísérhetlek?" "Meg\0 hát."`
- Any number of preverbs can have the 0 ID within the same line.
- A verb directly followed by preverb is only annotated if the preverb is not separated from the preceding verb:
    * `főzte meg`
    * but: `főzte|1 volna meg\1`
- Normally there is a 1:1 correspondence between preverbs and verbs. However, there are exceptions, and these are annotated accordingly, e.g. `Se ki\1, se be\1 nem lehetett menni|1 Budakesziről`

## Step 2: Check test set for errors and reformat for emtsv tok

In [10]:
with open(test_set_doc_checked, 'w', encoding='utf-8') as outfile:
    for t in test_items:
        prev_indices = re.findall(r'\\([1-5])', t)
        if prev_indices is None:
            print("No prevs:", t) # unannotated line
            continue
        verb_indices = re.findall(r'\|([1-5])', t)
        if (sorted(list(set(prev_indices))) != sorted(prev_indices) or # any duplicate indices
            sorted(list(set(verb_indices))) != sorted(verb_indices) or
            sorted(prev_indices) != sorted(verb_indices) # any unmatched indices
           ):
            print('Possible error:', t)
            print(prev_indices)
            print(verb_indices)
        outfile.write(t.replace('\\', '_p').replace('|', '_v') + '\n')

If there are errors, correct in input file, then back to step 1.

## Step 3: Tokenize in emtsv, insert testid column, run morph,pos

In [11]:
# replace type by cat on linux
!type {test_set_doc_checked} | docker run -i mtaril/emtsv tok > tok_output.txt

print("tok_output done")

with open('tok_output.txt', encoding='utf-8') as infile:
    with open('morph_input.txt', 'w', encoding='utf-8') as outfile:
        outfile.write(next(infile).strip() + '\ttestid\n')
        for line in infile:
            if line.strip() == '':
                outfile.write(line)
                continue
            testid = '.' # avoid removal of empty column by xtsv
            m = re.findall('_([pv]\d)', line)
            if m:
                testid = m[0]
                line = re.sub('_[pv]\d', '', line)
            outfile.write(line.strip() + '\t' + testid + '\n')

print("morph_input done")

!type morph_input.txt | docker run -i mtaril/emtsv morph,pos > pos_output.txt

print("pos_output done")

tok_output done
morph_input done
pos_output done


In [12]:
import sys
sys.path.append('../preverb')
from word import Word

## Step 4: Check if preverbs are annotated as preverbs and verb-like tokens as verbs

In [13]:
from word import Word

with open('pos_output.txt', encoding='utf-8') as infile:
    header = next(infile)
    Word.features = header.strip().split('\t')
    for n, line in enumerate(infile):
        if line.strip() == '':
            continue
        token = Word(line.strip().split('\t'))
        if token.testid[0] == 'p' and not token.xpostag.startswith('[/Prev]'):
            print(f"Bad prev on line {n + 2}:", token, '\n')
        if token.testid[0] == 'v' and not token.xpostag.startswith('[/V]'):
            print(f"Bad verb (?) on line {n + 2}:", token, '\n')


Bad verb (?) on line 760: fogadható	" "	v1	[{"lemma": "fogadható", "tag": "[/Adj][Nom]", "morphana": "fogad[/V]=fogad+ható[_ModPtcp/Adj]=ható+[Nom]=", "readable": "fogad[/V] + ható[_ModPtcp/Adj] + [Nom]", "twolevel": "f:f o:o g:g a:a d:d :[/V] h:h a:a t:t ó:ó :[_ModPtcp/Adj] :[Nom]"}]	fogadható	[/Adj][Nom] 

Bad verb (?) on line 1142: avatkozás	" "	v1	[{"lemma": "avatkozás", "tag": "[/N][Nom]", "morphana": "avatkozik[/V]=avatkoz+ás[_Ger/N]=ás+[Nom]=", "readable": "avatkozik[/V]=avatkoz + ás[_Ger/N] + [Nom]", "twolevel": "a:a v:v a:a t:t k:k o:o z:z :i :k :[/V] á:á s:s :[_Ger/N] :[Nom]"}]	avatkozás	[/N][Nom] 

Bad verb (?) on line 1149: képzelhető	" "	v1	[{"lemma": "képzelhető", "tag": "[/Adj][Nom]", "morphana": "képzel[/V]=képzel+hető[_ModPtcp/Adj]=hető+[Nom]=", "readable": "képzel[/V] + hető[_ModPtcp/Adj] + [Nom]", "twolevel": "k:k é:é p:p z:z e:e l:l :[/V] h:h e:e t:t ő:ő :[_ModPtcp/Adj] :[Nom]"}, {"lemma": "képzelhető", "tag": "[/Adj][Nom]", "morphana": "képzelhető[/Adj]=képzelhető+

If there is any output, the line numbers refer to `pos_output.txt`.

Check whether tokens annotated as separated preverbs are also analysed by morph,pos as preverbs. If not (e.g. if the preverb _meg_ is tagged by emtsv as a `[/Conj]`), **remove this annotation** (or the whole item if no annotation left) from the test set because `connect_prev` will necessarily fail due to incorrect emtsv annotation, which is extraneous to its performance evaluation.

Exception: person-inflected preverb-like postpositions such as in `utánam\1 dobják|1`, which are tagged by emtsv as `[/Post]`, and case-inflected personal pronouns such as in `hozzá\1 voltam szokva|1`, which are tagged as `[/N|Pro]`, **should not be removed from the test set** since `connect_prev` should be able to handle these.

If a token is annotated as the verb stem counterpart of a separated preverb, but is not tagged by emtsv as a verb, check whether the preverb annotation is correct, but if so, **do not remove this annotation** from the test set. `connect_prev` is supposed to be able to handle the connection of such separated preverbs.

After all real errors have been corrected, rerun everything from step 1.

In [14]:
!type pos_output.txt | python ../preverb > preverb_output.tsv