In [1]:
import re

test_set_doc = 'validate.txt'
test_set_doc_checked = test_set_doc + '.checked'

## Step 1: Load test set

In [2]:
test_items = []

with open(test_set_doc, encoding='utf-8') as infile:
    for line in infile:
        if not '\t' in line:
            continue
        test_items.append(line.strip().split('\t')[1].strip())

print(f'Loaded {len(test_items)} lines')

Loaded 199 lines


Test set annotation:
- Preverb marked by a suffixed backslash followed by an ID number, e.g. `meg\1`
- Word from which the preverb was separated marked by a pipe followed by the same ID number, e.g. `főzve|1`
- Within the same line, different verb-prefix pairs must (obviously) receive different ID numbers.
- Only single-digit ID numbers allowed.
- A preverb that is not separated from any word in the sentence (ellipsis etc.) is marked with a zero ID:
    * `"Hazakísérhetlek?" "Meg\0 hát."`
- Any number of preverbs can have the 0 ID within the same line.
- A verb directly followed by preverb is only annotated if the preverb is not separated from the preceding verb:
    * `főzte meg`
    * but: `főzte|1 volna meg\1`
- Normally there is a 1:1 correspondence between preverbs and verbs. However, there are exceptions, and these are annotated accordingly, e.g. `Se ki\1, se be\1 nem lehetett menni|1 Budakesziről`; `át-\1 meg átjárták|1`

## Step 2: Check test set for errors and reformat for emtsv tok

In [3]:
with open(test_set_doc_checked, 'w', encoding='utf-8') as outfile:
    for t in test_items:
        prev_indices = re.findall(r'\\([1-5])', t)
        if prev_indices is None:
            print("No prevs:", t) # unannotated line
            continue
        verb_indices = re.findall(r'\|([1-5])', t)
        if (sorted(list(set(prev_indices))) != sorted(prev_indices) or # any duplicate indices
            sorted(list(set(verb_indices))) != sorted(verb_indices) or
            sorted(prev_indices) != sorted(verb_indices) # any unmatched indices
           ):
            print('Possible error:', t)
            print(prev_indices)
            print(verb_indices)
        t = re.sub(r'(\\|\|)(\d)(\S)', r'\1\2 \3', t)
        outfile.write(t.replace('\\', '_p').replace('|', '_v') + '\n')

Possible error: S így búslakodván érzetem csapkodásaiban, kedves anyám jutott eszembe, s el\1 is gondoltam|1, hogy bezzeg az ő kies kertjében más gyümölcs termik; s el\1 azt is, hogy az a gyümölcs most jobb volna nekem, mert anyám nem azt akarná, hogy én álljak helyt a bajban, hanem azt inkább, hogy az öröm állja meg mellettem a helyét.
['1', '1']
['1']


If there are errors, correct in input file, then back to step 1.

## Step 3: Tokenize in emtsv, insert testid column, run morph,pos

In [4]:
!cat {test_set_doc_checked} | docker run -i mtaril/emtsv tok > tok_output.txt

print("tok_output done")

with open('tok_output.txt', encoding='utf-8') as infile:
    with open('morph_input.txt', 'w', encoding='utf-8') as outfile:
        outfile.write(next(infile).strip() + '\ttestid\n')
        for line in infile:
            if line.strip() == '':
                outfile.write(line)
                continue
            testid = '.' # avoid removal of empty column by xtsv
            m = re.findall('_([pv]\d)', line)
            if m:
                testid = m[0]
                line = re.sub('_[pv]\d', '', line)
            outfile.write(line.strip() + '\t' + testid + '\n')

print("morph_input done")

!cat morph_input.txt | docker run -i mtaril/emtsv morph,pos > pos_output.txt

print("pos_output done")

tok_output done
morph_input done
pos_output done


In [5]:
import sys
sys.path.append('../preverb')
from word import Word

## Step 4: Check if preverbs are annotated as preverbs and verb-like tokens as verbs

In [6]:
from word import Word

with open('pos_output.txt', encoding='utf-8') as infile:
    header = next(infile)
    Word.features = header.strip().split('\t')
    for n, line in enumerate(infile):
        if line.strip() == '':
            continue
        token = Word(line.strip().split('\t'))
        if token.testid[0] == 'p' and not token.xpostag.startswith('[/Prev]'):
            print(f"Bad prev on line {n + 2}:", token, '\n')
        if token.testid[0] == 'v' and not token.xpostag.startswith('[/V]'):
            print(f"Bad verb (?) on line {n + 2}:", token, '\n')


Bad verb (?) on line 760: fogadható	" "	v1	[{"lemma": "fogadható", "tag": "[/Adj][Nom]", "morphana": "fogad[/V]=fogad+ható[_ModPtcp/Adj]=ható+[Nom]=", "readable": "fogad[/V] + ható[_ModPtcp/Adj] + [Nom]", "twolevel": "f:f o:o g:g a:a d:d :[/V] h:h a:a t:t ó:ó :[_ModPtcp/Adj] :[Nom]"}]	fogadható	[/Adj][Nom] 

Bad verb (?) on line 1146: avatkozás	" "	v1	[{"lemma": "avatkozás", "tag": "[/N][Nom]", "morphana": "avatkozik[/V]=avatkoz+ás[_Ger/N]=ás+[Nom]=", "readable": "avatkozik[/V]=avatkoz + ás[_Ger/N] + [Nom]", "twolevel": "a:a v:v a:a t:t k:k o:o z:z :i :k :[/V] á:á s:s :[_Ger/N] :[Nom]"}]	avatkozás	[/N][Nom] 

Bad verb (?) on line 1153: képzelhető	" "	v1	[{"lemma": "képzelhető", "tag": "[/Adj][Nom]", "morphana": "képzel[/V]=képzel+hető[_ModPtcp/Adj]=hető+[Nom]=", "readable": "képzel[/V] + hető[_ModPtcp/Adj] + [Nom]", "twolevel": "k:k é:é p:p z:z e:e l:l :[/V] h:h e:e t:t ő:ő :[_ModPtcp/Adj] :[Nom]"}, {"lemma": "képzelhető", "tag": "[/Adj][Nom]", "morphana": "képzelhető[/Adj]=képzelhető+

If there is any output, the line numbers refer to `pos_output.txt`.

Check whether tokens annotated as separated preverbs are also analysed by morph,pos as preverbs. If not (e.g. if the preverb _meg_ is tagged by emtsv as a `[/Conj]`), **remove this annotation** (or the whole item if no annotation left) from the test set because `preverb` will necessarily fail due to incorrect emtsv annotation, which is extraneous to its performance evaluation.

Exception: person-inflected preverb-like postpositions such as in `utánam\1 dobják|1`, which are tagged by emtsv as `[/Post]`, and case-inflected personal pronouns such as in `hozzá\1 voltam szokva|1`, which are tagged as `[/N|Pro]`, **should not be removed from the test set** since `preverb` should be able to handle these.

If a token is annotated as the verb stem counterpart of a separated preverb, but is not tagged by emtsv as a verb, check whether the preverb annotation is correct, but if so, **do not remove this annotation** from the test set. `preverb` is supposed to be able to handle the connection of such separated preverbs.

After all real errors have been corrected, rerun everything from step 1.

In [7]:
!cat pos_output.txt | python ../preverb > preverb_output.tsv

## Step 5: Verify diffs

NB: ONLY DO THIS ON THE VALIDATION DATASET.

In [8]:
def print_error(error_str, n, token):
    print(error_str, f"on token {n + 2}: {token.form}")

with open('preverb_output.tsv', encoding='utf-8') as infile:
    header = next(infile)
    Word.features = header.strip().split('\t')
    curr_sent = ''
    testid_to_previd = {}
    annotated_previds = {}
    irrelevant_previds = {}
    bad_flag = False
    for n, token in enumerate(infile):
        if token.strip() == '':
            if bad_flag:
                print(curr_sent, "\n")
            curr_sent = ''
            bad_flag = False
            testid_to_previd = {}
            annotated_previds = {}
            irrelevant_previds = {}
            continue
        token = Word(token.strip('\n').split('\t'))
        token_suffix = token.testid
        if token_suffix == '.':
            token_suffix = ''
        else:
            token_suffix = '_' + token_suffix
        curr_sent += token.form + token_suffix + token.wsafter.replace('"','')
        
        if token.testid == 'p0':
            if token.separated == 'conn':
                print_error("Incorrectly connected preverb", n, token)
            continue
        
        if token.testid[0] == 'p' and not token.separated == 'conn':
            print_error("Preverb mismatch", n, token)
            bad_flag = True
            continue
        elif token.testid[0] == 'v' and not token.separated == 'sep':
            print_error("Verb mismatch", n, token)
            bad_flag = True
            continue
        elif token.testid == '.' and token.previd != '':
            if token.previd in annotated_previds:
                print_error("Unannotated token connected to annotated", n, token)
                print(f"Incorrectly connected to {annotated_previds[token.previd]}")
                bad_flag = True
            else:
                irrelevant_previds[token.previd] = token.form
            continue
        elif token.previd == '':
            continue

        if token.previd in irrelevant_previds:
            print_error("Annotated token connected to unannotated", n, token)
            print(f"Incorrectly connected to {irrelevant_previds[token.previd]}")
            bad_flag = True
        elif token.testid[1] not in testid_to_previd:
            testid_to_previd[token.testid[1]] = token.previd
            annotated_previds[token.previd] = token.form
        elif testid_to_previd[token.testid[1]] != token.previd:
            print_error("Annotated token index mismatch", n, token)
            bad_flag = True

Unannotated token connected to annotated on token 21: van
Incorrectly connected to le
Verb mismatch on token 22: írva
S le_p1 is van írva_v1 .\n 

Preverb mismatch on token 294: el
Verb mismatch on token 299: mondani
Akkor Fehérlófia önkéntelenül magához ölelte És mintha nem is ő tenné megcsókolta Krisztina száját Sután ügyetlenül mert összekoccant a foguk Rögtön elengedte hátra_p1 is lépett_v1 tőle Zavarában el_p2 sem búcsúzott_v2 szaladt vissza_p3 se nézve_v3 Megcsókoltam ismételte ujjongva magában Megcsókoltam igen itt maradunk a hazában Együtt a hazában édes vagy édes édes\nEzt persze el_p4 lehetne nem-költői módon is mondani_v4 , de mennyivel nagyobb dolog, ha ez költői megfogalmazást kap.  

Preverb mismatch on token 382: Be
Verb mismatch on token 385: durrantani
Be_p1 lehetett volna durrantani_v1 a vendégek fogadására szánt szobába.\n 

Verb mismatch on token 585: Fölidézte
Preverb mismatch on token 603: föl
Verb mismatch on token 637: mondott
Preverb mismatch on token 639: el
P

## Step 6: Calculate evaluation metrics

In [31]:
# A debuggal sorok csak debugoláshoz / átláthatósághoz kellenek, és törölhetők, ha minden jó így.

preverb_count, true_pos, false_pos, true_neg, false_neg = 0, 0, 0, 0, 0

with open('validate_baseline_output.tsv', encoding='utf-8') as infile:
    header = next(infile)
    Word.features = header.strip().split('\t')

    curr_sent = '' # debug

    p_testids = {}
    v_testids = {}
    irrelevant_previds = []

    for n, token in enumerate(infile):
        if token.strip() == '':
            for p_testid, p_previd in p_testids.items():
                if (p_previd in irrelevant_previds or  # \i is connected to a non-annotated verb
                    p_testid not in v_testids or       # \i is not connected to any annotated verbs
                    v_testids[p_testid] != p_previd):  # \i is not connected to |i, but some other |j
                    false_pos += 1
                    print("False pos", p_testid) # debug
                else: 
                    true_pos += 1
                    print("True pos", p_testid) # debug
            print(curr_sent)  # debug
            print("P:", p_testids) # debug
            print("V:", v_testids) # debug
            print("Irrel:", irrelevant_previds) # debug
            curr_sent = '' # debug
            p_testids = {}
            v_testids = {}
            irrelevant_previds = []
            print() # debug
            continue
        
        token = Word(token.strip('\n').split('\t'))
        token_suffix = token.testid # debug
        if token_suffix == '.': # debug
            token_suffix = '' # debug
        else: # debug
            token_suffix = '_' + token_suffix # debug
        curr_sent += token.form + token_suffix + token.wsafter.replace('"','') # debug
        
        if token.testid == '.':
            if token.previd != '':
                irrelevant_previds.append(token.previd)
            continue

        if token.testid == 'p0':
            preverb_count += 1
            if token.separated == 'conn':
                false_pos += 1   # preverb annotated as \0 connected to anything
                print("False pos") # debug
            else:
                true_neg += 1    # not connected, OK
                print("True neg") # debug
            continue
        
        if token.testid[0] == 'p':
            preverb_count += 1
            if token.separated != 'conn':
                false_neg += 1   # preverb annotated as \1 etc. should have been connected
                print("False neg", token.testid[1:]) # debug
            elif token.testid[1] in p_testids and p_testids[token.testid[1]] != token.previd:
                false_pos += 1   # previd does not match preverb with the same annotation
                print("False pos", token.testid[1:]) # debug
            else:
                p_testids[token.testid[1:]] = token.previd
            continue

        if token.testid[0] == 'v' and token.separated == 'sep':
            v_testids[token.testid[1:]] = token.previd        

print(preverb_count) # debug
print([true_pos, false_pos, true_neg, false_neg]) # debug
assert preverb_count == sum([true_pos, false_pos, true_neg, false_neg]) # debug

True pos 1
True pos 2
Be_p1 fog következni_v1, mert be_p2 kell következzék_v2 az immár elkerülhetetlen totális civilizációs összeomlás.\n
P: {'1': '1', '2': '2'}
V: {'1': '1', '2': '2'}
Irrel: []

False pos 3
S le_p3 is van írva_v3.\n
P: {'3': '3'}
V: {}
Irrel: ['3']

True pos 4
Jaj, annyi a gondom, hogy ki_p4 sem látszom_v4 belőle.\n
P: {'4': '4'}
V: {'4': '4'}
Irrel: []

True pos 5
True pos 6
Idő kérdése, de törvényszerűen alakul_v5 majd ki_p5 végül az a bizonyos új istenű és új módon értelmes világ, aminek el_p6 kell jönnie_v6...\n
P: {'5': '5', '6': '6'}
V: {'5': '5', '6': '6'}
Irrel: []

True pos 7
Menne ő is szívesen (miként azt később meg_p7 is teszi_v7), de egyelőre még szerve zési feladatai vannak.\n
P: {'7': '7'}
V: {'7': '7'}
Irrel: []

True pos 8
A nézőt valahogy meg_p8 kell tartani_v8.\n
P: {'8': '8'}
V: {'8': '8'}
Irrel: []

True pos 9
Melyhez mintha már hozzá_p9 is szoktunk_v9 volna, mely nélkül mintha már létezni sem tudnánk.\n
P: {'9': '9'}
V: {'9': '9'}
Irrel: []

Tru

True pos 220
True pos 221
True pos 222
Az önkormányzat létrejöttétől eltelt időt azonban ki_p220 kell értékelni_v220 és meg_p221 kell nézni_v221, hogy az a törvény, amely jelenleg szabályozza az önkormányzatok működését, mennyiben állja ki a gyakorlat próbáját, és meg_p222 kell találni_v222 azokat a pontokat, ahol kisebb-nagyobb változásokkal a rendszer működtethető.\n
P: {'220': '221', '221': '222', '222': '224'}
V: {'220': '221', '221': '222', '222': '224'}
Irrel: ['223', '223']

False neg 223
Ugyanakkor rá_p223 kell arra mutatnom_v223, hogy - a háromoldalú tárgyalásokat is beleszámítva, amelyek közismerten 1989-ben történtek - most már nyolc éve folyik az egész jogrendünk átalakítása, és bizony megítélésem szerint ennek a nyolc évnek bőségesen elegendőnek kellett volna lennie ahhoz, hogy egy új büntetőkódex kerüljön megalkotásra.\n
P: {}
V: {}
Irrel: []

True pos 224
True pos 225
Emberarcúvá kell formálni s meg_p224 kell nemesíteni_v224 az ember tágabb környezetét, és meg_p225 kell 

In [10]:
# precision:
precision = true_pos / (true_pos + false_pos)
print("Precision: %.4f"%precision)

# recall:
recall = true_pos / (true_pos + false_neg)
print("Recall: %.4f"%recall)

# F1
print("F1: %.4f"%(2 * precision * recall / (precision + recall)))

# accuracy:
print("Accuracy: %.4f"%((true_pos + true_neg) / preverb_count))

Precision: 0.9730
Recall: 0.8072
F1: 0.8824
Accuracy: 0.7957


## Step 7: Evaluate the baseline algorithms

In [11]:
!cat preverb_output.tsv | python old_connect_prev.py > baseline_output.tsv

In [30]:
# Basically same procedure as above, debug lines removed, wrapped in a function,
# abstraction over target field name

PREVERB_COUNT, TRUE_POS, FALSE_POS, TRUE_NEG, FALSE_NEG = 0, 1, 2, 3, 4

def evaluate_target_column(input_file_name, target_col_name):
    preverb_count, true_pos, false_pos, true_neg, false_neg = 0, 0, 0, 0, 0

    with open(input_file_name, encoding='utf-8') as infile:
        header = next(infile)
        Word.features = header.strip().split('\t')

        p_testids = {}
        v_testids = {}
        irrelevant_previds = []

        for n, token in enumerate(infile):
            if token.strip() == '':
                for p_testid, p_target_id in p_testids.items():
                    if (p_target_id in irrelevant_previds or  # \i is connected to a non-annotated verb
                        p_testid not in v_testids or          # \i is not connected to any annotated verbs
                        v_testids[p_testid] != p_target_id):  # \i is not connected to |i, but some other |j
                        false_pos += 1
                    else: 
                        true_pos += 1
                p_testids = {}
                v_testids = {}
                irrelevant_previds = []
                continue

            token = Word(token.strip('\n').split('\t'))
            target_id = getattr(token, target_col_name)
            
            if token.testid == '.':
                if target_id not in ('.', ''):
                    irrelevant_previds.append(target_id)
                continue

            if token.testid == 'p0':
                preverb_count += 1
                if target_id in ('.',''):
                    true_neg += 1    # not connected, OK
                else:
                    false_pos += 1   # preverb annotated as \0 connected to anything
                continue

            if token.testid[0] == 'p':
                preverb_count += 1
                if target_id in ('.',''):
                    false_neg += 1   # preverb annotated as \1 etc. should have been connected
                elif token.testid[1:] in p_testids and p_testids[token.testid[1:]] != target_id:
                    false_pos += 1   # previd does not match preverb with the same annotation
                else:
                    p_testids[token.testid[1:]] = target_id
                continue

            if token.testid[0] == 'v' and target_id not in ('.',''):
                v_testids[token.testid[1:]] = target_id
    
    return (preverb_count, true_pos, false_pos, true_neg, false_neg)

def calculate_metrics(results):
    # precision:
    precision = results[TRUE_POS] / (results[TRUE_POS] + results[FALSE_POS])
    print("Precision: %.4f"%precision)

    # recall:
    recall = results[TRUE_POS] / (results[TRUE_POS] + results[FALSE_NEG])
    print("Recall: %.4f"%recall)

    # F1
    print("F1: %.4f"%(2 * precision * recall / (precision + recall)))

    # accuracy:
    print("Accuracy: %.4f"%((results[TRUE_POS] + results[TRUE_NEG]) / results[PREVERB_COUNT]))

print("Evaluate old connect_prev baseline")

results = evaluate_target_column('baseline_output.tsv', 'prevold')
print(results)
assert results[PREVERB_COUNT] == sum([results[TRUE_POS], 
                                      results[FALSE_POS],
                                      results[TRUE_NEG],
                                      results[FALSE_NEG]]) # debug
calculate_metrics(results)
print()

print("Evaluate connect closest baseline")
results = evaluate_target_column('baseline_output.tsv', 'prevclosest')
print(results)
assert results[PREVERB_COUNT] == sum([results[TRUE_POS], 
                                      results[FALSE_POS],
                                      results[TRUE_NEG],
                                      results[FALSE_NEG]]) # debug
calculate_metrics(results)
print()


Evaluate old connect_prev baseline
(235, 185, 9, 7, 34)
Precision: 0.9536
Recall: 0.8447
F1: 0.8959
Accuracy: 0.8170

Evaluate connect closest baseline
(235, 172, 55, 2, 6)
Precision: 0.7577
Recall: 0.9663
F1: 0.8494
Accuracy: 0.7404



## Step 8: Evaluate Stanza preverb connections

In [14]:
!cat validate.txt | python reformat_for_tok.py | docker run -i mtaril/emtsv emstanza-tok > stanzatok_output.txt
!cat stanzatok_output.txt | python label_to_column.py | \
    docker run -i mtaril/emtsv emstanza-lem | \
    docker run -i mtaril/emtsv emstanza-parse > stanzaparse_output.txt

In [15]:
from more_itertools import split_at

with open('stanzaparse_output.txt', encoding='utf-8') as infile:
    with open('stanzaprev.txt', 'w', encoding='utf-8') as outfile:
        header = next(infile)
        Word.features = header.strip().split('\t') + ['stanzaid']
        outfile.write(Word.header() + '\n')
        lines = infile.read().split('\n')
        sentences = list(split_at(lines, lambda x: x == ''))
        for s in sentences:
            words = [Word(token.split('\t') + ['.']) for token in s]
            for w in words:
                if w.deprel == 'compound:preverb':
                    w.stanzaid = w.head
                    words[int(w.head) - 1].stanzaid = w.head
            for w in words:
                outfile.write(str(w) + '\n')
            outfile.write('\n')

In [32]:
results = evaluate_target_column("stanzaprev.txt", 'stanzaid')
print(results)
assert results[PREVERB_COUNT] == sum([results[TRUE_POS], 
                                      results[FALSE_POS],
                                      results[TRUE_NEG],
                                      results[FALSE_NEG]]) # debug
calculate_metrics(results)
print()

(235, 180, 25, 1, 29)
Precision: 0.8780
Recall: 0.8612
F1: 0.8696
Accuracy: 0.7702



## Step 9: Generalising evaluation workflow

In [36]:
test_set_name = 'validate'
test_set_raw = test_set_name + '.txt'
test_set_reformatted = test_set_name + '_reformat.txt'
test_set_emtok_output = test_set_name + '_emtok_output.tsv'
test_set_empos_output = test_set_name + '_empos_output.tsv'
test_set_preverb_output = test_set_name + '_preverb_output.tsv'
test_set_baseline_output = test_set_name + '_baseline_output.tsv'

test_set_stanza_tok_output = test_set_name + '_stanzatok_output.tsv'
test_set_stanza_parse_output = test_set_name + '_stanzaparse_output.tsv'
test_set_stanza_ids = test_set_name + '_stanzaids.tsv'

In [37]:
!cat {test_set_raw} | python reformat_for_tok.py > {test_set_reformatted}
!cat {test_set_reformatted} | docker run -i mtaril/emtsv tok > {test_set_emtok_output}
!cat {test_set_emtok_output} | python label_to_column.py | docker run -i mtaril/emtsv morph,pos > {test_set_empos_output}
!cat {test_set_empos_output} | python ../preverb > {test_set_preverb_output}
!cat {test_set_preverb_output} | python old_connect_prev.py > {test_set_baseline_output}

In [38]:
!cat {test_set_reformatted} | docker run -i mtaril/emtsv emstanza-tok > {test_set_stanza_tok_output}
!cat {test_set_stanza_tok_output} | python label_to_column.py | \
    docker run -i mtaril/emtsv emstanza-lem | \
    docker run -i mtaril/emtsv emstanza-parse > {test_set_stanza_parse_output}
!cat {test_set_stanza_parse_output} | python add_stanzaid.py > {test_set_stanza_ids}

In [39]:
print("emPreverb results")
results = evaluate_target_column(test_set_baseline_output, 'previd')
calculate_metrics(results)
print()

print("old connect_prev (max2) results")
results = evaluate_target_column(test_set_baseline_output, 'prevold')
calculate_metrics(results)
print()

print("closest verb baseline results")
results = evaluate_target_column(test_set_baseline_output, 'prevclosest')
calculate_metrics(results)
print()

print("stanza results")
results = evaluate_target_column(test_set_stanza_ids, 'stanzaid')
calculate_metrics(results)
print()

emPreverb results
Precision: 0.9728
Recall: 0.8027
F1: 0.8796
Accuracy: 0.7915

old connect_prev (max2) results
Precision: 0.9534
Recall: 0.8402
F1: 0.8932
Accuracy: 0.8128

closest verb baseline results
Precision: 0.7689
Recall: 0.9505
F1: 0.8501
Accuracy: 0.7404

stanza results
Precision: 0.8780
Recall: 0.8612
F1: 0.8696
Accuracy: 0.7702



## Step 10: Evaluation on general (easy) dataset

In [40]:
test_set_name = 'gold500_test'
test_set_raw = test_set_name + '.txt'
test_set_reformatted = test_set_name + '_reformat.txt'
test_set_emtok_output = test_set_name + '_emtok_output.tsv'
test_set_empos_output = test_set_name + '_empos_output.tsv'
test_set_preverb_output = test_set_name + '_preverb_output.tsv'
test_set_baseline_output = test_set_name + '_baseline_output.tsv'

test_set_stanza_tok_output = test_set_name + '_stanzatok_output.tsv'
test_set_stanza_parse_output = test_set_name + '_stanzaparse_output.tsv'
test_set_stanza_ids = test_set_name + '_stanzaids.tsv'

!cat {test_set_raw} | python reformat_for_tok.py > {test_set_reformatted}
!cat {test_set_reformatted} | docker run -i mtaril/emtsv tok > {test_set_emtok_output}
!cat {test_set_emtok_output} | python label_to_column.py | docker run -i mtaril/emtsv morph,pos > {test_set_empos_output}
!cat {test_set_empos_output} | python ../preverb > {test_set_preverb_output}
!cat {test_set_preverb_output} | python old_connect_prev.py > {test_set_baseline_output}

!cat {test_set_reformatted} | docker run -i mtaril/emtsv emstanza-tok > {test_set_stanza_tok_output}
!cat {test_set_stanza_tok_output} | python label_to_column.py | \
    docker run -i mtaril/emtsv emstanza-lem | \
    docker run -i mtaril/emtsv emstanza-parse > {test_set_stanza_parse_output}
!cat {test_set_stanza_parse_output} | python add_stanzaid.py > {test_set_stanza_ids}

In [41]:
print("emPreverb results")
results = evaluate_target_column(test_set_baseline_output, 'previd')
calculate_metrics(results)
print()

print("old connect_prev (max2) results")
results = evaluate_target_column(test_set_baseline_output, 'prevold')
calculate_metrics(results)
print()

print("closest verb baseline results")
results = evaluate_target_column(test_set_baseline_output, 'prevclosest')
calculate_metrics(results)
print()

print("stanza results")
results = evaluate_target_column(test_set_stanza_ids, 'stanzaid')
calculate_metrics(results)
print()

emPreverb results
Precision: 0.9979
Recall: 0.9539
F1: 0.9754
Accuracy: 0.9520

old connect_prev (max2) results
Precision: 0.9794
Recall: 0.9694
F1: 0.9744
Accuracy: 0.9500

closest verb baseline results
Precision: 0.9656
Recall: 0.9876
F1: 0.9765
Accuracy: 0.9540

stanza results
Precision: 0.9604
Recall: 0.9046
F1: 0.9316
Accuracy: 0.8720

