-
Notifications
You must be signed in to change notification settings - Fork 33
Closed
Labels
Description
I routinely use git diff
to check whether the most recent change in processing had the intended impact on the data. Recently I noticed a number of spurious mention head changes after every change I did, and the changes had nothing obvious to do with the changes I did in conversion code. So I added corefud.MoveHead
to the scenario but the problem is still there.
To test it, I ran the same scenario on the same input (cs_pcedt-ud-dev.conllu) three times in a row. The output of the first two runs was identical (showing only the second one below) but the output of the third one was different.
[20:09:20]sol5:/net/work/people/zeman/hamledt/normalize/cs-pcedt(master)> udapy -s read.OldCorefUD corefud.FixInterleaved corefud.MergeSameSpan corefud.MoveHead < cs_pcedt-ud-dev.conllu > /net/work/people/zeman/unidep/UD_Czech-PCEDT/cs_pcedt-ud-dev.conllu 2022-02-08 20:10:22,165 [ INFO] execute - ---- ROUND ---- 2022-02-08 20:10:22,165 [ INFO] execute - Executing block read.OldCorefUD 2022-02-08 20:10:26,406 [ INFO] execute - Executing block corefud.FixInterleaved 2022-02-08 20:10:26,656 [ INFO] execute - Executing block corefud.MergeSameSpan 2022-02-08 20:10:26,872 [ INFO] execute - Executing block corefud.MoveHead 2022-02-08 20:10:26,994 [ INFO] execute - Executing block write.Conllu 2022-02-08 20:10:30,075 [ INFO] process_end - corefud.MoveHead overview of mentions: 2022-02-08 20:10:30,076 [ INFO] process_end - total = 24968 (100.0%) 2022-02-08 20:10:30,076 [ INFO] process_end - single-word = 12451 ( 49.9%) 2022-02-08 20:10:30,076 [ INFO] process_end - treelet = 10119 ( 40.5%) 2022-02-08 20:10:30,076 [ INFO] process_end - treelet-kept = 9916 ( 39.7%) 2022-02-08 20:10:30,076 [ INFO] process_end - nontreelet = 2059 ( 8.2%) 2022-02-08 20:10:30,076 [ INFO] process_end - nontreelet-kept = 1697 ( 6.8%) 2022-02-08 20:10:30,076 [ INFO] process_end - nontreelet-moved = 362 ( 1.4%) 2022-02-08 20:10:30,076 [ INFO] process_end - gappy = 339 ( 1.4%) 2022-02-08 20:10:30,076 [ INFO] process_end - gappy-moved = 267 ( 1.1%) 2022-02-08 20:10:30,076 [ INFO] process_end - treelet-moved = 203 ( 0.8%) 2022-02-08 20:10:30,076 [ INFO] process_end - gappy-kept = 72 ( 0.3%) [20:10:30]sol5:/net/work/people/zeman/hamledt/normalize/cs-pcedt(master)> udapy -s read.OldCorefUD corefud.FixInterleaved corefud.MergeSameSpan corefud.MoveHead < cs_pcedt-ud-dev.conllu > /net/work/people/zeman/unidep/UD_Czech-PCEDT/cs_pcedt-ud-dev.conllu 2022-02-08 20:11:38,993 [ INFO] execute - ---- ROUND ---- 2022-02-08 20:11:38,993 [ INFO] execute - Executing block read.OldCorefUD 2022-02-08 20:11:43,347 [ INFO] execute - Executing block corefud.FixInterleaved 2022-02-08 20:11:43,557 [ INFO] execute - Executing block corefud.MergeSameSpan 2022-02-08 20:11:43,769 [ INFO] execute - Executing block corefud.MoveHead 2022-02-08 20:11:43,914 [ INFO] execute - Executing block write.Conllu 2022-02-08 20:11:47,032 [ INFO] process_end - corefud.MoveHead overview of mentions: 2022-02-08 20:11:47,032 [ INFO] process_end - total = 24968 (100.0%) 2022-02-08 20:11:47,032 [ INFO] process_end - single-word = 12451 ( 49.9%) 2022-02-08 20:11:47,032 [ INFO] process_end - treelet = 10119 ( 40.5%) 2022-02-08 20:11:47,032 [ INFO] process_end - treelet-kept = 9916 ( 39.7%) 2022-02-08 20:11:47,032 [ INFO] process_end - nontreelet = 2059 ( 8.2%) 2022-02-08 20:11:47,032 [ INFO] process_end - nontreelet-kept = 1697 ( 6.8%) 2022-02-08 20:11:47,032 [ INFO] process_end - nontreelet-moved = 362 ( 1.4%) 2022-02-08 20:11:47,032 [ INFO] process_end - gappy = 339 ( 1.4%) 2022-02-08 20:11:47,033 [ INFO] process_end - gappy-moved = 268 ( 1.1%) 2022-02-08 20:11:47,033 [ INFO] process_end - treelet-moved = 203 ( 0.8%) 2022-02-08 20:11:47,033 [ INFO] process_end - gappy-kept = 71 ( 0.3%)
And the git diffs on the result (there was no commit in the meantime, so both diffs are against the same base):
[20:09:34]zen:/net/work/people/zeman/unidep/UD_Czech-PCEDT(dev *)> git diff cs_pcedt-ud-dev.conllu diff --git a/cs_pcedt-ud-dev.conllu b/cs_pcedt-ud-dev.conllu index 01e7a91..7607889 100644 --- a/cs_pcedt-ud-dev.conllu +++ b/cs_pcedt-ud-dev.conllu @@ -75070,9 +75070,9 @@ # orig_file_sentence wsj0118#131 1 To ten DET PDNS1---------- Case=Nom|Gender=Neut|Number=Sing|PronType=Dem 2 nsubj 2:nsubj Entity=(wsj0118001c173--1-gstype:spec)|MentionHead=1|MentionText=To 1.1 on #PersPron PRON _ Case=Nom|Number=Sing|Person=3|PronType=Prs _ _ 2:nsubj Entity=(wsj0118001c173--1-gstype:spec)|Functor=ACT|MentionHead=1.1|MentionText=on -1.2 někoho #PersPron PRON _ Case=Acc|PronType=Prs _ _ 2:obj Entity=(wsj0118001c174[1/3]--4-gstype:spec)|Functor=PAT|MentionHead=1.2,12,17|MentionText=někoho mnohem menší, než o jaký usiluje většina tradičních sběračů akcií jako hlavní cíl své práce +1.2 někoho #PersPron PRON _ Case=Acc|PronType=Prs _ _ 2:obj Entity=(wsj0118001c174[1/3]--1-gstype:spec)|Functor=PAT|MentionHead=1.2,12,17|MentionText=někoho mnohem menší, než o jaký usiluje většina tradičních sběračů akcií jako hlavní cíl své práce 2 znamená znamenat VERB VB-S---3P-AAI-- Mood=Ind|Number=Sing|Person=3|Polarity=Pos|Tense=Pres|VerbForm=Fin|Voice=Act 0 root 0:root _ -3 velmi velmi ADV Db------------- _ 4 advmod 4:advmod Entity=(wsj0118001c174[2/3]--4-gstype:spec +3 velmi velmi ADV Db------------- _ 4 advmod 4:advmod Entity=(wsj0118001c174[2/3]--1-gstype:spec 4 malý malý ADJ AAIS4----1A---- Animacy=Inan|Case=Acc|Degree=Pos|Gender=Masc|Number=Sing|Polarity=Pos 5 amod 5:amod _ 5 zisk zisk NOUN NNIS4-----A---- Animacy=Inan|Case=Acc|Gender=Masc|Number=Sing|Polarity=Pos 2 obj 2:obj MentionHead=5|MentionText=velmi malý zisk "navíc 6 " " PUNCT Z:------------- _ 5 punct 5:punct SpaceAfter=No @@ -75080,7 +75080,7 @@ 8 " " PUNCT Z:------------- _ 5 punct 5:punct SpaceAfter=No 9 , , PUNCT Z:------------- _ 2 punct 2:punct _ 10 bezesporu bezesporu PART TT------------- _ 12 advmod 12:advmod _ -11 mnohem mnohem ADV Db------------- _ 12 advmod 12:advmod Entity=(wsj0118001c174[3/3]--4-gstype:spec +11 mnohem mnohem ADV Db------------- _ 12 advmod 12:advmod Entity=(wsj0118001c174[3/3]--1-gstype:spec 12 menší malý ADJ AAIS4----2A---- Animacy=Inan|Case=Acc|Degree=Cmp|Gender=Masc|Number=Sing|Polarity=Pos 2 dep 2:dep SpaceAfter=No 13 , , PUNCT Z:------------- _ 17 punct 17:punct _ 14 než než SCONJ J,------------- _ 17 mark 17:mark LId=než-2 [20:11:04]zen:/net/work/people/zeman/unidep/UD_Czech-PCEDT(dev *)> git diff cs_pcedt-ud-dev.conllu diff --git a/cs_pcedt-ud-dev.conllu b/cs_pcedt-ud-dev.conllu index 01e7a91..c1ad519 100644 --- a/cs_pcedt-ud-dev.conllu +++ b/cs_pcedt-ud-dev.conllu @@ -36181,7 +36181,7 @@ # sent_id = wsj0071-001-p1s30 # text = Některá mladší vína, dokonce i ta za 90 až 100 dolarů za láhev, jsou téměř zadarmo." # orig_file_sentence wsj0071#31 -1 Některá některý DET PZNP1---------- Case=Nom|Gender=Neut|Number=Plur|PronType=Ind 3 det 3:det Entity=(wsj0071001c30--7-gstype:spec +1 Některá některý DET PZNP1---------- Case=Nom|Gender=Neut|Number=Plur|PronType=Ind 3 det 3:det Entity=(wsj0071001c30--3-gstype:spec 2 mladší mladý ADJ AANP1----2A---- Case=Nom|Degree=Cmp|Gender=Neut|Number=Plur|Polarity=Pos 3 amod 3:amod _ 3 vína víno NOUN NNNP1-----A---- Case=Nom|Gender=Neut|Number=Plur|Polarity=Pos 16 nsubj 16:nsubj MentionHead=3,5,6|MentionText=Některá mladší vína, dokonce i|SpaceAfter=No 4 , , PUNCT Z:------------- _ 3 punct 3:punct _