mwt training fails in evaluating dev set, but the dev set passes validation #1167

toufiglu · 2022-12-19T17:25:50Z

Hi I am training stanza with the Arabic padt treebank. In training mwt, the evaluation of dev set failed and I got the following error.

2022-12-19 17:16:31 INFO: Training dictionary-based MWT expander...
2022-12-19 17:16:32 INFO: Evaluating on dev set...
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/Users/yiminglu/Desktop/week_10/corpus/new_stanza/stanza-dev/stanza/utils/training/run_mwt.py", line 113, in <module>
    main()
  File "/Users/yiminglu/Desktop/week_10/corpus/new_stanza/stanza-dev/stanza/utils/training/run_mwt.py", line 110, in main
    common.main(run_treebank, "mwt", "mwt_expander")
  File "/Users/yiminglu/Desktop/week_10/corpus/new_stanza/stanza-dev/stanza/utils/training/common.py", line 274, in main
    run_treebank(mode, paths, treebank, short_name,
  File "/Users/yiminglu/Desktop/week_10/corpus/new_stanza/stanza-dev/stanza/utils/training/run_mwt.py", line 79, in run_treebank
    mwt_expander.main(train_args)
  File "/Users/yiminglu/Desktop/week_10/corpus/new_stanza/stanza-dev/stanza/models/mwt_expander.py", line 94, in main
    train(args)
  File "/Users/yiminglu/Desktop/week_10/corpus/new_stanza/stanza-dev/stanza/models/mwt_expander.py", line 135, in train
    _, _, dev_f = scorer.score(system_pred_file, gold_file)
  File "/Users/yiminglu/Desktop/week_10/corpus/new_stanza/stanza-dev/stanza/models/mwt/scorer.py", line 8, in score
    evaluation = ud_scores(gold_conllu_file, system_conllu_file)
  File "/Users/yiminglu/Desktop/week_10/corpus/new_stanza/stanza-dev/stanza/models/common/utils.py", line 127, in ud_scores
    system_ud = ud_eval.load_conllu_file(system_conllu_file)
  File "/Users/yiminglu/Desktop/week_10/corpus/new_stanza/stanza-dev/stanza/utils/conll18_ud_eval.py", line 656, in load_conllu_file
    return load_conllu(_file,treebank_type)
  File "/Users/yiminglu/Desktop/week_10/corpus/new_stanza/stanza-dev/stanza/utils/conll18_ud_eval.py", line 256, in load_conllu
    parent = ud.words[sentence_start + hd -1] if hd else hd  # just assign '0' to parent for root cases
IndexError: list index out of range

However, the dev set passes the validation file in the ud github release.

(env) (base) Toufig-Lu:tools-master yiminglu$ python validate.py --lang ar --level 2 ar_padt-ud-dev.conllu 
*** PASSED ***

I am also training other treebanks. None of them reported this problem. Is there anything I can do to fix it? Thank you.

The text was updated successfully, but these errors were encountered:

AngledLuffa · 2022-12-19T19:13:44Z

The dev set might pass validation, but I don't see how, honestly. The dependency is for word 63, whereas the sentence # sent_id = afp.20000715.0001:p2u1 only has 62 words.

AngledLuffa · 2022-12-19T19:19:14Z

The original dataset actually has 65 words in that sentence, so clearly we are transcribing something wrong as part of the MWT process. I will figure it out later today.

toufiglu · 2022-12-19T19:53:22Z

Hi! Thanks so much for your help. I am a beginner in this line of work. May I ask: how can we check which sentence has errors, when the error signal does not indicate that? Often I found tracebacks saying there is a particular error, but I don`t get to know which instance has it.

AngledLuffa · 2022-12-20T04:30:41Z

If you update to the latest dev branch, it should be fixed:

9c39636

In terms of debugging this particular problem, what I did was change the existing script to output the line number when there was an exception, and that made it pretty clear what happened. The eval script is from a different repo, though, and I'm not sure they'll want that particular edit made permanent.

toufiglu · 2022-12-20T09:39:15Z

Thanks so much, John!

toufiglu added the question label Dec 19, 2022

toufiglu closed this as completed Dec 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mwt training fails in evaluating dev set, but the dev set passes validation #1167

mwt training fails in evaluating dev set, but the dev set passes validation #1167

toufiglu commented Dec 19, 2022

AngledLuffa commented Dec 19, 2022

AngledLuffa commented Dec 19, 2022

toufiglu commented Dec 19, 2022

AngledLuffa commented Dec 20, 2022

toufiglu commented Dec 20, 2022

mwt training fails in evaluating dev set, but the dev set passes validation #1167

mwt training fails in evaluating dev set, but the dev set passes validation #1167

Comments

toufiglu commented Dec 19, 2022

AngledLuffa commented Dec 19, 2022

AngledLuffa commented Dec 19, 2022

toufiglu commented Dec 19, 2022

AngledLuffa commented Dec 20, 2022

toufiglu commented Dec 20, 2022