Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot Download wmt21 en2zh test data #116

Open
Tracked by #112
Pzzzzz5142 opened this issue Jun 7, 2022 · 5 comments
Open
Tracked by #112

Cannot Download wmt21 en2zh test data #116

Pzzzzz5142 opened this issue Jun 7, 2022 · 5 comments

Comments

@Pzzzzz5142
Copy link

here is my mtdata.recipes.wmt22-constrained.yaml config

- id: wmt22-zhen-t
  langs: zho-eng
  desc: WMT 22 General MT
  url: https://www.statmt.org/wmt22/translation-task.html
  dev:
  test:
    - Statmt-newstest_enzh-2021-eng-zho
  train:

when download the test set using the following command,

mtdata get-recipe -ri wmt22-zhen-t -o .

it will raise error, and here is the error log.

2022-06-07 15:19:36 data.add_parts_sequential:329 ERROR:: Unable to add Statmt-newstest_enzh-2021-eng-zho: /Users/pzzzzz/.mtdata/data.statmt.org/1df0/c1646dcf67bf017db12b47b5c987/wmt21tests.tgz-extracted/test/newstest2021.en-zh.xml has unequal number of segs: 1845 == 2847?

it seems that for the 2021 en2zh test has multiple ref sentences for each src sentence, the assert statement will cause the error ahead.

image

the code cause this issue is at sgm.py line 79.

srcs = list(xpath_all(tree.getroot(), xpath=".//src//seg"))
tgts = list(xpath_all(tree.getroot(), xpath=".//ref//seg"))
assert len(srcs) == len(tgts), f'{data} has unequal number of segs: {len(srcs)} == {len(tgts)}?'
@khayrallah
Copy link

Just wanted to make a note that this effects more than just enzh. German English is also affected when using the default scripts provided by wmt too

@thammegowda
Copy link
Owner

Thanks for reporting this.
Sorry for the delay; I was on vacation and away GitHub.
I will try to fix this issue soon and release a new version.

@thammegowda thammegowda mentioned this issue Jun 28, 2022
7 tasks
@thammegowda
Copy link
Owner

thammegowda commented Jul 3, 2022

Thanks, @khayrallah for the pointer!

You are right, WMT21 test refs have multiple translators, which is different from the previous years.

What is causing the delay is that not all files have multiple refs, and when we do have multiple refs, not all translators translate every segment. I will need a bit more time to fix it properly.

$ for i in ~/.mtdata/data.statmt.org/1df0/c1646dcf67bf017db12b47b5c987/wmt21tests.tgz-extracted/test/newstest2021.*xml; 
  do basename $i; grep -o 'translator="[^"]*"' $i | sort | uniq -c ;  done 
  
newstest2021.cs-en.xml
    167 translator="A"
     62 translator="B"
newstest2021.de-en.xml
     67 translator="A"
     61 translator="B"
newstest2021.de-fr.xml
     61 translator="A"
newstest2021.en-cs.xml
    201 translator="A"
     68 translator="B"
newstest2021.en-de.xml
     74 translator="A"
     68 translator="C"
     68 translator="D"
newstest2021.en-ha.xml
   3524 translator="A"
newstest2021.en-is.xml
     65 translator="A"
newstest2021.en-ja.xml
     65 translator="A"
newstest2021.en-ru.xml
     77 translator="A"
     68 translator="B"
newstest2021.en-zh.xml
     77 translator="A"
     68 translator="B"
newstest2021.fr-de.xml
     74 translator="A"
newstest2021.ha-en.xml
   3559 translator="A"
newstest2021.is-en.xml
     47 translator="A"
newstest2021.ja-en.xml
     81 translator="A"
newstest2021.ru-en.xml
    116 translator="A"
    107 translator="B"
newstest2021.zh-en.xml
    165 translator="A"

@khayrallah
Copy link

thanks for the update! It might be a good idea to make a note on the main WMT page, since it is linked as the way to download the WMT data.

@thammegowda
Copy link
Owner

Thanks for the suggestion! I have sent a pull request to wmt22 page. When it is merged, we will see a note under “limitations” section.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants