Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Definition of "very small corpora" #2

Closed
ftyers opened this issue Jul 26, 2016 · 4 comments
Closed

Definition of "very small corpora" #2

ftyers opened this issue Jul 26, 2016 · 4 comments

Comments

@ftyers
Copy link

ftyers commented Jul 26, 2016

(with some exceptions -- Japanese because of the license, very small corpora and corpora which cannot be reliably detokenized).

How would you define "very small corpora" would this mean that Kazakh and Buryat are excluded ?

@foxik
Copy link
Member

foxik commented Jul 26, 2016

Our worry about very small corpora is that their test set is so small that even several words make a considerable percentage (6 words in the Kazakh test set are more than 1%) -- we wanted to avoid people spending most of their time working on the smallest corpus to get bet overall score.

Another worry was whether we could get enough raw tests.

But the first issue could be alleviated by changing an overall score computation. Will discuss it in a new issue.

@ftyers
Copy link
Author

ftyers commented Jul 27, 2016

There is no problem with getting raw text for Kazakh or Buryat. As for the other points we can take it up in the new issue. One other possibility would be to have a "small corpus" track where subsets of all the corpora are given which are approximately the same size.

The visibility of being included in shared tasks like this is a massive motivation for people planning to work on open UD-based treebanks, and it would be a shame to turn around to them and say "well, you didn't do enough work", especially without explicitly saying how big or how small is required.

In any case regarding the average, it's two languages out of 40-50, so even if people really tune the hell out of Kazakh and Buryat, I don't expect it would have a massive effect on the final score.

@foxik
Copy link
Member

foxik commented Jul 28, 2016

You have a very good point regarding "not being part of a shared task".

We therefore deleted the non-inclusion of very small corpora. Therefore, we currently suggest to leave out Japanese [because of the licence of the original corpus] and the corpora we cannot reliably detokenized [Old Church Slavonic and Gothic being candidates, but if someone is able to get additional raw data, it should be fine].

@foxik foxik closed this as completed in d6fddcc Jul 28, 2016
@dan-zeman
Copy link
Member

Note that the final decision, as per the Berlin meeting, is to re-introduce a lower limit on corpus size: the test data must have at least 10,000 words and the development data should ideally also reach 10,000 words, although this is not a hard constraint. There is no size requirement for training data—it can be zero.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants