Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Companion data necessary for training? #6

Closed
danielhers opened this issue Feb 3, 2021 · 5 comments
Closed

Companion data necessary for training? #6

danielhers opened this issue Feb 3, 2021 · 5 comments

Comments

@danielhers
Copy link

To train PERIN on a new dataset (not from MRP 2020), a companion file currently needs to be specified for the new text. Is this a real requirement, or is it just a result of the implementation? Does PERIN actually use any of the information from the companion data? If so, what is the easiest way to generate that data for new text?

@foxik
Copy link
Member

foxik commented Feb 3, 2021

David knows better, but my feeling is that we use the lemmas from the companion data. I.e., when constructing the rules for labels, we allow copying/modifying a corresponding lemma instead of a token (or other sources).

So either you need to lemmatize the data (you could use the UDPipe service, for example, we have the new Bert version trained on UD 2.6 running on https://lindat.mff.cuni.cz/services/udpipe/), or you could disable the usage of the lemma rules (and perform the lemmatization during the syntactic parsing).

@davda54
Copy link
Collaborator

davda54 commented Feb 3, 2021

Yeah, the only problem is in the lemmatized tokens, which are used to create more efficient set of relative label rules -- so specifically, the absence of the companion data shouldn't impact UCCA parsing (but it will most likely negatively influence the accuracy of label prediction for the other frameworks).

I've quickly hacked a workaround to preprocess the data without a companion file into the branch no_lemmas.

@davda54
Copy link
Collaborator

davda54 commented Feb 3, 2021

As for generating the companion data (i.e. lemmas), you can use the code from UDPipeWrapper.

@danielhers
Copy link
Author

This makes a lot of sense. Thank you both for the quick solution!
Besides, it's good to know the new UDPipe is already usable, even if not yet offline.
I'll be happy to close the issue unless you want to keep it, e.g. for adding documentation about this option.

@davda54
Copy link
Collaborator

davda54 commented Mar 21, 2021

Merged into the main branch [#9], closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants