-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Companion data necessary for training? #6
Comments
David knows better, but my feeling is that we use the lemmas from the companion data. I.e., when constructing the rules for labels, we allow copying/modifying a corresponding lemma instead of a token (or other sources). So either you need to lemmatize the data (you could use the UDPipe service, for example, we have the new Bert version trained on UD 2.6 running on https://lindat.mff.cuni.cz/services/udpipe/), or you could disable the usage of the lemma rules (and perform the lemmatization during the syntactic parsing). |
Yeah, the only problem is in the lemmatized tokens, which are used to create more efficient set of relative label rules -- so specifically, the absence of the companion data shouldn't impact UCCA parsing (but it will most likely negatively influence the accuracy of label prediction for the other frameworks). I've quickly hacked a workaround to preprocess the data without a companion file into the branch no_lemmas. |
As for generating the companion data (i.e. lemmas), you can use the code from UDPipeWrapper. |
This makes a lot of sense. Thank you both for the quick solution! |
Merged into the main branch [#9], closing. |
To train PERIN on a new dataset (not from MRP 2020), a companion file currently needs to be specified for the new text. Is this a real requirement, or is it just a result of the implementation? Does PERIN actually use any of the information from the companion data? If so, what is the easiest way to generate that data for new text?
The text was updated successfully, but these errors were encountered: