Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Brackets (punct) are not properly tagged to its heads in show tables (english-ewt-ud-2.12-230717) #175

Closed
Shasetty opened this issue Oct 18, 2023 · 2 comments

Comments

@Shasetty
Copy link

Text :
MALVERN, Pa., Aug. 09, 2023 (GLOBE NEWSWIRE) -- Galera Therapeutics, Inc. (Nasdaq: GRTX), a clinical-stage biopharmaceutical company focused on developing and commercializing a pipeline of novel, proprietary therapeutics that have the potential to transform radiotherapy in cancer, today announced that it has received a Complete Response Letter (CRL) from the U.S.Food and Drug Administration (FDA) regarding the Company’s New Drug Application (NDA) for avasopasem manganese (avasopasem) for radiotherapy-induced severe oral mucositis (SOM) in patients with head and neck cancer undergoing standard-of-care treatment.

correct output in "show trees"

wrong outputs in "show tables" & output text : (FDA) , (NDA)
https://lindat.mff.cuni.cz/services/udpipe/

@martinpopel
Copy link
Member

I confirm the right brackets following FDA and NDA are attached to a wrong parent (i.e. not to FDA and NDA, respectively), when parsing this very long sentence with english-ewt-ud-2.12-230717. You can use udapy -s ud.FixPunct < in.conllu > out.conllu to fix it.

However, the output in "Show Trees" is exactly the same as in "Show Table" (and as the CoNLL-U in "Output Text"), so there is no bug in UDPipe. These GitHub issues are for reporting bugs in the software. You cannot expect 100% parsing accuracy from all models.

BTW: When using e.g. the english-gum-ud-2.12-230717 model, the brackets enclosing FDA and NDA are attached correctly. This suggest GUM is better training data then EWT in this aspect. Indeed, when applying ud.FixPunct on en_gum-ud-train.conllu, there are only 39 errors fixed, but on en_ewt-ud-train.conllu, there are 7496 bugs. So maybe the authors of EWT should fix these bugs and the new version of UDPipe will be better. However, that should not be discussed here, but at https://github.com/UniversalDependencies/UD_English-EWT/issues

@foxik
Copy link
Member

foxik commented Oct 18, 2023

Thanks @martinpopel for your detailed answer 😊

@Shasetty UDPipe is a statistical tool, so its performance depends both on (a) its ability to effectively train on the UD training data and correctly generalizing on user inputs, and (b) the correctness of the training data. It is expected that it makes errors, but we cannot easily fix them one by one (so it makes little use to report them to us); but you can definitely try improving the training datain the repository @martinpopel suggested.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants