Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Span crossing two sentences? #1333

Closed
rizpras opened this issue Jan 20, 2024 · 5 comments
Closed

Span crossing two sentences? #1333

rizpras opened this issue Jan 20, 2024 · 5 comments
Labels

Comments

@rizpras
Copy link

rizpras commented Jan 20, 2024

Hello

I got an error saying that the model predicted span that crosses two sentences and to send the example to github. Here is my code (pretty simple):

`import stanza

pipe = stanza.Pipeline("en", processors="tokenize, coref")
out = pipe("""If an electrical machine or equipment generates mechanical vibrations when in service, e.g. because it is out of balance, the vibration amplitude measured on the machine or the equipment on board shall not lie outside area A. For this evaluation, reference is made only to the self-generated vibration components. Area A may only be utilized if the loading of all components, with due allowance for local excess vibration, does not impair reliable long-term operation""")

print(out)`

My guess is on the term "Area A". Is the model currently unable to process coreference that cross two sentence? What can I do about the sentence?

Thank you

@rizpras rizpras added the bug label Jan 20, 2024
@AngledLuffa
Copy link
Collaborator

Ah, this was me being an idiot. I put an error check in the coref model to make sure the spans were all in the same sentence (the original code masks for that AFAIK), but the error check itself was buggy.

@AngledLuffa
Copy link
Collaborator

If you use the dev branch, it should now be fixed...

I was thinking that perhaps waiting for a bigger feature to be finished would be good for a new release, but seeing as how we've fixed a couple bugs in the last couple months, it might be worth doing an interim release

@rizpras
Copy link
Author

rizpras commented Jan 21, 2024

Thank you very much! Just curious, can I still use it in google colab if it's in dev branch? Another thing, why does the model need to make sure that the spans are all in the same sentence?

@AngledLuffa
Copy link
Collaborator

Just curious, can I still use it in google colab if it's in dev branch?

I don't know how you've installed Stanza, but you should be able to pip install from a branch, if that's what you did:

https://stackoverflow.com/questions/20101834/pip-install-from-git-repo-branch

Another thing, why does the model need to make sure that the spans are all in the same sentence?

Technically it doesn't, but the model was trained to only have spans which are contained in a single sentence, and I used that assumption downstream when turning the spans into human-readable output. I had put an assertion to test that, but the assertion itself was buggy in the event that a span was exactly at the end of a sentence. Since sentence endings are usually punctuation, that hadn't come up until you hit one of the sentences for which the tokenizer is incorrectly splitting

@rizpras
Copy link
Author

rizpras commented Jan 23, 2024

I installed the Stanza from dev branch and now it works! Thank you very much @AngledLuffa, you have been very helpful!

I'm closing this issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants