-
Notifications
You must be signed in to change notification settings - Fork 881
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Coref model predicted a span that crossed two sentences #1339
Labels
Comments
Thank you. This has actually already been addressed in the dev branch - I
should probably make a new release with that
…On Mon, Jan 29, 2024, 3:29 PM Abhinav Patil ***@***.***> wrote:
*Describe the bug*
Depending on whether tokenize_pretokenized and tokenize_no_ssplit are
each True or False, the following sentence results in the coref processor
yielding either the exception ValueError: The coref model predicted a
span that crossed two sentences! or the exception IndexError: list index
out of range error, on lines 120 and 119 of
stanza/pipeline/coref_processor.py, respectively.
The sentence: The son of Mr. and Mrs. X. He is four during the events of
the first book . <eos>
*To Reproduce*
Steps to reproduce the behavior:
Set up code:
import stanza
s = "The son of Mr. and Mrs. X. He is four during the events of the first book . <eos>"
pipeline = stanza.Pipeline(lang="en", processors="tokenize,pos,lemma,depparse,coref")pipeline_no_ssplit = stanza.Pipeline(lang="en", processors="tokenize,pos,lemma,depparse,coref", tokenize_no_ssplit=True)pipeline_pretok = stanza.Pipeline(lang="en", processors="tokenize,pos,lemma,depparse,coref", tokenize_pretokenized=True)pipeline_pretok_no_ssplit = stanza.Pipeline(lang="en", processors="tokenize,pos,lemma,depparse,coref", tokenize_pretokenized=True, tokenize_no_ssplit=True)
Then the following line of code:
a = pipeline(s)
produces the exception:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/apatil/anaconda3/envs/base_nlp/lib/python3.9/site-packages/stanza/pipeline/core.py", line 476, in __call__
return self.process(doc, processors)
File "/Users/apatil/anaconda3/envs/base_nlp/lib/python3.9/site-packages/stanza/pipeline/core.py", line 427, in process
doc = process(doc)
File "/Users/apatil/anaconda3/envs/base_nlp/lib/python3.9/site-packages/stanza/pipeline/coref_processor.py", line 120, in process
raise ValueError("The coref model predicted a span that crossed two sentences! Please send this example to us on our github")ValueError: The coref model predicted a span that crossed two sentences! Please send this example to us on our github
whereas any of the following lines of code:
b = pipeline_nossplit(s)c = pipeline_pretok(s)d = pipeline_pretok_nossplit(s)
produce the exception:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/apatil/anaconda3/envs/base_nlp/lib/python3.9/site-packages/stanza/pipeline/core.py", line 476, in __call__
return self.process(doc, processors)
File "/Users/apatil/anaconda3/envs/base_nlp/lib/python3.9/site-packages/stanza/pipeline/core.py", line 427, in process
doc = process(doc)
File "/Users/apatil/anaconda3/envs/base_nlp/lib/python3.9/site-packages/stanza/pipeline/coref_processor.py", line 119, in process
if sent_ids[span[1]] != sent_id:IndexError: list index out of range
*Expected behavior*
All of these should just work. They should not throw any of the issues
above.
*Environment (please complete the following information):*
- OS: Reproduced on Mac (with CPU) and Oracle Linux (with GPU)
- Python version: Python 3.9.16 | packaged by conda-forge
- Stanza version: 1.7.0
*Additional context*
I have also seen sporadic instances of the coref model predicted a span
that crossed two sentences! error elsewhere, but previously only with a
large group of sentences in a single doc, omitting any one of which,
strangely, resulting in the error no longer surfacing. This is the first
time I've been able to reproduce it with a single sentence, hence why I am
reporting it. I can, however, provide other batches of sentences that
result in the same issue, if it helps.
—
Reply to this email directly, view it on GitHub
<#1339>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AA2AYWMTFPNU33OMRGFU323YRAWFNAVCNFSM6AAAAABCQI7N6SVHI2DSMVQWIX3LMV43ASLTON2WKOZSGEYDMNRSGU3TIMQ>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
Ah, I see that now- as reported in #1333. Yes, if you could make a release for that, it would be very helpful. |
This was in 1.8.0, since superseded by 1.8.1 as there were some critical bugs in the 1.8.0 release |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Describe the bug
Depending on whether
tokenize_pretokenized
andtokenize_no_ssplit
are eachTrue
orFalse
, the following sentence results in the coref processor yielding either the exceptionValueError: The coref model predicted a span that crossed two sentences!
or the exceptionIndexError: list index out of range
error, on lines 120 and 119 ofstanza/pipeline/coref_processor.py
, respectively.The sentence:
The son of Mr. and Mrs. X. He is four during the events of the first book . <eos>
To Reproduce
Steps to reproduce the behavior:
Set up code:
Then the following line of code:
produces the exception:
whereas any of the following lines of code:
produce the exception:
Expected behavior
All of these should just work. They should not throw any of the issues above.
Environment (please complete the following information):
Python 3.9.16 | packaged by conda-forge
1.7.0
Additional context
I have also seen sporadic instances of the
coref model predicted a span that crossed two sentences!
error elsewhere, but previously only with a large group of sentences in a single doc, omitting any one of which, strangely, resulting in the error no longer surfacing. This is the first time I've been able to reproduce it with a single sentence, hence why I am reporting it. I can, however, provide other batches of sentences that result in the same issue, if it helps.The text was updated successfully, but these errors were encountered: