Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stanza 1.8.1 failing to split sentence apart #1362

Open
khannan-livefront opened this issue Mar 6, 2024 · 2 comments
Open

Stanza 1.8.1 failing to split sentence apart #1362

khannan-livefront opened this issue Mar 6, 2024 · 2 comments
Labels

Comments

@khannan-livefront
Copy link

khannan-livefront commented Mar 6, 2024

Describe the bug
We've encountered a sentence pattern where Stanza fails to split apart two sentences. It appears when certain names are used (e.g. Max, Anna) but not with others (e.g. Ann).

To Reproduce
Steps to reproduce the behavior:

  1. Go to http://stanza.run/ or input into stanza either sentence:
Max has the map? No. Max has no map.

Anna has the map? No. Anna has no map.
  1. See error – stanza fails to split the "No." into a separate sentence.

Screenshot 2024-03-06 at 12 59 01 PM

Expected behavior
The parse returns No. as a separate sentence.

Environment (please complete the following information):

  • OS: MacOS Ventura 13.4
  • Python version: Python 3.12.2 using Poetry 1.8.2
  • Stanza version: 1.6.1

Additional context
This issue also appears in Stanza 1.8.1. Have not tested it with Stanza 1.7.x. Screenshot is from Stanza 1.6.1.

@AngledLuffa
Copy link
Collaborator

It is definitely on our radar to improve the tokenizer in general. I would say this particular instance it is treating "No." as "Number", even though it should be conditioned not to do that when a name (or rather, a capital letter) comes after the "No.". I wonder if there's room to add some examples to the training data to discourage this behavior

@khannan-livefront
Copy link
Author

khannan-livefront commented Mar 11, 2024

@AngledLuffa I have more examples we discovered of sentences oversplitting that you could add to the training model:

"I do not love this thick fog!" yells Thad.

Screenshot 2024-03-11 at 2 54 59 PM


Then a dog licks Thad on his leg.

Screenshot 2024-03-11 at 2 55 37 PM


Sentence with dialogue are not splitting correctly as well:

"Is this something bad?" Doc Chez said, "It's OK, Max. We will get you glasses."

Screenshot 2024-03-11 at 3 21 31 PM


Note pics are from Stanza 1.6.1.

UPDATE: Had many more examples here, but removed the ones now working in Stanza 1.8.1. Big improvement! :) But these ones are still broken.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants