Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

If a word does not have a POS (specifically xpos) value what should I do? #1366

Closed
gabriellestein opened this issue Mar 13, 2024 · 5 comments

Comments

@gabriellestein
Copy link

(I apologize if this has already been asked I tried for a long time to find an existing answer).
I running into an error where I need the XPOS value for a word but it doesn't have any POS value.
I have provided an extremely simplified version of my code below.

stanza.download('en')
nlp = stanza.Pipeline('en', processors = 'tokenize,mwt,pos,lemma,depparse')
text = "My favorite actress is Joanna Lumley."
doc = nlp(text)
for word in doc.sentences[i].to_dict():
   print(word)
   xpos = word['xpos']

These are the word values for the last two words:

{'id': 4, 'text': 'actress', 'lemma': 'actress', 'upos': 'NOUN', 'xpos': 'NN', 'feats': 'Number=Sing', 'head': 8, 'deprel': 'nsubj', 'start_char': 14, 'end_char': 21}
{'id': (5, 6), 'text': 'joanna', 'start_char': 22, 'end_char': 28}

print(word['xpos'])
KeyError: 'xpos'

I am using the XPOS value of each word in the text. The string "joanna" does not have any POS data. Is the reason for this that the word doesn't exist in the vocabulary where the POS data is sourced? It is a name so it makes sense that every name wouldn't be added to a vocabulary. Should I manually state in my code that the XPOS is PROPN? I am working with a large amount of text that has many potential instances of unique names. Is there a better way to handle such instances, like an existing library of uncommon names?

Thank you for your help, I'm sorry if there is an obvious solution to this I am a complete stanza novice.

@AngledLuffa
Copy link
Collaborator

This is kinda funny. The tokenizer has misinterpreted Johanna to be similar to tokens such as wanna and gonna, which get split into want to and going to. I suppose in this case it would be Johan to? Obviously this is incorrect, and I will add a sentence or two to the training data to hopefully fix it up.

To answer the question regarding not having an xpos, the tokens which are split into multiple word pieces are included in the to_dict() result like this. Those macro tokens don't have POS because they are composed of multiple words, so for example want to both have their own POS in the above example. You can skip them if they have an ID which isn't a single integer, or you can iterate over just the word pieces with doc.sentences[i].words()

Going through the English words ending with "nna", here are some broken examples:

My favorite actress is Joanna Lumley
I used a burnt sienna crayon to color in the picture
A large goanna bit my arm
Someone I know wears a bandanna every time he goes dancing
Legend tells us belladonna killed Socrates
Ms. Bumbry was never afraid to inhabit the primadonna role offstage
Unlike most fish, channa can leave the water for a short period of time
Apparently Madonna loathes hydrangeas
The savanna has many lions
I got a henna tattoo of my Chinese name
The Israelites survived on manna during the Exodus
That restaurant has excellent panna cotta
Nanna is an Icelandic singer

Correctly not split:

I accidentally touched Jennifer's antenna and she accused me of sexual harassment
The common platanna is invasive in the US
The music at the feiseanna was energizing
Gloria, hosanna in excelsis
The Duenna is an opera in three acts
I have not read Anna Karenina
A pinna is a fern leaf
One set of Islamic traditions is the sunna
You were supposed to take ONE senna, not ten.  Now who's going to clean up all this ...

Correctly split, but the lemmatizer gets it wrong - presumably no training data available -

Dinna light that candle!

should split to "do not"

Then there's the ambiguous case of canna, which is either a lily or Scotty saying "I canna change the laws of physics!"

Heh, THREE Star Trek references in one github response. I'm becoming more efficient!

AngledLuffa added a commit to stanfordnlp/handparsed-treebank that referenced this issue Mar 13, 2024
@gabriellestein
Copy link
Author

Thank you for the quick response. That fixed my issue!
if not instanceof(word[‘id’], int):
I’m not too concerned about incorrect lemmatizations for my application but that thank you for updating the training set, that will surely help someone down the road!
I wish I had a clever Star Trek response but embarrassingly I’ve never seen a Star Trek movie. So just imagine I said something very witty.

@AngledLuffa
Copy link
Collaborator

Without knowing your particular application, I would be surprised if it's happy with Johanna being broken into two pieces and then possibly not given an NNP tag. However, updating the training set with a few examples of -nna makes it not split for most of the cases listed above. Weirdly, it still splits for henna...

@AngledLuffa
Copy link
Collaborator

Alright, I added a few more sentences with henna to the training data until it stopped splitting that word for no reason. The new models should be automatically downloaded by v1.8.1

>>> print("{:C}".format(pipe("Johanna said she's gonna get a henna tattoo")))
# text = Johanna said she's gonna get a henna tattoo
# sent_id = 0
1       Johanna _       _       _       _       0       _       _       start_char=0|end_char=7
2       said    _       _       _       _       1       _       _       start_char=8|end_char=12
3-4     she's   _       _       _       _       _       _       _       start_char=13|end_char=18
3       she     _       _       _       _       2       _       _       start_char=13|end_char=16
4       's      _       _       _       _       3       _       _       start_char=16|end_char=18
5-6     gonna   _       _       _       _       _       _       _       start_char=19|end_char=24
5       gon     _       _       _       _       4       _       _       start_char=19|end_char=22
6       na      _       _       _       _       5       _       _       start_char=22|end_char=24
7       get     _       _       _       _       6       _       _       start_char=25|end_char=28
8       a       _       _       _       _       7       _       _       start_char=29|end_char=30
9       henna   _       _       _       _       8       _       _       start_char=31|end_char=36
10      tattoo  _       _       _       _       9       _       _       start_char=37|end_char=43|SpaceAfter=No

@AngledLuffa
Copy link
Collaborator

This is now part of the 1.8.2 release

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants