If a word does not have a POS (specifically xpos) value what should I do? #1366

gabriellestein · 2024-03-13T00:58:17Z

(I apologize if this has already been asked I tried for a long time to find an existing answer).
I running into an error where I need the XPOS value for a word but it doesn't have any POS value.
I have provided an extremely simplified version of my code below.

stanza.download('en')
nlp = stanza.Pipeline('en', processors = 'tokenize,mwt,pos,lemma,depparse')
text = "My favorite actress is Joanna Lumley."
doc = nlp(text)
for word in doc.sentences[i].to_dict():
   print(word)
   xpos = word['xpos']

These are the word values for the last two words:

{'id': 4, 'text': 'actress', 'lemma': 'actress', 'upos': 'NOUN', 'xpos': 'NN', 'feats': 'Number=Sing', 'head': 8, 'deprel': 'nsubj', 'start_char': 14, 'end_char': 21}
{'id': (5, 6), 'text': 'joanna', 'start_char': 22, 'end_char': 28}

print(word['xpos'])
KeyError: 'xpos'

I am using the XPOS value of each word in the text. The string "joanna" does not have any POS data. Is the reason for this that the word doesn't exist in the vocabulary where the POS data is sourced? It is a name so it makes sense that every name wouldn't be added to a vocabulary. Should I manually state in my code that the XPOS is PROPN? I am working with a large amount of text that has many potential instances of unique names. Is there a better way to handle such instances, like an existing library of uncommon names?

Thank you for your help, I'm sorry if there is an obvious solution to this I am a complete stanza novice.

The text was updated successfully, but these errors were encountered:

AngledLuffa · 2024-03-13T03:11:19Z

This is kinda funny. The tokenizer has misinterpreted Johanna to be similar to tokens such as wanna and gonna, which get split into want to and going to. I suppose in this case it would be Johan to? Obviously this is incorrect, and I will add a sentence or two to the training data to hopefully fix it up.

To answer the question regarding not having an xpos, the tokens which are split into multiple word pieces are included in the to_dict() result like this. Those macro tokens don't have POS because they are composed of multiple words, so for example want to both have their own POS in the above example. You can skip them if they have an ID which isn't a single integer, or you can iterate over just the word pieces with doc.sentences[i].words()

Going through the English words ending with "nna", here are some broken examples:

My favorite actress is Joanna Lumley
I used a burnt sienna crayon to color in the picture
A large goanna bit my arm
Someone I know wears a bandanna every time he goes dancing
Legend tells us belladonna killed Socrates
Ms. Bumbry was never afraid to inhabit the primadonna role offstage
Unlike most fish, channa can leave the water for a short period of time
Apparently Madonna loathes hydrangeas
The savanna has many lions
I got a henna tattoo of my Chinese name
The Israelites survived on manna during the Exodus
That restaurant has excellent panna cotta
Nanna is an Icelandic singer

Correctly not split:

I accidentally touched Jennifer's antenna and she accused me of sexual harassment
The common platanna is invasive in the US
The music at the feiseanna was energizing
Gloria, hosanna in excelsis
The Duenna is an opera in three acts
I have not read Anna Karenina
A pinna is a fern leaf
One set of Islamic traditions is the sunna
You were supposed to take ONE senna, not ten.  Now who's going to clean up all this ...

Correctly split, but the lemmatizer gets it wrong - presumably no training data available -

Dinna light that candle!

should split to "do not"

Then there's the ambiguous case of canna, which is either a lily or Scotty saying "I canna change the laws of physics!"

Heh, THREE Star Trek references in one github response. I'm becoming more efficient!

…p/stanza#1366

gabriellestein · 2024-03-13T10:17:27Z

Thank you for the quick response. That fixed my issue!
if not instanceof(word[‘id’], int):
I’m not too concerned about incorrect lemmatizations for my application but that thank you for updating the training set, that will surely help someone down the road!
I wish I had a clever Star Trek response but embarrassingly I’ve never seen a Star Trek movie. So just imagine I said something very witty.

AngledLuffa · 2024-03-13T16:51:49Z

Without knowing your particular application, I would be surprised if it's happy with Johanna being broken into two pieces and then possibly not given an NNP tag. However, updating the training set with a few examples of -nna makes it not split for most of the cases listed above. Weirdly, it still splits for henna...

AngledLuffa · 2024-03-13T22:22:18Z

Alright, I added a few more sentences with henna to the training data until it stopped splitting that word for no reason. The new models should be automatically downloaded by v1.8.1

>>> print("{:C}".format(pipe("Johanna said she's gonna get a henna tattoo")))
# text = Johanna said she's gonna get a henna tattoo
# sent_id = 0
1       Johanna _       _       _       _       0       _       _       start_char=0|end_char=7
2       said    _       _       _       _       1       _       _       start_char=8|end_char=12
3-4     she's   _       _       _       _       _       _       _       start_char=13|end_char=18
3       she     _       _       _       _       2       _       _       start_char=13|end_char=16
4       's      _       _       _       _       3       _       _       start_char=16|end_char=18
5-6     gonna   _       _       _       _       _       _       _       start_char=19|end_char=24
5       gon     _       _       _       _       4       _       _       start_char=19|end_char=22
6       na      _       _       _       _       5       _       _       start_char=22|end_char=24
7       get     _       _       _       _       6       _       _       start_char=25|end_char=28
8       a       _       _       _       _       7       _       _       start_char=29|end_char=30
9       henna   _       _       _       _       8       _       _       start_char=31|end_char=36
10      tattoo  _       _       _       _       9       _       _       start_char=37|end_char=43|SpaceAfter=No

AngledLuffa · 2024-04-20T18:58:49Z

This is now part of the 1.8.2 release

gabriellestein added the question label Mar 13, 2024

AngledLuffa added a commit to stanfordnlp/handparsed-treebank that referenced this issue Mar 13, 2024

Add several example sentences for words which end in -nna. stanfordnl…

2c48d40

…p/stanza#1366

AngledLuffa mentioned this issue Mar 13, 2024

WARNING: Can not find mwt: default from official model list. Ignoring it. #297

Closed

AngledLuffa added the fixed on dev label Mar 13, 2024

AngledLuffa closed this as completed Apr 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

If a word does not have a POS (specifically xpos) value what should I do? #1366

If a word does not have a POS (specifically xpos) value what should I do? #1366

gabriellestein commented Mar 13, 2024

AngledLuffa commented Mar 13, 2024

gabriellestein commented Mar 13, 2024

AngledLuffa commented Mar 13, 2024

AngledLuffa commented Mar 13, 2024

AngledLuffa commented Apr 20, 2024

If a word does not have a POS (specifically xpos) value what should I do? #1366

If a word does not have a POS (specifically xpos) value what should I do? #1366

Comments

gabriellestein commented Mar 13, 2024

AngledLuffa commented Mar 13, 2024

gabriellestein commented Mar 13, 2024

AngledLuffa commented Mar 13, 2024

AngledLuffa commented Mar 13, 2024

AngledLuffa commented Apr 20, 2024