Latin default package doesn't usually lemmatize words starting with a capital letter #1330

pseudomonas · 2024-01-12T14:25:01Z

Latin default package (ITTB) doesn't usually lemmatize words starting with a capital letter. This seems to be the case whether the word is a proper noun, normally capitalised (eg "Iacobi"), a common word that is extraordinarily capitalised, or a word capitalised out of devotion (eg "Deo"). This seems to be a systematic problem though in the example below "Erat" is lemmatized to "sum"; I have not done any digging into what might provoke this behaviour.

To Reproduce
see code below

Environment (please complete the following information):

OS: Ubuntu
Python version: Conda Python 3.10.9
Stanza version: 1.6.1

import stanza
latindefault = stanza.Pipeline('la', processors='tokenize,pos,lemma' )
#%%


sent = "Quod Erat Demonstrandum" 

print(latindefault(sent))

#### Correctly diagnoses parts of speech; does not lemmatize.
 # {
 #      "id": 3,
 #      "text": "Demonstrandum",
 #      "lemma": "Demonstrandum",
 #      "upos": "VERB",
 #      "xpos": "J2|modO|grp1|casA|gen3",
 #      "feats": "Aspect=Prosp|Case=Nom|Gender=Neut|InflClass=LatA|InflClass[nominal]=IndEurO|Number=Sing|VerbForm=Part|Voice=Pass",
 #      "start_char": 10,
 #      "end_char": 23
 #    }

print(latindefault(sent.lower()))
#### Correctly diagnoses parts of speech and lemmatizes.

# {
#       "id": 3,
#       "text": "demonstrandum",
#       "lemma": "demonstro",
#       "upos": "VERB",
#       "xpos": "J2|modO|grp1|casA|gen3",
#       "feats": "Aspect=Prosp|Case=Nom|Gender=Neut|InflClass=LatA|InflClass[nominal]=IndEurO|Number=Sing|VerbForm=Part|Voice=Pass",
#       "start_char": 10,
#       "end_char": 23
#     }

AngledLuffa · 2024-01-14T06:07:05Z

The entire treebank is lowercase letters. I could imagine adding a feature where if the treebank is >99% lowercase, the model always lowercases everything.

Interestingly, the POS model already lowercases before using the word vectors, hence not failing horribly when feeding the model capitals.

pseudomonas · 2024-01-14T16:00:41Z

I'd expect it to behave like an uncased model and it's not a huge faff for me to just convert everything to lower-case before processing it. It just seemed like an unfortunate quirk.

AngledLuffa · 2024-01-14T20:06:14Z

Just to verify, what you want is the lemmas

qui sum demonstro

… data (or a user flag) requested it Testing additions: Add a basic unit test of the all_lowercase function Add a test of the caseless lemmatizer in the Pipeline Test that the Latin ITTB lemmatizer is marked as caseless. Check that the results for capitalized text is as expected Addresses #1330

AngledLuffa · 2024-01-14T21:27:02Z

If you try the lowercase_lemmas branch, the la_ittb lemmatizer will now automatically treat all text as if it were lowercased. I haven't done anything with the tokenizer or POS yet, though. Have you noticed the tokenizer behaving badly with capitalized letters?

AngledLuffa · 2024-01-19T08:04:55Z

Any thoughts on this fix?

… data (or a user flag) requested it Testing additions: Add a basic unit test of the all_lowercase function Add a test of the caseless lemmatizer in the Pipeline Test that the Latin ITTB lemmatizer is marked as caseless. Check that the results for capitalized text is as expected Addresses #1330

AngledLuffa · 2024-03-03T21:47:17Z

The lemmatizer now trains a caseless version of itself if all of the training data is caseless, as proposed in the above PR. The 1.8.1 version of the Latin lemmatizer uses that feature, so the lemmatizer gives the same output for any capitalization variation of "quod erat demonstrandum".

POS and depparse already use caseless versions of the word embeddings, so the impact of the casing is a lot less on those words.

Please let us know if this satisfies the issue

pseudomonas added the bug label Jan 12, 2024

AngledLuffa mentioned this issue Jan 14, 2024

Potentially lowercase the data in a lemmatizer if all of the training… #1331

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Latin default package doesn't usually lemmatize words starting with a capital letter #1330

Latin default package doesn't usually lemmatize words starting with a capital letter #1330

pseudomonas commented Jan 12, 2024

AngledLuffa commented Jan 14, 2024

pseudomonas commented Jan 14, 2024

AngledLuffa commented Jan 14, 2024

AngledLuffa commented Jan 14, 2024

AngledLuffa commented Jan 19, 2024

AngledLuffa commented Mar 3, 2024

Latin default package doesn't usually lemmatize words starting with a capital letter #1330

Latin default package doesn't usually lemmatize words starting with a capital letter #1330

Comments

pseudomonas commented Jan 12, 2024

AngledLuffa commented Jan 14, 2024

pseudomonas commented Jan 14, 2024

AngledLuffa commented Jan 14, 2024

AngledLuffa commented Jan 14, 2024

AngledLuffa commented Jan 19, 2024

AngledLuffa commented Mar 3, 2024