Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Latin default package doesn't usually lemmatize words starting with a capital letter #1330

Open
pseudomonas opened this issue Jan 12, 2024 · 6 comments
Labels

Comments

@pseudomonas
Copy link

Latin default package (ITTB) doesn't usually lemmatize words starting with a capital letter. This seems to be the case whether the word is a proper noun, normally capitalised (eg "Iacobi"), a common word that is extraordinarily capitalised, or a word capitalised out of devotion (eg "Deo"). This seems to be a systematic problem though in the example below "Erat" is lemmatized to "sum"; I have not done any digging into what might provoke this behaviour.

To Reproduce
see code below

Environment (please complete the following information):

  • OS: Ubuntu
  • Python version: Conda Python 3.10.9
  • Stanza version: 1.6.1
import stanza
latindefault = stanza.Pipeline('la', processors='tokenize,pos,lemma' )
#%%


sent = "Quod Erat Demonstrandum" 

print(latindefault(sent))

#### Correctly diagnoses parts of speech; does not lemmatize.
 # {
 #      "id": 3,
 #      "text": "Demonstrandum",
 #      "lemma": "Demonstrandum",
 #      "upos": "VERB",
 #      "xpos": "J2|modO|grp1|casA|gen3",
 #      "feats": "Aspect=Prosp|Case=Nom|Gender=Neut|InflClass=LatA|InflClass[nominal]=IndEurO|Number=Sing|VerbForm=Part|Voice=Pass",
 #      "start_char": 10,
 #      "end_char": 23
 #    }

print(latindefault(sent.lower()))
#### Correctly diagnoses parts of speech and lemmatizes.

# {
#       "id": 3,
#       "text": "demonstrandum",
#       "lemma": "demonstro",
#       "upos": "VERB",
#       "xpos": "J2|modO|grp1|casA|gen3",
#       "feats": "Aspect=Prosp|Case=Nom|Gender=Neut|InflClass=LatA|InflClass[nominal]=IndEurO|Number=Sing|VerbForm=Part|Voice=Pass",
#       "start_char": 10,
#       "end_char": 23
#     }
@AngledLuffa
Copy link
Collaborator

The entire treebank is lowercase letters. I could imagine adding a feature where if the treebank is >99% lowercase, the model always lowercases everything.

Interestingly, the POS model already lowercases before using the word vectors, hence not failing horribly when feeding the model capitals.

@pseudomonas
Copy link
Author

I'd expect it to behave like an uncased model and it's not a huge faff for me to just convert everything to lower-case before processing it. It just seemed like an unfortunate quirk.

@AngledLuffa
Copy link
Collaborator

Just to verify, what you want is the lemmas

qui sum demonstro

AngledLuffa added a commit that referenced this issue Jan 14, 2024
… data (or a user flag) requested it

Testing additions:

Add a basic unit test of the all_lowercase function
Add a test of the caseless lemmatizer in the Pipeline
Test that the Latin ITTB lemmatizer is marked as caseless.  Check that the results for capitalized text is as expected

Addresses #1330
@AngledLuffa
Copy link
Collaborator

If you try the lowercase_lemmas branch, the la_ittb lemmatizer will now automatically treat all text as if it were lowercased. I haven't done anything with the tokenizer or POS yet, though. Have you noticed the tokenizer behaving badly with capitalized letters?

@AngledLuffa
Copy link
Collaborator

Any thoughts on this fix?

AngledLuffa added a commit that referenced this issue Feb 3, 2024
… data (or a user flag) requested it

Testing additions:

Add a basic unit test of the all_lowercase function
Add a test of the caseless lemmatizer in the Pipeline
Test that the Latin ITTB lemmatizer is marked as caseless.  Check that the results for capitalized text is as expected

Addresses #1330
@AngledLuffa
Copy link
Collaborator

The lemmatizer now trains a caseless version of itself if all of the training data is caseless, as proposed in the above PR. The 1.8.1 version of the Latin lemmatizer uses that feature, so the lemmatizer gives the same output for any capitalization variation of "quod erat demonstrandum".

POS and depparse already use caseless versions of the word embeddings, so the impact of the casing is a lot less on those words.

Please let us know if this satisfies the issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants