-
Notifications
You must be signed in to change notification settings - Fork 872
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Latin default package doesn't usually lemmatize words starting with a capital letter #1330
Comments
The entire treebank is lowercase letters. I could imagine adding a feature where if the treebank is >99% lowercase, the model always lowercases everything. Interestingly, the POS model already lowercases before using the word vectors, hence not failing horribly when feeding the model capitals. |
I'd expect it to behave like an uncased model and it's not a huge faff for me to just convert everything to lower-case before processing it. It just seemed like an unfortunate quirk. |
Just to verify, what you want is the lemmas
|
… data (or a user flag) requested it Testing additions: Add a basic unit test of the all_lowercase function Add a test of the caseless lemmatizer in the Pipeline Test that the Latin ITTB lemmatizer is marked as caseless. Check that the results for capitalized text is as expected Addresses #1330
If you try the |
Any thoughts on this fix? |
… data (or a user flag) requested it Testing additions: Add a basic unit test of the all_lowercase function Add a test of the caseless lemmatizer in the Pipeline Test that the Latin ITTB lemmatizer is marked as caseless. Check that the results for capitalized text is as expected Addresses #1330
The lemmatizer now trains a caseless version of itself if all of the training data is caseless, as proposed in the above PR. The 1.8.1 version of the Latin lemmatizer uses that feature, so the lemmatizer gives the same output for any capitalization variation of "quod erat demonstrandum". POS and depparse already use caseless versions of the word embeddings, so the impact of the casing is a lot less on those words. Please let us know if this satisfies the issue |
Latin default package (ITTB) doesn't usually lemmatize words starting with a capital letter. This seems to be the case whether the word is a proper noun, normally capitalised (eg "Iacobi"), a common word that is extraordinarily capitalised, or a word capitalised out of devotion (eg "Deo"). This seems to be a systematic problem though in the example below "Erat" is lemmatized to "sum"; I have not done any digging into what might provoke this behaviour.
To Reproduce
see code below
Environment (please complete the following information):
The text was updated successfully, but these errors were encountered: