Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve implementation of disambiguated lemmas #142

Open
jacobwegner opened this issue May 23, 2023 · 6 comments
Open

Improve implementation of disambiguated lemmas #142

jacobwegner opened this issue May 23, 2023 · 6 comments

Comments

@jacobwegner
Copy link
Contributor

See our LSJ entries for ἄωρος in urn:cts:greekLit:tlg0012.tlg002.perseus-grc2:12.89:

image

https://beyond-translation.perseus.org/reader/urn:cts:greekLit:tlg0012.tlg002.perseus-grc2:12.89?mode=dictionary-entries&entryUrn=urn%3Acite2%3Ascafife-viewer%3Adictionary-entries.atlas_v1%3Alsj-18938

@jacobwegner
Copy link
Contributor Author

@jtauber:

We had introduced a "normalized" version of the entry headword:

image

If I use the "display" value instead of the normalized value, things get cluttered:

image

I can make use of the "display" version when choosing a "sibling":

image

Any thoughts?

I'll get a deploy done soon so you can play around with this some more...

@jacobwegner
Copy link
Contributor Author

(To review character stripping)

@jacobwegner
Copy link
Contributor Author

@jtauber: Here is a better explanation of what is going on with δελφῖνάς in Odyssey 12.
If you click through to load this query:

https://tinyurl.com/gh-bt-142-sample

You can see that headwordNormalizedStripped for LSJ, Cunliffe and Cambridge is stored as δελφις.

headword is provided directly from each lexicon.

headwordNormalized is computed in normalized_no_digits:

  • get the NFD normalized form
  • get the case-folded NFKC form of the NFD normalized form
  • strip digits (done for disambiguation, e.g. ἄωρος1 vs ἄωρος2, etc)

headwordNormalizedStripped is computed in normalize_and_strip_marks:

  • get the NFD normalized form
  • remove characters matching UNICODE_MARK_CATEGORY_REGEX
  • get the case-folded NFKC form of the NFD normalized, mark-stripped value
  • does not do a stripping of digits (so that θεά1 and θέα2 in LSJ are distinct)

Beyond Translation is currently using headwordNormalized for the lookups; I believe this was done to avoid the exact kind of error where we might resolve both θεά and θέα within LSJ.

We're performing the exact same normalization from headwordNormalized on the search term provided by a user on the frontend.

So, back to δελφῖνάς in Od. 12:

  • The lemma we're using is δελφίς
  • The headwordNormalized form for the Cambridge Greek Lexicon is δελφῑ́ς
  • If we could make the headword in the file you're providing for Cambridge Greek Lexicon δελφίς or δελφίς, the headwordNormalized would then become δελφίς
  • headwordDisplay could continue to have δελφῑ́ς

Does that make sense to you? I have some additional things I'd like to document around this, but I think having this new headwordDisplay option will be a big help going forward.

@jacobwegner
Copy link
Contributor Author

(We should review this for Cambridge and Lexicon Thucydideum, as well as replicating what the "word study tool" does for lookups https://www.perseus.tufts.edu/hopper/morph?l=%CF%84%CE%B1%CF%81%CE%AC%CF%83%CF%83%CF%89&la=greek)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

No branches or pull requests

1 participant