Replace [Ss]commaaccent with [Ss]cedilla as required for Turkic languages #72

alerque · 2022-02-15T12:14:12Z

Closes #71

alerque · 2022-02-15T12:21:39Z

This is much more messed up than this PR corrects. The Romanian derivations seem to have the correct alternates, but all the Turkic ones are messed up in one way or another.

MrBrezina · 2022-02-17T15:04:01Z

Thank you, Caleb. We will merge this in and set up a new issue (and investigation) in the other Turkic languages. cc: @kontur

MrBrezina · 2022-02-17T15:09:05Z

@alerque you have actually corrected this in the Latin Plus which was there only for reference. The database is in lib/hyperglot/hyperglot.yaml. I can edit it myself, no problem.

kontur · 2022-02-17T15:13:47Z

Happy for a PR on this, but let's not change the original reference data; if anything, we can add a corrected version of the latin plus data set, if it has been addressed there.

Presumably this applies to Turkish and Ottoman Turkish — any others?

alerque · 2022-02-17T15:25:05Z

Yes, others. I started going down the rabbit hole farther after opening this and edited some of the YAML files. I'll push that commit here just so it isn't lost, but it is incomplete.

I'd be happy to make the fix for real if you let me know which data set is canonical vs. which are derived.

At the very least Turkish, Ottoman Turkish, Gagavuz, Kurdish (Latin), and Turkmen are affected, but that isn't an exhaustive search. I stopped looking when I realized that there were so many and I didn't know what data I was supposed to be editing.

kontur · 2022-02-17T15:29:41Z

hyperglot.yaml is the source of truth :)

If you have the package installed you can also run hyperglot-validate (checks the data in the yaml) and hyperglot-save (enforces some sortings and for example mark related things).

Happy to answer questions or give pointers. We appreciate you input 👍

alerque · 2022-02-17T16:25:06Z

In that case let me co through the YAML file a bit more and clean up the bits I'm sure of and maybe comment on some ones I suspect. I'll force-push and tag for review when that's done.

Required for Romanization to modern Turkish

alerque · 2022-02-17T18:00:14Z

I have updated this PR to be the two bits I'm pretty confident on.

For Ottoman Turkish the glyph list is just completely foobared as far as I can tell. To the best of my knowledge there are three common ways to Romanize Ottoman Turkish: using the modern Turkish orthography, using the IJMES transliteration system, or using the ALA-
LC rules shown in the chart on Wikipedia.

The glyph list in the hyperglot data set are is a jumble of all three with some from each. This PR should complete the set for using modern Turkish (although it might be better to accomplish that with an include rather than listing them again). It also includes fragments from the other two, but neither are complete. I don't know what the goal is here. List all of them from all popular competing Romanization schemes? Anyway I left that for another commit or PR so that this one can get reviewed and moved along since the error in modern Turkish is pretty bad.

kontur · 2022-02-18T11:24:36Z

Thanks. I think for both Turkish and Ottoman Turkish (and probably others we can identify) the requirement for base should be current established norm, so using the cedilla-variants.

I think it would be useful to include the comma-variants as auxiliary to do justice to the reality that those may be encountered when typesetting those languages. At the same time, this would denote the cedilla-variants as preferred. I can add this when merging the PR next week, or feel free to add this still.

The Ottoman transliteration is or should be based on https://www.cambridge.org/core/journals/international-journal-of-middle-east-studies/information/author-resources/ijmes-translation-and-transliteration-guide — this should be added as an actual source in the yaml, not just as note. I will cross-check against e.g. https://www.cambridge.org/core/services/aop-file-manager/file/57d83390f6ea5a022234b400/TransChart.pdf to un-foobar it, if it currently is.

kontur · 2022-03-01T19:10:35Z

Thanks again @alerque for the contribution. The changes to Turkish are now published in 0.3.8 as well as on the Hyperglot website. We'll review the other related languages bit by bit.

alerque marked this pull request as draft February 15, 2022 12:20

MrBrezina mentioned this pull request Feb 17, 2022

Issues with s-cedilla in several Turkic and related languages #73

Open

alerque added 2 commits February 17, 2022 20:31

Replace [Ss]commaaccent with [Ss]cedilla in Turkish

1a961e3

Add missing [Ss]cedilla to Ottoman Turkish

2771c6f

Required for Romanization to modern Turkish

alerque force-pushed the master branch from d72c19d to 2771c6f Compare February 17, 2022 17:54

alerque marked this pull request as ready for review February 17, 2022 18:00

MrBrezina merged commit 2771c6f into rosettatype:master Feb 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace [Ss]commaaccent with [Ss]cedilla as required for Turkic languages #72

Replace [Ss]commaaccent with [Ss]cedilla as required for Turkic languages #72

alerque commented Feb 15, 2022

alerque commented Feb 15, 2022

MrBrezina commented Feb 17, 2022

MrBrezina commented Feb 17, 2022

kontur commented Feb 17, 2022

alerque commented Feb 17, 2022

kontur commented Feb 17, 2022

alerque commented Feb 17, 2022

alerque commented Feb 17, 2022

kontur commented Feb 18, 2022

kontur commented Mar 1, 2022

Replace [Ss]commaaccent with [Ss]cedilla as required for Turkic languages #72

Replace [Ss]commaaccent with [Ss]cedilla as required for Turkic languages #72

Conversation

alerque commented Feb 15, 2022

alerque commented Feb 15, 2022

MrBrezina commented Feb 17, 2022

MrBrezina commented Feb 17, 2022

kontur commented Feb 17, 2022

alerque commented Feb 17, 2022

kontur commented Feb 17, 2022

alerque commented Feb 17, 2022

alerque commented Feb 17, 2022

kontur commented Feb 18, 2022

kontur commented Mar 1, 2022