Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace [Ss]commaaccent with [Ss]cedilla as required for Turkic languages #72

Merged
merged 2 commits into from
Feb 23, 2022

Conversation

alerque
Copy link
Contributor

@alerque alerque commented Feb 15, 2022

Closes #71

@alerque alerque marked this pull request as draft February 15, 2022 12:20
@alerque
Copy link
Contributor Author

alerque commented Feb 15, 2022

This is much more messed up than this PR corrects. The Romanian derivations seem to have the correct alternates, but all the Turkic ones are messed up in one way or another.

@MrBrezina
Copy link
Member

Thank you, Caleb. We will merge this in and set up a new issue (and investigation) in the other Turkic languages. cc: @kontur

@MrBrezina
Copy link
Member

@alerque you have actually corrected this in the Latin Plus which was there only for reference. The database is in lib/hyperglot/hyperglot.yaml. I can edit it myself, no problem.

@kontur
Copy link
Contributor

kontur commented Feb 17, 2022

Happy for a PR on this, but let's not change the original reference data; if anything, we can add a corrected version of the latin plus data set, if it has been addressed there.

Presumably this applies to Turkish and Ottoman Turkish — any others?

@alerque
Copy link
Contributor Author

alerque commented Feb 17, 2022

Yes, others. I started going down the rabbit hole farther after opening this and edited some of the YAML files. I'll push that commit here just so it isn't lost, but it is incomplete.

I'd be happy to make the fix for real if you let me know which data set is canonical vs. which are derived.

At the very least Turkish, Ottoman Turkish, Gagavuz, Kurdish (Latin), and Turkmen are affected, but that isn't an exhaustive search. I stopped looking when I realized that there were so many and I didn't know what data I was supposed to be editing.

@kontur
Copy link
Contributor

kontur commented Feb 17, 2022

hyperglot.yaml is the source of truth :)

If you have the package installed you can also run hyperglot-validate (checks the data in the yaml) and hyperglot-save (enforces some sortings and for example mark related things).

Happy to answer questions or give pointers. We appreciate you input 👍

@alerque
Copy link
Contributor Author

alerque commented Feb 17, 2022

In that case let me co through the YAML file a bit more and clean up the bits I'm sure of and maybe comment on some ones I suspect. I'll force-push and tag for review when that's done.

@alerque
Copy link
Contributor Author

alerque commented Feb 17, 2022

I have updated this PR to be the two bits I'm pretty confident on.

For Ottoman Turkish the glyph list is just completely foobared as far as I can tell. To the best of my knowledge there are three common ways to Romanize Ottoman Turkish: using the modern Turkish orthography, using the IJMES transliteration system, or using the ALA-
LC rules shown in the chart on Wikipedia.

The glyph list in the hyperglot data set are is a jumble of all three with some from each. This PR should complete the set for using modern Turkish (although it might be better to accomplish that with an include rather than listing them again). It also includes fragments from the other two, but neither are complete. I don't know what the goal is here. List all of them from all popular competing Romanization schemes? Anyway I left that for another commit or PR so that this one can get reviewed and moved along since the error in modern Turkish is pretty bad.

@alerque alerque marked this pull request as ready for review February 17, 2022 18:00
@kontur
Copy link
Contributor

kontur commented Feb 18, 2022

Thanks. I think for both Turkish and Ottoman Turkish (and probably others we can identify) the requirement for base should be current established norm, so using the cedilla-variants.

I think it would be useful to include the comma-variants as auxiliary to do justice to the reality that those may be encountered when typesetting those languages. At the same time, this would denote the cedilla-variants as preferred. I can add this when merging the PR next week, or feel free to add this still.

The Ottoman transliteration is or should be based on https://www.cambridge.org/core/journals/international-journal-of-middle-east-studies/information/author-resources/ijmes-translation-and-transliteration-guide — this should be added as an actual source in the yaml, not just as note. I will cross-check against e.g. https://www.cambridge.org/core/services/aop-file-manager/file/57d83390f6ea5a022234b400/TransChart.pdf to un-foobar it, if it currently is.

@MrBrezina MrBrezina merged commit 2771c6f into rosettatype:master Feb 23, 2022
@kontur
Copy link
Contributor

kontur commented Mar 1, 2022

Thanks again @alerque for the contribution. The changes to Turkish are now published in 0.3.8 as well as on the Hyperglot website. We'll review the other related languages bit by bit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Wrong required letter in Turkish/Türkçe
3 participants