Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvements to the language and writing direction detection #250

Merged

Conversation

saschaleib
Copy link
Contributor

This code changes the way :lang attributes are handled, allowing more flexibility, including a possible Script specification, as specified in BCP 47.

The direction specification (è.g. dir="rtl") now uses the language code as a default, but allows the script specification to override this when needed.

A side effect of this change is that the additional config file is no longer needed.

This code changes the way `:lang` attributes are handled, allowing more flexibility, including a possible Script specification, as specified in BCP 47.

The direction specification (è.g. `dir="rtl"`) now uses the language code as a default, but allows the script specification to override this when needed.

A side effect of this change is that the additional config file is no longer needed.
@saschaleib
Copy link
Contributor Author

Hello, and thanks for this extremely useful plugin!

I have spent a bit of time trying to see how this works, as I was trying to improve compatibility of my template with this plugin. In this process, I noticed a couple of issues where I think I could contribute a bit ... So expect a couple more pull requests from me in the near future :-)

First off: I noticed that the code used to determine the dir attribute makes a couple of assumptions about languages that will not always hold true: most of all: the text direction is not a property of a language, but one of the script used. These two only have a clear relationship most of the time, but there are many cases where this breaks down. Please allow me to explain this in a bit more detail:

Firstly, there are languages which use more than one writing system (the technical term is "digraphia"). For example, Serbian can be written either in Latin or in Cyrillic alphabet. Turkish switched the writing system in the 20th century from Arabic to Latin – but there are still many old texts that are written in Arabic. Kurdish can be written in either Arabic, Latin or Cyrillic, etc.

But there is more: if you read a transliteration of a non-Latin text, this is still the same language, but written in a different writing system. To illustrate this, the following are actually two examples that I found in my own wiki:

  • <span lang="grc">Οὐδὲν ἐξ οὐδενός</span> [<i lang="grc-Latn">oudén ex oudenós</i>]
  • <bdi dir="rtl" style="font-size:120%" lang="ar">الله</bdi> [<i lang="ar-Latn">Allāh</i>], and <bdi dir="rtl" lang="he">יהוה</bdi> [<i lang="he-Latn">Jahwe</i>]

This also shows the correct way to specify the language in this situation: the script will be added as a four-letter-code after the ISO 639 language code (and any potential other code, like the country, etc.).

At the moment, the language detection would not pass such codes through to the output, so that had to be changed as well. This means, that the change will make it possible to even specify very obscure languages (Wikipedia has this beautiful example of "he-IL-u-ca-hebrew-tz-jeruslm" that even explains the time-zone used, etc.

This means that this change now makes it also possible to specify the region. Remember that just as en-GB is not the same as en-US, the same is true for many other languages, like de-DE vs. de-AT or de-CH, …

I hope you find this change useful, and I will already start looking at some improvements on the semantic markup and CSS ... coming soon ;-)

Best greetings /sascha

@saschaleib saschaleib marked this pull request as ready for review March 10, 2023 14:06
@5shekel
Copy link

5shekel commented Apr 30, 2023

@saschaleib
Copy link
Contributor Author

Thanks, @5shekel this is indeed a good use case, as your site mixes RTL and LTR. May I interest you to try if my own Ad-Hoc Tags plugin would be an alternative for you? It gives you more flexibility as it also supports the dir attribute which can override the language attribute.

@Klap-in
Copy link
Collaborator

Klap-in commented May 10, 2023

Could it be that people use the config file to configure other combinations? so that merging results in not backward compatibility issues? if you expect not, I will merge.

@saschaleib
Copy link
Contributor Author

Hm, in principle that would be possible, but I think the overhead would not be worth the benefits. I reckon that the built-in list of languages and scripts are now covering most cases, and with the option to override the script this should be pretty much complete.

I should add that I have moved on and made my own plugin which implements (and extends) the attribute handling and other aspects. If there is an interest, I am happy to backport some more features here. :-)

@Klap-in Klap-in merged commit ce2c1d7 into selfthinker:master Aug 13, 2023
@saschaleib saschaleib deleted the saschaleib-patch-language-dir branch August 13, 2023 13:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants