Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hreflang should support ISO 639-1, ISO 3166-1 Alpha 2, ISO 15924 formats #668

Open
nethubonline opened this issue Apr 15, 2019 · 26 comments · Fixed by #880
Open

hreflang should support ISO 639-1, ISO 3166-1 Alpha 2, ISO 15924 formats #668

nethubonline opened this issue Apr 15, 2019 · 26 comments · Fixed by #880
Labels
core Core functionalities, including the admin section enhancement New feature or request help wanted Extra attention is needed

Comments

@nethubonline
Copy link

nethubonline commented Apr 15, 2019

According to Google: https://support.google.com/webmasters/answer/189077

The value of the hreflang attribute identifies the language (in ISO 639-1 format) and optionally a region (in ISO 3166-1 Alpha 2 format) of an alternate URL. (The language need not be related to the region.) For example:

  • de: German language content, independent of region
  • en-GB: English language content, for GB users
  • de-ES: German language content, for users in Spain

Since some languages have different type, such as Chinese Tranditional and Chinese Simplified, ISO 639-1 format only describes 1 language code "zh" for both Chinese languages, but we cannot fill "zh" for both Chinese Tranditional & Chinese Simplified at the same time.

From Google answer, we should use "zh-TW" / "zh-CN" or "zh-Hant" / "zh-Hans" instead, but the "Language Code" field in qtranslate-XT setting limit 2 characters only.

Could qtranslate-XT release the 2 characters limit so that we can follow what Google suggests?

@herrvigg
Copy link
Collaborator

Good point, seems legit for a plugin dealing with localizations! I won't work on this for now and i don't know how much should be changed but it should definitely be feasible. I I'll keep it open but anyone feel free to send a PR.

@herrvigg herrvigg added the enhancement New feature or request label Apr 15, 2019
@nethubonline
Copy link
Author

Sorry that I am not good on programming, but I found that WP Multilang supports it.

Although it uses 2 characters (e.g. en) officially, but I can still fill en-us in order to follow what Google suggests.

I still loves qtranslate-xt, I hopes that it can meet Google's standard.

Screenshot: https://ps.w.org/wp-multilang/assets/screenshot-1.png?rev=1760406
Reference site: https://wordpress.org/plugins/wp-multilang/

@herrvigg
Copy link
Collaborator

herrvigg commented Jun 17, 2019

It should be feasible but it requires a bit of work and a lot of testing.

The main part is handled via regex format such as [a-z]{2}. This part is easy to change once the specification is clearly defined (see below). In practice one problem is that there are many regex both in the PHP and JS code. But it looks somehow limited to a few files. What is even more problematic is that some parts of the code assume the length is 3, typically when looking a the lengths of string. This is much harder to track. There might also be other tricky cases. Also, the addition of the locale might have other impacts.

For the new format, a very lazy way would just be to handle something like [a-z]{2}[-][A-Z]{2} for language-LOCALE (e.g. zt-TW). But this specification should be studied better to do something right for the long term. The regex should not be too complicated because this is used several times in the code, especially when parsing the data. So for performance reasons it has to be done wisely. Maybe the format itself should be a variable.

@nethubonline
Copy link
Author

Hi,

Just wanna know if there is any update about this enhancement, if you need any testing user, I am here :)

@herrvigg
Copy link
Collaborator

herrvigg commented Nov 2, 2019

Yes, i still have this in mind, i think it's an important feature to have. I often see in the code many places where the 2 character format is hard-coded so i know more or less what to change, but this is quite a deep change. I don't think it's reasonable to do it in the current codebase because it's way too much of a spaghetti code. This would happen after migrating to a new repo and refactoring most of the core part.

@nethubonline
Copy link
Author

Understood, hope to see it soon :)

@Devoleksiy
Copy link

Sorry for my English:
Here is my first time instruction, it will help you.
https://devoleksiy.vpoltave.net/cms/wordpress/del-prefiks-url-qtranslate-xt/

@nethubonline
Copy link
Author

Devoleksiy,

Your information is completely not related to this enhancement.

Sorry for my English:
Here is my first time instruction, it will help you.
https://devoleksiy.vpoltave.net/cms/wordpress/del-prefiks-url-qtranslate-xt/

@herrvigg
Copy link
Collaborator

herrvigg commented Nov 5, 2019

Yes it's a different topic.
Regarding the SEO i don't think you should remove the language from the language link even if you hide the default. The problem is that the default with hidden language can also be related to the current value of your cookie if you switch from the browser, so it's not a fully deterministic result. Imo it's better keep it in the hreflang links. But this is not the same topic as extending the hreflang to other ISO formats with more than 2 characters.

@herrvigg herrvigg added help wanted Extra attention is needed core Core functionalities, including the admin section labels Mar 11, 2020
@herrvigg
Copy link
Collaborator

herrvigg commented May 23, 2020

We have also to take @mikoet remark from #836 mentioning ISO 639-2 and ISO 639-3

I want to use this plugin for a website that will contain content in Low Saxon and maybe also North Frisian. Both are minority/regional languages in Northern Germany and for Low Saxon also in the Eastern Netherlands.

Both, Low Saxon and North Frisian have no ISO 639-1 two letter code, but ISO 639-2 and ISO 639-3 three letter codes. Those are "nds" for Low Saxon and "frr" for North Frisian.

See https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes

For us the distinction ISO 639-2 and ISO 639-3 does not really matter because the 639-3 variant adds simply the notion of macro-language which is just one hierarchical level (we may come back to this concept later). The format for qTranslate can be summarized to 2 or 3 letter code.

@herrvigg
Copy link
Collaborator

herrvigg commented May 23, 2020

This topic is quite an interesting question, very important for qTranslate i'd say!
If i try to summarize how to deal with the formats, i've digged a bit more and here is what i get. Double-check and don't hesitate to comment if you see mistakes.

Here we don't really care about the values but only the format that will define the pattern for the regex (regular expression).

ISO 639-1 - the only format currently supported in qTranslate-XT (!)
Format: 2 letters (minor case)
Describes a language.
Note: this is not ISO 3166-1 which is upper case! And they don't describe the same (language vs country/region)

ISO 3166-1 alpha-2
Format: 2 letters (upper case)
Describes a country code or an area, also called "region" for hreflang...
Great they added alpha-2 only for the purpose of creating a confusion with 3166-2 which is a totally different format. That's pure evil really...!!!

ISO 3166-2
Adds a subdivision code...
Format: 2 letters (upper case) - 3 alphanumeric

  • the first 2 letters are ISO 3166-1 (alpha-2) - perfect mix with the names, congrats guys really! 👏
  • the subdivision is actually two digits or 3 letters (apparently upper case)

ISO 639-2 and 639-3
Format: 2 or 3 letters (minor case)
Describes a language. The 639-3 adds the concept of macro-language but we don't care here, the format is the same for 639-2 and 639-3.

ISO 15924
Format: 4 letters
Used for scripts (?) Honestly I don't know exactly what this is for.

hreflang according to google
quite a bastard mix of ISO 639-1 (2 letters minor case for language) - ISO 3166-2 ISO 3166-1 alpha 2 (2 letters upper case for... the region - but not the subdivision).

Uh!! 🤯
What else?! I'm sure we are forgetting something 😅

@herrvigg
Copy link
Collaborator

herrvigg commented May 23, 2020

Google does not seem to encourage ISO-639-2 (3 letters language) for hreflang. But i think this is wrong. I don't know exactly how they handle this but if we follow the RFC 8288 that describes hreflang, it says:

The "hreflang" attribute, when present, is a hint indicating what the language of the result of dereferencing the link should be. Note that this is only a hint; for example, it does not override the Content-Language header field of a HTTP response obtained by actually following the link.
Multiple hreflang attributes on a single link-value indicate that multiple languages are available from the indicated resource.

The ABNF for the hreflang parameter's value is:

Language-Tag

If we look for the norml of the Language-Tag i found this: https://tools.ietf.org/html/rfc5646#section-2.1

Which is quite complex but if we keep just the simplest part...

langtag       = language
                 ["-" script]
                 ["-" region]

 language      = 2*3ALPHA            ; shortest ISO 639 code
                 ["-" extlang]       ; sometimes followed by
                                     ; extended language subtags
               / 4ALPHA              ; or reserved for future use
               / 5*8ALPHA            ; or registered language subtag

 extlang       = 3ALPHA              ; selected ISO 639 codes
                 *2("-" 3ALPHA)      ; permanently reserved

 script        = 4ALPHA              ; ISO 15924 code

 region        = 2ALPHA              ; ISO 3166-1 code
               / 3DIGIT              ; UN M.49 code

The concept of "shortest ISO 639 code" might be a bit difficult to grasp but this other document can explain better what it means:

https://www.w3.org/International/articles/language-tags/

"Most language tags consist of a two- or three-letter language subtag. Often this is followed by a two-letter or three-digit region subtag. RFC 5646 also allows for a number of additional subtags, where needed. These will be explained briefly in the next section, and include extended language, script, variant, extension and private-use subtags.

The golden rule when creating language tags is to keep the tag as short as possible. Avoid region, script or other subtags except where they add useful distinguishing information. For instance, use ja for Japanese and not ja-JP, unless there is a particular reason that you need to say that this is Japanese as spoken in Japan, rather than elsewhere."

I feel this is the right way to specify the languages properly. But this raises some questions for qTranslate-XT.

  • why do we have this distinction between language code and front-end value for hreflang?
  • shouldn't the language code match exactly these definitions?
  • what about the WordPress locales?

herrvigg referenced this issue Jul 14, 2020
These functions are not used and don't seem useful:
- qtranxf_stripSlashesIfNecessary (admin_utils)
- qtranxf_get_domain_language
- qtranxf_isAvailableIn
@herrvigg
Copy link
Collaborator

herrvigg commented Sep 5, 2020

An update here, this is the feature i'm working on currently.

I have done quite some big steps forward to handle the edition of the new language format on the admin side, which seems to work pretty well. I refactored the language code checks entirely and generalized to a unique regex defined on the server.

// Language code format: language[-script][-region]
// #1 language ISO 639-1 and 639-2: 2 or 3 alpha
// #2 script ISO 15924: 4 Alpha
// #3 region ISO 3166-2: 2 ALPHA or 3 digits

define( 'QTX_LANG_CODE', '[a-z]{2,3}(?:-[A-Z][a-z]{3})?(?:-[A-Z]{2}|-\d{3})?');

This is available in this new branch:
https://github.com/qtranslate/qtranslate-xt/tree/feature/lang-code

The admin part should be quite fine.
Now on the frontside there are some redirection issues with infinite loops but i hope this can be solved.

@herrvigg
Copy link
Collaborator

herrvigg commented Sep 5, 2020

Front-end redirections fixed. This seems to work now!

@herrvigg
Copy link
Collaborator

herrvigg commented Sep 12, 2020

I have a little problem with the case-sensitive checks. In the current version, the 2-letter code check is entirely non-sensitive. This means you could enter en or EN indifferently. It was only suggested to use lower case. I don't really know the historical reasons:

  • maybe this was because the reference was this obsolete list of ISO-639 codes containing upper case versions in the end. But this list has become obsolete. If we look are more recent specifications it seems any ISO-639 code should be lower case.
  • maybe the reason was to simplify the relation with the URL. The URL check should be case-insensitive. But this not the same as the internal format.

My view is that the internal language code should be case-sensitive to match exactly the official language specifications. When we check against the URL, this particular check can remain case-insensitive, but internally the language code should be strictly case-sensitive to ensure a perfect consistency between the definitions of the enabled languages and the embedded tags stored in DB.

I would like to enable a strict case-sensitive check when the language codes are edited on the admin side. The problem is that some users may have a 2-letter codes with upper case e.g. with [:EN], though this is not the general case i hope. Before enabling a strict check we'll need to convert all those in the post content to lower case and this is requires a bit more work. So for now i will end up with a mix, enforcing the case sensitive check for the new formats other than 2-letter but i will still allow a deviation for the existing 2-letter codes. Probably need a big warning to prepare this migration.

@herrvigg
Copy link
Collaborator

I also realize that adding the country code and the script raises a new need, having fallback mechanisms to a more generic language. Let's say you have now:

  • en-US: American English
  • en-GB: British English

With the new feature, the two language codes would exist in DB. When looking for a translation, we would look for that language strictly, either tags [:en-US] or [:en-GB]. But if a post doesn't contain such, it would be quite logic to take existing [:en] if they are found in the post. This is not the case for now.

Same would be for zh-TW / zh-CN --> zn or with the scripts as zh-Hant / zh-Hans --> zn.

We can think the languages codes as a hierarchy depending on which level they are defined and what we have in database with the post content. That would require a few more changes but it's an idea for future features.

@herrvigg
Copy link
Collaborator

Mmmm what we need maybe is rather embedded sub-tags... For example we would get this with standard tags (extended):

[:en-US]What a nice color! The same as my favorite soccer team![:]
[:en-GB]What a nice colour! The same as my favourite football team![:]

If we handle this with two separate language codes, all the post content would be duplicated.

But we could think of a smarter format like this:

[:en]What a nice [:en-US]color[:en-GB]colour[:-]! The same as my [:en-US]favorite soccer[:en-GB]favourite football[:-] team![:]

This would save a lot of space in the database because the common parts are shared. And the fallback propagation here would be quite intrinsic to the format.

But this is definitely more complex and requires more work to handle. I don't want to block this new feature for the extension of the language formats too long but... the problem is if i release the new full extended format it may become very hard to migrate to such format in the future... I need to think about it a bit more.

For sure the 3-letter format ISO-639-2 can be released. Maybe i will do a first release for this. The current regex allows much more flexibility for the main format.

@herrvigg
Copy link
Collaborator

herrvigg commented Sep 13, 2020

Or maybe it should be handled like this:

[:en]What a nice [:][:en-US]color[:en-GB]colour[:][:en]! The same as my [:][:en-US]favorite soccer[:en-GB]favourite football[:][:en] team![:]

Though this looks more verbose than the embedded version, it could be much easier to handle in the code. It is important to realize that these specific variations of words are a very small subset compared to the common part. The example is just to show what we really need for the regional, but in reality the main language matters more for 99% of the content.

Another way to see it would be to look for those specific words explicitly. We could have a database of regional translations for specific words translated on the fly for the front-end only. In database we would just keep one language of reference.

@nethubonline is that the same for "zh-TW" / "zh-CN" and "zh-Hant" / "zh-Hans"? Are these variations changing only small part relatively or does it change drastically the whole content?

herrvigg added a commit that referenced this issue Sep 13, 2020
Fixes #836.  Fixes partially #668.
Major refactoring: language code format now handled with a unique regex.
The new format allows 2 or 3-letter (ISO 639-2 and 639-3), lower case.
Upper case values are only allowed for legacy codes but not for new entries.
A migration of DB will be required before enforcing to lower case.
URL checks remain case-insensitive (unchanged).
herrvigg added a commit that referenced this issue Sep 13, 2020
Fixes #836.  Fixes partially #668.
Major refactoring: language code format now handled with a unique regex.
The new format allows 2 or 3-letter (ISO 639-2 and 639-3), lower case.
Upper case values are only allowed for legacy codes but not for new entries.
A migration of DB will be required before enforcing to lower case.
URL checks remain case-insensitive (unchanged).
@herrvigg herrvigg reopened this Sep 13, 2020
@herrvigg
Copy link
Collaborator

herrvigg commented Sep 13, 2020

So i released a 3.9.0 for the initial 3-letter support with all the related refactoring.
Created a new ticket #884 for the follow-up on the legacy upper case that should be migrated to lower case.

For support of regions and scripts we can continue the discussion here or create new specific topics, i need to clarify a bit the needs before deciding for the right format in the database for the sub-parts related to regions and scripts. What i want to avoid is opting for a solution that would require a DB migration later.

@nethubonline
Copy link
Author

Hi herrvigg,

Glad to know a big step of qtranslate-xt, thank you very much for your efforts.

hm.....
zh-Hant is Traditional Chinese which is used in Taiwan (zh-TW), Hong Kong (zh-HK);
zh-Hans is Simplified Chinese which is used in China (zh-CN)

They can change drastically the whole content depends on content, below is an example (I bold the differences)
en: For support of regions and scripts we can continue the discussion here or create new specific topics
zh-TW: 了支持區域本,我可以在此處繼續討論建新的特定主
zh-CN: 了支持区域本,我可以在此处继续讨论建新的特定主

@nethubonline is that the same for "zh-TW" / "zh-CN" and "zh-Hant" / "zh-Hans"? Are these variations changing only small part relatively or does it change drastically the whole content?

I agree that internally the language code should be strictly case-sensitive to ensure a perfect consistency, because we just need to show correct hreflang for SEO.

FYI, not sure if it helps: please check the source of https://www.apple.com/hk/ , Apple shows quite a lot of hreflang tag.

<link rel="alternate" href="https://www.apple.com/uk/" hreflang="en-GB" />
<link rel="alternate" href="https://www.apple.com/hk/" hreflang="zh-HK" />
<link rel="alternate" href="https://www.apple.com/tw/" hreflang="zh-TW" />
<link rel="alternate" href="https://www.apple.com.cn/" hreflang="zh-CN" />```


@herrvigg
Copy link
Collaborator

Thank you, yes the hreflang example looks good, that's what we are aiming for! The country (region) or script suffix are meant to be optional but they are relevant and qTranslate should support this.

In the example you gave between zh-TW and zh-CN it is actually similar to en-US vs en-GB in the sense most of the content is common and ideally should be shared. The longer the content, the more shared parts they will have for sure.

If i just enable the regions and scripts as new language entries, we would have a lot of content duplication and i believe there is a better way to handle it. I could release the new format but if we change later, this may create problems of migration if the internal format changes. The question is how to find a good user interface to edit those and how to store it in DB. I still need a bit of time to reflect on this to find a good plan.

@nethubonline
Copy link
Author

herrvigg,

Hm.....with all respect, I can confirm that no Chinese people will use the shared common content between zh-TW & zh-CN 😄 because editing the content will kill us 🤣 , and in most cases of Chinese characters, it will also take up more database space.

for the same example above, we need to rewrite the original code:
Before: [:zh-TW] 為了支持區域和腳本,我們可以在此處繼續討論或創建新的特定主題 [:zh-CN] 为了支持区域和脚本,我们可以在此处继续讨论或创建新的特定主题 [:]

After: [:zh-TW] 為[:zh-CN] 为[:][:zh]了支持[:][:zh-TW]區域[:zh-CN]区域[:][:zh]和[:][:zh-TW]腳[:zh-CN]脚[:][:zh]本,我[:][:zh-TW]們[:zh-CN]们[:][:zh]可以在此[:][:zh-TW]處繼續討論[:zh-CN]处继续讨论[:][:zh]或[:][:zh-TW]創[:zh-CN]创[:][:zh]建新的特定主[:][:zh-TW]題[:zh-CN]题[:]

However, it may help for other languages 😄

@herrvigg
Copy link
Collaborator

herrvigg commented Sep 14, 2020

Ahah fair point! Some regional variations should definitely not be merged 🤣😅

Indeed for chinese the differences are actually more frequent and it would perhaps be more efficient to let the contents separated. For other languages we may also give the same possibility. But in general there are two questions:

  1. fallback for missing content
    Let's say you write a zn-TW version as the main language.
    If you have just zn-CN for a few pages, it would be nice to fallback to zn-TW if zn-CN does not exist. Same would be for any combination such as en-US / en-UK.
    This also applies to the existing content without region/scripts. Let's say we have a database with a lot of existing zn content. If we just want to add zn-TW or zn-CN we would also want to benefit from the existing zn content as fallback (same for en / en-UK).

  2. URL redirections
    Currently we don't have any kind of mapping, so if you have zn-CN and zn-TW you would get two URL paths:
    https://website.com/zn-cn/
    https://website.com/zn-tw/
    But this would be missing: https://website.com/zn/. We could still create zn but that would be a separate language as it if were completely different such as en. It is not clear to me yet how we should handle the relation between the new language+region entries and just language.

These are the main reasons we still need to think how to combine language with or without regions/scripts.

@herrvigg
Copy link
Collaborator

herrvigg commented Sep 14, 2020

As a complement to my point 2 and your example:

<link rel="alternate" href="https://website.com/uk/" hreflang="en-GB" />
<link rel="alternate" href="https://website.com/hk/" hreflang="zh-HK" />
<link rel="alternate" href="https://website.com/tw/" hreflang="zh-TW" />
<link rel="alternate" href="https://website.com.cn/" hreflang="zh-CN" />```

For the 3 first items, here there is an additional mapping between the URL path and hreflang. In qTranslate, i think the internal language should be the one with the region (en-GB, zh-HK, zh-TW) as shown in point 2 earlier. But we may add a new level of mapping just for the URL as a form of alias, to say "map the path uk to language en-GB, path hk -> language zh-HK and so on".

For the last item this is not supported by qTranslate yet. What we have is the QTX_URL_DOMAIN or QTX_URL_DOMAINS option but it is not possible to mix different URL methods (it cannot be combined at the same time with QTX_URL_PATH). You may add some redirection rules in your web server configuration (HTTP level - not the web application) but that can create some complications.

@nethubonline
Copy link
Author

Yeah, it would be great feature to fallback for missing content.

For point 2 URL redirections, I am not sure if it is useful for other users. If I have a site with URLs below, actually I don't need https://website.com/zh/

https://website.com/zh-cn/
https://website.com/zh-tw/

Again, thank you for your great effort on this.

@herrvigg
Copy link
Collaborator

herrvigg commented Sep 16, 2020

Hybrid sequences - too hard!

I've made a few experiments and my idea of mixing hybrid sequences of tags (shared and specific parts) is going to be way too complicated!

Internally, qTranslate decomposes a post content in blocks in memory and assign each block to one language, given in the order set in the language options. Therefore, the order and the position of the original tags are lost. This is true now if you edit the raw content and enable the LSB later. There can only be a unique part per language.

Also, the classic editor in the admin section only allows to edit one language at a time even if you have the LSB to switch quickly in the same session. In Gutenberg, maybe we could have other alternatives with the Gutenberg blocks, but this would require a whole different approach and we can't really do this for now.

New feature - maybe soon!

So, for a pragmatic solution regarding the regional codes, we have to treat them as a normal language. This is both for the storage and the edition. In other words, i think i will enable this feature soon just by extending the regex with the one i shown previously.

My main concern was the storage format, but it will remain the same. The feature will allow more flexibility but the users will have to manage the new possibilities on their side.

One need that may be raised is the possibility to rename a language code internally in the whole database. For example, if you created long ago a language as en you might want to convert it to en-US. Someone who knows regex and SQL could do this directly, but qTranslate should certainly offer this feature or other utility functions at some point, but this is not strictly needed just for now.

I'm still doing a few experiments and i'd like to solve the fallback question a bit better:

  • for the main vs regional decomposition there is a natural order (for example a missing en-UK should look for en first if it exists) and this should not be a huge change in the code.
  • I also noticed the available languages have sometimes very strange names (en becomes "American English" though it has the name English in qTranslate and this is confusing if i have a en-US which is my own "English American" i created).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core functionalities, including the admin section enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants