Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upcasing in Greek no longer drops diacritics #581

Closed
gmilde opened this issue Mar 23, 2023 · 47 comments
Closed

Upcasing in Greek no longer drops diacritics #581

gmilde opened this issue Mar 23, 2023 · 47 comments
Milestone

Comments

@gmilde
Copy link

gmilde commented Mar 23, 2023

Greek letters drop diacritics (eccept dialytika and sub-iota) in UPPERCASE. This
is not cared for by the Unicode standard (except with a reference implementation for a Greek-specific upcasing algorithm https://icu.unicode.org/design/case/greek-upper).
Polyglossia uses the \uccode re-definitions from "xgreek.sty" to achieve this effect. However, since the new implementation in the 2022/06 LaTeX release, these re-definitions are ignored by \MakeUppercase.

\documentclass{article}
\usepackage{fontspec}
\setmainfont{DejaVu Serif}

\usepackage{polyglossia}
\setmainlanguage[variant=poly]{greek}
% if works still with Babel
% \usepackage[polutoniko.greek]{babel}

\begin{document}
 άδικος, κείμενο, ίριδα → \MakeUppercase{άδικος, κείμενο, ίριδα}
\end{document}

Somehow a "locale setting" by the Babel core activates Greek
upcasing rules while by default the common upcasing rules from Unicode
are used for Greek literals.
Unfortunately, there is no documented API that I know of.

@jspitz
Copy link
Collaborator

jspitz commented Apr 1, 2023

Note that \MakeUppercase[locale=el]{άδικος, κείμενο, ίριδα} works. The reason why babel works, AFAIU, is not due to a change in babel, but due to an incomplete kernel change which is meanwhile using expl3's \text_uppercase:nn and friends in the legacy commands and currently only checks for babel, not (yet) for polyglossia.

See ltnews36 which states how:

The commands \MakeUppercase, \MakeLowercase
and \MakeTitlecase now automatically detect the
locale currently in use when babel is loaded. This
allows automatic adjustment of letter mappings where
appropriate. They also accept a leading optional
argument. This accepts a key–value list of control
settings. At present, there is one key available: locale,
which can also be accessed via the alias lang. This is
intended to allow local setting of the language, which
can be done using a BCP-47 descriptor. For example,
this could be used to force Turkish case changing in
otherwise English input
\MakeUppercase[lang = tr]{Ragıp Hulûsi Özdem}
yields RAGIP HULÛSİ ÖZDEM.

Now the question is whether it is planned to also detect the locale when polyglossia is loaded. If not, we need to roll our own definitions à la:

% Roll a locale-aware \MakeUppercase version that handles diacritics correctly (#581)
% This employs the expl3 case changer
\ExplSyntaxOn
   \cs_gset:Npn \MakeXPGGreekUppercase #1 { \text_uppercase:nn { el } {#1} }
\ExplSyntaxOff

% Save original \MakeUppercase definition
\let\xpg@save@MakeUppercase\MakeUppercase

\def\blockextras@greek{%
  \let\MakeUppercase\MakeXPGGreekUppercase%
}

\def\noextras@greek{%
   % restore original \MakeUppercase definition
   \let\MakeUppercase\xpg@save@MakeUppercase%
}

Of course I would very much prefer if we could avoid all the necessary ad hoc definitions in our gloss files.
@josephwright are there serious reasons for the neglection of polyglossia or is this planned for later anyway?

@jspitz
Copy link
Collaborator

jspitz commented Apr 1, 2023

BTW @josephwright it looks like the following addition (ensuing the respective babel test in the kernel) will do:

    \@ifpackageloaded { polyglossia }
       {
	\@ifpackagelater { polyglossia } { 2020-01-29 }
	{
	    \cs_gset_protected:Npn \@@text@case@aux@
		{
		    \str_set:Nx \reserved@a
		    { \csuse{bcp47id} }
		}
	}
	{ }
    }
    { }

@jspitz
Copy link
Collaborator

jspitz commented Apr 1, 2023

(I'm using \csuse{bcp47id} here since \languageid is not expandable. We can also provide an expandable version, but it does not strike me necessary here).

@jspitz jspitz added the needinfo label Apr 1, 2023
@u-fischer
Copy link

@jspitz

It would be better if polyglossia would provide an expandable \localeinfo* { tag.bcp47 } too, so that the kernel doesn't have to test for external packages. Having a common interface shared by both packages would also allow to remove the test for babel and define a dummy in the kernel.

@jspitz
Copy link
Collaborator

jspitz commented Apr 1, 2023

I don't understand why we would need to emulate the babel syntax (which is newer than polyglossia's)

@u-fischer
Copy link

I don't understand why we would need to emulate the babel syntax (which is newer than polyglossia's)

well Joseph asked last year for a common interface for the BCP-data.

But Javier just reminded me that after this discussion babel added another interface. If polyglossia would implement that too, the kernel could use it in the next release.

\documentclass{article}

\usepackage[austrian]{babel}

\begin{document}

(\BCPdata{language}) (\BCPdata{script}) (\BCPdata{region})

\end{document}

@jspitz
Copy link
Collaborator

jspitz commented Apr 1, 2023

Forget my request. I'll fix this in polyglossia.

@jspitz jspitz removed the needinfo label Apr 1, 2023
@u-fischer
Copy link

if you mean that you want to redefine \MakeUppercase: Apart from the fact that I think that patching kernel commands like this isn't a good idea, be aware that hyperref relies for the bookmarks on an expandable interface to the bcp tag too (currently it still uses \localeinfo, but I will switch soon to \BCPtag too).

jspitz added a commit that referenced this issue Apr 2, 2023
@jspitz jspitz added the FIXED IN DEV This bug is fixed for the next release label Apr 2, 2023
@jspitz jspitz added this to the 1.61 milestone Apr 2, 2023
@jspitz
Copy link
Collaborator

jspitz commented Apr 2, 2023

Fixed downstream.

@jspitz
Copy link
Collaborator

jspitz commented Apr 2, 2023

if you mean that you want to redefine \MakeUppercase: Apart from the fact that I think that patching kernel commands like this isn't a good idea, be aware that hyperref relies for the bookmarks on an expandable interface to the bcp tag too (currently it still uses \localeinfo, but I will switch soon to \BCPtag too).

\csuse{bcp47id} is expandable and works well for the hyperref case. But since you seem to be unwilling to support everything else than babel syntax, I am going to patch into \MakeUppercaseUnsupportedInPdfStrings and \MakeLowercaseUnsupportedInPdfStrings as well.

@josephwright
Copy link

Of course I would very much prefer if we could avoid all the necessary ad hoc definitions in our gloss files. @josephwright are there serious reasons for the neglection of polyglossia or is this planned for later anyway?

There's no intention to neglect polyglossia, but to explain the current situation, the background as I see it.

The change of approach to case changing brought in the idea of locale to a kernel function for the first time. This raised the opportunity to have a fixed kernel definition and to then pick up the locale set by packages. However, as there is currently no generally-agreed kernel interface for the the locale data, we couldn't just 'make a change'.

I therefore looked at what babel and polyglossia offer in terms of interface for BCP-47 info: I was hoping for a common one that we could use in the kernel. Again, there isn't a current agreed common interface so I looked at what there is in both packages. As has been mentioned already, we ideally need something expandable, or at least returning a 'string' containing the BCP-47 info (or a proxy to it). In babel, there is a documented interface that does that, but for polyglossia there is not currently: I know \bcp47id is available but that's an internal data store as far as I can make out.

Ideally, the kernel shouldn't be checking for any packages at all, so we did have quite a lively discussion about whether adding code that is dependent on babel is 'acceptable'. We've gone with the current approach as a temporary measure as babel has a documented interface for recovering BCP-47 data (including decomposition) (and so if 'temporary' sticks we are using something documented).

However, the aim remains to find a way toward an agreed single interface that both babel and polyglossia can use to
make the information available. From my point of view, it would be ideal if the two could agree on a common data structure too, such that loading both would mean that setting the locale from either would apply to both, but I suspect at present that is too much of an ask.

With a common interface, the kernel can then avoid having any package dependence at all: we set up the interface in the kernel, babel and polyglossia can then re-define it to provide the data in an appropriate way (or if we can agree a common data store, they can populate the data store and the kernel interface could be left alone).

In terms of a common interface, I think many things are possible. One could for example imagine documenting \bcp47id and arranging for babel to populate it. Or one could take babel's \BCPdata{<aspect>} approach, or ... I guess I would weakly favour the babel method as it offers decomposition of the data into different aspects, which I suspect will become more useful over time.

What we really wanted to avoid is a set of 'if package loaded' tests in the kernel. The aim of the current temporary situation was to move forward somewhat whilst not simply putting in a set of patches that would be forgotten and bite all of us later. The current thread is in that sense what I was anticipating: coming back to the interface questions and trying to get to a position everyone can work with.

@u-fischer
Copy link

\csuse{bcp47id} is expandable and works well for the hyperref case. But since you seem to be unwilling to support everything else than babel syntax,

I'm willing to support whatever syntax, but as Joseph wrote: we need one syntax that in the long term is not dependant on package tests. The need to query the locale will appear also in other places, e.g. in tagging structures where one sometimes have to set the language too, and it won't do if the kernel or hyperref has to plant everywhere tests about special package code.

I am going to patch into \MakeUppercaseUnsupportedInPdfStrings and \MakeLowercaseUnsupportedInPdfStrings as well.

Well I can't prevent you to go this way, but I doubt that these commands will stay forever in hyperref.

@josephwright

I guess I would weakly favour the babel method as it offers decomposition of the data into different aspects, which I suspect will become more useful over time.

Yes, I agree with this.

@jspitz
Copy link
Collaborator

jspitz commented Apr 2, 2023

I completely agree with "a common interface would be ideal". However, devising such a common interface requires, in my opinion, that all involved parties sit down and talk about how this should look like. At least in the cases where I have been involved, this has not been the case. The babel maintainer did not show any interest in corporation but simply enforced his own approach (without even considering talking back with us). This is not how collaboration works, and my motivation to embark on this ship is consequently low.

@u-fischer
Copy link

The babel maintainer did not show any interest in corporation but simply enforced his own approach (without even considering talking back with us)

\BCPdata was a suggestion from Joseph, and is not a "babel approach". babel only implemented it after the discussion by mail in may in which you participated too. If you don't like the idea, please suggest something else for the collaboration with the kernel and packages like hyperref.

jspitz added a commit that referenced this issue Apr 2, 2023
@jspitz
Copy link
Collaborator

jspitz commented Apr 2, 2023

I have added a very basic (expandable) \BCPdata command. For the time being, this returns the full tag no matter what argument you pass to it. I suppose this will suffice for the current uses with casing. So instead of checking for a particular babel version, you probably can simply check whether this command is defined at document begin.

We can add subtags later if required. This requires a bit more work, as polyglossia currently does only store full tags.

@jbezos
Copy link

jbezos commented Apr 2, 2023

@josephwright @u-fischer @jspitz

babel defines the following 4 fields whose names are (I think) obvious, because they are used in the BCP 47: language, script, region, and variant (eg, \BCPdata{script}). For the 3rd the CLDR seems to prefer territory, and polyglossia uses ‘variant’ with another meaning. Of course, I’m willing to create aliases if necessary.

Extensions are tentatively named extension.t and extension.u. The private use area is named extension.x which is not quite correct. Since these names are documented as tentative in babel, they may be changed.

@jspitz
Copy link
Collaborator

jspitz commented Apr 2, 2023

babel defines the following 4 fields whose names are (I think) obvious, because they are used in the BCP 47: language, script, region, and variant (eg, \BCPdata{script}). For the 3rd the CLDR seems to prefer territory, and polyglossia uses ‘variant’ with another meaning. Of course, I’m willing to create aliases if necessary.

I'll go with those as well for polyglossia (in the medium term), and probably extension.x which is used by latin. We don't have u and t extensions yet.

@jspitz
Copy link
Collaborator

jspitz commented Apr 3, 2023

@jbezos I am wondering whether it would make sense to add

  1. a way to access these info also for the main language (as opposed to the current language), e.g. \BCPdata{main.language}, as sometimes it is useful to know the difference
  2. a standardized way to let \BCPdata output the BCP-47 tag as it is recorded in the gloss/ini file (\localeinfo*{tag.bcp47} in babel, \languageid{bcp47} in polyglossia), e.g. \BCPdata{tag} or simply with empty argument

@josephwright
Copy link

I have added a very basic (expandable) \BCPdata command. For the time being, this returns the full tag no matter what argument you pass to it. I suppose this will suffice for the current uses with casing. So instead of checking for a particular babel version, you probably can simply check whether this command is defined at document begin.

We can add subtags later if required. This requires a bit more work, as polyglossia currently does only store full tags.

Many thanks. I think we can work with this. Probably at some stage we might want to have \providecommand in the packages and \newcommand in the kernel (so no \ifdefined test is needed), but that is likely a refinement for a future 'pass'.

@josephwright
Copy link

I will look to adjust the kernel code for the next (2023-06-01) release: does this fit with everyone's thinking?

@u-fischer
Copy link

Many thanks. I think we can work with this. Probably at some stage we might want to have \providecommand in the packages and \newcommand in the kernel (so no \ifdefined test is needed), but that is likely a refinement for a future 'pass'.

If I got it right, then the kernel can already now make \newcommand\BCPdata[1]{...} with some suitable default definition (and document what the output should be if packages overwrite the definition).

@josephwright
Copy link

Many thanks. I think we can work with this. Probably at some stage we might want to have \providecommand in the packages and \newcommand in the kernel (so no \ifdefined test is needed), but that is likely a refinement for a future 'pass'.

If I got it right, then the kernel can already now make \newcommand\BCPdata[1]{...} with some suitable default definition (and document what the output should be if packages overwrite the definition).

@jbezos, @jspitz That work for both of you?

@jbezos
Copy link

jbezos commented Apr 3, 2023

  1. a way to access these info also for the main language (as opposed to the current language), e.g. \BCPdata{main.language}, as sometimes it is useful to know the difference
  2. a standardized way to let \BCPdata output the BCP-47 tag as it is recorded in the gloss/ini file (\localeinfo*{tag.bcp47} in babel, \languageid{bcp47} in polyglossia), e.g. \BCPdata{tag} or simply with empty argument

Why not, but \BCPdata is more a LaTeX command, so I wonder what @josephwright and @u-fischer think. Since it’s not documented (at least in babel I didn’t documented it because it was essentially tentative and ‘unfinished’), we can still make some changes, or at least leave room for future changes.

@jbezos
Copy link

jbezos commented Apr 3, 2023

If I got it right, then the kernel can already now make \newcommand\BCPdata[1]{...} with some suitable default definition (and document what the output should be if packages overwrite the definition).

@jbezos, @jspitz That work for both of you?

That’s mostly fine, but I thinking of something like this: we load Polytonic Greek, whose tag is el-polyton. I want to keep this tag for whatever reason, to correctly identify which Greek variant is in force (in the BCP 47 sense, and in this case in the polyglossia sense, too, iir). But I also want to activate the iota rules by default, always, without explicitly passing el-x-iota. I’m not sure how this can be done.

@josephwright
Copy link

That’s mostly fine, but I thinking of something like this: we load Polytonic Greek, whose tag is el-polyton. I want to keep this tag for whatever reason, to correctly identify which Greek variant is in force (in the BCP 47 sense, and in this case in the polyglossia sense, too, iir). But I also want to activate the iota rules by default, always, without explicitly passing el-x-iota. I’m not sure how this can be done.

Wouldn't that be el-polyton-x-iota? My reading of the x stuff is it can be used to add 'whatever' to the end of the tag. I suspect I may need to do more work if we want multiple x tag combos, but I think el-polyton-x-iota should already work: we do decomposition into <lang>-<filler>-x-<extension> so it should 'just work'.

@jbezos
Copy link

jbezos commented Apr 3, 2023

If it ‘just works’, great (actually, didn’t test).

@josephwright
Copy link

If it ‘just works’, great (actually, didn’t test).

I did: this does work :)

@josephwright
Copy link

  1. a way to access these info also for the main language (as opposed to the current language), e.g. \BCPdata{main.language}, as sometimes it is useful to know the difference
  2. a standardized way to let \BCPdata output the BCP-47 tag as it is recorded in the gloss/ini file (\localeinfo*{tag.bcp47} in babel, \languageid{bcp47} in polyglossia), e.g. \BCPdata{tag} or simply with empty argument

Why not, but \BCPdata is more a LaTeX command, so I wonder what @josephwright and @u-fischer think. Since it’s not documented (at least in babel I didn’t documented it because it was essentially tentative and ‘unfinished’), we can still make some changes, or at least leave room for future changes.

Should we open a kernel issue to hammer this out now we have broad agreement? The plan would be I think for the kernel to define the command and document the expected semantics, but itself do nothing, then leave it to polyglossia and babel to implement whatever parts of the documented API make sense.

@u-fischer
Copy link

If it ‘just works’, great (actually, didn’t test).

I did: this does work :)

But wouldn't you get in \MakeUppercase only el from \BCPdata{language}?

@josephwright
Copy link

If it ‘just works’, great (actually, didn’t test).

I did: this does work :)

But wouldn't you get in \MakeUppercase only el from \BCPdata{language}?

Er, yes: what I meant is that el-polyton-x-iota is read by the case code as el + x-iota and the polyton part is ignored.

@jbezos
Copy link

jbezos commented Apr 4, 2023

@josephwright Making tests. I’m still not sure what to do wrt to the last part of my question:

But I also want to activate the iota rules by default, always, without explicitly passing el-x-iota.

Will the casing macros take the private area into account if the corresponding field is defined?

(btw, I wonder if this discussion should take place in the LaTeX repository.)

@u-fischer
Copy link

But I also want to activate the iota rules by default, always, without explicitly passing el-x-iota.

Will the casing macros take the private area into account if the corresponding field is defined?

I wondered if an indirection would make sense. So that \MakeUppercase uses say \BCPdata{casing}, and you could then define this e.g. as a combination of \BCPdata{language}-\BCPdata{whatever)?

@jbezos
Copy link

jbezos commented Apr 4, 2023

@u-fischer 👍 It’s my preferred option, indeed, which imo will make things easier for other packages (eg, xgreek) and for users who don’t like babel or polyglossia. And document what the casing macros expect (which, in fact, is already done).

@josephwright
Copy link

(btw, I wonder if this discussion should take place in the LaTeX repository.)

I think that might be best: we are really talking about something separate from the thread here

@jspitz
Copy link
Collaborator

jspitz commented Apr 8, 2023

Many thanks. I think we can work with this. Probably at some stage we might want to have \providecommand in the packages and \newcommand in the kernel (so no \ifdefined test is needed), but that is likely a refinement for a future 'pass'.

If I got it right, then the kernel can already now make \newcommand\BCPdata[1]{...} with some suitable default definition (and document what the output should be if packages overwrite the definition).

@jbezos, @jspitz That work for both of you?

Works for me.

@jspitz
Copy link
Collaborator

jspitz commented Apr 8, 2023

  1. a way to access these info also for the main language (as opposed to the current language), e.g. \BCPdata{main.language}, as sometimes it is useful to know the difference
  2. a standardized way to let \BCPdata output the BCP-47 tag as it is recorded in the gloss/ini file (\localeinfo*{tag.bcp47} in babel, \languageid{bcp47} in polyglossia), e.g. \BCPdata{tag} or simply with empty argument

Why not, but \BCPdata is more a LaTeX command, so I wonder what @josephwright and @u-fischer think. Since it’s not documented (at least in babel I didn’t documented it because it was essentially tentative and ‘unfinished’), we can still make some changes, or at least leave room for future changes.

I think it will be good if at least babel and polyglossia use the same arguments here (and then we both document it).

jspitz added a commit that referenced this issue Apr 9, 2023
@jspitz jspitz removed the FIXED IN DEV This bug is fixed for the next release label Apr 9, 2023
@jspitz
Copy link
Collaborator

jspitz commented Apr 9, 2023

As there is a kernel change in sight (and hopefully also hyperref), I've removed the patches. Let's get this right from the start. @josephwright I define \BCPdata in polyglossia via \DeclareExpandableDocumentCommand. This works with your plans, doesn't it?

@josephwright
Copy link

I've opened latex3/latex2e#1035 as defining the interface really does need to be done at the kernel level.

@josephwright
Copy link

I suggest closing here at this point as the issue is being handled elsewhere

@jspitz
Copy link
Collaborator

jspitz commented Apr 10, 2023

I'll close as soon as I have implemented \BCPdata with all keywords (which will take me some time).

jspitz added a commit that referenced this issue Apr 12, 2023
TODO: Add the proper subtags to the gloss files
@jspitz
Copy link
Collaborator

jspitz commented Apr 13, 2023

Everything is done on polyglossia side now. Some subtags might need adjustment, but in general all data is in, and \BCPdata{casing} should always return the appropriate tag.

Closing this as "upstream" bug (for the record, this will be fixed with the next LaTeX kernel release, 2023/06).

@jspitz
Copy link
Collaborator

jspitz commented Apr 17, 2023

I have just committed poylglossia 1.61 to CTAN. This includes \BCPdata.

@jspitz
Copy link
Collaborator

jspitz commented Apr 21, 2023

@u-fischer with the latest hyperref release, the following should work right?

% !TeX TS-program = xelatex
\documentclass{article}
\usepackage{fontspec}
\setmainfont{DejaVu Serif}

\usepackage{polyglossia}
\setmainlanguage[variant=poly]{greek}

\usepackage[bookmarks]{hyperref}

\begin{document}
	\section{\MakeUppercase{άδικος, κείμενο, ίριδα}}
	άδικος, κείμενο, ίριδα → \MakeUppercase{άδικος, κείμενο, ίριδα}
\end{document}

still get wrong casing in the bookmarks.

Same with

\usepackage[greek.polutoniko]{babel}

@jspitz
Copy link
Collaborator

jspitz commented Apr 21, 2023

No, sorry, babel works (after an additional xelatex pass). So is there something wrong with the return value of polyglossia?

@u-fischer
Copy link

@jspitz your \BCPdata is not expandable.

\edef\blub{\BCPdata{casing}}\show\blub 

gives

> \blub=macro:
->\tl_set:Nn casing{casing}\str_if_in:nnTF {casing}{main.}{\seq_set_split:Nnx \s__seq {.}casing\seq_get_right:NN \s__seq casing\str_set_eq:NN greekgreek}{\str_set_eq:NN greekgreek}\cs_set_nopar:cpx {tmpcasing}{}el.

In babel I get

> \blub=macro:
->el.

hyperref needs expandable commands in the bookmarks (and the kernel needs an expandable \BCPdata too).

@jspitz
Copy link
Collaborator

jspitz commented Apr 21, 2023

Any advise how to make it expandable is welcome.

@jspitz
Copy link
Collaborator

jspitz commented Apr 22, 2023

I have reworked \BCPdata at 8a5a2fd. It is now expandable. A hotfix release (1.61b) will go to CTAN later today.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants