-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement (Almost) All The Unicode Caseless Matching Systems #232
base: series/2.x
Are you sure you want to change the base?
Conversation
This commit adds CaseFoldedString as a partner to CIString. A CaseFoldedString is case folded according to the Unicode rules for Caseless Matching. In contrast to CIString, it does _not_ keep a reference to the input `String`. This commit changes CIString to be based on CaseFoldedString.
The exception is identifier caseless matching because the normalization operations are not natively supported by `java.text.Normalizer` and would require even more code. This change introduces a staggering _12_ types of case folded strings. These break down to simple/full case folded strings, each with a default and Turkic variant, and then for each of those we have default, canonical, and compatibility normalization/folding. CIString is now deprecated and defers to CIStringS, which is an input remembering caseless string using default Unicode caseless matching on simple case folded strings. This is the most similar form to what CIString was doing before, though I think most users will actually want to use CIStringCF which is canonical Unicode caseless matching on full case folded strings.
Tests are expected to fail right now. This branch currently still has the tests I added for CIString to debug/validate the full case folding behavior on CIString. Since CIString is now explicitly using default case less matching with simple folding, it shouldn't pass default case less matching with full folding. If we move forward here, I'll fix all that up. |
Phew! 😅 Some assorted thoughts:
If
Static types are also better for Scala.js, since the optimizer can detect unused types and elide the relevant code. If this is implemented dynamically, then it no longer can do that. Although not sure if it this will make much difference in practice 😅 I think the hardest thing to stomach is that as a user or library author it's no longer obvious what the "right" choice is. I'm worried that if libraries don't sync up on this, then there will be fragmentation between APIs. |
My reasoning is that while I think the current implementation of I'm okay just deprecating though if that is preferred.
Yeah, I agree. If we keep the other types around I think we should update the github site and basically all documentation to point people towards I on the fence about having the full set of types proposed here, but I'm trying not to be overly biased towards my specific use case. Since they are all in the standard, it's reasonable to think someone might want them eventually. They do have some nice properties, like I'm willing to be talked out of this though. I also hate the names |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like mostly a superset of the other one, so it's probably better to discuss the design there.
message = | ||
"Please use either CIStringCF, CIStringCS, or CIStringS instead. CIString/CIStringS implement Unicode default caseless matching on simple case folded strings. For most applications you probably want to use CIStringCF which implements Unicode canonical caseless matching on full case folded strings.", | ||
since = "1.3.0") | ||
final class CIString private (override val toString: String, val asCIStringS: CIStringS) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
http4s-0.22 and 0.23 would have to pin, because this is a big part of its public API, and even if we nowarned all the usages, the deprecation would be viral into everyone's apps. I would really rather avoid a source-breaking change on this type if we can avoid it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I'm struggling with what to do here. I'd like CIString
to be the one most users will want out of the box, but we can't do that unless we break the semantics by changing it to be Unicode canonical caseless matching with full folded strings.
For most users, they probably won't notice, so maybe it's okay? That also would help with the concern about advertising which is the best type to use by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, are we breaking semantics? Or was our current implementation buggy, and we are fixing the semantics?
The CI microsite states that "locale independence" is a design goal. Is that what @isomarcte is achieving here?
Edit: hmm, I see that "locale-independence" is a technical term meaning that e.g. Locale.setDefault(Locale.forLanguageTag("tr"))
should not change the result
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@armanbilge that's a tricky question. Yes and no. The Turkic special cases are obviously not locale independent. However, and forgive me for splitting hairs here, I'm not sure you can say that any Unicode caseless matching is locale independent. Whether or not "ı" == "I"
in a caseless match is really only answerable in the context of some locale.
In as much as one can have locale independence in this context, the non-Turkic variants achieve that.
I can drop the Turkic ones, but...I don't know...maybe some Turkic language users will want to use them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Edit: hmm, I see that "locale-independence" is a technical term meaning that e.g. Locale.setDefault(Locale.forLanguageTag("tr")) should not change the result
I did not realize that. 👀
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. Idk if we can declare this a bug in case-insensitive specifically, but IIUC it is a bug in http4s.
The [Unicode simple](https://www.w3.org/TR/charmod-norm/#dfn-unicode-simple) casefolding form is not appropriate for string identity matching on the Web.
@armanbilge is it clear to you if they are implying default full caseless matching or canonical full caseless matching? It is not to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@isomarcte canonical?
Case sensitivity is not recommended for most specifications but, in the case of an exception where the vocabulary allows non-ASCII characters and which does not want to be sensitive to case distinctions, the 'Unicode canonical case fold' approach SHOULD be used.
https://www.w3.org/TR/charmod-norm/#matching_CanonicalFoldNormalizationStep
How does "compatibility" fit into this?
A 'Unicode compatibility case fold' approach should not be used.
https://www.w3.org/TR/charmod-norm/#matching_CompatibilityFoldNormalizationStep
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@armanbilge I don't think compatibility comes into play in our http4s context. (thanks for finding that btw!)
I'm becoming increasingly convinced we should change CIString
to be full canonical caseless matching and update the docs. I suspect most of our current user base is working around web based technologies and probably should be using full canonical caseless matching anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nothing here contradicts that: https://mvnrepository.com/artifact/org.typelevel/case-insensitive/usages
It's probably fine 🙃
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would challenge that this is a bug in http4s. From the next paragraph in the spec:
Specifications that define case-insensitive matching in vocabularies limited to the Basic Latin (ASCII) subset of Unicode MAY specify ASCII case-insensitive matching.
I can't think of anywhere that http4s is using it that isn't an ASCII context. The one place I'm less sure of is the URI, but even there, encodings are by convention, so I think the answer is ambiguous. So, while CIString
is less general purpose than we would like, is there a smoking gun that its use in the http4s ecosystem is wrong?
// Yes, you read that right. Round and round we go. | ||
// | ||
// > "Yes, that’s it," said the Hatter with a sigh: | ||
// > "it’s always tea-time, and we’ve no time to wash the things between whiles." | ||
// > | ||
// > "Then you keep moving round, I suppose?" said Alice. | ||
// > | ||
// > "Exactly so," said the Hatter: "as the things get used up." | ||
// > | ||
// > "But what happens when you come to the beginning again?" Alice ventured to ask. | ||
// > | ||
// > "Suppose we change the subject," the March Hare interrupted, yawning | ||
// | ||
// - Alice's Adventures In Wonderland, Chapter VII, by Lewis Carroll |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😆
This one drops the generic |
This allows us to not deprecate the type. There is still an unfortunate issue here. Full case folding can change the length (in terms of characters) of the `String`. This makes `CIString.length` not safe to use, as it is not clear which length it now refers to. This commit deprecates that method because of this. However, the `glob` based `unapplySeq` relies both on `length` and `Char` based case insensitivity. This means that it is no longer correct. That said, by relying on `Char` based case insensitivity, it was probably not quite correct to begin with.
@rossabaker @armanbilge I've pushed a new commit which undeprecates and changes
I ran into a different issue however. The unapplySeq for the I think we should deprecate it. 🙁 Edit: I linked the wrong commit, 619b36f |
Hmm. Putting aside the semantic changes, all these deprecations to |
Yeah, it's not great. I'm still open to suggestions on what to do here. We could provide an alternative to to the |
Right, and that wouldn't help http4s anyway. Seems to me like we can either swallow this pretty breaking change on both the semantic and API level or put this into 2.0.0. IIUC this came up while working on cats-uri which itself is targeting http4s 1.0 I assume. At the very least, starting a 2.x branch could unblock you for now, and we could think about strategies to get this into 1.x? /shrug |
My concern with not changing this until http4s 1.x.x is that what CIString does right now is (probably) equivalent to simple case folding with Turkic rules and default caseless matching. When that disagrees with canonical caseless matching with full folding, the result is a bug in the http4s context. Are we okay with that on 0.22.x or 0.23.x? I'm not sure what the full impact would be when they disagree. In a Uri for example, those would be cause two equivalent regNames to be treated as non-equivalent. Or, in the case of dotless I characters, our current CIString would cause two non-equivalent regNames to compare equal. This might get blocked by DNS rules, DNS adds a bunch of rules on top of RFC-3986, but I'm not sure. That's just one example, I'm sure there are others. I mean...I get it...we've no great options and big source breaking changes suck...I'm just thinking out loud here.
Yeah, I'll work towards that in the meantime. |
Yep me too 😅 Definitely agree that it's a bug in http4s, but I can't begin to understand how serious it is. Defer to Ross on next steps :) |
@rossabaker I'm (finally) circling back here. Do you have any thoughts? I suspect there are some very deep serious side effects of the incorrect case-folding, but that they are quite rare. Given our user base doesn't seem to be complaining about them right now in http4s, I think we could pivot for version 2 of this library and hope any such invalid behavior is rare enough that we don't see it on 0.23.x of http4s. Thoughts? |
I have severe apprehensions about introducing a 2.x version. The scala-xml upgrade has been a shit show, not so much due to actual incompatibility, and more because it's setting off alarms in SBT and getting casually upgraded in patch releases downstream because it was advertised as "compatible enough". We risk a similar shitshow as soon as we introduce a 2.x here. Eyeballing this after several months away, it seems like it's binary compatible, and I didn't set off any MiMa alarm bells. Am I correct that this a purely semantic change as written? I would also like to consider benchmarks before changing the core type used in http4s. It seems like this is heavier, but it probably doesn't matter in the context of an IO application? It's worth considering, given that this library is well established with one consumer that actually matters. |
I've been distracted by work in https://github.com/typelevel/idna4s and at $WORK, but IIRC this can be made bincompat with the possible exception of the globbing matcher. See #297 which I just hastily wrote up.
Definitely. |
See also my comment in #232 (comment). While we can do this bin-compatibly, it feels like a technicality since so much would be changing. It may be worthwhile however.
I recently adopted it in Natchez as well. |
The exception is identifier caseless matching because the normalization operations are not natively supported by
java.text.Normalizer
and would require even more code.This change introduces a staggering 12 types of case folded strings. These break down to simple/full case folded strings, each with a default and Turkic variant, and then for each of those we have default, canonical, and compatibility normalization/folding.
CIString is now deprecated and defers to CIStringS, which is an input remembering caseless string using default Unicode caseless matching on simple case folded strings. This is the most similar form to what CIString was doing before, though I think most users will actually want to use CIStringCF which is canonical Unicode caseless matching on full case folded strings.
Note: There is still quite a bit of work to do on this branch. Tests/Scaladoc/typeclass instances come to mind, but I wanted to get it up for initial thoughts before I did all that.
This is an alternative to #229. There are some obvious cons here.
A major Con is that there are so many types. I don't know that we'd need/want to add
CIString
variants for each of the types of case folded strings, but even if we just have the types introduced thus far we've gone from 1 CIString to 15 Folded/Caseless strings. That said, this is just what the reality of the Unicode standard dictates. There are a few more succinct encodings I toyed with, but they have some negative code complexity tradeoffs.An alternative is to just have a single
CaseFoldedString
which is the result of applying one of the folding algorithms and one of the normalization algorithms. This is what I did in #229, but I'm very wary of it as we lose information about what happened. Two case folded strings could be equal but not really be equivalent because they we derived from different folding/normalization. That's why, despite the verbosity, the independent types seemed like the right choice.But then again 🤷 . I may have just lost my mind after staring at the Unicode standard this long. 🤣