Implement (Almost) All The Unicode Caseless Matching Systems #232

isomarcte · 2022-02-06T18:09:48Z

The exception is identifier caseless matching because the normalization operations are not natively supported by java.text.Normalizer and would require even more code.

This change introduces a staggering 12 types of case folded strings. These break down to simple/full case folded strings, each with a default and Turkic variant, and then for each of those we have default, canonical, and compatibility normalization/folding.

CIString is now deprecated and defers to CIStringS, which is an input remembering caseless string using default Unicode caseless matching on simple case folded strings. This is the most similar form to what CIString was doing before, though I think most users will actually want to use CIStringCF which is canonical Unicode caseless matching on full case folded strings.

Note: There is still quite a bit of work to do on this branch. Tests/Scaladoc/typeclass instances come to mind, but I wanted to get it up for initial thoughts before I did all that.

This is an alternative to #229. There are some obvious cons here.

A major Con is that there are so many types. I don't know that we'd need/want to add CIString variants for each of the types of case folded strings, but even if we just have the types introduced thus far we've gone from 1 CIString to 15 Folded/Caseless strings. That said, this is just what the reality of the Unicode standard dictates. There are a few more succinct encodings I toyed with, but they have some negative code complexity tradeoffs.

An alternative is to just have a single CaseFoldedString which is the result of applying one of the folding algorithms and one of the normalization algorithms. This is what I did in #229, but I'm very wary of it as we lose information about what happened. Two case folded strings could be equal but not really be equivalent because they we derived from different folding/normalization. That's why, despite the verbosity, the independent types seemed like the right choice.

But then again 🤷 . I may have just lost my mind after staring at the Unicode standard this long. 🤣

This commit adds CaseFoldedString as a partner to CIString. A CaseFoldedString is case folded according to the Unicode rules for Caseless Matching. In contrast to CIString, it does _not_ keep a reference to the input `String`. This commit changes CIString to be based on CaseFoldedString.

The exception is identifier caseless matching because the normalization operations are not natively supported by `java.text.Normalizer` and would require even more code. This change introduces a staggering _12_ types of case folded strings. These break down to simple/full case folded strings, each with a default and Turkic variant, and then for each of those we have default, canonical, and compatibility normalization/folding. CIString is now deprecated and defers to CIStringS, which is an input remembering caseless string using default Unicode caseless matching on simple case folded strings. This is the most similar form to what CIString was doing before, though I think most users will actually want to use CIStringCF which is canonical Unicode caseless matching on full case folded strings.

isomarcte · 2022-02-06T18:12:16Z

Tests are expected to fail right now. This branch currently still has the tests I added for CIString to debug/validate the full case folding behavior on CIString. Since CIString is now explicitly using default case less matching with simple folding, it shouldn't pass default case less matching with full folding.

If we move forward here, I'll fix all that up.

armanbilge · 2022-02-06T21:12:26Z

Phew! 😅 Some assorted thoughts:

CIString is now deprecated and defers to CIStringS, which is an input remembering caseless string using default Unicode caseless matching on simple case folded strings.

If CIString is deprecated anyway, why not just let it retain its old implementation, for sake of compatibility? Unless I misunderstood and that is what it was doing.

That's why, despite the verbosity, the independent types seemed like the right choice.

Static types are also better for Scala.js, since the optimizer can detect unused types and elide the relevant code. If this is implemented dynamically, then it no longer can do that. Although not sure if it this will make much difference in practice 😅

I think the hardest thing to stomach is that as a user or library author it's no longer obvious what the "right" choice is. I'm worried that if libraries don't sync up on this, then there will be fragmentation between APIs.

isomarcte · 2022-02-06T23:12:56Z

If CIString is deprecated anyway, why not just let it retain its old implementation, for sake of compatibility? Unless I misunderstood and that is what it was doing.

My reasoning is that while I think the current implementation of CIString is actually Unicode default caseless matching with simple folding + the turkic rules, I'm not 100% sure on that. It might not actually be exactly any of the 16 different rules. Also, the standard states that the turkic rules should be omitted by default, but the current implementation of CIString includes them.

I'm okay just deprecating though if that is preferred.

I think the hardest thing to stomach is that as a user or library author it's no longer obvious what the "right" choice is. I'm worried that if libraries don't sync up on this, then there will be fragmentation between APIs.

Yeah, I agree. If we keep the other types around I think we should update the github site and basically all documentation to point people towards CIStringCF/CanonicalFullCaseFoldedString unless they know they specifically want something else since the Unicode standard states this will yield the "most correct results". (I'm trying not to freak out too much that we are dealing in ranges of correctness here...)

I on the fence about having the full set of types proposed here, but I'm trying not to be overly biased towards my specific use case. Since they are all in the standard, it's reasonable to think someone might want them eventually. They do have some nice properties, like SimpleCaseFoldedString won't change the length of the String and TurkicSimpleCaseFoldedString is actually the most similar to String.equalsIgnoreCase.

I'm willing to be talked out of this though.

I also hate the names CIStringCF, CIStringCS, and CIStringS, but I was at a lose for a better scheme.

rossabaker

This seems like mostly a superset of the other one, so it's probably better to discuss the design there.

rossabaker · 2022-02-06T22:46:32Z

core/src/main/scala/org/typelevel/ci/CIString.scala

+  message =
+    "Please use either CIStringCF, CIStringCS, or CIStringS instead. CIString/CIStringS implement Unicode default caseless matching on simple case folded strings. For most applications you probably want to use CIStringCF which implements Unicode canonical caseless matching on full case folded strings.",
+  since = "1.3.0")
+final class CIString private (override val toString: String, val asCIStringS: CIStringS)


http4s-0.22 and 0.23 would have to pin, because this is a big part of its public API, and even if we nowarned all the usages, the deprecation would be viral into everyone's apps. I would really rather avoid a source-breaking change on this type if we can avoid it.

Yeah, I'm struggling with what to do here. I'd like CIString to be the one most users will want out of the box, but we can't do that unless we break the semantics by changing it to be Unicode canonical caseless matching with full folded strings.

For most users, they probably won't notice, so maybe it's okay? That also would help with the concern about advertising which is the best type to use by default.

Well, are we breaking semantics? Or was our current implementation buggy, and we are fixing the semantics?

The CI microsite states that "locale independence" is a design goal. Is that what @isomarcte is achieving here?

Edit: hmm, I see that "locale-independence" is a technical term meaning that e.g. Locale.setDefault(Locale.forLanguageTag("tr")) should not change the result

@armanbilge that's a tricky question. Yes and no. The Turkic special cases are obviously not locale independent. However, and forgive me for splitting hairs here, I'm not sure you can say that any Unicode caseless matching is locale independent. Whether or not "ı" == "I" in a caseless match is really only answerable in the context of some locale.

In as much as one can have locale independence in this context, the non-Turkic variants achieve that.

I can drop the Turkic ones, but...I don't know...maybe some Turkic language users will want to use them?

Edit: hmm, I see that "locale-independence" is a technical term meaning that e.g. Locale.setDefault(Locale.forLanguageTag("tr")) should not change the result

I did not realize that. 👀

Ok. Idk if we can declare this a bug in case-insensitive specifically, but IIUC it is a bug in http4s.

The [Unicode simple](https://www.w3.org/TR/charmod-norm/#dfn-unicode-simple) casefolding form is not appropriate for string identity matching on the Web.

https://www.w3.org/TR/charmod-norm/#sec_unicode_cs

@armanbilge is it clear to you if they are implying default full caseless matching or canonical full caseless matching? It is not to me.

@isomarcte canonical?

Case sensitivity is not recommended for most specifications but, in the case of an exception where the vocabulary allows non-ASCII characters and which does not want to be sensitive to case distinctions, the 'Unicode canonical case fold' approach SHOULD be used.

https://www.w3.org/TR/charmod-norm/#matching_CanonicalFoldNormalizationStep

How does "compatibility" fit into this?

A 'Unicode compatibility case fold' approach should not be used.

https://www.w3.org/TR/charmod-norm/#matching_CompatibilityFoldNormalizationStep

@armanbilge I don't think compatibility comes into play in our http4s context. (thanks for finding that btw!)

I'm becoming increasingly convinced we should change CIString to be full canonical caseless matching and update the docs. I suspect most of our current user base is working around web based technologies and probably should be using full canonical caseless matching anyway.

Nothing here contradicts that: https://mvnrepository.com/artifact/org.typelevel/case-insensitive/usages

It's probably fine 🙃

I would challenge that this is a bug in http4s. From the next paragraph in the spec:

Specifications that define case-insensitive matching in vocabularies limited to the Basic Latin (ASCII) subset of Unicode MAY specify ASCII case-insensitive matching.

I can't think of anywhere that http4s is using it that isn't an ASCII context. The one place I'm less sure of is the URI, but even there, encodings are by convention, so I think the answer is ambiguous. So, while CIString is less general purpose than we would like, is there a smoking gun that its use in the http4s ecosystem is wrong?

rossabaker · 2022-02-06T22:49:06Z

core/src/main/scala/org/typelevel/ci/CompatibilityFullCaseFoldedString.scala

+    // Yes, you read that right. Round and round we go.
+    //
+    // > "Yes, that’s it," said the Hatter with a sigh:
+    // > "it’s always tea-time, and we’ve no time to wash the things between whiles."
+    // >
+    // > "Then you keep moving round, I suppose?" said Alice.
+    // >
+    // > "Exactly so," said the Hatter: "as the things get used up."
+    // >
+    // > "But what happens when you come to the beginning again?" Alice ventured to ask.
+    // >
+    // > "Suppose we change the subject," the March Hare interrupted, yawning
+    //
+    // - Alice's Adventures In Wonderland, Chapter VII, by Lewis Carroll


isomarcte · 2022-02-07T13:01:03Z

@rossabaker

This seems like mostly a superset of the other one, so it's probably better to discuss the design there.

This one drops the generic CaseFoldedString for separate types. I think CaseFoldedString may be dangerous.

This allows us to not deprecate the type. There is still an unfortunate issue here. Full case folding can change the length (in terms of characters) of the `String`. This makes `CIString.length` not safe to use, as it is not clear which length it now refers to. This commit deprecates that method because of this. However, the `glob` based `unapplySeq` relies both on `length` and `Char` based case insensitivity. This means that it is no longer correct. That said, by relying on `Char` based case insensitivity, it was probably not quite correct to begin with.

isomarcte · 2022-02-07T14:43:53Z

@rossabaker @armanbilge I've pushed a new commit which undeprecates and changes CIString to be Unicode canonical caseless matching with full case folding as discussed here.

~~See: b376e37~~

I ran into a different issue however. The unapplySeq for the ci StringContext is unsound on two counts. The first is that it relies on the length of the CIString, but that is ambiguous under full case folding. The second is that it is using case insensitivity on Char values directly. I don't think you can actually apply any of the Unicode caseless matching systems directly to a Char.

I think we should deprecate it. 🙁

Edit: I linked the wrong commit, 619b36f

armanbilge · 2022-02-07T15:22:22Z

Hmm. Putting aside the semantic changes, all these deprecations to length, unapply, etc. make this seem much more like a breaking change. Like full case folding is fundamentally incompatible/unsound with the API currently exposed.

isomarcte · 2022-02-07T16:03:32Z

Yeah, it's not great. I'm still open to suggestions on what to do here. We could provide an alternative to to the ci syntax, but still not great.

armanbilge · 2022-02-07T16:19:03Z

Right, and that wouldn't help http4s anyway. Seems to me like we can either swallow this pretty breaking change on both the semantic and API level or put this into 2.0.0. IIUC this came up while working on cats-uri which itself is targeting http4s 1.0 I assume.

At the very least, starting a 2.x branch could unblock you for now, and we could think about strategies to get this into 1.x? /shrug

isomarcte · 2022-02-07T16:26:40Z

My concern with not changing this until http4s 1.x.x is that what CIString does right now is (probably) equivalent to simple case folding with Turkic rules and default caseless matching. When that disagrees with canonical caseless matching with full folding, the result is a bug in the http4s context. Are we okay with that on 0.22.x or 0.23.x?

I'm not sure what the full impact would be when they disagree. In a Uri for example, those would be cause two equivalent regNames to be treated as non-equivalent. Or, in the case of dotless I characters, our current CIString would cause two non-equivalent regNames to compare equal. This might get blocked by DNS rules, DNS adds a bunch of rules on top of RFC-3986, but I'm not sure. That's just one example, I'm sure there are others.

I mean...I get it...we've no great options and big source breaking changes suck...I'm just thinking out loud here.

At the very least, starting a 2.x branch could unblock you for now, and we could think about strategies to get this into 1.x?

Yeah, I'll work towards that in the meantime.

armanbilge · 2022-02-07T16:36:06Z

I mean...I get it...we've no great options and big source breaking changes suck...I'm just thinking out loud here.

Yep me too 😅

Definitely agree that it's a bug in http4s, but I can't begin to understand how serious it is. Defer to Ross on next steps :)

isomarcte · 2022-04-18T16:11:02Z

@rossabaker I'm (finally) circling back here. Do you have any thoughts?

I suspect there are some very deep serious side effects of the incorrect case-folding, but that they are quite rare. Given our user base doesn't seem to be complaining about them right now in http4s, I think we could pivot for version 2 of this library and hope any such invalid behavior is rare enough that we don't see it on 0.23.x of http4s.

Thoughts?

rossabaker · 2023-01-04T04:37:37Z

I have severe apprehensions about introducing a 2.x version. The scala-xml upgrade has been a shit show, not so much due to actual incompatibility, and more because it's setting off alarms in SBT and getting casually upgraded in patch releases downstream because it was advertised as "compatible enough". We risk a similar shitshow as soon as we introduce a 2.x here.

Eyeballing this after several months away, it seems like it's binary compatible, and I didn't set off any MiMa alarm bells. Am I correct that this a purely semantic change as written?

I would also like to consider benchmarks before changing the core type used in http4s. It seems like this is heavier, but it probably doesn't matter in the context of an IO application? It's worth considering, given that this library is well established with one consumer that actually matters.

isomarcte · 2023-01-04T14:35:58Z

Eyeballing this after several months away, it seems like it's binary compatible, and I didn't set off any MiMa alarm bells. Am I correct that this a purely semantic change as written?

I've been distracted by work in https://github.com/typelevel/idna4s and at $WORK, but IIRC this can be made bincompat with the possible exception of the globbing matcher. See #297 which I just hastily wrote up.

I would also like to consider benchmarks before changing the core type used in http4s. It seems like this is heavier, but it probably doesn't matter in the context of an IO application? It's worth considering, given that this library is well established with one consumer that actually matters.

Definitely.

armanbilge · 2023-01-04T14:38:43Z

See also my comment in #232 (comment). While we can do this bin-compatibly, it feels like a technicality since so much would be changing. It may be worthwhile however.

It's worth considering, given that this library is well established with one consumer that actually matters.

I recently adopted it in Natchez as well.

isomarcte added 7 commits February 5, 2022 17:15

Fix Formatting Error

f0bd02f

Fix Compare

cfebcb0

Fix CIString Compare

9ca6c36

Rename CaseFolds To CaseFolding

98f82e7

Add Simple Case Folding Tables

6ed8433

rossabaker reviewed Feb 6, 2022

View reviewed changes

isomarcte added 2 commits February 6, 2022 16:45

Merge remote-tracking branch 'origin/main' into all-the-case-folding

4a3fba6

Add Missing new Keywords

3bc849d

isomarcte mentioned this pull request Feb 7, 2022

Implement Unicode Case Folding #229

Draft

isomarcte mentioned this pull request May 7, 2022

Series 2.x.x #247

Open

armanbilge changed the base branch from main to series/2.x May 13, 2022 18:24

Add More Docs

47e5980

isomarcte mentioned this pull request Jan 4, 2023

Globbing #297

Open

rossabaker mentioned this pull request May 2, 2023

Case insensitive keys TextMapGetter typelevel/otel4s#191

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement (Almost) All The Unicode Caseless Matching Systems #232

Implement (Almost) All The Unicode Caseless Matching Systems #232

isomarcte commented Feb 6, 2022 •

edited

Loading

isomarcte commented Feb 6, 2022

armanbilge commented Feb 6, 2022

isomarcte commented Feb 6, 2022

rossabaker left a comment

rossabaker Feb 6, 2022

isomarcte Feb 6, 2022

armanbilge Feb 6, 2022 •

edited

Loading

isomarcte Feb 6, 2022

isomarcte Feb 6, 2022

isomarcte Feb 6, 2022 •

edited

Loading

armanbilge Feb 7, 2022

isomarcte Feb 7, 2022

armanbilge Feb 7, 2022

rossabaker Jan 4, 2023

rossabaker Feb 6, 2022

isomarcte commented Feb 7, 2022

isomarcte commented Feb 7, 2022 •

edited

Loading

armanbilge commented Feb 7, 2022 •

edited

Loading

isomarcte commented Feb 7, 2022

armanbilge commented Feb 7, 2022 •

edited

Loading

isomarcte commented Feb 7, 2022 •

edited

Loading

armanbilge commented Feb 7, 2022

isomarcte commented Apr 18, 2022

rossabaker commented Jan 4, 2023

isomarcte commented Jan 4, 2023

armanbilge commented Jan 4, 2023 •

edited

Loading

Implement (Almost) All The Unicode Caseless Matching Systems #232

Are you sure you want to change the base?

Implement (Almost) All The Unicode Caseless Matching Systems #232

Conversation

isomarcte commented Feb 6, 2022 • edited Loading

isomarcte commented Feb 6, 2022

armanbilge commented Feb 6, 2022

isomarcte commented Feb 6, 2022

rossabaker left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

armanbilge Feb 6, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

isomarcte Feb 6, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

isomarcte commented Feb 7, 2022

isomarcte commented Feb 7, 2022 • edited Loading

armanbilge commented Feb 7, 2022 • edited Loading

isomarcte commented Feb 7, 2022

armanbilge commented Feb 7, 2022 • edited Loading

isomarcte commented Feb 7, 2022 • edited Loading

armanbilge commented Feb 7, 2022

isomarcte commented Apr 18, 2022

rossabaker commented Jan 4, 2023

isomarcte commented Jan 4, 2023

armanbilge commented Jan 4, 2023 • edited Loading

isomarcte commented Feb 6, 2022 •

edited

Loading

armanbilge Feb 6, 2022 •

edited

Loading

isomarcte Feb 6, 2022 •

edited

Loading

isomarcte commented Feb 7, 2022 •

edited

Loading

armanbilge commented Feb 7, 2022 •

edited

Loading

armanbilge commented Feb 7, 2022 •

edited

Loading

isomarcte commented Feb 7, 2022 •

edited

Loading

armanbilge commented Jan 4, 2023 •

edited

Loading