Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Series 2.x.x #247

Open
isomarcte opened this issue May 7, 2022 · 14 comments
Open

Series 2.x.x #247

isomarcte opened this issue May 7, 2022 · 14 comments

Comments

@isomarcte
Copy link
Member

I'm considering targeting #232 to a series/2.x.x branch.

@rossabaker @armanbilge any objections to me creating that branch and moving the code there?

@armanbilge
Copy link
Member

armanbilge commented May 7, 2022

👍 thanks for all your work on this. I think a series/2.x is the best path forward for the time being.

Since this is an issue about series/2.x in general, I'm thinking we should consider using a newtype for CIString (at least on Scala 3) now that we are getting a typeclass-based Map in typelevel/cats#4185 which does not depend on universal equals/hashCode.

@isomarcte
Copy link
Member Author

@armanbilge Do you just mean CIString encoded as an opaque type or are you talking about wrapping CIString again? And are you foreseeing us having hashCode differ from Hash?

@armanbilge
Copy link
Member

Do you just mean CIString encoded as an opaque type

Yes, this. Which would mean that hashCode would differ from Hash.

@isomarcte
Copy link
Member Author

Yeah, so in #232 that changes a bit. I changed the underlying hashCode/equals/compare to be based on a CanonicalFullCaseFoldedString.

There is still some discussion to have there, but all CanonicalFullCaseFoldedString values have the same representation (for the same input) and thus the same universal hashCode and equality. So CanonicalFullCaseFoldedString could be an opaque type, though it wouldn't need to for hash/equal reasons. And under this encoding CIString is effectively a specialized (String, CanonicalCaseFoldedString) so making it opaque doesn't gain you as much.

FWIW, I'm starting to be of the opinion we should be using CanonicalCaseFoldedString for most use cases where we use CIString. That is, I don't know if we need to hold a reference to the original representation as the normative case.

@isomarcte
Copy link
Member Author

I hope that makes some amount of sense. Rereading my comment makes me feel like I didn't explain that well.

This is what I'm talking about: https://github.com/typelevel/case-insensitive/pull/232/files#diff-90a144a03717314cdaa1151276da76557be0f1e78c890580b57c9d95406f95f6R80

@isomarcte
Copy link
Member Author

@armanbilge can you make series/1.x.x and series/2.x.x branches? I don't have access apparently.

@armanbilge
Copy link
Member

Neither do I :)

@armanbilge
Copy link
Member

@isomarcte to clarify, my proposal is we replace:

final class CIString private (override val toString: String)

with

opaque type CIString = String

If we do this, then the universal hashCode and equals for CIString will be those of String itself, because the opaque type is erased to String. This is why the instances will be incompatible with them.

@armanbilge
Copy link
Member

armanbilge commented May 7, 2022

Wait, sorry, I wasn't looking carefully. I see, CIString can no longer be represented as an opaque type, sorry! 🤔

Edit: wait, nope, I'm confused by all the various implementations 😬

@isomarcte
Copy link
Member Author

Yeah, there are a lot of types in the Unicode standard. For our discussion here we just need to focus on CanonicalFullCaseFoldedString and CIString.

The new encoding of CIString on my branch is similar to this.

final case class CIString private (original: String, caselessString: CanonicalFullCaseFoldedString)

To make a cassless string, in general, you need to case fold it, then normalize it (this is simplified, there can be more steps depending on the type of caseless string).

Thus caselessString is the result of normalize(caseFold(original)).

The reason I changed the representation of CIString is because what we are doing in 1.x.x is incorrect for caseless strings. To get to a caseless string is more complicated than just toUpper.toLower on each char, so it seemed reasonable to hold a reference to the caseless string in CIString, rather than computing it each time we needed to do an operation on the class. That also allows us to use the CanonicalFullCaseFoldedString as the source of hashCode and equals.

There are other ways we could do this. We could have the caselessString be a def,

opaque type CIString = String

extension (inline ciString: CIString) {
  inline def caselessString: CanonicalFullCaseFoldedString = normalize(caseFold(ciString))
}

Then all the methods on CIString would need to convert to a CanonicalFullCaseFoldedString on each operation. We could cache the hash via a null init pattern as is currently done on main, but we couldn't easily do that for equality checks and the other methods.

All of these gymnastics are the result of us keeping around a reference to the original input string in CIString. That's why I'm leaning on making CanonicalFullCaseFoldedString (perhaps with a more succinct name) the default type we use, rather than CIString. CanonicalFullCaseFoldedString can be trivial represented as an opaque type, though we don't need to for hash/eq purposes.

@armanbilge
Copy link
Member

Thanks for the detailed explanation, that makes sense to me. Regarding whether to cache or not, it probably makes sense? Not sure if there are situations where lots of CIStrings are allocated but no ops are called on them.

All of these gymnastics are the result of us keeping around a reference to the original input string in CIString.

The FAQ suggests there are Reasons™, but I'm not sure what they are :)
https://typelevel.org/case-insensitive/#why-pay-for-a-wrapper-when-you-can-fold-case-before-storing

@rossabaker
Copy link
Member

I added @armanbilge and @isomarcte to the team, so anyone should be able to create the necessary branches.

Not keeping the original would be a non-semantic, but highly visible, change to HTTP rendering. For example, all field names would be rendered lower case.

I don't know how much uptake CIString has seen outside http4s to evaluate how bad a binary breakage would be, but we know what immense pain the last source breakage was. I would very much prefer to avoid source-breaking changes to http4s in this.

@armanbilge
Copy link
Member

Here are steward PRs for case-insensitive:
https://github.com/search?q=case-insensitive+author%3Ascala-steward&type=Issues&ref=advsearch&l=&l=

Besides http4s, I recognize trace4cats.

@armanbilge
Copy link
Member

@isomarcte I created a series/2.x branch and re-targeted #232 to there.

I think some form of source-breaking changes are inevitable (at least in the form of deprecations) since in #232 (comment) we realized many of the exsting APIs are unsound.

Personally I think the best strategy to is keep pushing for http4s 1.x, with better CI strings, better URIs, better entities, etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants