Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(Byte)string contracts unsound in the presence of mutable values #2179

Open
LiberalArtist opened this issue Jul 21, 2018 · 12 comments
Open

(Byte)string contracts unsound in the presence of mutable values #2179

LiberalArtist opened this issue Jul 21, 2018 · 12 comments

Comments

@LiberalArtist
Copy link
Contributor

LiberalArtist commented Jul 21, 2018

Per the docs, strings and byte strings can be used as contracts "that recognize themselves using equal?", and regular expressions "are treated as contracts that recognize byte strings and strings that match the regular expression." Such contracts seem to be unsound when applied to mutable (byte)strings. The problem also affects predicates that inspect the contents of (byte)strings and are frequently used as contracts, such as path-string? and string-no-nuls?.

For example, in the following program:

#lang racket

(module server racket
  (provide (contract-out
            [struct a-string-box
              ([string (and/c string? #rx"a")])]
            ))
  (struct a-string-box (string)
    #:transparent))

(require 'server)

(define my-string
  (string #\c #\a #\r))

(define boxed
  (a-string-box my-string))

(string-set! my-string 1 #\d)

(a-string-box-string boxed)

the use of a-string-box-string raises an exception:

a-string-box-string: broke its own contract
  promised: #rx"a"
  produced: "cdr"
  in: an and/c case of
      the range of
      (-> a-string-box? (and/c string? #rx"a"))
  contract from: (anonymous-module server)
  blaming: (anonymous-module server)
   (assuming the contract is correct)
  at: unsaved-editor:8.10

The problem is exacerbated because Racket doesn't implement chaperone-string or chaperone-bytes. A module that wants to enforce an invariant like the one server tries to put on a-string-box must either insist that its clients supply only immutable? (byte)strings, implement an impersonator contract that coerces mutable values to immutable ones, or resort to manual checking.

I suspect this problem is rare in practice, given that Racketeers avoid mutating (byte)strings, but it does seem worth addressing, especially given the great effort racket/contract makes to provide sound contracts for other mutable values. At a minimum, I think it would be worth noting in the docs, along the lines of the explicit warnings given for other parts of the contract library that can be used to create unsound contracts.

@rfindler
Copy link
Member

rfindler commented Jul 21, 2018 via email

@LiberalArtist
Copy link
Contributor Author

LiberalArtist commented Jul 21, 2018

I'm not sure.

I would love for (byte) strings and regular expressions used as contracts to be sound by default, if the backwards-compatibility issues are surmountable.

I suspect (without any data) that just rejecting mutable values would cause problems. I often get (byte) strings from IO and treat them as immutable without actually coercing them with string->immutable-string/bytes->immutable-bytes, and my guess is that, if I had to manually add α->immutable-α to all the right places in previously-working code, I would end up missing a few.

For me personally, I think it would be nicer for the contracts themselves to do the coercion when needed with α->immutable-α. For immutable values, that would be exactly the same as the current behavior. For mutable ones, it would break eq? (but the guide already says, "do not use eq? on values that have contracts") and obviously would break any programs that actually do mutate the (byte) strings, though I bet it is very rare to intentionally mutate (byte) strings with these kinds of contracts. My biggest concern is that this change would make these impersonator contracts instead of flat contracts, and I can imagine some programs might rely on their being flat. On the other hand, such programs would inherently be unsound, and, if something has to break, that might be the best choice. In theory, it might even be possible to let (byte) strings and regular expressions be unsoundly coerced to flat contracts when needed, perhaps logging a warning, while functioning as impersonator contracts most of the time.

If chaperone-string and chaperone-bytes were implemented, treating (byte) strings and regular expressions as chaperone contracts would be an option, but I'm not sure the payoff compared to impersonator versions would be worth what I suspect would be a substantial implementation effort.

In any case, dealing with (byte) strings and regular expressions used as contracts still leaves predicates like path-string?, string-no-nuls?, and bytes-no-nuls?. I could imagine making some combinator like sound-string/c that handles the string? check, deals with string->immutable-string (or chaperone-string), and then applies the inner combinator. If programmers who care about soundness in the presence of mutable (byte) strings have to change their habits anyway, I could then see an argument that (byte) strings and regular expressions used as contracts should remain flat and unsound, but completely backwards-compatible.

As a matter of principle, I think sound-by-default is better, but I don't feel like I have a good sense of just how daunting the backwards-compatibility would be in practice or what potential change would break the least amount of currently-working code.

@rfindler
Copy link
Member

rfindler commented Jul 21, 2018 via email

@mfelleisen
Copy link
Collaborator

mfelleisen commented Jul 21, 2018 via email

@rfindler
Copy link
Member

I'm happy to admit it was a mistake, but I'm concerned about compounding the mistake.

Considering how strings and regular expressions are turned into contracts, we have some choices:

  1. they produce contracts that reject mutable strings
  2. they produce contracts that copy the strings (preserving mutability, but they won't be flat contracts anymore)
  3. add string, etc. chaperones and use them

Are there other choices?

I feel like the second one is worse because it may introduce extremely subtle bugs in programs and anyway copying mutable data seems unfriendly.

The third one seems like it will introduce some slowdowns and it seems like a lot of work. But it also seems the best from the backwards compatibility perspective.

I think we can probably see how bad option 1 is by making the change and then running a pkg test and see what changes. (I don't feel confident about that for option 2).

@LiberalArtist
Copy link
Contributor Author

LiberalArtist commented Jul 22, 2018

  1. they produce contracts that copy the strings (preserving mutability, but they won't be flat contracts anymore)

I had envisioned something slightly different, though I don't know that it's better: a contract that would accept any kind of string, but copy mutable strings to immutable ones with string->immutable-string.

I am also concerned about the possibility for subtle bugs. With option 1 in particular, a complication is that string literals are immutable, so writing tests using mutable strings requires special effort, which I, at least, generally have not done.

For a concrete example, I have some functions that use (and/c string? #px"\\S") to insist that an argument string have at least one non-whitespace character. All of my tests are run on string literals, but the "real" arguments to these functions are strings that come in over the network. I have other code at the user interface level that deals with bad cases, and I never mutate the strings, but I also (until now) haven't actually called string->immutable-string. So changing the meaning of (and/c string? #px"\\S") to be equivalent to (and/c string? immutable? #px"\\S") would break that code, and I wouldn't have caught it with automated tests.

I wouldn't hate that breaking, since from a certain perspective it amounts to the contract system protecting me from a category of bug I hadn't even thought about (one of the advantages in general of contracts over manual tests). I am in fact adding immutable? to contracts now at key places. But I think the change on the whole would have the potential to break things that might not be revealed by tests.

@mfelleisen
Copy link
Collaborator

mfelleisen commented Jul 22, 2018 via email

@lexi-lambda
Copy link
Member

(Conjecture: it won’t break anything.)

I agree that testing the change would be a better idea than blindly guessing as to its impact, but I’d bet you’re wrong. Sadly, in Racket, mutable strings are really common. String literals produce immutable strings, but almost everything else produces mutable ones. Even string-append on two immutable strings produces a mutable string. I think this is kind of disappointing, but it seems impossible to change now.

@LiberalArtist
Copy link
Contributor Author

That brings up another potential possibility, though a particularly radical one: make strings and byte strings immutable, as was done with pairs, presumably adding new mstring? and mbytes? datatypes (by analogy to mpair?) for the case when mutability is actually desired.

Obviously the backwards-compatibility issues would be very significant, but arguably breaking things noisily could be better than breaking them subtly. There would be a clear path to identifying what fails to compile if string-set!, bytes-set!, bytes-copy!, and bytes-fill! (which I believe is the complete list of primitive mutation operations) cease to exist. My intuition is that string-set! is especially rare—I've never used it outside of a demo like this—whereas mutable buffers is a more significant use-case for byte strings. If @rfindler is right that chaperone-bytes might be doable but that chaperone-string "might be off the table because of the performance implications in the runtime," potentially strings could become immutable by default and bytes could get chaperones.

This would be a far-reaching change to Racket, well beyond addressing this issue with contracts, and I'm certainly not able to assess all of the implications, especially for the runtime. If it weren't for the precedent with pairs, I wouldn't even float it as a possibility. But it seems like it would be a particularly principled response, and it might have advantages beyond this issue, in addition to the disadvantages. If I'm right that Racketeers almost always reason about strings in particular as immutable, having the language reflect that would seem to have similar advantages to making pairs immutable. Right now, as @lexi-lambda points out, enforcing immutability for strings is especially awkward.

@LiberalArtist
Copy link
Contributor Author

In general, though, I agree that guessing about what would cause the most problems ultimately must yield to someone picking an approach that seems promising, trying to implement it, and seeing what happens. Also, at least from my perspective, coming up with a solution that gets the long-term considerations right seems much more important than doing something quickly. I don't know of a real-world program that's currently having problems because of this issue: I see it as a matter of correctness more than an urgent bug.

@gus-massa
Copy link
Contributor

I was thinking about a similar problem, because string.rkt had also a problem with mutable strings.

Some functions have to calculate an internal regexp of the arguments, and to gain some speed they are cached. The problem was that it was using the argument that was a (mutable) string as the key in a hash. My solution was to the leave the immutable string alone and use string->immutable-string to transform the mutable strings. (And add some cache for the conversion, to avoid allocations in the common case when the mutable string doesn't change.)

I'm still not 100% happy with the current implementation. I was thinking about changing string->immutable-string to datum-intern-literal but I'm not sure of the trade-offs. (Probably string->immutable-string is usually faster, but the ad hoc cache may make as slow as datum-intern-literal.)

Anyway, about the current issue:

I think that the option 1 is a good solution. Probably most of the times the strings in the arguments are not mutated. I'm more worried about the contract in the result of the function.

What about using datum-intern-literal for the mutable strings? It will also change the eq-ness of the values in a different way. Perhaps I should take later a look at the generated contracts to try to make the optimizer happier, at least in the easiest cases where the argument is a string in the source code.

There were some ideas to make the use of immutable strings easier, but they were never merged. See #1219 and #1341 . I like immutability by default, but it's a big not backward compatible change. Some intermediate step like a #:immutable? keyword may be better for now.

@racket-discourse-github-bot

This issue has been mentioned on Racket Discussions. There might be relevant details there:

https://racket.discourse.group/t/advice-on-implementing-a-contract-system/832/1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants