You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
LazySmallcheck2012 (original, and my fork that works on newer GHCs) behaves in a way where data is only generated when the test forces its evaluation. (I just checked the original lazysmallcheck package, and it doesn't seem to behave this way.)
For example, consider this simple test in falsify:
This is asserting that every lst :: [Int] with more than three elements also has a first element that is at least 10. It fails (as expected):
Running1 test suites...Test suite falsify-test-test:RUNNING...PlayingUnneeded elements:FAIL (0.12s)
failed after 28 successful tests and5 shrinks
known failure, to check strictness
Logs for failed test run:
generated [0,0,0,0] at CallStack (from HasCallStack):
gen, called at test/Main.hs:18:19in main:MainUse--falsify-replay=010985dabce2ff4eb5c78d1e70747b669b to replay.1 out of1 tests failed (0.13s)
Test suite falsify-test-test:FAIL
In this case falsify reports a minimal counterexample of [0, 0, 0, 0]. Whilst that's the smallest fully-normalised value, it is actually over-specified: the test never forces the last three elements, so they are irrelevant to the outcome (only the spine and the first element are relevant). If we compare this to the same check in LazySmallcheck2012 (in GHCi):
*LSC2012> test (\(lst :: [Int]) ->not (length lst >3&&head lst <10))
LSC:Depth0:LSC:Property holds after 3 tests.LSC:Depth1:LSC:Property holds after 5 tests.LSC:Depth2:LSC:Property holds after 7 tests.LSC:Depth3:LSC:Property holds after 9 tests.LSC:Depth4:LSC:Counterexample found after 12 tests.Var0:-3:_:_:_:[]***Exception:ExitFailure1
It gives -3:_:_:_:[], AKA [-3, _, _, _]. Here _ indicates a value which was never forced, so never ran its generator, and is hence irrelevant to the cause of the test failure. Unfortunately in this case the first value (-3) isn't minimal, since (Lazy)Smallcheck doesn't do any shrinking; the ideal counterexample would be [0, _, _, _].
Why lazy generation
There are a few reasons this behaviour is nicer, as a user, than seeing fully normalised counterexamples:
Minimal values, like 0, are ambiguous: sometimes it means they can be ignored, but sometimes there is significance to that value. Using _ for the former removes some ambiguity.
If we expect a value to be important for branching, seeing it appear as _ indicates a problem with our control flow
Using _ instead of a particular value avoids accidental coincidences that we can waste time ruling out, e.g. [0,0,0,0] contains duplicate elements, but that's just a coincidence and isn't related to the failure.
More properties can be proved correct. Those with small, finite domains like Bool can already be proved by exhaustion; lazy generators complement this, since a property which passes without forcing some part of its input must therefore be true for all values of that part.
There may be performance benefits, by avoiding a bunch of generators running; but I haven't found that particularly noticable.
How LazySmallcheck does it
As far as I know, LazySmallcheck2012 works by starting with all inputs as exceptions, and retrying a test with more refined inputs if any of those exceptions are thrown (until some depth limit is reached). In the above example it might have run the property on an input like throw A, which causes the exception A to be thrown; so it ran again with [] (which passed) and throw A : throw B (which throws B). The latter is refined to throw A : [] (which passed) and throw A : throw B : throw C (which throws C) and so on until throw A : throw B : throw C : throw D : [] finally satisfies the length check, and causes the head check to throw A; causing it to be retried with an input like -3 : throw A : throw B : throw C : [] which finally causes the test to fail, and that is our counterexample (with throw X shown as _).
How falsify might do it
Reading through https://well-typed.com/blog/2023/04/falsify/ it appears like falsify's sample trees could be used for lazy generation. From my understanding, each generator gets its own sub-tree; each node contains a PRNG sample which a generator can use directly; and calls to other generators can be given the child subtrees. It seems to me that we can tell which parts of the input were forced by a test, by keeping track of which nodes of the sample tree have been forced.
Unfortunately I can't think of an elegant, pure way to do this. We could wrap each node in a thunk which has the side-effect of inserting an ID in a set, then freeze a copy of the set after running the test, to see which nodes were forced. Ugly, but workable.
Once we know which parts of the same tree were used/unused, we need to map that over to actual counterexamples, which would require running the generator again. We'd need a pretty-printing mechanism that's similar to LazySmallcheck's, i.e. capable of inserting a _ in place of undefined values (e.g. by replacing the unused sample nodes with such printers, or exceptions, or something else).
I'm sure someone could think of a more elegant approach than this, but I at least think it's (a) desirable behaviour and (b) plausible to implement, given the way (I think) falsify works.
The text was updated successfully, but these errors were encountered:
As you already hint at, falsify is very much designed with lazy generation in mind; that's how it can generate infinite data structures such as functions, for example. Only the part of the value that the test needs is actually generated.
I think a pure and idiomatic way to do what you sketch above is something like this:
Then write a regular generator that can shrink Defined values to Undefined; that way, if the function under test never looks at the value, it can definitely shrink to Undefined and you get the behaviour you want, I tihnk.
Let me know how that works out; if it does work and it's useful, a PR would be great! :)
What is lazy generation
LazySmallcheck2012 (original, and my fork that works on newer GHCs) behaves in a way where data is only generated when the test forces its evaluation. (I just checked the original lazysmallcheck package, and it doesn't seem to behave this way.)
For example, consider this simple test in falsify:
This is asserting that every
lst :: [Int]
with more than three elements also has a first element that is at least10
. It fails (as expected):In this case falsify reports a minimal counterexample of
[0, 0, 0, 0]
. Whilst that's the smallest fully-normalised value, it is actually over-specified: the test never forces the last three elements, so they are irrelevant to the outcome (only the spine and the first element are relevant). If we compare this to the same check in LazySmallcheck2012 (in GHCi):It gives
-3:_:_:_:[]
, AKA[-3, _, _, _]
. Here_
indicates a value which was never forced, so never ran its generator, and is hence irrelevant to the cause of the test failure. Unfortunately in this case the first value (-3
) isn't minimal, since (Lazy)Smallcheck doesn't do any shrinking; the ideal counterexample would be[0, _, _, _]
.Why lazy generation
There are a few reasons this behaviour is nicer, as a user, than seeing fully normalised counterexamples:
0
, are ambiguous: sometimes it means they can be ignored, but sometimes there is significance to that value. Using_
for the former removes some ambiguity._
indicates a problem with our control flow_
instead of a particular value avoids accidental coincidences that we can waste time ruling out, e.g.[0,0,0,0]
contains duplicate elements, but that's just a coincidence and isn't related to the failure.Bool
can already be proved by exhaustion; lazy generators complement this, since a property which passes without forcing some part of its input must therefore be true for all values of that part.There may be performance benefits, by avoiding a bunch of generators running; but I haven't found that particularly noticable.
How LazySmallcheck does it
As far as I know, LazySmallcheck2012 works by starting with all inputs as exceptions, and retrying a test with more refined inputs if any of those exceptions are thrown (until some depth limit is reached). In the above example it might have run the property on an input like
throw A
, which causes the exceptionA
to be thrown; so it ran again with[]
(which passed) andthrow A : throw B
(which throwsB
). The latter is refined tothrow A : []
(which passed) andthrow A : throw B : throw C
(which throwsC
) and so on untilthrow A : throw B : throw C : throw D : []
finally satisfies thelength
check, and causes thehead
check to throwA
; causing it to be retried with an input like-3 : throw A : throw B : throw C : []
which finally causes the test to fail, and that is our counterexample (withthrow X
shown as_
).How falsify might do it
Reading through https://well-typed.com/blog/2023/04/falsify/ it appears like falsify's sample trees could be used for lazy generation. From my understanding, each generator gets its own sub-tree; each node contains a PRNG sample which a generator can use directly; and calls to other generators can be given the child subtrees. It seems to me that we can tell which parts of the input were forced by a test, by keeping track of which nodes of the sample tree have been forced.
Unfortunately I can't think of an elegant, pure way to do this. We could wrap each node in a thunk which has the side-effect of inserting an ID in a set, then freeze a copy of the set after running the test, to see which nodes were forced. Ugly, but workable.
Once we know which parts of the same tree were used/unused, we need to map that over to actual counterexamples, which would require running the generator again. We'd need a pretty-printing mechanism that's similar to LazySmallcheck's, i.e. capable of inserting a
_
in place of undefined values (e.g. by replacing the unused sample nodes with such printers, or exceptions, or something else).I'm sure someone could think of a more elegant approach than this, but I at least think it's (a) desirable behaviour and (b) plausible to implement, given the way (I think) falsify works.
The text was updated successfully, but these errors were encountered: