Deep integration between Hypothesis and py.test is currently impossible #916

Open
DRMacIver opened this Issue Aug 5, 2015 · 45 comments

Projects

None yet

5 participants

@DRMacIver

Context: I write Hypothesis, a randomized testing library for Python. It works "well" under py.test, but only in the sense that it ignores py.test almost completely other than doing its best to expose functions in a way that py.test fixtures can understand.

A major problem with using Hypothesis with py.test is that function level fixtures get evaluated once per top-level function, not once per example. When these fixtures are mutable and mutated by the test this is really bad, because you end up running the test against the fixture many times, changing it each time.

People keep running into this as an issue, but currently it seems to be impossible to fix without significant changes to py.test. @RonnyPfannschmidt asked me to write a ticket about this as an example use-case of subtests, so here I am.

So what's the problem?

A test using Hypothesis looks something like:

@given(b=integers())
def test_some_stuff(a, b):
    ...

This translates into something approximately like:

def test_some_stuff(a, b=special_default):
    if b == special_default:
       for b in examples():
           ...
   else:
      ...

The key problem here is that examples() cannot be evaluated at collect time because it depends on the results of previous test execution.

The reasons of this in order of decreasing amount of "this seems to be impossible" (i.e. with the current feature set of py.test I have no idea how to solve the first and neither does anyone else, could maybe solve the second, could definitely do something about the third):

  1. The fundamental blocker is that this is a two-phase process. You've got an initial generate phase, but then if a failure is found you have a "simplify" phase, which runs multiple simplify passes over the failing example. The space of possible examples to explore here is essentially infinite and depends intimately on the structure of the failing test.
  2. The number of examples run depends on both timing (Hypothesis stops running examples after a configurable timeout) and what the test does. In particular tests can throw an UnsatisfiedAssumption exception which causes the example to not count towards the maximum number of examples to run (there is an additional cap which is larger but does count these).
  3. Some examples may be skipped if they come from the same batch as something that produced an UnsatisfiedAssumption error.
@The-Compiler
Member

There's a somewhat similar issue in pytest-dev/pytest-qt#63 - this adds a way to pytest-qt to test Qt models to ensure they behave correctly.

The tests can't easily change the data in the model (as a model has a defined interface for getting the data from it, not necessarily for adding/removing/changing data), so the approach of the original C++ tests is to re-run all tests when the model changes, so you can "attach" the tester and then make the model do something and the tests rerun as soon as the model changes.

I've not found a satisfying way to do that yet since the tests aren't know at collection time. What the code is doing currently is to have a qtmodeltester.setup_and_run(model) method which runs the tests once and listens for changes, and the user then modifies the model as part of their (single) test.

This however poses several problems, e.g. how to tell the user which of the "sub-tests" has failed and which tests did run, etc.

/cc @nicoddemus

@RonnyPfannschmidt
Contributor

@The-Compiler i think your use-case is fundamentally different

as far as i understand @DRMacIver needs sub-test level operations, setup/teardown
while you need something thats more like a set of attached checks that run per model change

@The-Compiler
Member

I think both use-cases would be satisfied by having a way to generate new tests (or sub-tests) while a test is running. Then pytest would take care of running the new tests and handling setup/teardown for each one.

@untitaker
Contributor

Generating new first-class tests while the tests are already running will be awkward for the UI, so I think subtests are the only option (for a start only the parent test is visible in the UI).

I wonder if, for hypothesis' case, there's an upperbound on the test runs necessary that can be determined at collection time.

@DRMacIver

There isn't right now, but there could be made to be one. However it's going to be somewhere between 10 and 100 times larger than the typical number of runs.

@DRMacIver

Also note that Hypothesis in default configuration runs 200 subtests per test as part of its typical run, so if you want to display those in the UI it's already going to be um, fun.

@untitaker
Contributor

I see. The idea was to, as a workaround, generate as many testcases as possibly needed for hypothesis, and then just skip the ones that are not needed.

@DRMacIver

Yeah, I figured it would be something like that. It's... sortof possible but the problem is also that Hypothesis can't really know in advance what each example is going to be, so there'd have to be a bunch of work to match the two up. I think I would rather simply not support the feature than use this workaround.

@untitaker
Contributor

I'm currently fooling around with this. Would it be an OK API if there's a way to instantiate sub-sessions (on the same config)?

@nicoddemus
Member

@untitaker you mean subtests (#153)? or something else?

@untitaker
Contributor

No, I meant to actually instantiate a new _pytest.Session within the existing test session. Nevermind, it seems to be unnecessary.

Meanwhile I've come up with https://gist.github.com/untitaker/49a05d4ea9c426b179e9, the thing works for function-scoped fixtures only.

@RonnyPfannschmidt
Contributor

@untitaker that looks pretty much like what i mean with subtests, however the way its implemented might add extra unnecessary setup/tardown cost due to nextitem

@untitaker
Contributor

I'm not sure if we can set nextitem properly without changes to at least Hypothesis.

@DRMacIver

I'm not expecting this to work automatically. :-) Hypothesis doesn't depend on py.test by default, but I can either hook in to things from the hypothesis-pytest plugin or provide people with a decorator they can use to make this work (the former would be better).

What sort of unneccessary teardown/setup cost did you have in mind? Does it just run the fixtures an extra time?

@untitaker
Contributor

Currently it seems that module-level fixtures are set up and torn down for each subtest. I wonder if that's because of the incorrect nextitem value.

@DRMacIver

Ah, yes, that would be unfortunate.

@RonnyPfannschmidt
Contributor

@untitaker thats exactly the problem, but i consider that a pytest bug - unfortunately its a structural one, so hard to fix before 3.0

as a hack you could perhaps use the parent as next item, that way the teardown_torwards mechanism should keep things intact

@RonnyPfannschmidt
Contributor

@untitaker i future i'd like to see a subtest mechanism help with those details

@untitaker
Contributor

I'm currently experimenting with this, I fear that this might leak state to subsequent testfuncs in different modules/classes.

@RonnyPfannschmidt
Contributor

the state leak should be prevented by the outer runtest_protocol of the actual real test function

due to doing a teardown_torwards with a next item there cleanup should be expected,
but to ensure it works, a acceptance tests with a fnmatch_lines item is needed

@untitaker
Contributor

I've updated the gist.

@untitaker
Contributor

BTW should this hack rather go into hypothesis-pytest for trying it out, or do you already want to stabilize an API in pytest?

@untitaker
Contributor

Also I'd like to hide the generated tests from the UI.

@DRMacIver

Yeah I was just about to ask if there was a way to do that. This looks great (just tried it locally), but I'd rather not spam the UI with 200 tests, particularly for people like me who typically run in verbose mode.

@RonnyPfannschmidt
Contributor

@untitaker should go into something external, and we should later on figure a feature test to kill it out

@DRMacIver the proper solution is still a bit away (it would hide the number of sub-tests)

however making that happen is a bit major, and between personal life and a job i cant make any promises for quick progress

right now i'm not even putting the needed amount of time into the pytest-cache merge and the yield test refactoring

@DRMacIver

That's of course totally fine. Life always takes priority over free work. I'm also probably not going to be that active on the Hypothesis side of this in the near term.

@untitaker
Contributor

I don't think the number of subtests is relevant -- hypothesis would bump this count by hundreds for each testcase. I'd like to see the same exact UI as before.

@untitaker
Contributor

To clarify, the number of collected tests in the pytest UI is still 1 with that gist.

@RonnyPfannschmidt
Contributor

@untitaker subtest's are named a test item execution time, not collection time

its needed for fixing #16 as well (and its perfectly fine to collect and report exactly one item in most cases)

@RonnyPfannschmidt
Contributor

(for hypothesis for example it would be very sensible to pick and choose what of the subtests to report in error cases (i.e. the minimized still failing examples), and generally hide the non-error cases)

@DRMacIver

A thing to note there is that I would like to at some point start reporting multiple distinct errors per test when multiple are found, even though Hypothesis currently only reports one error per test.

@DRMacIver

So I'm looking into this approach and I don't think it works. The problem is that the inner tests seem to run after the outer test has executed, which makes it impossible for minimize to work (and also to stop execution when the first failure is found). The behaviour Hypothesis needs is to run the test function and get an exception immediately if it fails.

@untitaker
Contributor

Try again? I'm not sure if I'm not missing some critical setup now though.

@DRMacIver

Yes, this seems to work. FWIW my test for it working is just to add "assert len(s) <= 2" in test_inner: What should happen is that it prints out a falsifying example of s='000', and it does. Yay!

@DRMacIver

Although actually it's not quite right: Something you're doing is interfering with output capturing, so it prints the falsifying example in the wrong place. It normally appears in captured output (I have a pytest plugin for better reporting integration, so this isn't intrinsically a problem and I could just add it to the report in the same way)

@untitaker
Contributor

Oh dear...

@untitaker
Contributor

Well, I think I've reached the dead end with trying random hooks! 😆

@untitaker
Contributor

As far as I can tell, pytest's capturing stuff is not cleanly nestable.

@RonnyPfannschmidt
Contributor

currently capture is not nest-able, i have a rough plan to change that, but it needs discussion with @hpk42 and about 2 days of work to build a Prood of Concept - so its def not a soon item

@untitaker
Contributor

I'm considering putting my hack above into a new package, as I need it for one of my projects. Did a better solution appear out there since then, or is there some other reason (other than shitty diagnostics) I shouldn't use it?

@RonnyPfannschmidt
Contributor

There is no better solution yet, but Please document it AS hack that might break between pytest minor releases

@untitaker
Contributor

I've now published https://github.com/untitaker/pytest-subtesthack

There's also an experimental drop-in replacement for given, https://gist.github.com/untitaker/49a05d4ea9c426b179e9, but it's extremely buggy because it's too simple. I don't want to replicate all the argspec-juggling logic in hypothesis :(

@DRMacIver

Yeah, that's fair. I wrote it, and I don't want to replicate all the argspec juggling logic in Hypothesis :-)

@untitaker
Contributor

I wonder if there's a way to split up given into unstable internal APIs such that I can hook better into it.

@DRMacIver

I'm open to suggestions. It's not totally obvious where that would be though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment