Skip to content

Make testing more robust especially for random objects #1268

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
nathanshammah opened this issue May 15, 2020 · 3 comments
Open

Make testing more robust especially for random objects #1268

nathanshammah opened this issue May 15, 2020 · 3 comments
Assignees
Labels

Comments

@nathanshammah
Copy link
Member

Is your feature request related to a problem? Please describe.
A lot of issues in test fail seem to arise from random objects (as well as low-level math in MKL and/or cython issues).

Describe the solution you'd like
There are several options at hand.

stick to pytest and be creative

A possible fix in testing may be to add randomly generated data that then is pointed at or fix seeds.
Pros: fast (?)
Cons: technical debt.

pytest-randomly plugin

pytest-randomly is a pytest plugin that addresses this kind of issues. It allows to control random.seed, rather than numpy.random.seed.
Pros: pytest plugin, supports doctest.
Cons: not super popular, not designed for numpy

property-based testing with Hypothesis

Hypothesis is a library that aims at changing the way tests are designed, allegedly: it should go from testing an instance to designing a test that applies to a domain of instances (property-based testing). It is not super clear to me right now.
It contains various randomness-related features, including a seed function.

Pros: sounds powerful and clever, popular and growing, well documented, more robust even beyond this randomness problems.
Cons: radical change of testing framework (?), steep learning curve (?), overkill (?).

I also admit I used nose until recently / used with pytest tests thought for nose, without taking advantage of pytest full power.

@nathanshammah
Copy link
Member Author

"Assigned" this just to stimulate discussion...

@jakelishman
Copy link
Member

I think hypothesis is the best method here in the long term, but it will most likely have to be a long-term goal. I think the main pro in favour of it is that it actually is making an attempt to remove randomness; it's attempting to comprehensively test a spanning set of input parameters, rather than just Monte-Carlo'ing our way through and hoping. There's a couple of points which make it difficult to implement:

  1. QuTiP can be quite fragile with respect to unexpected input formats, particularly in older parts of the code.
  2. Various components are only accurate up to some tolerance, and the error propagation to work out how that corresponds to useful measurable quantities can be rather tricky.

Those are certainly both solvable problems, and point 1 in particular is just general improvement of usability. The second point is about designing the tests well, which again is certainly doable, but will take a while (it takes long enough just to refactor them, let alone a total rewrite of large chunks of them!).

@hristog
Copy link
Contributor

hristog commented May 11, 2022

Hi @nathanshammah,

Could this issue perhaps be broken into sub-tasks in some way, perhaps, in order to enable work starting on it a bit more feasible?

Also, I can see it's been labelled as a "good first issue" but it seems to me the definition of done (i.e., what would a PR - or a set of PRs - that would successfully address the problem entail?), with respect to the entire issue, could be clarified a bit further, and the breakdown into sub-tasks (which itself might result organically from some further discussion) might help a bit in that direction.

Regarding possible approaches for handling randomness - I have to admit I've started looking into QuTiP only very recently, and I'm yet to start familiarising myself with its more intricate details and get to run the full set of tests, and investigate what kind of test failures occur. In the meantime, I'll generally share some (what I think is) relevant experience in the context of handling randomness in tests:

  • My understanding is that we're talking about not testing the behaviour of random-number generations per se, but how the numbers they generate affect the non-deterministic functions/algorithms, which depend on the former. In such cases, I've either used a fixed seed (as you've mentioned in your original post), which can get a bit messy, as there's Python's random, NumPy's random (which also comes in a legacy and a modern flavour, namely numpy.random.RandomState and numpy.random.Generator). Also, as you've mentioned, there's tight coupling between the implementation logic and the corresponding set of tests, and updates in the former require constant maintenance of the latter.
    • As an alternative, I've sometimes resorted to mocking the random-number generation process itself, and - in a way equivalent to using a fixed seed - providing a pre-defined sequence of numbers to the functions which depend on those. It comes with the same disadvantage of relatively tight coupling, but is somehow more flexible and powerful than just setting the seeds.
  • I have used Hypothesis in the past (on a relatively small-scale project), and it's indeed a great framework with a lot of interesting and nice functionalities. I like it very much, but one disadvantage I ran into was the increased overall time for completion of a test run. In my case, a viable approach, for instance, was to run my Hypothesis tests (which were a subset of all tests) with a bit larger periodicity (in the CI pipeline) than the regular tests. Another concern that I intuitively have (I may be wildly wrong on this one, though, not having yet properly explored QuTiP's testing landscape) is that - depending on how some Hypothesis tests are set up - there might be a cost incurred in that, eventually, one might have to implement guards against inputs, that would be virtually impossible (or very close to that) to occur in practical scenarios. In any case, I'd strongly recommend performing a time-boxed experiment on a small subset of tests, which are more suited to property-based testing, and examining how that affects the test-run-completion time(s) and whether it introduces more failures than what would be desired/cost-efficient.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

8 participants