Public hypothesis strategies for generating xarray data #6911

TomNicholas · 2022-08-12T15:17:40Z

Proposal

We should expose a public set of hypothesis strategies for use in testing xarray code. It could be useful for downstream users, but also for our own internal test suite. It should live in xarray.testing.strategies. Specifically perhaps

xarray.testing.strategies.variables
xarray.testing.strategies.dataarrays
xarray.testing.strategies.datasets
(xarray.testing.strategies.datatrees ?)
xarray.testing.strategies.indexes
xarray.testing.strategies.chunksizes following dask.array.testing.strategies.chunks

This issue is different from #1846 because that issue describes how we could use such strategies in our own testing code, whereas this issue is for how we create general strategies that we could use in many places (including exposing publicly).

I've become interested in this as part of wanting to see #6894 happen. #6908 would effectively close this issue, but itself is just a pulled out section of all the work @keewis did in #4972.

(Also xref #2686. Also also @max-sixty didn't you have an issue somewhere about creating better and public test fixtures?)

Previous work

I was pretty surprised to see this comment by @Zac-HD in #1846

@rdturnermtl wrote a Hypothesis extension for Xarray, which is at least a nice demo of what's possible.

given that we might have just used that instead of writing new ones in #4972! (@keewis had you already seen that extension?)

We could literally just include that extension in xarray and call this issue solved...

Shrinking performance of strategies

However I was also reading about strategies that shrink yesterday and think that we should try to make some effort to come up with strategies for producing xarray objects that shrink in a performant and well-motivated manner. In particular by pooling the knowledge of the @xarray-dev core team we could try to create strategies that search for many of the edge cases that we are collectively aware of.

My understanding of that guide is that our strategies ideally should:

Quickly include or exclude complexity

For instance if draw(booleans()): # then add coordinates to generated dataset.

It might also be nice to have strategy constructors which allow passing other strategies in, so the user can choose how much complexity they want their strategy to generate. e.g. I think a signature like this should be possible

from hypothesis import strategies as st

@st.composite
def dataarrays(
    data: xr.Variable | st.SearchStrategy[xr.Variable] | duckarray | st.SearchStrategy[duckarray] | None ..., 
    coords: ...,
    dims: ...,
    attrs: ...,
    name: ...,
) -> st.SearchStrategy[xr.DataArray]:
    """
    Hypothesis strategy for generating arbitrary DataArray objects.

    Parameters
    ----------
    data
        Can pass an absolute value of an appropriate type (i.e. `Variable`, `np.ndarray` etc.), 
        or pass a strategy which generates such types.
         Default is that the generated DataArray could contain any possible data.
    ...
    (similar flexibility for other constructor arguments)
    """
    ...

Deliberately generate known edge cases

For instance deliberately create:
- dimension coordinates,
- names which are Hashable but not strings,
- multi-indexes,
- weird dtypes,
- NaNs,
- duckarrays instead of np.ndarray,
- inconsistent chunking between different variables,
- (any other ideas?)
Be very modular internally, to help with "keeping things local"

Each sub-strategy should be in its own function, so that hypothesis' decision tree can cut branches off as soon as possible.
Avoid obvious inefficiencies

e.g. not .filter(...) or assume(...) if we can help it, and if we do need them then keep them in the same function that generates that data. Plus just keep all sizes small by default.

Perhaps the solutions implemented in #6894 or this hypothesis xarray extension already meet these criteria - I'm not sure. I just wanted a dedicated place to discuss building the strategies specifically, without it getting mixed in with complicated discussions about whatever we're trying to use the strategies for!

The text was updated successfully, but these errors were encountered:

TomNicholas added enhancement topic-testing topic-hypothesis Strategies or tests using the hypothesis library labels Aug 12, 2022

TomNicholas linked a pull request Sep 2, 2022 that will close this issue

Hypothesis strategies in xarray.testing.strategies #6908

Open

4 tasks

TomNicholas mentioned this issue Nov 2, 2023

Hypothesis strategy for generating Variable objects #8404

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Public hypothesis strategies for generating xarray data #6911

Public hypothesis strategies for generating xarray data #6911

TomNicholas commented Aug 12, 2022 •

edited

Loading

Public hypothesis strategies for generating xarray data #6911

Public hypothesis strategies for generating xarray data #6911

Comments

TomNicholas commented Aug 12, 2022 • edited Loading

Proposal

Previous work

Shrinking performance of strategies

TomNicholas commented Aug 12, 2022 •

edited

Loading