Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Public hypothesis strategies for generating xarray data #6911

Open
TomNicholas opened this issue Aug 12, 2022 · 0 comments · May be fixed by #6908
Open

Public hypothesis strategies for generating xarray data #6911

TomNicholas opened this issue Aug 12, 2022 · 0 comments · May be fixed by #6908
Labels
enhancement topic-hypothesis Strategies or tests using the hypothesis library topic-testing

Comments

@TomNicholas
Copy link
Contributor

TomNicholas commented Aug 12, 2022

Proposal

We should expose a public set of hypothesis strategies for use in testing xarray code. It could be useful for downstream users, but also for our own internal test suite. It should live in xarray.testing.strategies. Specifically perhaps

  • xarray.testing.strategies.variables
  • xarray.testing.strategies.dataarrays
  • xarray.testing.strategies.datasets
  • (xarray.testing.strategies.datatrees ?)
  • xarray.testing.strategies.indexes
  • xarray.testing.strategies.chunksizes following dask.array.testing.strategies.chunks

This issue is different from #1846 because that issue describes how we could use such strategies in our own testing code, whereas this issue is for how we create general strategies that we could use in many places (including exposing publicly).

I've become interested in this as part of wanting to see #6894 happen. #6908 would effectively close this issue, but itself is just a pulled out section of all the work @keewis did in #4972.

(Also xref #2686. Also also @max-sixty didn't you have an issue somewhere about creating better and public test fixtures?)


Previous work

I was pretty surprised to see this comment by @Zac-HD in #1846

@rdturnermtl wrote a Hypothesis extension for Xarray, which is at least a nice demo of what's possible.

given that we might have just used that instead of writing new ones in #4972! (@keewis had you already seen that extension?)

We could literally just include that extension in xarray and call this issue solved...


Shrinking performance of strategies

However I was also reading about strategies that shrink yesterday and think that we should try to make some effort to come up with strategies for producing xarray objects that shrink in a performant and well-motivated manner. In particular by pooling the knowledge of the @xarray-dev core team we could try to create strategies that search for many of the edge cases that we are collectively aware of.

My understanding of that guide is that our strategies ideally should:

  1. Quickly include or exclude complexity

    For instance if draw(booleans()): # then add coordinates to generated dataset.

    It might also be nice to have strategy constructors which allow passing other strategies in, so the user can choose how much complexity they want their strategy to generate. e.g. I think a signature like this should be possible

    from hypothesis import strategies as st
    
    @st.composite
    def dataarrays(
        data: xr.Variable | st.SearchStrategy[xr.Variable] | duckarray | st.SearchStrategy[duckarray] | None ..., 
        coords: ...,
        dims: ...,
        attrs: ...,
        name: ...,
    ) -> st.SearchStrategy[xr.DataArray]:
        """
        Hypothesis strategy for generating arbitrary DataArray objects.
    
        Parameters
        ----------
        data
            Can pass an absolute value of an appropriate type (i.e. `Variable`, `np.ndarray` etc.), 
            or pass a strategy which generates such types.
             Default is that the generated DataArray could contain any possible data.
        ...
        (similar flexibility for other constructor arguments)
        """
        ...
  2. Deliberately generate known edge cases

    For instance deliberately create:

    • dimension coordinates,
    • names which are Hashable but not strings,
    • multi-indexes,
    • weird dtypes,
    • NaNs,
    • duckarrays instead of np.ndarray,
    • inconsistent chunking between different variables,
    • (any other ideas?)
  3. Be very modular internally, to help with "keeping things local"

    Each sub-strategy should be in its own function, so that hypothesis' decision tree can cut branches off as soon as possible.

  4. Avoid obvious inefficiencies

    e.g. not .filter(...) or assume(...) if we can help it, and if we do need them then keep them in the same function that generates that data. Plus just keep all sizes small by default.

Perhaps the solutions implemented in #6894 or this hypothesis xarray extension already meet these criteria - I'm not sure. I just wanted a dedicated place to discuss building the strategies specifically, without it getting mixed in with complicated discussions about whatever we're trying to use the strategies for!

@TomNicholas TomNicholas added enhancement topic-testing topic-hypothesis Strategies or tests using the hypothesis library labels Aug 12, 2022
@TomNicholas TomNicholas linked a pull request Sep 2, 2022 that will close this issue
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement topic-hypothesis Strategies or tests using the hypothesis library topic-testing
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant