Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding an xarray wrapper with apply_ufunc #15

Merged
merged 25 commits into from
Jun 15, 2022

Conversation

jbusecke
Copy link
Contributor

@jbusecke jbusecke commented May 12, 2022

This draft PR represents the progress @TomNicholas and I made today.

We achieved the following:

Testing the output for given shapes of input

We wrote a little script (dev_numpy_wrapper.py) that tests the return shape of values from the 'raw' f2py function.

The results can be summarized like this:

# original:(200, 200, 10) | output:(200, 200, 10)
# original:(200, 200, 1) | output:(200, 200, 1)
# original:(200, 200) | output:(200, 200, 1)
# original:200 | output:(200, 1, 1)
# original:() | output:(1, 1, 1)
# (200, 200, 10, 4),  # ValueError: too many axes: 4 (effrank=4), expected rank=3
# (200, 200, 0),  # ValueError: unexpected array size: new_size=40000, got array with arr_size=0
# (200, 0, 3), #ValueError: unexpected array size: new_size=40000, got array with arr_size=0
# original:(1, 1, 1) | output:(1, 1, 1)

From which you can see that any input with <=3 dims/axes will return a 3d array, while >3 dims/axes or a 0 size axis will result in errors.

We concluded from this that the easiest way to use apply_ufuncs would be to ensure len(dims)==3 before invoking xr.apply_ufunc, because handling changes in array dimensionality is complicated. Our high level approach for now is to simple assert that input arrays have 3 dimensions after broadcasting. In the future we could simply expand along dummy dimensions, then pass to apply_ufunc and squeeze the dummy dimensions out again (the user would get back the same shape of array that they put in).

Working with chunked dask arrays

We have an initial test running in dev_xarray_wrapper.py that seems to successfully parallelize over a chunked array.

Outstanding issues:

  • Refactor this into a proper module + write tests
  • Check that results are 'transpose invariant'. If I assume correctly that the calculation is in principal pointwise, it doesnt matter for the results which dimension the fortran code internally iterates over. I will write a test that confirms that with some synthetic data. If this turns out to be right, we could tweak the performance of the fortran code by transposing the arrays appropriately outside of the wrapper (and add a comment about this). If the results are NOT the same, we need to think more deeply about how we make sure that the time dimension is identified (and then parallelized over since at that point it will be a core dimension).

source/aerobulk/flux.py Outdated Show resolved Hide resolved
source/aerobulk/flux.py Outdated Show resolved Hide resolved
@jbusecke
Copy link
Contributor Author

UPDATE: I have added the xarray wrapper to flux.py and added some initial tests.

I believe that the tests confirm this to be indeed true:

Check that results are 'transpose invariant'. If I assume correctly that the calculation is in principal pointwise, it doesnt matter for the results which dimension the fortran code internally iterates over. I will write a test that confirms that with some synthetic data. If this turns out to be right, we could tweak the performance of the fortran code by transposing the arrays appropriately outside of the wrapper (and add a comment about this).

Which is great news! This means that we can use this wrapper to process our data in the cloud already (given that we are careful about units).

@jbusecke
Copy link
Contributor Author

Hmm interesting. The CI seems to be failing, but locally it worked for me. Could this be platform dependent?

@jbusecke
Copy link
Contributor Author

Just expanded the testing matrix to include macos to rule that out.

Copy link
Member

@TomNicholas TomNicholas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests look good, at least without a known answer value to compare to.

Lots of other small things to do still here though - do you want me to push to this PR?

@jbusecke
Copy link
Contributor Author

Yeah sure. You can make suggestions or push directly. I will probably try to work on it later. Just let me know when you break for the weekend.

Oh I think the CI problems are caused by the None kwargs in the numpy wrapper, so if we could solve this with positional arguments only that might solve this.

dev_numpy_wrapper.py Outdated Show resolved Hide resolved
@jbusecke
Copy link
Contributor Author

Sorry this should not have been closed, my bad

@jbusecke jbusecke reopened this May 27, 2022
@jbusecke jbusecke mentioned this pull request Jun 9, 2022
@jbusecke
Copy link
Contributor Author

jbusecke commented Jun 9, 2022

Ooof this is really messed up. I am getting these weird error during the test execution that actually show up as passed even though they only pass a subset of the tests.
I need to check this really carefully before this gets merged.
More tomorrow

1 similar comment
@jbusecke
Copy link
Contributor Author

jbusecke commented Jun 9, 2022

Ooof this is really messed up. I am getting these weird error during the test execution that actually show up as passed even though they only pass a subset of the tests.
I need to check this really carefully before this gets merged.
More tomorrow

@jbusecke jbusecke marked this pull request as ready for review June 10, 2022 20:34
Copy link
Contributor

@rabernat rabernat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a good keep the project moving forward.

There is a lot more that could be done to make the API more friendly and flexible towards different input configurations. I should be able to call this on 1D or 4D data and have it automatically expanded / contracted appropriately. The constraint to use 3 dimensions should really not be seen by the user.

However, I am absolutely fine with deferring those feature to a future PR.


END SUBROUTINE AEROBULK_MODEL_NOSKIN

END MODULE mod_aerobulk_wrapper_noskin
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be simpler to put both functions (skin + noskin) in the same module / f90 file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this seems to have caused the problems we had with pytest. The problem was actually not due to pytest, but arose, when both python wrappers were imported and called in the same module/script/notebook.
This was a brute force solution, but I think the overhead is not too bad for now.
As much as I would like to understand the problem, I think I really want to have a working implementation for now, instead of sinking more time into this somewhat obscure problem.

tests/test_flux_xr.py Outdated Show resolved Hide resolved
tests/test_flux_xr.py Outdated Show resolved Hide resolved
source/aerobulk/flux.py Outdated Show resolved Hide resolved

if len(sst.dims) < 3:
# TODO promote using expand_dims?
raise NotImplementedError
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add an actual message to this error

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I refactored these warnings into a check_input decorator, which we can later use to implement logic to expand/reshape the input and revert that action after the values are returned.

source/aerobulk/flux.py Outdated Show resolved Hide resolved
source/aerobulk/flux.py Outdated Show resolved Hide resolved
source/aerobulk/flux.py Outdated Show resolved Hide resolved
source/aerobulk/flux.py Outdated Show resolved Hide resolved
Comment on lines +246 to +247
input_core_dims=[()] * 6,
output_core_dims=[()] * 5,
Copy link
Contributor

@rabernat rabernat Jun 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious about the lack of core dims. I think of 'time', 'lat', 'lon' as the core dims (input and output). Providing these would allow us to automatically work on arrays with more dimensions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On a theoretical level the calculation is pointwise (this test confirms that). For practical reasons we definitely do not want to define time as a core dim, since it would prevent us from using dask='parallelized' in most cases.
The restrictions (of passing 3d arrays) can be solved independent of the actual dimensions, so I think it is not appropriate to define core dims here.

@jbusecke
Copy link
Contributor Author

I made #28 to serve as a reminder for @rabernat concerns. Do you think we can merge this for now? Keen to get this released before our hack tomorrow.

cc @TomNicholas

algo,
zt,
zu,
niter,
):
"""Python wrapper for aerobulk without skin correction.
!ATTENTION If input not provided in correct units, will crash.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very vague warning. I would prefer that we actually listed the allowed value ranges under each input in the docstring.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also don't think we should describe this error in terms of "correct units". The function expects the values to be expressed in certain units, AND the function will error if the values given are outside some (completely arbitrary) range.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also you should be consistent with using the Warnings RST block.

Comment on lines +165 to +167
test_arg = args[
0
] # assuming that all the input shapes are the same size. TODO: More thorough check
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we not just looping over all the input arguments?

source/aerobulk/flux.py Outdated Show resolved Hide resolved
@jbusecke
Copy link
Contributor Author

We need this to be available on conda for todays hack, so we will merge now, but record concerns raised in issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

xarray wrapper
4 participants