Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for byte shuffle as a codec so that it can be used as a Zarr filter #273

Merged
merged 33 commits into from Mar 18, 2021

Conversation

pbranson
Copy link
Contributor

@pbranson pbranson commented Feb 17, 2021

PR for a byte shuffle codec as discussed in #260

Became too difficult (for me atleast) to see a way forward making use of the optimised blosc-c shuffle.

This add a dependency on numba which may not be acceptable, in which case perhaps a cython implementation of the shuffle functions will be required. Would appreciate advice if that is the case

TODO:

  • Unit tests and/or doctests in docstrings
  • tox -e py39 passes locally
  • Docstrings and API docs for any new/modified user-facing classes and functions
  • Changes documented in docs/release.rst
  • tox -e docs passes locally
  • GitHub Actions CI passes
  • Test coverage to 100% (Coveralls passes)

@pep8speaks
Copy link

pep8speaks commented Feb 17, 2021

Hello @pbranson! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

Line 95:1: E402 module level import not at top of file

Comment last updated at 2021-03-18 13:41:16 UTC

@joshmoore
Copy link
Member

@pbranson : The wheel GH actions don't make use of requirements.txt. I think you'll need to register numba in setup.py as well.

@pbranson
Copy link
Contributor Author

pbranson commented Mar 2, 2021

Thanks for the suggestion @joshmoore. The wheels now build for all except Python 3.9

@joshmoore
Copy link
Member

Interesting:

ModuleNotFoundError: No module named 'numpy'

Looks like NumPy is missing somewhere, but I haven't tracked down where yet.

@joshmoore
Copy link
Member

At least for Windows/3.9, it looks like it's failing on trying to build the numba wheel. If there's no wheel available, perhaps installing numba via conda would help? But this raises the issue of whether end users will face the same issue.

@pbranson
Copy link
Contributor Author

pbranson commented Mar 3, 2021

Looks like the build problems with 3.9 as numba does not yet have a release with 3.9 support (see numba/numba#6345). This is coming soon in numba 0.53.0

However, it would probably be good to get some consensus on whether including numba as a dependency is something acceptable to for numcodecs: @jakirkham, @alimanfoo do you have any guidance on that front?

@martindurant
Copy link
Member

I would suggest, to push this forward, that numba not be a dependency, but only needed when instantiating this filter, i.e., using any other filters works fine without numba installed. However, numba should be in the tests, and installed from conda. There is a good reason for the existence of conda: you can include complex binary dependencies like LLVM outside of the python tree in a way that is really hard to achieve with pip.

How hard is it to write shuffle cython, like the other filters?

Copy link
Member

@jakirkham jakirkham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank for looking into this Paul! 😄

Unfortunately have a few concerns with this implementation noted below. There are probably a few different ways to address this. I think the easiest will likely be adding a new optional dependency for this functionality. Though maybe other solution like using NumPy directly may be viable (though I haven't dug deeply into this)

numcodecs/shuffle.py Outdated Show resolved Hide resolved
numcodecs/shuffle.py Outdated Show resolved Hide resolved
setup.py Outdated Show resolved Hide resolved
@martindurant
Copy link
Member

I made a cython version in pbranson#1 that eliminated the dep problem. It's not quite done, but the tests pass. Almost identical code.

For licensing, presumably we need to just find a publicly free definition of shuffle - I'm sure HDF didn't invent it. Actually I'm not sure what their license is anyway.

@pbranson
Copy link
Contributor Author

Thanks @martindurant and @jakirkham for taking a look at this!

The license for HSDS is apache 2.0 https://github.com/HDFGroup/hsds/blob/master/LICENSE which from my naive reading seems to suggest that inclusion of the copyright notice is all that is needed...

@martindurant Thanks a lot for the cython code! I have had a look at your PR - will have a go at merging it into this branch to resolve the numba dep

@pbranson
Copy link
Contributor Author

Yay, all ticks!

@pbranson
Copy link
Contributor Author

Heh, I went into the logs for the Windows CI and for all of the codecs that have that ImportError handling the tests are skipped...
https://github.com/zarr-developers/numcodecs/runs/2092033530#step:6:12

So on Windows only 108 out of a possible 593 tests are actually executed.... So I dont know if it works on those platforms! Doesnt seem satisfactory

@martindurant
Copy link
Member

cc ^ @zarr-developers/core-devs

@joshmoore
Copy link
Member

I assume related to #270

@martindurant
Copy link
Member

cc @JacksonMaxfield - it seems we get ImportErrors for any cython implementation on windows, and plenty of tests not running. Is this expected?

@evamaxfield
Copy link
Contributor

evamaxfield commented Mar 16, 2021

I know some of this is already said above but condensing into a single comment as I get caught up:

See here any of the Windows CI builds from my PR.

... collected 108 items / 6 skipped / 102 selected
... 107 passed, 6 skipped, 1 xfailed, 4 warnings in 7.60s

Did 400 tests get added to this repo in that past month? Did I miss something in my PR?

No that's not the case because even on that same PR the ubuntu and mac OS builds have the expected number of tests detected and ran.

... collected 533 items
... 494 passed, 39 xfailed, 4 warnings in 11.61s

Well then whats the difference between the windows build from the others?

Taking the step definition of the ci-macos and the ci-windows workflows to a text compare shows they are exactly the same....

So no differences in the CI workflow definition, and crucially it uses the same testing command as the others, so still confused as to what is going on.

There is this comment from my PR: #270 (comment)

So, if my memory is correct and understanding the context of that comment is correct, there was no Windows testing prior to my PR.

A bit of digging on "pytest fails to pick up Cython tests on Windows" got me to this stackoverflow

Which has a final update of:

How can I use pytest with Cython? How can I discover why these methods aren't working and then fix it?

FINAL UPDATE: After taking both the original problem and the question Updates into consideration (thanks hoefling for solving these issues!), this question is now reduced to the question of:

why can pytest no import the Cython module calculateScore, even though running the code just with python (no pytest) works just fine?

Looks like this has been an issue of pytest + cython + windows for a while?

@evamaxfield
Copy link
Contributor

One more update. Looked at that stackover again and saw there was a single upvoted answer down below that linked to this answer: https://stackoverflow.com/a/49068163

Which says to build the cython extension manually instead of using cythonize but that seems cumbersome?

@martindurant
Copy link
Member

The question is: can we be certain that these cythonized modules will work on windows? We know from the linux tests that the algorithms themselves are fine.

@hoefling
Copy link

I'm not sure why I was summoned by Github, but I took a look at the workfow YAMLs, and this is my guess for the different amount of tests on Win + MacOS on one side, and Linux on the other:

Linux: project is installed properly via pip install --editable ., Win + Mac: only setup.py build is executed. This leaves the extensions in the build dir, so they can't be imported. Whereas the editable install runs setup.py build_ext --inplace under the hood, which copies built extensions to the target package.

Second, Linux uses pytest -v --cov=numcodecs --doctest-modules --doctest-glob=*.pyx numcodecs which explicitly runs doctests, while Win + Mac uses pytest -v --pyargs numcodecs, so no doctests (they are not being run by default). why using pyargs at all here, BTW?

Hope this helps!

@evamaxfield
Copy link
Contributor

Hey @hoefling! Sorry for pinging you. You and the person who helped answer the question on StackOverflow have the same name (or you are the same person!) and when I copy-pasted the answer in to GitHub I forgot to remove the @ character but your solutions definitely work for running all of the tests on Windows and Mac!

Made a PR to fix this: #276

@pbranson
Copy link
Contributor Author

Thats awesome, thanks everyone!

For this PR should I change the codec registration to not handle the import error and we accept that the windows CI tests will fail until #276 is merged?

@hoefling
Copy link

Hey @hoefling! Sorry for pinging you. You and the person who helped answer the question on StackOverflow have the same name (or you are the same person!) and when I copy-pasted the answer in to GitHub I forgot to remove the @ character but your solutions definitely work for running all of the tests on Windows and Mac!

Ah, I get it now, funny thing :-) Absolutely no worries, happy to help!

@martindurant
Copy link
Member

I have merged #276 , so you can merge in, and then we can merge this too (note the conflict in release.rst).

@pbranson
Copy link
Contributor Author

OK I think this is ready!

docs/index.rst Outdated Show resolved Hide resolved
docs/release.rst Outdated Show resolved Hide resolved
@martindurant
Copy link
Member

Is it right that all the fixture .dat and .npy files show up as empty?

@pbranson
Copy link
Contributor Author

Yeah that is weird about the fixtures - they arent empty on my system - if you click to view file it shows as ~7kb raw. Actually now they are all showing correctly?!

@martindurant
Copy link
Member

Screen Shot 2021-03-18 at 11 44 42

But if I go to view raw and download it, it has content. I suppose the tests woudn't pass without them!

@martindurant martindurant merged commit f946d7d into zarr-developers:master Mar 18, 2021
@jakirkham
Copy link
Member

Yeah I think one has to manually add them. They shouldn't be empty, but we do have them in .gitignore I believe. @pbranson could you please open a new PR that adds these?

@martindurant
Copy link
Member

@jakirkham - the files really are there, they just didn't show in the diff for some reason
https://github.com/zarr-developers/numcodecs/blob/master/fixture/shuffle/array.00.npy

@jakirkham
Copy link
Member

Oh interesting did not realize that. Thanks for correcting me Martin 😄

@pbranson
Copy link
Contributor Author

Yay, thanks everyone for your help and time reviewing. I'm pretty new to this so has been a good learning experience!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants