Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Outreachy project proposals (Oct. 2021) #39

Closed
joshmoore opened this issue Sep 23, 2021 · 2 comments
Closed

Outreachy project proposals (Oct. 2021) #39

joshmoore opened this issue Sep 23, 2021 · 2 comments

Comments

@joshmoore
Copy link
Member

Zarr has been accepted as an Outreachy project for mentoring interns during a three month paid project. Projects are propose by the mentors and should balance both the attractiveness and feasibility of the project along with the learning effect for the interns. Projects must be submitted by September 29th. Comments and ideas welcome (though complete drafts are more likely to be used 😉).

Contribution section:

During the application period, interns make contributions to open-source projects to give mentors a chance to evaluate their skills & interest. This same block of text will likely be used for all proposals under the "Contribution" section:

Contributions to any of the repositories in the zarr-developers organization are welcome. Particularly of interest are the "help wanted" and "low-hanging-fruit" labels. Most of the issues will be on the zarr-python repo. If you are interested in getting involved in another language, see the list under "Zarr implementations" and let us know what you are interested in under https://github.com/zarr-developers/community/discussions.


Proposal 1. Add multi-language Zarr implementation tests (C/Julia/etc.)

Zarr is a chunked, compressed format for the storage of N-dimensional arrays like those created by NumPy. There are a number of Zarr implementations across a wide range of programming languages which all need to conform to the same specification. We would like to have Zarr tested across as many of these languages as possible in order to ensure conformance. Several languages are already well-represented in the Zarr test suite; others have not yet been added like include Julia (#42) or C (netcdf #35; or tensorstore from Google #20). Candidates should pick implementations in their order of preference and test them through several phases as described under “tasks”.

Tasks incl. stretch goals:

  • Phase 1: A small driver application will be needed in each language. Initially, the driver should create a Zarr file using the given library and evaluate if the output can be read by other implementations.
  • Phase 2: Extend the driver to reading Zarr files from all other implementations (#25), reporting (and perhaps fixing!) any failures.
  • Phase 3 (stretch): Add any test functions that may be of particular interest to a given implementation, reporting (and perhaps fixing!) any failures.

Benefits:

This project will give the prospective intern the opportunity to try out several new languages while learning the basics of the Zarr format.

Community Benefits:

As the Zarr ecosystem grows, testing that all the implementations produce compatible data is critical for maintaining user trust.


Proposal 2. Build registry of Zarr codecs

Zarr is a chunked, compressed format for the storage of N-dimensional arrays like those created by NumPy. Each compression algorithm used by Zarr is assigned an identifier by the numcodecs library. In order to make these identifiers useful in other programming languages, issue #278 requires a registry to be defined that is machine readable from multiple programming languages (C, Java, Javascript, etc.).

Tasks incl. stretch goals:

  • Research and compare existing lists of compression filters (e.g. from HDF5 and imagecodecs).
  • Document all of the current codecs and their identifiers from the research in an issue.
  • Propose a format for the registry in an issue for community review.
  • Create and publish the registry, both in a machine readable as well as human-friendly form.
  • Document the steps for others to update the registry.
  • Stretch goal: Write tests that show which codecs from each implementation are supported by other implementations. Display a matrix of the results.

Benefits:

This project will give the prospective intern insight into low-level details of a number of file formats and compression techniques as well as the importance of standardization for keeping data readable.

Community Benefits:

The registry will serve as a critical bridge between a number of communities (different programming languages like C/Java/Python as well as multiple file formats like Zarr/NetCDF/HDF5).


Proposal 3. Benchmark Zarr implementations

Zarr is a chunked, compressed format for the storage of N-dimensional arrays like those created by NumPy. Performance when reading and writing very large arrays is critical for user acceptance. With issue#337, we would like to have benchmarks updated with each version of Zarr showing the long-term tendency of various metrics, warning developers if a change causes too much slowdown. Airspeed Velocity (asv) is a likely tool for this project. (Example output can be seen in the aicsimageio project.)

Tasks incl. stretch goals:

  • Define a number of simple metrics for reading and writing Zarr files, asking the community for further suggestions. Examples include: writing arrays in a number of different sizes & shapes; reading those same arrays back in.
  • Implement these metrics in zarr-python and optionally in other programming languages available in zarr_implementations.
  • A further stretch goal would be to compare the same operations using HDF5 or NumPy files (.npy) as described in issue #519.

Benefits:

This project will give the prospective intern the opportunity to learn the basics of the Zarr format and APIs as well as experiment with benchmarking, a useful tool for any programmer.

Community Benefits:

The community will benefit from a clear, visual representation of Zarr’s speed overtime, which will identify areas for further optimization**.**


Proposal 4. Fuzz-test Zarr implementations

Zarr is a chunked, compressed format for the storage of N-dimensional arrays like those created by NumPy. Implementing libraries must reliably read the same data across a large number of parameters: a range of platforms and codecs, big and little endian systems, etc. Fuzz testing, “an automated software testing technique that involves providing invalid, unexpected, or random data as inputs to a computer program”, will help to show that each library can cope with this diversity.

Tasks incl. stretch goals:

  • Chose a first Zarr library for testing together with the mentors (e.g. zarr-python)
  • Evaluate fuzz testing frameworks for the appropriate language and report back on a public issue.
  • Choose a framework and automate fuzzing of the Zarr library with GitHub Actions.
  • Optionally, choose a next library and repeat.

Benefits:

This project will give the prospective intern the opportunity to learn the basics of the Zarr format and APIs as well as experiment with fuzz testing, a useful tool for any programmer.

Community Benefits:

The community will benefit from the improved security of the Zarr libraries.


Remaining possible projects:

These were ideas found while going through open issues and PRs. Thoughts on how to refine one or more of them are welcome.


Original drafted in gdoc

@joshmoore
Copy link
Member Author

thanks for the 🎉s. I've now uploaded the proposals. I see that others from the community can also open their own and offer to co-mentor. If anyone is interested, get in touch.

@joshmoore
Copy link
Member Author

In the end, we did not accept any of the potential Outreachy mentorees this round. It was lovely to interact with everyone on Zoom, Gitter, and GitHub, but ultimately I made the mistake of not making clear what would be needed in terms of existing Python experience in order to get involved with Zarr. My apologies for this! I will try to be clearer in the future. In that sense it was a good learning experience for me and I thank everyone who participated.

The above outline is probably still useful for future proposals but I'll close this ticket for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant