Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decide how to bundle .pickle files from COMSOL solution results data #4026

Closed
7 of 8 tasks
agriyakhetarpal opened this issue Apr 20, 2024 · 6 comments · Fixed by #4098
Closed
7 of 8 tasks

Decide how to bundle .pickle files from COMSOL solution results data #4026

agriyakhetarpal opened this issue Apr 20, 2024 · 6 comments · Fixed by #4098
Assignees
Labels
difficulty: medium Will take a few days priority: medium To be resolved if time allows release blocker Issues that need to be addressed before the creation of a release

Comments

@agriyakhetarpal
Copy link
Member

agriyakhetarpal commented Apr 20, 2024

.pickle files are binary files and not inherently secure. They therefore should be best included elsewhere in PyBaMM's packaging infrastructure when publishing to common distribution channels like PyPI owing to security reasons in the case where these binaries are modified with malicious intent and make it to the develop branch or a PyPI release where PyBaMM can be downloaded from (binary or non-binary). This means they should neither be included in the source distribution (controlled by MANIFEST.in), nor in the wheels (in the case where the wheel is built from the sdist). Binary platform-specific wheels for PyBaMM cannot be built from sdists right now and have to be built in-tree (#3651, #3564), and require restructuring the build system configuration at least, if not rewriting what files are copied into the package structure – since pybamm/input/comsol_results/ is a namespace package (via setuptools.command.install_data).

We have the following options:

  1. Store in JSON, CSV, or TXT file formats (non-binary) instead of .pickle
  2. Use pooch to reproducibly download these files when needed and adjust PyBaMM's public API plus example notebooks for this, while moving these files to a separate repository in the pybamm-team organisation, and marking the repository as an archive to prevent tampering.
  3. Use non-binary storage solutions from xarray or zarr because they can handle multi-dimensional data

Out of these, point 2 requires an internet connection and those wishing to load COMSOL results from these files would need access to the internet to download them, and therefore should be the last option to consider.


Task list, as of 10/05/2024

There are two major things to be done:

@agriyakhetarpal agriyakhetarpal added difficulty: medium Will take a few days priority: medium To be resolved if time allows labels Apr 20, 2024
@agriyakhetarpal
Copy link
Member Author

NumPy arrays in the values of the dict can't be serialised to JSON directly. It might be worth converting these arrays to lists to be able to save them or just use pandas, which is already an optional dependency for PyBaMM, to convert to JSON and read it afterwards where needed

@agriyakhetarpal agriyakhetarpal added the release blocker Issues that need to be addressed before the creation of a release label Apr 25, 2024
@kratman
Copy link
Contributor

kratman commented Apr 26, 2024

@agriyakhetarpal Why is this a release blocker?

@agriyakhetarpal agriyakhetarpal removed the release blocker Issues that need to be addressed before the creation of a release label Apr 26, 2024
@agriyakhetarpal
Copy link
Member Author

Ah, I added this in error – most likely it was because I was selecting multiple issues, my apologies. It's something to look into soon enough, but by no means a release blocker. I removed the label.

@santacodes
Copy link
Contributor

I think the first option would be better, JSON and CSV files would provide a better structure than txt files in my opinion. I could try this out if it's not being carried out by someone else.

@agriyakhetarpal
Copy link
Member Author

Thanks, assigned you. You'll need to take care of the fact that if you convert NumPy arrays to some other format without a compatibility-provider library and convert them back again, you could potentially lose data. You'll also need to take care of things like floating-point precision (most of the elements in that have decimal values).

@agriyakhetarpal agriyakhetarpal added the release blocker Issues that need to be addressed before the creation of a release label May 13, 2024
@agriyakhetarpal
Copy link
Member Author

This is now a legible release blocker, added the label. The reason is that the JSON files are cumulatively summing up to 26.47 MB (https://github.com/pybamm-team/pybamm-data/releases/tag/v1.0.0), which is going to make the sdist and wheels bloated with very large files that are cumulatively greater in size than that of the Python extension module we ship inside the wheel. We will need to move to pooch + add relevant classes/functions to download them quite soon, and we will need to do this before we do the v24.5rc0 pre-release (cc: @santacodes).

The size of the sdist should not be larger than a few MBs at most, and while there is no guideline as such for sdist sizes, it is generally acceptable to keep just minimal files in them so that they are smaller than the wheels, and also to provide a reasonably sized download for users (both wheels and sdist). For context, our Windows wheels are currently ~8 MB, macOS ones are ~12 MB, and those for Linux are ~22 MB in size. This would not have been a blocker if the wheels did not include the files but the sdist did – both of them will, at this time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
difficulty: medium Will take a few days priority: medium To be resolved if time allows release blocker Issues that need to be addressed before the creation of a release
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants