-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JOSS - Functionality #3
Comments
Hello!
I hope that make sense! |
I think this is not the case anymore (from 0.10.0).
Basically https://joblib.readthedocs.io/en/latest/persistence.html
This is a complicated matter, IMO. Versioning will always be a trouble when dealing with pickle (Python version, pickle protocol version, library version -> the way that data structure are stored). However, this is true that being compatible with
IMO, here is the most important. I will go through the code but I would expect a benchmark regarding the pickling/unpickling. It could be done on different size of arrays and data structure. We could start with LFW dataset with the available gist there: |
Sorry, I was not aware of that, however import joblib, numpy
d={'a':numpy.array([1,2,3]),'b':numpy.array([4,5,6]), 'c': numpy.ma.array([7,8,9],mask=[False, True, False])}
joblib.dump(d, '/tmp/out.pkl')
x=joblib.load('/tmp/out.pkl',mmap_mode='r+')
Now, how do I add a new key to the dictionary stored in Regarding the benchmark, it is very far for the classical use case of |
Masked arrays are not supported in joblib, this is true.
You would need to dump the dict again.
Could you give more information regarding the use case. Apparently, it seems that having a dictionary is useful for your use-case. It allows to dump and memmap on the fly? |
1 similar comment
Masked arrays are not supported in joblib, this is true.
You would need to dump the dict again.
Could you give more information regarding the use case. Apparently, it seems that having a dictionary is useful for your use-case. It allows to dump and memmap on the fly? |
That doesn't seem that efficient, especially as all the matrices have to be written again to disk?
Exactly. Here's the simplest use case I can think of: hyperspectral time lapses. At every time step, an image is captured, but we don't know how many images will be captured in total. An image is 3D numpy array of a few hundred megabytes usually, and its size is known beforehand (it consists in multiple camera frames). There are lot of advantages to hold all the images in the same files, instead of having a lot of different files (for example all the common metadata are stored only once), and it is not possible to hold all images in RAM simultaneously (usually we use a laptop to capture data). From the
The nice thing of using That's why the common "pickling all data at once"-benchmark doesn't really make sense for Does that make sense to you? |
I will go point by point for the review of the JOSS in the coming day.
But before to start, I got a question asap I saw the project.
What is the difference/benefit of
mmappickle
in comparison with joblib.joblib is handling the pickiing/unpickling of large numpy array and already manage the parallelization with auto-memmaping for the user. There is also work done on the integration with dask to allow distributed programming. Also joblib is used in scikit-learn and other python library.
So my question is the following: what
mmappickle
offers more thanjoblib
regarding the application that you stated in the paper (multiprocessing
/ pickling of numpy array)?Note:
Regarding the pickling you can refer to: here for a blog post regarding the performance at that time (some improvements have been made since this post).
cc @clemaitre58
The text was updated successfully, but these errors were encountered: