Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JOSS - Functionality #3

Closed
glemaitre opened this issue Mar 27, 2018 · 6 comments
Closed

JOSS - Functionality #3

glemaitre opened this issue Mar 27, 2018 · 6 comments

Comments

@glemaitre
Copy link

I will go point by point for the review of the JOSS in the coming day.

But before to start, I got a question asap I saw the project.
What is the difference/benefit of mmappickle in comparison with joblib.

joblib is handling the pickiing/unpickling of large numpy array and already manage the parallelization with auto-memmaping for the user. There is also work done on the integration with dask to allow distributed programming. Also joblib is used in scikit-learn and other python library.

So my question is the following: what mmappickleoffers more than joblib regarding the application that you stated in the paper (multiprocessing / pickling of numpy array)?

Note:

Regarding the pickling you can refer to: here for a blog post regarding the performance at that time (some improvements have been made since this post).

cc @clemaitre58

@lfasnacht
Copy link
Member

Hello!
Thank you very much for agreeing to review. Here's a quick answer, I can get deeper in details if required (but I need to get to know more about joblib), let me know.

joblib shares multiple goals with mmappickle. Here are what I think are the main differences with joblib, as far as I know (feel free to correct me if needed):

  • joblib uses one file per array. This can create issues when working with a lot of arrays simultaneously (limits on the number of open files).
  • joblib seems to focus mainly on computation. mmappickle on the other hand in addition try to ease data storage and distribution. Typically, having all the data in one file makes it simple to share (no need to "zip" files), and in addition the file is using the standard pickle protocol version 4, meaning that the file can be loaded by pickle.load. This is in my opinion important when sharing and archiving data, since individual projects, even open source ones, tend to evolve or disappear, while it seems pretty likely that Python pickle will stay quite stable.
  • concerning the performance, mmappickle has a small, nearly-constant overhead over numpy.memmap. Since all the alternatives (including joblib) also use numpy.memmap, the performance are usually similar. The overhead depend linearly on the number of keys in the dictionary, and is cached, so it is usually not an issue except if the dictionary has a large number (thousands) of keys, and changes frequently.

I hope that make sense!

@glemaitre
Copy link
Author

joblib uses one file per array. This can create issues when working with a lot of arrays simultaneously (limits on the number of open files).

I think this is not the case anymore (from 0.10.0).

joblib seems to focus mainly on computation. mmappickle on the other hand in addition try to ease data storage and distribution. Typically, having all the data in one file makes it simple to share (no need to "zip" files).

Basically joblib.load and joblib.dump are intended for this usage (without multiple files anymore). The compression is coming for free on the fly when dumping:

https://joblib.readthedocs.io/en/latest/persistence.html

and in addition the file is using the standard pickle protocol version 4, meaning that the file can be loaded by pickle.load. This is in my opinion important when sharing and archiving data, since individual projects, even open source ones, tend to evolve or disappear, while it seems pretty likely that Python pickle will stay quite stable.

This is a complicated matter, IMO. Versioning will always be a trouble when dealing with pickle (Python version, pickle protocol version, library version -> the way that data structure are stored). However, this is true that being compatible with pickle library is a plus at least at time t :)

concerning the performance, mmappickle has a small, nearly-constant overhead over numpy.memmap. Since all the alternatives (including joblib) also use numpy.memmap, the performance are usually similar. The overhead depend linearly on the number of keys in the dictionary, and is cached, so it is usually not an issue except if the dictionary has a large number (thousands) of keys, and changes frequently.

IMO, here is the most important. I will go through the code but I would expect a benchmark regarding the pickling/unpickling. It could be done on different size of arrays and data structure.

We could start with LFW dataset with the available gist there:
https://gist.github.com/aabadie/2ba94d28d68f19f87eb8916a2238a97c

@lfasnacht
Copy link
Member

lfasnacht commented Mar 27, 2018

Basically joblib.load and joblib.dump are intended for this usage (without multiple files anymore). The compression is coming for free on the fly when dumping:

https://joblib.readthedocs.io/en/latest/persistence.html

Sorry, I was not aware of that, however mmappickle is made to store a dictionary of things. If I try to do the same with joblib, let say I have the following case:

import joblib, numpy
d={'a':numpy.array([1,2,3]),'b':numpy.array([4,5,6]), 'c': numpy.ma.array([7,8,9],mask=[False, True, False])}
joblib.dump(d, '/tmp/out.pkl')
x=joblib.load('/tmp/out.pkl',mmap_mode='r+')

x['a'] and x['b'] are memmap, but not x['c']. (this would work with mmappickle).

Now, how do I add a new key to the dictionary stored in /tmp/out.pkl?

Regarding the benchmark, it is very far for the classical use case of mmappickle, but I'll adapt it nevertheless. Give me a few days ;-)

@glemaitre
Copy link
Author

x['a'] and x['b'] are memmap, but not x['c']. (this would work with mmappickle).

Masked arrays are not supported in joblib, this is true.
https://github.com/joblib/joblib/blob/master/joblib/test/test_numpy_pickle.py#L275

Now, how do I add a new key to the dictionary stored in /tmp/out.pkl?

You would need to dump the dict again.

Regarding the benchmark, it is very far for the classical use case of mmappickle, but I'll adapt it nevertheless. Give me a few days ;-)

Could you give more information regarding the use case. Apparently, it seems that having a dictionary is useful for your use-case. It allows to dump and memmap on the fly?

1 similar comment
@glemaitre
Copy link
Author

x['a'] and x['b'] are memmap, but not x['c']. (this would work with mmappickle).

Masked arrays are not supported in joblib, this is true.
https://github.com/joblib/joblib/blob/master/joblib/test/test_numpy_pickle.py#L275

Now, how do I add a new key to the dictionary stored in /tmp/out.pkl?

You would need to dump the dict again.

Regarding the benchmark, it is very far for the classical use case of mmappickle, but I'll adapt it nevertheless. Give me a few days ;-)

Could you give more information regarding the use case. Apparently, it seems that having a dictionary is useful for your use-case. It allows to dump and memmap on the fly?

@lfasnacht
Copy link
Member

Now, how do I add a new key to the dictionary stored in /tmp/out.pkl?

You would need to dump the dict again.

That doesn't seem that efficient, especially as all the matrices have to be written again to disk?

Could you give more information regarding the use case. Apparently, it seems that having a dictionary is useful for your use-case. It allows to dump and memmap on the fly?

Exactly.

Here's the simplest use case I can think of: hyperspectral time lapses. At every time step, an image is captured, but we don't know how many images will be captured in total. An image is 3D numpy array of a few hundred megabytes usually, and its size is known beforehand (it consists in multiple camera frames). There are lot of advantages to hold all the images in the same files, instead of having a lot of different files (for example all the common metadata are stored only once), and it is not possible to hold all images in RAM simultaneously (usually we use a laptop to capture data).

From the mmappickle point of view, here's what happens:

  1. at first, the file is created, and the metadata is written.
  2. at the beginning of each scan, a new key is added to the dict. This doesn't take too much time as only the matrix structure and a "hole" is written in the file.
  3. each frame is written to the (now memmap'ed) array (filling the "hole")
  4. then go to step 2, as long as not stopped by the user.

The nice thing of using mmappickle is that it is possible to view the content of the file while it is written, or even use it from another program (for example to check if everything looks right, or to run an algorithm on the first captured images while still capturing new data). It is the kind of simultaneous access which is not considered by other libraries.

That's why the common "pickling all data at once"-benchmark doesn't really make sense for mmappickle. The common use case it random access (add/delete/memmap) of the dict.

Does that make sense to you?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants