JOSS - Functionality #3

glemaitre · 2018-03-27T12:26:43Z

I will go point by point for the review of the JOSS in the coming day.

But before to start, I got a question asap I saw the project.
What is the difference/benefit of mmappickle in comparison with joblib.

joblib is handling the pickiing/unpickling of large numpy array and already manage the parallelization with auto-memmaping for the user. There is also work done on the integration with dask to allow distributed programming. Also joblib is used in scikit-learn and other python library.

So my question is the following: what mmappickleoffers more than joblib regarding the application that you stated in the paper (multiprocessing / pickling of numpy array)?

Note:

Regarding the pickling you can refer to: here for a blog post regarding the performance at that time (some improvements have been made since this post).

cc @clemaitre58

The text was updated successfully, but these errors were encountered:

lfasnacht · 2018-03-27T15:15:25Z

Hello!
Thank you very much for agreeing to review. Here's a quick answer, I can get deeper in details if required (but I need to get to know more about joblib), let me know.

joblib shares multiple goals with mmappickle. Here are what I think are the main differences with joblib, as far as I know (feel free to correct me if needed):

joblib uses one file per array. This can create issues when working with a lot of arrays simultaneously (limits on the number of open files).
joblib seems to focus mainly on computation. mmappickle on the other hand in addition try to ease data storage and distribution. Typically, having all the data in one file makes it simple to share (no need to "zip" files), and in addition the file is using the standard pickle protocol version 4, meaning that the file can be loaded by pickle.load. This is in my opinion important when sharing and archiving data, since individual projects, even open source ones, tend to evolve or disappear, while it seems pretty likely that Python pickle will stay quite stable.
concerning the performance, mmappickle has a small, nearly-constant overhead over numpy.memmap. Since all the alternatives (including joblib) also use numpy.memmap, the performance are usually similar. The overhead depend linearly on the number of keys in the dictionary, and is cached, so it is usually not an issue except if the dictionary has a large number (thousands) of keys, and changes frequently.

I hope that make sense!

glemaitre · 2018-03-27T16:47:28Z

joblib uses one file per array. This can create issues when working with a lot of arrays simultaneously (limits on the number of open files).

I think this is not the case anymore (from 0.10.0).

joblib seems to focus mainly on computation. mmappickle on the other hand in addition try to ease data storage and distribution. Typically, having all the data in one file makes it simple to share (no need to "zip" files).

Basically joblib.load and joblib.dump are intended for this usage (without multiple files anymore). The compression is coming for free on the fly when dumping:

https://joblib.readthedocs.io/en/latest/persistence.html

and in addition the file is using the standard pickle protocol version 4, meaning that the file can be loaded by pickle.load. This is in my opinion important when sharing and archiving data, since individual projects, even open source ones, tend to evolve or disappear, while it seems pretty likely that Python pickle will stay quite stable.

This is a complicated matter, IMO. Versioning will always be a trouble when dealing with pickle (Python version, pickle protocol version, library version -> the way that data structure are stored). However, this is true that being compatible with pickle library is a plus at least at time t :)

concerning the performance, mmappickle has a small, nearly-constant overhead over numpy.memmap. Since all the alternatives (including joblib) also use numpy.memmap, the performance are usually similar. The overhead depend linearly on the number of keys in the dictionary, and is cached, so it is usually not an issue except if the dictionary has a large number (thousands) of keys, and changes frequently.

IMO, here is the most important. I will go through the code but I would expect a benchmark regarding the pickling/unpickling. It could be done on different size of arrays and data structure.

We could start with LFW dataset with the available gist there:
https://gist.github.com/aabadie/2ba94d28d68f19f87eb8916a2238a97c

lfasnacht · 2018-03-27T17:39:16Z

Basically joblib.load and joblib.dump are intended for this usage (without multiple files anymore). The compression is coming for free on the fly when dumping:

https://joblib.readthedocs.io/en/latest/persistence.html

Sorry, I was not aware of that, however mmappickle is made to store a dictionary of things. If I try to do the same with joblib, let say I have the following case:

import joblib, numpy
d={'a':numpy.array([1,2,3]),'b':numpy.array([4,5,6]), 'c': numpy.ma.array([7,8,9],mask=[False, True, False])}
joblib.dump(d, '/tmp/out.pkl')
x=joblib.load('/tmp/out.pkl',mmap_mode='r+')

x['a'] and x['b'] are memmap, but not x['c']. (this would work with mmappickle).

Now, how do I add a new key to the dictionary stored in /tmp/out.pkl?

Regarding the benchmark, it is very far for the classical use case of mmappickle, but I'll adapt it nevertheless. Give me a few days ;-)

glemaitre · 2018-03-27T18:17:37Z

x['a'] and x['b'] are memmap, but not x['c']. (this would work with mmappickle).

Masked arrays are not supported in joblib, this is true.
https://github.com/joblib/joblib/blob/master/joblib/test/test_numpy_pickle.py#L275

Now, how do I add a new key to the dictionary stored in /tmp/out.pkl?

You would need to dump the dict again.

Regarding the benchmark, it is very far for the classical use case of mmappickle, but I'll adapt it nevertheless. Give me a few days ;-)

Could you give more information regarding the use case. Apparently, it seems that having a dictionary is useful for your use-case. It allows to dump and memmap on the fly?

glemaitre · 2018-03-27T18:17:37Z

x['a'] and x['b'] are memmap, but not x['c']. (this would work with mmappickle).

Masked arrays are not supported in joblib, this is true.
https://github.com/joblib/joblib/blob/master/joblib/test/test_numpy_pickle.py#L275

Now, how do I add a new key to the dictionary stored in /tmp/out.pkl?

You would need to dump the dict again.

Regarding the benchmark, it is very far for the classical use case of mmappickle, but I'll adapt it nevertheless. Give me a few days ;-)

Could you give more information regarding the use case. Apparently, it seems that having a dictionary is useful for your use-case. It allows to dump and memmap on the fly?

lfasnacht · 2018-03-31T17:00:53Z

Now, how do I add a new key to the dictionary stored in /tmp/out.pkl?

You would need to dump the dict again.

That doesn't seem that efficient, especially as all the matrices have to be written again to disk?

Could you give more information regarding the use case. Apparently, it seems that having a dictionary is useful for your use-case. It allows to dump and memmap on the fly?

Exactly.

Here's the simplest use case I can think of: hyperspectral time lapses. At every time step, an image is captured, but we don't know how many images will be captured in total. An image is 3D numpy array of a few hundred megabytes usually, and its size is known beforehand (it consists in multiple camera frames). There are lot of advantages to hold all the images in the same files, instead of having a lot of different files (for example all the common metadata are stored only once), and it is not possible to hold all images in RAM simultaneously (usually we use a laptop to capture data).

From the mmappickle point of view, here's what happens:

at first, the file is created, and the metadata is written.
at the beginning of each scan, a new key is added to the dict. This doesn't take too much time as only the matrix structure and a "hole" is written in the file.
each frame is written to the (now memmap'ed) array (filling the "hole")
then go to step 2, as long as not stopped by the user.

The nice thing of using mmappickle is that it is possible to view the content of the file while it is written, or even use it from another program (for example to check if everything looks right, or to run an algorithm on the first captured images while still capturing new data). It is the kind of simultaneous access which is not considered by other libraries.

That's why the common "pickling all data at once"-benchmark doesn't really make sense for mmappickle. The common use case it random access (add/delete/memmap) of the dict.

Does that make sense to you?

glemaitre mentioned this issue Mar 27, 2018

[REVIEW]: mmappickle: A Python 3 library to store memory mappable objects into pickle-compatible files openjournals/joss-reviews#651

Closed

36 tasks

lfasnacht closed this as completed May 15, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JOSS - Functionality #3

JOSS - Functionality #3

glemaitre commented Mar 27, 2018

lfasnacht commented Mar 27, 2018

glemaitre commented Mar 27, 2018

lfasnacht commented Mar 27, 2018 •

edited

Loading

glemaitre commented Mar 27, 2018

glemaitre commented Mar 27, 2018

lfasnacht commented Mar 31, 2018

JOSS - Functionality #3

JOSS - Functionality #3

Comments

glemaitre commented Mar 27, 2018

lfasnacht commented Mar 27, 2018

glemaitre commented Mar 27, 2018

lfasnacht commented Mar 27, 2018 • edited Loading

glemaitre commented Mar 27, 2018

glemaitre commented Mar 27, 2018

lfasnacht commented Mar 31, 2018

lfasnacht commented Mar 27, 2018 •

edited

Loading