Create efficient binary storage format alternative to pickle #686

wesm · 2012-01-25T21:30:52Z

Ideally it should support compression! Possibly using blosc or some other method

mcobzarenco · 2012-08-15T09:13:23Z

As a first order approximation, what do you think about:

import pickle, bz2

def pickle_compress(obj, path):
    pickle.dump(obj, bz2.BZ2File(path, 'w'), pickle.HIGHEST_PROTOCOL)

def unpickle_compress(obj, path):
    return pickle.load(bz2.BZ2File(path, 'r'))

wesm · 2012-08-15T18:55:10Z

Unfortunately, this doesn't solve the "pickle problem" (i.e. that classes can't move to different modules).

dalejung · 2012-09-03T17:05:17Z

@wesm have you considered making HDF5 the preferred backend? You would get compression baked in. Plus, the random access support would make on-disk dataframe operations possible.

wesm · 2012-09-08T21:52:28Z

I have, but the main problem is the deserialization speed for lots of small objects (this turns out to be a fairly important use case to a lot of users). I think using msgpack along with snappy or blosc (which uses the fastlz algorithm) for fast in-memory compression might be a good way to go

I should point out that using HDF5 (or at least PyTables) adds quite a bit of overhead for loading small objects

dalejung · 2012-09-13T17:18:23Z

When you talk about loading many small objects, what do you mean exactly? I've definitely run into a lot of performance cul-de-sacs when it comes to pytables, so I'm not sure which one you're referring to :/

I've come to having a DatetimeIndex in memory and translating the bool arrays to list of slices since pytables dies on long list of indexes. Really quick but bypassing most pytables features like in-kernel searches.

wesm · 2012-09-18T01:04:23Z

As a benchmark, try loading 1000 Series objects each containing 100 values and random string indexes with 10 or fewer characters. So we're talking roughly 8K of data per Series

ghost · 2013-03-24T15:30:27Z

related #3151

ghost · 2013-04-04T22:25:14Z

The metadata discussion in #2485 reached an impasse about having metadata
propegate through data ops, it still makes sense to allow a metadata descriptor
to be serialized with an object and available initially upon deserialization for things
like date of measurements, source of data information, etc'.

The pickle code in pandas is not my favorite bit, would be nice to tackle it
as part of a new binary format if/when though.

jreback · 2013-04-04T22:31:27Z

I think in 0.12 we could add a generic .meta that attaches to all objects and can propogate it. Easy to save/recreate in HDF (see http://pandas.pydata.org/pandas-docs/dev/cookbook.html#hdfstore) and pickle, csv/excel would prob need some work to do that though. Adding direct .name also should be straightforward (on frames/panel).

ghost · 2013-04-04T22:57:32Z

How do you mean propogate? you have a way to do #2485 in a clean way across operations? or
just adding it to the current pickle serialization?

jreback · 2013-04-04T23:02:19Z

pickle/hdf is easy, I would do it like name propogates in series, essentially its an add to the constructor (and maybe move name and make it a property of meta). I think can prop across most common operations.

of course not exactly sure what to do in a case like this:

df1.meta = dict(foo = 1)
df2.meta = dict(bar = 1)

df = df1 + df2

df.meta = dict(foo = 1, bar = 1)

and maybe a warning for clobbering? could also

ghost · 2013-04-04T23:08:06Z

#2485 (possibly the longest discussion ever?) convinced me there be dragons on the propagation
front, too much complexity for something that really doesn't mesh with the data model.

Adding metadata to store/load only is indeed fairly straightforward, and useful for archiving. That's
why I brought it up.

ghost · 2013-04-07T20:57:29Z

re: msgpack, maybe google protobuf or fb/apache avro should be considered?
msgpack was meant to be a small footprint replacement for json, but doesn't
beat compressed json by much. protobufs and avro have other selling points as
binary formats, and have a lot of mileage.

perhaps dataframes in the browser may be a future consideration, msgpack is a
browser-friendly format. not sure if that's an issue.

jreback · 2013-04-07T22:25:00Z

what exactly is the goal here? to provide essentially a pickle replacement? or just to support saving/loading a large amount of smaller type of objects?

its clear (to me at least!) if you have large amounts of data, HDF5 is the way to go, so we are talking about small fast storage of data (in binary)? Essentially a savez/loadz replacement?

Why is compression a requirement in this case in any event, or even a binary format? what is wrong with JSON / BSON?

I am somewhat agnostic on msgback / protobuf / avro looks good too

We need a use case to figure out what format suits

wesm · 2013-04-08T02:38:29Z

Yeah, for a ton of small objects using pickle is the best way. I would be willing to explore avro, especially since it's compatible with HDFS (S, not 5)

jreback · 2013-04-08T03:08:12Z

http://www.slideshare.net/IgorAnishchenko/pb-vs-thrift-vs-avro

+1 on avro - it's more python like (and no compiling schemes)

ghost · 2013-04-08T04:31:12Z

good read, http://martin.kleppmann.com/2012/12/05/schema-evolution-in-avro-protocol-buffers-thrift.html
and somewhat scary how accurate the psychological profile in the intro is.

ghost · 2013-04-09T18:24:47Z

Please, bake in serialization format version into this.

wesm · 2013-04-09T19:14:08Z

Agreed re: versioning

jreback · 2013-10-01T17:05:35Z

closed by #3525

ghost mentioned this issue Dec 11, 2012

Allow custom metadata to be attached to panel/df/series? #2485

Closed

ghost mentioned this issue Mar 24, 2013

savez method for DataFrame, Series: porting data between python2 and python3 #3151

Closed

nehalecky mentioned this issue Apr 3, 2013

pd.concat loses frequency attribute for 'continuous' DataFrame appends #3232

Closed

ghost mentioned this issue Apr 9, 2013

WIP: Support metadata at de/serialization time #3297

Closed

jreback mentioned this issue May 3, 2013

ENH: support for msgpack serialization/deserialization #3525

Closed

jreback closed this as completed Oct 1, 2013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create efficient binary storage format alternative to pickle #686

Create efficient binary storage format alternative to pickle #686

wesm commented Jan 25, 2012

mcobzarenco commented Aug 15, 2012

wesm commented Aug 15, 2012

dalejung commented Sep 3, 2012

wesm commented Sep 8, 2012

dalejung commented Sep 13, 2012

wesm commented Sep 18, 2012

ghost commented Mar 24, 2013

ghost commented Apr 4, 2013

jreback commented Apr 4, 2013

ghost commented Apr 4, 2013

jreback commented Apr 4, 2013

ghost commented Apr 4, 2013

ghost commented Apr 7, 2013

jreback commented Apr 7, 2013

wesm commented Apr 8, 2013

jreback commented Apr 8, 2013

ghost commented Apr 8, 2013

ghost commented Apr 9, 2013

wesm commented Apr 9, 2013

jreback commented Oct 1, 2013

Create efficient binary storage format alternative to pickle #686

Create efficient binary storage format alternative to pickle #686

Comments

wesm commented Jan 25, 2012

mcobzarenco commented Aug 15, 2012

wesm commented Aug 15, 2012

dalejung commented Sep 3, 2012

wesm commented Sep 8, 2012

dalejung commented Sep 13, 2012

wesm commented Sep 18, 2012

ghost commented Mar 24, 2013

ghost commented Apr 4, 2013

jreback commented Apr 4, 2013

ghost commented Apr 4, 2013

jreback commented Apr 4, 2013

ghost commented Apr 4, 2013

ghost commented Apr 7, 2013

jreback commented Apr 7, 2013

wesm commented Apr 8, 2013

jreback commented Apr 8, 2013

ghost commented Apr 8, 2013

ghost commented Apr 9, 2013

wesm commented Apr 9, 2013

jreback commented Oct 1, 2013