Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create efficient binary storage format alternative to pickle #686

Closed
wesm opened this issue Jan 25, 2012 · 20 comments
Closed

Create efficient binary storage format alternative to pickle #686

wesm opened this issue Jan 25, 2012 · 20 comments
Labels
Enhancement IO Data IO issues that don't fit into a more specific label
Milestone

Comments

@wesm
Copy link
Member

wesm commented Jan 25, 2012

Ideally it should support compression! Possibly using blosc or some other method

@mcobzarenco
Copy link

As a first order approximation, what do you think about:

import pickle, bz2

def pickle_compress(obj, path):
    pickle.dump(obj, bz2.BZ2File(path, 'w'), pickle.HIGHEST_PROTOCOL)

def unpickle_compress(obj, path):
    return pickle.load(bz2.BZ2File(path, 'r'))

@wesm
Copy link
Member Author

wesm commented Aug 15, 2012

Unfortunately, this doesn't solve the "pickle problem" (i.e. that classes can't move to different modules).

@dalejung
Copy link
Contributor

dalejung commented Sep 3, 2012

@wesm have you considered making HDF5 the preferred backend? You would get compression baked in. Plus, the random access support would make on-disk dataframe operations possible.

@wesm
Copy link
Member Author

wesm commented Sep 8, 2012

I have, but the main problem is the deserialization speed for lots of small objects (this turns out to be a fairly important use case to a lot of users). I think using msgpack along with snappy or blosc (which uses the fastlz algorithm) for fast in-memory compression might be a good way to go

I should point out that using HDF5 (or at least PyTables) adds quite a bit of overhead for loading small objects

@dalejung
Copy link
Contributor

When you talk about loading many small objects, what do you mean exactly? I've definitely run into a lot of performance cul-de-sacs when it comes to pytables, so I'm not sure which one you're referring to :/

I've come to having a DatetimeIndex in memory and translating the bool arrays to list of slices since pytables dies on long list of indexes. Really quick but bypassing most pytables features like in-kernel searches.

@wesm
Copy link
Member Author

wesm commented Sep 18, 2012

As a benchmark, try loading 1000 Series objects each containing 100 values and random string indexes with 10 or fewer characters. So we're talking roughly 8K of data per Series

@ghost
Copy link

ghost commented Mar 24, 2013

related #3151

@ghost
Copy link

ghost commented Apr 4, 2013

The metadata discussion in #2485 reached an impasse about having metadata
propegate through data ops, it still makes sense to allow a metadata descriptor
to be serialized with an object and available initially upon deserialization for things
like date of measurements, source of data information, etc'.

The pickle code in pandas is not my favorite bit, would be nice to tackle it
as part of a new binary format if/when though.

@jreback
Copy link
Contributor

jreback commented Apr 4, 2013

I think in 0.12 we could add a generic .meta that attaches to all objects and can propogate it. Easy to save/recreate in HDF (see http://pandas.pydata.org/pandas-docs/dev/cookbook.html#hdfstore) and pickle, csv/excel would prob need some work to do that though. Adding direct .name also should be straightforward (on frames/panel).

@ghost
Copy link

ghost commented Apr 4, 2013

How do you mean propogate? you have a way to do #2485 in a clean way across operations? or
just adding it to the current pickle serialization?

@jreback
Copy link
Contributor

jreback commented Apr 4, 2013

pickle/hdf is easy, I would do it like name propogates in series, essentially its an add to the constructor (and maybe move name and make it a property of meta). I think can prop across most common operations.

of course not exactly sure what to do in a case like this:

df1.meta = dict(foo = 1)
df2.meta = dict(bar = 1)

df = df1 + df2

df.meta = dict(foo = 1, bar = 1)

and maybe a warning for clobbering? could also

@ghost
Copy link

ghost commented Apr 4, 2013

#2485 (possibly the longest discussion ever?) convinced me there be dragons on the propagation
front, too much complexity for something that really doesn't mesh with the data model.

Adding metadata to store/load only is indeed fairly straightforward, and useful for archiving. That's
why I brought it up.

@ghost
Copy link

ghost commented Apr 7, 2013

re: msgpack, maybe google protobuf or fb/apache avro should be considered?
msgpack was meant to be a small footprint replacement for json, but doesn't
beat compressed json by much. protobufs and avro have other selling points as
binary formats, and have a lot of mileage.

perhaps dataframes in the browser may be a future consideration, msgpack is a
browser-friendly format. not sure if that's an issue.

@jreback
Copy link
Contributor

jreback commented Apr 7, 2013

what exactly is the goal here? to provide essentially a pickle replacement? or just to support saving/loading a large amount of smaller type of objects?

its clear (to me at least!) if you have large amounts of data, HDF5 is the way to go, so we are talking about small fast storage of data (in binary)? Essentially a savez/loadz replacement?

Why is compression a requirement in this case in any event, or even a binary format? what is wrong with JSON / BSON?

I am somewhat agnostic on msgback / protobuf / avro looks good too

We need a use case to figure out what format suits

@wesm
Copy link
Member Author

wesm commented Apr 8, 2013

Yeah, for a ton of small objects using pickle is the best way. I would be willing to explore avro, especially since it's compatible with HDFS (S, not 5)

@jreback
Copy link
Contributor

jreback commented Apr 8, 2013

http://www.slideshare.net/IgorAnishchenko/pb-vs-thrift-vs-avro

+1 on avro - it's more python like (and no compiling schemes)

@ghost
Copy link

ghost commented Apr 8, 2013

good read, http://martin.kleppmann.com/2012/12/05/schema-evolution-in-avro-protocol-buffers-thrift.html
and somewhat scary how accurate the psychological profile in the intro is.

@ghost
Copy link

ghost commented Apr 9, 2013

Please, bake in serialization format version into this.

@wesm
Copy link
Member Author

wesm commented Apr 9, 2013

Agreed re: versioning

@jreback
Copy link
Contributor

jreback commented Oct 1, 2013

closed by #3525

@jreback jreback closed this as completed Oct 1, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement IO Data IO issues that don't fit into a more specific label
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants