-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support to save model using optimized data formats #504
Comments
That sounds like a good idea. Don't know much about the to-binary-serialization landscape in C++ though, so we'll need to do some research first. (P.S: parametric models will be much smaller) |
Interesting! The current implementation uses boost property trees, which can be exported in a number of data formats that can be represented by such a tree, including XML, INI, and JSON. Are you using truncated models? If not, you should consider this. Also, are you using nonparametric families? I just tried a 2000 parametric model truncated after 2 trees, and I got less than 2 MB. For a nonparametric model, it was below 200 MB. The issue with nonparametric models is that there is a 30x30 grid of numbers that need to be stored for each pair, and JSON/XML and others are plain text formats, so it will take a lot of space in any non-binary format. |
We set this to tll with 50 levels at the moment. Happy to share the model (~ 200 MB compressed).
This may be an edge case for copula users. Otherwise, would it be possible to maybe consider a move to something like HDF5 in future releases instead of text based formats which are inherently inefficient for large data volume (some examples in C++ https://support.hdfgroup.org/HDF5/doc/cpplus_RM/examples.html). |
2000 dimensions nonparametric with 50 trees is definitely something that we aim at being able to handle. To be honest, we've never encountered this issue because, when experimenting with such large models, we were using the R interface, where we are saving the objects in a binary format rather than plain text. But the Python bindings are really only wrapping the C++ code, and I've been looking at solutions to this issue. Mainly, the problem is that plain text without compression isn't a good idea for that model size. As you've noticed, the compressed files are much smaller, meaning that we could surely do a lot better. One way that I think is sensible without adding other dependencies (i.e., moving towards HDF5 is currently not feasible because of this requirement): use boost serialization and hook up some compression on top (see e.g. here and here. @tnagler What do you think? |
Also, HDF5 was considered as an addition to boost serialization a long time ago, but it didn't pan out. Not sure why. |
I like it! Seems both easy and solid. |
Quick update: after noticing that boost serialization isn't header only, we need to find another way. |
Small update. I spent some time today on this issue. Since #539, we are now using https://github.com/nlohmann/json instead of boost::property_tree. In C++, this lets us do things like
and
I'm integrating this into |
Sorry, shouldn't have closed right away :) |
Alright, it's in I'm closing for now, don't hesitate to reopen! |
OK, I just did the following:
And then
I was also looking at |
I am currently working with a large model > 2 GB of JSON. This leads to files being large and loading slow due to parsing. Are there any plans to add drivers to save to other formats like HDF5?
The text was updated successfully, but these errors were encountered: