Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: Boost dependencies should be modular #545

Closed
joincamp opened this issue May 3, 2017 · 18 comments
Closed

Feature Request: Boost dependencies should be modular #545

joincamp opened this issue May 3, 2017 · 18 comments

Comments

@joincamp
Copy link

joincamp commented May 3, 2017

Summary:

The boost super-library is included, but only a few modules appear to be used. This super-library is very large, and makes it hard to deploy stan(pystan in my case) to architectures that have size limitations (AWS lambda in my case). Rather than using the full super-library, only the necessary modules should be included.

Description:

It appears that only the following boost modules are directly used. (Most likely need to use boostdep to determine the full subset of modules in use)

  • type_traits
  • utility
  • math
  • random
  • throw_exception
  • tuple
  • numeric
  • odeint

I don't do much in the way of c++, but I think this project should be using https://svn.boost.org/trac/boost/wiki/ModularBoost instead, or just manually embedding the subset of libraries used.

Reproducible Steps:

In my case, using a project that uses pystan (fbprophet), delete boost modules that are not found anywhere in stan math (e.g. phoenix). Regression test parent project.

Current Output:

When modules like phoenix are removed, there do not appear to be any regressions in down-the-line projects

Expected Output:

Much smaller distributable sizes.

Additional Information:

Maybe http://www.boost.org/doc/libs/master/tools/boostdep/doc/html/ can be used to ensure the right modules are included and http://www.boost.org/doc/libs/1_64_0/tools/bcp/doc/html/index.html to generate the distributable

The bcp utility is a tool for extracting subsets of Boost, it's useful for Boost authors who want to distribute their library separately from Boost, and for Boost users who want to distribute a subset of Boost with their application.

Current Version:

v2.15.0

@bob-carpenter
Copy link
Contributor

Thanks for the request and tips about Boost. I didn't even know they'd done that---we've been using it since before it moved to GitHub and I haven't really paid attention to their version control software.

Any recommendations for integrating with the submodule structure that we already have: CmdStan, RStan, PyStan depend on Stan depend on Math. Stan uses phoenix as part of the parser, for instance.

Part of the reason we just tossed in all of Boost is that it's easier on our developers---we don't have to fiddle with new bits of Boost as we introduce them (we haven't used any of the non-header-only libs).

In R, we depend on Boost through the BH package, which I don't believe is 100% of Boost. And I know Jiqiang used to have a way of cutting down Boost. It's surprising how big it is even after only including what you use---the libraries we use tend to make liberal use of other Boost libs.

I'd like to hear from the repo managers: @betanalpha, @seantalts, @syclik, @bgoodri, @ariddell

@bgoodri
Copy link
Contributor

bgoodri commented May 3, 2017 via email

@joincamp
Copy link
Author

joincamp commented May 3, 2017

@bob-carpenter that's an interesting point about the boost libs getting used in stan as well as math. I didn't realize that would be the case. My own use case must be a corner case that doesn't use that subset of functionality in stan.

@bgoodri since you already ran that, do you mind telling me the /tmp/boost size vs the ./lib/boost size?

@ariddell
Copy link

ariddell commented May 3, 2017

I'm happy with culling parts of boost if it can be done reliably and automatically.

@joincamp the binaries (or Python extension modules) that Stan generates can be rather big. For example,
_api.cpython-35m-x86_64-linux-gnu.so which wraps stanc is 188M. I'm not sure removing parts of boost from the source will make any dent in that. What exactly is the size/memory limit you're hitting?

@joincamp
Copy link
Author

joincamp commented May 3, 2017

@ariddell I'm a few levels removed from this, so I'm trying to make sense of it. I'm using the project https://github.com/facebookincubator/prophet which depends on pystan. I'm trying to make this all fit into a deployable package to run on AWS lambda (Ephemeral disk capacity 512MB or uncompressed zip/jar size 250 MB, depending on the approach I take). There are some strict limitations on uncompressed disk usage, and when I investigated the site-packages directory in my python virtual environment here are the heavy hitters.

28M	./Cython
35M	./matplotlib
47M	./fbprophet
75M	./numpy
83M	./pandas
445M	./pystan

After some very inelegant hacking and slashing, I was able to trim down a workable (for me) version that has these sizes:

8.7M	./Cython
14M	./pandas
47M	./fbprophet
56M	./numpy
117M	./pystan

...Mostly through the killing of modules in the boost library. (using the incorrect assumption that the boost library was only being used by math, and not downstream in stan)

Since I was able to make my particular use case work, I was hoping to go about this the right way and see if there was a more general solution. It appears that my assumptions being incorrect may invalidate that though. I might just be in the wild west of monkey-patching.

@ariddell
Copy link

ariddell commented May 3, 2017

Boost is certainly a heavy hitter.

672K	./cpplint_4.45
139M	./boost_1.62.0
1.3M	./gtest_1.7.0
1.5M	./cvodes_2.9.0
5.3M	./eigen_3.2.9
147M	.

If prophet uses a fixed model you could probably throw away all the source (and _api).

@joincamp
Copy link
Author

joincamp commented May 3, 2017

That should be perfect. It uses pkl files for the models.

@bgoodri
Copy link
Contributor

bgoodri commented May 4, 2017 via email

@bob-carpenter
Copy link
Contributor

bob-carpenter commented May 4, 2017 via email

@bob-carpenter
Copy link
Contributor

bob-carpenter commented May 4, 2017 via email

@ariddell
Copy link

ariddell commented May 4, 2017

gtest and cpplint are already excluded from the PyStan distribution.

@joincamp
Copy link
Author

joincamp commented May 4, 2017

Looks like 10M of that is logs too, the actual boost dir is only 55M.

/tmp/boost_1.62.0 > du -h -d1
 55M	./boost
 66M	.

@ariddell
Copy link

ariddell commented May 5, 2017

For the record, here's what PyStan trims during release. Suggestions for additions would be welcome. If there are parts of boost we're certain are not going to be used, we can, at least, stop distributing a copy.

# unused parts of the stan source
prune pystan/stan/make
prune pystan/stan/src/docs
prune pystan/stan/src/doxygen
prune pystan/stan/src/python
prune pystan/stan/src/test

# unused parts of the stan math source
prune pystan/stan/lib/stan_math*/doc
prune pystan/stan/lib/stan_math*/doxygen
prune pystan/stan/lib/stan_math*/lib/cpplint_*
prune pystan/stan/lib/stan_math*/lib/gtest_*
prune pystan/stan/lib/stan_math*/make
prune pystan/stan/lib/stan_math*/test

(from MANIFEST.in)

@depet
Copy link

depet commented Jun 20, 2017

@joincamp, I have exactly the same issue. Do you mind to share a list of modules that you removed from the boost library or any other libraries. It seems you somehow managed to reduce the size of pandas and numpy libraries as well.

@joincamp
Copy link
Author

joincamp commented Jun 21, 2017

@depet I ran through some general cleaning procedures that I found elsewhere. It basically amounted to removing .pyc files, documentation, and tests.

Here is a gist of the files in my distributable. Maybe you can diff it to yours to get an idea of what you might be able to remove.

https://gist.github.com/joincamp/69f32ee84ef1eb9c1dfac2c5b4449739

@joincamp
Copy link
Author

@depet
Copy link

depet commented Jun 21, 2017

Thanks a lot @joincamp - much appreciated.

@rok-cesnovar
Copy link
Member

Closing this issue as the Boost lib has since been upgraded a bunch of times and we also prune it now to reduce the size. We could prune it further if this was desired. Please open another issue if that is the case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants