Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support *-sys packages larger than 10MB (somehow) #40

Closed
emk opened this issue Nov 20, 2014 · 8 comments
Closed

Support *-sys packages larger than 10MB (somehow) #40

emk opened this issue Nov 20, 2014 · 8 comments

Comments

@emk
Copy link

emk commented Nov 20, 2014

(Continuing a discussion started here.)

The cld2 library is a natural-language detection library from Google, and it does some pretty cool stuff. I've packaged it as two Rust libraries, cld2 and cld2-sys. But because the upstream cld2 library is packaged by very few Linux distributions, I've chosen to distribute the source code with the cld2-sys package and build it using the Rust gcc library. So far, so good—all this works quite nicely.

But I can't upload the package to crates.io because it contains statistical language models, and those models are just too big:

$ du -sh target/package/cld2-sys-0.0.1.crate 
35M target/package/cld2-sys-0.0.1.crate

I can shrink this down somewhat (by omitting everything I don't need for the build), but I almost certainly can't get it under the 10MB limit. I can think of a couple of ways to address this issue:

  1. Accept that certain *-sys packages will be larger than 10MB, and provide some way to override the limit selectively.
  2. Store compressed source code in an S3 bucket, and ask build.rs to download it. But this introduces a dependency on an outside data source that may go away.

Any thoughts on the best way to handle this? Thank you for your advice, and for a great package-management system!

@DanielKeep
Copy link

I thought about this a little when I was working out how to get binary dependencies for a Rust project. In the end, I decided that what the build script should try (note: this was in Python, before Cargo had build scripts):

  • Check a standard drop location to see if the necessary files are already present (local override).
  • Run any system-specific locators that might help (pkg-config on *nix, shrug and give up on Windows).
  • Try to download a pre-compiled binary from the official website, for the current platform, to a reasonable cache location.
  • Try to check-out the source from the official repository, cross its fingers, and hope the user has the necessary software to build it (probably after prompting them).

I've always felt that just compiling the source is dicey as Windows doesn't have a C compiler by default. Since Rust no longer depends on GCC, you can't even assume that is present on Windows. Besides which, it basically ignores any version installed on the system, which might cause surprising behaviour ("but, I updated libsplang on my system to close the security vulnerability; how'd I get exploited?!", or "why can't prog-a and prog-b share files? They're both using libsplang!").

It might be worth having a standard sysdep package that abstracts all this, so it doesn't have to be re-engineered for every project.

@alexcrichton
Copy link
Member

Another possible route here would be to compress with xz or bzip2. For me it shaves 10MB off the size of the cld2 directory packed up. In general though @steveklabnik was right on reddit in that we don't want to let this get out of control too fast.

@emk
Copy link
Author

emk commented Nov 21, 2014

@DanielKeep: I'd use system packages for cld2, but it's not a very widely-packaged library. Plus, I need a build solution for Heroku, where I have no control over the installed libraries.

lifthrasiir has just sent me emk/rust-cld2#1 , which removes cld2's documentation, deletes some unused data tables, and strips comments from the source code (which substantially boosts compression performance). This gets the rust-cld2 crate under 10MB, at least for this version, though the recent update to the upstream project may break it.

Is there any way to run a custom script during the packaging process? If not, maybe I need to fork cld2 and produce a stripped down git repo. Or cache tarballs on S3, but I'm trying to avoid that.

I'd love to find a good solution here.

@lifthrasiir
Copy link

@alexcrichton If the crate has a data which inherent entropy exceeds 10MB, we are left with no choice but workarounds.

In the particular case of cld2, the main source of excess entropy is a comment (with UTF-8-encoded words for each entry) and removing comments really helps, but the table itself already exceeds 10MB and no common general purpose compresser can easily pack them. (My estimate is that, the actual entropy is some 7 or 8MB, as about 40% of data can be somewhat correlated to each other. But it wouldn't be very easy to infer.)

@alexcrichton
Copy link
Member

@lifthrasiir we've got to draw the line somewhere in terms of package upload or otherwise it'll get out of hand. Some crates will always fall on the other side of the line (and this may for example).

@emk
Copy link
Author

emk commented Nov 21, 2014

Yeah, I can see there's an obvious tension between:

  1. Wanting reproducible builds coming entirely from inside crates.io.
  2. Keeping crate sizes reasonable.
  3. Packaging libraries according to the *-sys convention (and therefore being able to easily deploy them to Heroku, etc).

cld2 is a very interesting case, because it legitimately needs large data tables to do its job, and the official version is unpackaged Subversion repository. On the other hand, it's a pretty useful library and I have some server-side Rails projects that use it quite successfully in production.

Then there are the semi-evil solutions, including breaking cld2 up into multiple packages by language detected, or some such. I'm going to try to figure out how these tables fit together, and see if I can find a clever solution.

@emk
Copy link
Author

emk commented Nov 22, 2014

Using @lifthrasiir's well-researched patch as a starting point, I've created a new git mirror of the upstream cld2 repository, stripped the comments as proposed, and built an exclude list in my Cargo.toml file. With all these tweaks, the cld2-sys package is now down to 6.5MB.

There are bunch of table files which aren't getting included in the current build, and I'll need to look into those later. So maybe we'll see this probem again in the future.

But at least for now, for this one package, we appear to have a workable solution. Thank you to everybody who helped out, especially to @lifthrasiir for figuring out how to cut down the package size.

alexcrichton added a commit that referenced this issue Nov 18, 2015
There's still a global limit on the nginx server but each crate can now have its
own maximum limit as well which is larger than the standard limit.

Closes #40
Closes #195
alexcrichton added a commit that referenced this issue Nov 19, 2015
There's still a global limit on the nginx server but each crate can now have its
own maximum limit as well which is larger than the standard limit.

Closes #40
Closes #195
alexcrichton added a commit that referenced this issue Nov 19, 2015
There's still a global limit on the nginx server but each crate can now have its
own maximum limit as well which is larger than the standard limit.

Closes #40
Closes #195
@alexcrichton
Copy link
Member

With the change I just merged, just contact me over IRC/email/whatnot and I can raise the limit for crates individually

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants