Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Caching the contents of jars on disk from run-to-run #83

Merged
merged 3 commits into from
May 23, 2013

Conversation

d40cht
Copy link

@d40cht d40cht commented May 21, 2013

Here is a basic implementation of on-disk jar content caching from run-to-run. I've added the sha1 hash of the jar itself into the name of the unzip contents directory and the .jarName file. Also the .jarName file is created last (only after a successful unzip) and it is the existence of this file that is used to check whether to unzip or use a cache. So under most circumstances the assembly should be good even if the user aborts and restarts the build midway through.

Is this OK as-is, or would you like the caching to be conditional on a key? This change results in a very significant speedup on my setup.

@@ -151,8 +151,7 @@ object Plugin extends sbt.Plugin {
import Cache._
import FileInfo.{hash, exists}

IO.delete(tempDir)
tempDir.mkdir()
if ( !tempDir.exists ) tempDir.mkdir()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if someone keeps adding/removing library deps, it'll accumulate here, but it won't be included in the final assembly?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True. Should I implement a garbage collect at the end, keeping track of referenced directories and discarding the others?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As long as it doesn't end up in the final assembly, I think it's ok.

@eed3si9n
Copy link
Member

Adding a conditional key would be prudent in case someone wants to opt-out. But by default it should be enabled. What do you think about cacheUnzip as the name?

lazy val cacheUnzip = SettingKey[Boolean]("assembly-cache-unzip")

@d40cht
Copy link
Author

d40cht commented May 21, 2013

cacheUnzip sounds good to me. I'll add it in.

@eed3si9n
Copy link
Member

Also could you start notes/0.9.0.markdown release note (you can take a look notes/0.8.0.markdown as template)? It'd be nice to see some number comparison in there.

@d40cht
Copy link
Author

d40cht commented May 21, 2013

By the way: how do I run/add tests? I'm not very familiar with scripted.

@eed3si9n
Copy link
Member

I usually read my own tutorial: http://eed3si9n.com/testing-sbt-plugins

You probably have to change the version to 0.9.0-SNAPSHOT for the plugin, and manually change all the plugins.sbt under test.

@d40cht
Copy link
Author

d40cht commented May 21, 2013

Also: if cachedMakeJar were to sha1 each jar (where applicable) rather each individual contained file, presumably that would speed up assemblyCacheOutput somewhat. Is it worth me plumbing that through too?

@eed3si9n
Copy link
Member

Totally yes.

…e mappings to keep track of their parent jars where applicable. Compute the sha1 hash of the jar rather than extracted class files when deciding whether to rebuild when cacheOutput is set. Significant speedups.
@d40cht
Copy link
Author

d40cht commented May 22, 2013

Would it be worth putting sbt-assembly under autobuild (for scripted tests) via Travis CI?

@eed3si9n
Copy link
Member

github doesn' send me a notification when a commit gets added to a pull req, so I didn't see the commit 4h ago.
Travis integration should be another pull req.

@@ -4,7 +4,7 @@ name := "sbt-assembly"

organization := "com.eed3si9n"

version := "0.8.8"
version := "0.8.9"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

During development SNAPSHOT version should be used.

@d40cht
Copy link
Author

d40cht commented May 22, 2013

Some timings for a largish build on a machine with reasonable speed local disks (having already run assembly once):

0.8.8 with cacheOutput: ~60s

0.9.0 with cacheUnzip, without cacheOutput: ~33s
0.9.0 with cacheUnzip, with cacheOutput: ~12s

@eed3si9n
Copy link
Member

What's 0.8.8 without cacheOutput? That's the baseline current users have.
Could you include first run and the second run? Currently cacheOutput is off by default because it was so slow, but I might turn it back on if the first run is decent.

@eed3si9n
Copy link
Member

Let me know if you're still working on this pull req or if it's ready to be merged.

@d40cht
Copy link
Author

d40cht commented May 23, 2013

Still working on it. Here are a load of runtimes as requested. Two sets, both on the same build, the first for a single project, the second for all the projects within the build that have an assembly target (11 of them). I'm including both sets because the numbers for running multiple assemblies at once are somewhat unexpected for the first assembly on 0.8.8. Here we go, and bear in mind that I've run each of these several times and there is about a +/- 10% variation in runtimes from build to build. So I'm assuming that if the numbers are within 10% of each other, they're effectively the same.

Single project

0.8.8

Run number No caching Cache output
1st 12s 14s
2nd 13s 14s
3rd 14s 13s

0.9.0

Run number No caching Cache unzip Cache unzip + cache output
1st 10s 10s 11s
2nd 11s 9s 2s
3rd 12s 9s 2s

Multiple projects, 11 with assembly target, sharing a lot of dependency jars

0.8.8

Run number No caching Cache output
1st 100s 53s(!)
2nd 102s 62s(!)

0.9.0

Run number No caching Cache unzip Cache unzip + cache output
1st 99s 101s 85s
2nd 119s 34s 12s
3rd 102s 32s 12s

I've not investigated the 0.8.8 multi-project results in enormous depth, but I have checked them and checked them again. I don't really have any idea why it's faster to run the first time with cacheoutput enabled - clearly it's doing more work, but I presume the IO is somehow being scheduled more efficiently as this assembly is pretty IO-bound. Any thoughts? Unfortunately the project concerned is my non-open work one - so I'm not able to share it for wider investigation.

As far as finishing the PR - I'll write some notes as requested, and add a test for the caching behaviour. And then it'll be ready - subject to your approval of course.

@eed3si9n
Copy link
Member

Interesting stuff. No rush on finishing up the pull req. I just wanted to make sure I'm not holding anything up.

…nsure that a jar built before and after caching has the same content hash.
@d40cht
Copy link
Author

d40cht commented May 23, 2013

I've pushed up another commit including a test for the caching and some changes to notes and README.md. Hopefully that'll be the last apart from any problems you spot. I've not enabled cacheOutput by default - I thought I'd let you make your own judgement (possibly including your own performance checks) and enable it if you'd like. But I guess this PR addresses #68 too?

eed3si9n added a commit that referenced this pull request May 23, 2013
Caching the contents of jars on disk from run-to-run
@eed3si9n eed3si9n merged commit 647381a into sbt:master May 23, 2013
@eed3si9n
Copy link
Member

Merged. Thanks for your contribution!

@d40cht
Copy link
Author

d40cht commented May 23, 2013

Thanks sir. Out of interest - when you would be planning to publish this?

@eed3si9n
Copy link
Member

Probably tonight or tomorrow, unless there are more things to work on.

@d40cht
Copy link
Author

d40cht commented May 23, 2013

Cool. And thanks for all the help.

@eed3si9n
Copy link
Member

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants