Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Save space by compressing at least some of the builds #23

Closed
AlexDaniel opened this issue Aug 12, 2016 · 16 comments
Closed

Save space by compressing at least some of the builds #23

AlexDaniel opened this issue Aug 12, 2016 · 16 comments
Labels
testneeded Issue is generally resolved but tests were not written yet whateverable Issues affecting the bot framework (and therefore all of the bots)

Comments

@AlexDaniel
Copy link
Member

AlexDaniel commented Aug 12, 2016

OK, so here is the math!

Each build uncompressed is ≈28 MB. This does not include unnecessary source files or anything, so it does not look like we can go lower than that (unless we use compression).

So how many builds do we want to keep? Well, one year from now back is about 3000 commits. This gives roughly 84 GB. And this is just one year of builds for just MoarVM backend. In about 10 years we will slowly start approaching the 1 TB mark. Multiply it by the number of backends we want to track.

Is it a lot? Well, no, but for folks who have an SSD (me) this might be a problem.

Given that people commit stuff at a slightly faster pace than the space becomes significantly cheaper, I think that we should compress it anyway (even if it is moved to a server with more space). It is a good idea in a long run. And it will make it easier for us to throw in some extra stuff (JVM builds, or maybe even 32-bit builds or something? You never know).

OK, so what can we do?

Filesystem compression

The most obvious one is to use compression in btrfs. The problem is that it is applied for each file individually, so we are not going to save anything across many builds. Also, it is only viable if you already have btrfs, so it looks like it is not the best option.

Compress each build individually

While it may sound like a great idea to compress all builds together, it does not work that well in practice. Well, it does, but keep reading.

The best compression I got is with 7z. Each build is ≈4 MB (≈28 MB uncompressed, therefore ≈7 times space saving!)

Compressing each build individually is also good for things like local bisect. That is, we can make these archives publicly available, and then write a script that will pull these archives for your local git bisect. How cool is that! That will be about 40 MB of stuff to download per git bisect, and you cannot really compress it any further anyway because you don't know ahead of time which files you would need.

This gives us about ≈120 GB per 10 years. Good enough, I like it.

Is there anything that performs better than 7z? Well, yes:

27M     c587b9d3e6f9c34819e2e8a9d63b8dc98a20a6d7.tar
9.0M    c587b9d3e6f9c34819e2e8a9d63b8dc98a20a6d7.tar.lz4
7.0M    c587b9d3e6f9c34819e2e8a9d63b8dc98a20a6d7.zip
6.7M    c587b9d3e6f9c34819e2e8a9d63b8dc98a20a6d7.tar.gz
6.7M    c587b9d3e6f9c34819e2e8a9d63b8dc98a20a6d7.tar.lzfse
6.4M    c587b9d3e6f9c34819e2e8a9d63b8dc98a20a6d7.tar.bz2
5.7M    c587b9d3e6f9c34819e2e8a9d63b8dc98a20a6d7-L15.tar.zst
4.9M    c587b9d3e6f9c34819e2e8a9d63b8dc98a20a6d7-L22.tar.zst
4.7M    c587b9d3e6f9c34819e2e8a9d63b8dc98a20a6d7.tar.lzham
4.5M    c587b9d3e6f9c34819e2e8a9d63b8dc98a20a6d7.tar.brotli
4.3M    c587b9d3e6f9c34819e2e8a9d63b8dc98a20a6d7.tar.lrz
4.1M    c587b9d3e6f9c34819e2e8a9d63b8dc98a20a6d7.7z
4.1M    c587b9d3e6f9c34819e2e8a9d63b8dc98a20a6d7.tpxz
4.0M    c587b9d3e6f9c34819e2e8a9d63b8dc98a20a6d7.tar.xz
3.8M    c587b9d3e6f9c34819e2e8a9d63b8dc98a20a6d7-extremely-slow.tar.lrz

xz is much slower during compression and is tiny bit slower when decompressing, so the win is insignificant. lrz with extreme options is much slower at everything, so forget it.

Let's compress everything together!

For the tests, I took only 7 builds:

c587b9d3e6f9c34819e2e8a9d63b8dc98a20a6d7
3878066a953195276ef99739f157d38793395d06
ef04e1e07f1de0d4eb2666985c7290f96c912be6
eea786e2a2febef0ab0bfabca956beae95ab81fd
76be77c9d6e697c26e92dc704109b7b8780845aa
d30806bd00dee201cd891550ac65f084f18e8285
9bfbab9186d710e0603b1eb86be1e5ba2e0c84d1

Now, there are some funny entries here. Obviously, .tar is the size of all repos together, uncompressed. git-repo is a git repo with every build committed one after another in the right order. Look, it performs better than some of the other things! Amazing! And even better if you compress it afterwards. Who would've thought, right?

187M    all.tar
63M     all.tar.lz4
49M     all.zip
47M     all.tar.gz
45M     all.tar.bz2
42M     git-repo.tar
35M     all.tar.zstd # -19
28M     all.tar.xz
28M     all.7z
19M     git-repo.7z
9.5M    all.tar.zstd # -19 --long
8.3M    all.tar.lrz
7.3M    all.tar.zstd # --ultra -22 --long=31 --format=lzma # but it's much slower than lrz when compressing

However, lrz clearly wins this time. And that's with default settings! Wow.

Conclusion

I think that there are ways to fiddle with options to get even better results. Suggestions are welcome!

However, at this point it looks like the best way is to use 7z to compress each build individually.
Messing around with one huge archive is probably not worth the savings. It will make decompression time significantly slower, but we want decompression to be as fast as possible.

@AlexDaniel AlexDaniel self-assigned this Aug 12, 2016
@niner
Copy link

niner commented Aug 12, 2016

On Freitag, 12. August 2016 07:51:40 CEST Aleks-Daniel Jakimenko-Aleksejev

Now, there are some funny entries here. Obviously, .tar is the size of
all repos together, uncompressed. git-repo is a git repo with every
build committed one after another in the right order. Look, it performs
better than some of the other things! Amazing! And even better if you
compress it afterwards. Who would've thought, right?

However, lrz clearly wins this time. And that's with default settings!
Wow.

Conclusion

I think that there are ways to fiddle with options to get even better
results. Suggestions are welcome!

Acutally I like the git repo idea very much. With the build files in a git repo
instead of the source files, one can use plain old git bisect to find the
offending commit. I'm not surprised that git performs so well on this task as
it stores each content only once. So unless a file got changed, it will not be
stored for the new version.

Have you run git repack after committing the different versions? That should
reduce the repository size considerably, since it uses delta compression which
should help especially with te larger build files that change a lot like
CORE.setting.moarvm

Stefan

@AlexDaniel
Copy link
Member Author

Actually I like the git repo idea very much.

It is hard to tell if it is going to perform better when we put all builds into it. Currently, with just 7 builds in, 28 MB repo size is equivalent to storing each build separately (≈4 MB per build).

Also, I'm not sure if performance is going to be adequate. Bisect has to jump a couple of hundreds commits back and forth, that is definitely slower than just unpacking a 4 MB archive (or am I wrong?).

Have you run git repack after committing the different versions?

Well, yes, it says there's nothing to repack (perhaps git gc called it automatically?).

@MasterDuke17
Copy link
Collaborator

How about testing LZ4 or LZHAM? I suspect they won't compress as well, but they are supposed to be very fast at decompressing, so the trade-off might be worth it.

@AlexDaniel
Copy link
Member Author

AlexDaniel commented Aug 13, 2016

@MasterDuke17 I've added lz4 (and a bunch of other stuff) to the main post.

LZ4 is actually a very good finding, thank you very much. Indeed, we should probably forget about space savings and think about decompression speed instead.

How long does it take to decompress one build compressed with 7z? 0.4s. Why? See this. Basically, LZMA is not a good fit for multithreading (besides just being slow).
7z does not look like a good candidate anymore. Typical bisect takes a bit less than 15 steps, so that's a 5 second delay just to decompress the data. We can speed it up a bit by keeping at least the starting points decompressed, but in the end it is still a significant delay.

So let's see how fast things decompress:

0m45.291s   c587b9d3e6f9c34819e2e8a9d63b8dc98a20a6d7-extremely-slow.tar.lrz
0m0.716s    c587b9d3e6f9c34819e2e8a9d63b8dc98a20a6d7.tar.bz2
0m0.548s    c587b9d3e6f9c34819e2e8a9d63b8dc98a20a6d7.tar.lzfse
0m0.421s    c587b9d3e6f9c34819e2e8a9d63b8dc98a20a6d7.tar.lrz
0m0.415s    c587b9d3e6f9c34819e2e8a9d63b8dc98a20a6d7.tar.xz
0m0.414s    c587b9d3e6f9c34819e2e8a9d63b8dc98a20a6d7.7z
0m0.325s    c587b9d3e6f9c34819e2e8a9d63b8dc98a20a6d7.tpxz
0m0.262s    c587b9d3e6f9c34819e2e8a9d63b8dc98a20a6d7.zip
0m0.225s    c587b9d3e6f9c34819e2e8a9d63b8dc98a20a6d7.tar.gz
0m0.208s    c587b9d3e6f9c34819e2e8a9d63b8dc98a20a6d7.tar.lzham
0m0.169s    c587b9d3e6f9c34819e2e8a9d63b8dc98a20a6d7.tar.brotli
0m0.130s    c587b9d3e6f9c34819e2e8a9d63b8dc98a20a6d7-L22.tar.zst
0m0.120s    c587b9d3e6f9c34819e2e8a9d63b8dc98a20a6d7-L15.tar.zst
0m0.087s    c587b9d3e6f9c34819e2e8a9d63b8dc98a20a6d7.tar.lz4
0m0.049s    c587b9d3e6f9c34819e2e8a9d63b8dc98a20a6d7.tar

Almost everything is with default options, so feel free to recommend something specific.

As stupid as it sounds, brotli is a clear winner right now (UPDATE: nope. See next comment). It is a bit slow during compression, but I don't mind it at all.

@AlexDaniel
Copy link
Member Author

AlexDaniel commented Aug 14, 2016

We have a new winner: https://github.com/Cyan4973/zstd

≈0m0.130s decompression, ≈4.9M size, compression faster than brotli. Basically, it is a winner on all criteria except for file size, and it is only ≈0.4MB worse. Where is the catch??
In fact, zstd is rather young, and if I understand correctly it is not multithreaded yet. So perhaps it will get even better?

We can tweak it a bit by using a different compression level. The numbers above are with max level (22), but we can make it ≈10ms faster by sacrificing ≈0.8MB (level 15). I don't care about neither of these.

@xenu
Copy link

xenu commented Aug 14, 2016

@AlexDaniel: were you using "plain" xz or maybe pixz in these benchmarks? pixz claims to be much faster on both compression and decompression, thanks to its parallelization.

@AlexDaniel
Copy link
Member Author

@xenu pixz is .tpxz in these tests. It does perform faster, but not too much.

@AlexDaniel
Copy link
Member Author

AlexDaniel commented Aug 14, 2016

By the way, I found this blog post very interesting: http://fastcompression.blogspot.com/2013/12/finite-state-entropy-new-breed-of.html
It is by the author of lz4 and zstd. Amazing stuff.

AlexDaniel added a commit that referenced this issue Aug 23, 2016
Also, bash scripts were replaced with perl 6 code.

zstd was chosen for its fast decompression speed and good compression ratio. See
issue #23 for more info.

Rakudo is now being installed into “/tmp”. This way it will be easier for
others to extract archives on their system if we ever publish them (because
/tmp is writeable on almost any system). For example, we can let people run
git bisect locally by downloading prepared builds.

This means that all previous builds have to be rebuilt. That's not a big deal,
but it will take up to three days to process.

Another side effect of these changes is that bots will not see new commits
until they are actually built. Now when running something on HEAD you can be
sure that the actual commit that is pointed on by HEAD is actually ready to
roll. No more “whoops, try using some older commit” error messages.

Currently it is very unstable and more testing is required.
@AlexDaniel
Copy link
Member Author

AlexDaniel commented Aug 26, 2016

OK, so this has been implement some time ago along with other major changes. Given that everything is written in 6lang now, some things tend to segfault sometimes… but otherwise everything is fine. At least, compression is definitely there, so I am closing this.

@MasterDuke17
Copy link
Collaborator

MasterDuke17 commented Aug 31, 2016

@AlexDaniel AlexDaniel added the whateverable Issues affecting the bot framework (and therefore all of the bots) label Sep 21, 2016
@AlexDaniel AlexDaniel added the testneeded Issue is generally resolved but tests were not written yet label Jan 6, 2017
@AlexDaniel AlexDaniel reopened this Mar 12, 2017
@AlexDaniel AlexDaniel removed their assignment May 14, 2017
@AlexDaniel
Copy link
Member Author

AlexDaniel commented Aug 8, 2020

Some news! I never liked the dependency on lrzip because it only appears required when working with old builds. Now that zstd has a long-range mode I think it's better to rely on zstd for everything.

I have updated the post, but basically this is my finding:

9.5M    all.tar.zstd # -19 --long
8.3M    all.tar.lrz
7.3M    all.tar.zstd # -19 --long --format=lzma # but it's much slower than lrz when compressing

So lrzip is still somewhat winning when it comes to compression ratio. Moreover, it is blazing fast during compression. zstd takes ≈70 seconds to finish the job, while lrzip gets it done in ≈20s.

However, what I actually care about is decompression speed (because we need fast access to slowly accumulating builds):

9.5M    all.tar.zstd  # 0.338s
8.3M    all.tar.lrz   # 1.332s
7.3M    all.tar.zstd  # 0.946s

So lrzip stands out as being noticeably slow. It's clear that after the transition we will win both compression ratio and decompression speed.

This test is done using 7 builds in a single archive but I chose that number based on the behavior of lrzip actually, it was picked semi-randomly for benchmarking only, the actual number of builds per archive is 20. That is, putting more builds into the archive tends to make decompression slower but obviously improves the overall compression ratio.

I guess the next step is to increase the number of builds in zstd archives until I reach the same decompression speed, then compare the ratio.

@AlexDaniel
Copy link
Member Author

<MasterDuke> AlexDaniel`: i wonder if https://github.com/mhx/dwarfs would be good for the *ables. might make managing the builds simpler

(log)

Fantastic find! We should definitely bench it one day.

@AlexDaniel
Copy link
Member Author

I think using zstd for everything is a good idea. It'd make some code paths more generic, and it'll drop the dependency on lrzip. dwarfs is fine locally, but whateverable is also serving files for Blin and other remote usages, so zstd is still needed.

@AlexDaniel
Copy link
Member Author

Closing this in favor of #389.

@MasterDuke17
Copy link
Collaborator

I think using zstd for everything is a good idea. It'd make some code paths more generic, and it'll drop the dependency on lrzip. dwarfs is fine locally, but whateverable is also serving files for Blin and other remote usages, so zstd is still needed.

If the whole architecture was being redone from scratch, using dwarfs for local storage and compressing on the fly with zstd when serving files would be an interesting experiment.

@AlexDaniel
Copy link
Member Author

AlexDaniel commented Nov 7, 2023

Yes and no 🤷🏼 Depending on how much you're using the mothership remotely, it might be that you'll have lots of archives locally. If they're compressed in long-range mode, your local setup will be as efficient (storage-wise) as the remote one. Compressing on the fly in long-range mode doesn't work (it takes roughly a minute to do that for 20 builds). Of course, nothing stops you from using dwarfs locally, but the current system with archives seems easier and simpler.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
testneeded Issue is generally resolved but tests were not written yet whateverable Issues affecting the bot framework (and therefore all of the bots)
Projects
None yet
Development

No branches or pull requests

4 participants