-
-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[bug] Inflated build time due to heavy dependencies #3571
Comments
Are you using anything like this in your Cargo.toml:
codegen-units will DEFINITELY slow you down |
I think we can rid of bzip if we disable the default features of the zip crate. The situation looks worse for the other 2 crates tho (at least on first glance) |
Also, I pushed a PR that disables zip entirely if you're not using the updater. We can extend that to disable bzip and add a feature for it. |
I think we can easily replace blake3, it's only used to hash the assets. Just need to find a good replacement. |
For the release profile, yes. For the debug profile, I have the following [profile.dev]
incremental = true
strip = false
lto = false
opt-level = 0
[profile.dev.package."*"]
opt-level = 0
incremental = true
[profile.dev.build-override]
opt-level = 0
incremental = true |
blake3 is not only the name of the crate but also of the algorithm (afaik) |
|
Hmm... is the high performance of Blake3 needed? Because that's the only benefit of using Blake3 that I can find, unless it was a decision from the security audit for a specific reason? |
hey @chippers can you give us your thoughts here since you introduced blake3 and zstd to the asset system? |
At the time of #1430 i had used the cargo timings to test compilation times in general. When I was running it, the linking time absolutely swamped everything else. On computers where it is more CPU constrained than IO constrained (like this issue's example), these slow to build crates can become a larger issues. Over time, it seems there have also been issues with longer compile times in some of these projects. Specifically Overall, I was not really concerned with the crate dependency compile times because I was much more focused on dirty builds (the dependencies have already been compiled) since that is the typical developer workflow loop. Additionally, I was not aware of how severe being CPU constrained (crate compilation) instead IO constrained (linking) could affect build times. Along with the above reasoning, I chose
I ended up adding
The runtime performance was important because its runtime is also during compile time due to its use in tl;dr - I mostly focused on dirty build compile times (the dependency is already compiled) and otherwise focused on runtime performance, which There are multiple ways to go for solutions, I'll start with
|
method | note | time |
---|---|---|
b3sum (blake3 w/ rayon ) |
--- | 0.017s |
b3sum (blake3 w/ rayon ) |
--no-mmap |
0.018s |
b3sum (blake3 w/ rayon ) |
--no-mmap --num-threads=1 |
0.02s |
rust reference | release | 0.026s |
rust reference | debug | 0.85s |
rust reference | debug opt-level = 1 |
0.03s |
rust reference | debug opt-level = 2 |
0.027s |
rust reference | debug opt-level = 3 |
0.026s |
Click me to see the rust reference sum wrapper code
use crate::reference::Hasher;
use std::env::args_os;
use std::fmt::Write;
use std::fs::File;
use std::io::{self, Read};
mod reference;
fn copy_wide(mut reader: impl Read, hasher: &mut reference::Hasher) -> io::Result<u64> {
let mut buffer = [0; 65536];
let mut total = 0;
loop {
match reader.read(&mut buffer) {
Ok(0) => return Ok(total),
Ok(n) => {
hasher.update(&buffer[..n]);
total += n as u64;
}
Err(ref e) if e.kind() == io::ErrorKind::Interrupted => continue,
Err(e) => return Err(e),
}
}
}
fn main() {
let input = args_os().nth(1).expect("at least 1 argument of file path");
let file = File::open(&input).expect("unable to open up input file path");
let mut hasher = Hasher::new();
let _copied = copy_wide(file, &mut hasher).expect("io error while copying file to hasher");
let mut output = [0u8; 32];
hasher.finalize(&mut output);
let mut s = String::with_capacity(2 * output.len());
for byte in output {
write!(s, "{:02x}", byte).expect("cant write hex byte to hex buffer");
}
println!("{} {}", s, input.to_string_lossy());
}
As a sanity check against b3sum to make sure the output was the same, here is the output for both. Side note that the release build and the debug builds for the reference take a very similar amount of compile time (both ~1s) due to no dependencies. It may be worth it to enable an opt-level
for it in the proc macro with a profile override (if we can do it in the proc-macro?) to change the runtime from 0.8s -> 0.03s. Not sure if the overrides work on the proc macro or only the root crate being built
PS C:\Users\chip\Documents\b3sum-reference> .\target\release\b3sum-reference.exe ..\..\Downloads\vendors_init.js
668031821b2ae54e9856e2b09fbc3403d5052567904fb76f47c9e2e42370bb18 ..\..\Downloads\vendors_init.js
PS C:\Users\chip\Documents\b3sum-reference> C:\Users\chip\.cargo\bin\b3sum.exe --no-mmap --num-threads=1 ..\..\Downloads\vendors_init.js
668031821b2ae54e9856e2b09fbc3403d5052567904fb76f47c9e2e42370bb18 ../../Downloads/vendors_init.js
Summary, if disabling rayon
doesn't give us enough compile-time gains on blake3
, using the reference implementation is almost instant to compile, and only 50% slower (of a very fast runtime).
zstd
alternatives
I started off building zstd
in that same virtual machine (2 cpu). Clean build took only 13s which seems really low compared to the timings in this issue because that's half the time blake3 build time took with rayon. The timings thinks it takes longer than blake3. Perhaps this is another issue that's difficult to see for all computers.
I did get a warning while compiling the zstd-sys
crate of warning: cl : Command line warning D9002 : ignoring unknown option '-fvisibility=hidden'
, but cargo test
passes so I don't believe it has an effect.
brotli
I first checked out brotli because that is actually what I used when first adding compression a long time ago. A clean build of brotli
took 7s on the VM when ffi-api
was disabled (compared to 12s). Dropbox's implementation of brotli (the brotli
crate) includes a few things over the base brotli project, most notably multithreading which brings it back into the performance ballpark of zstd.
This is looking promising, so I did some comparisons using the JS vendor file from https://app.element.io (6.7MB). These timings were taken on the same 2 cpu VM. Note that the brotli
command used was the binary available on the rust crate. Note that brotli's default profile is the same as best.
algorithm | profile | time | size |
---|---|---|---|
none | none | 0s | 6.7MB |
zstd | best | 4s | 1.42MB |
brotli | best | 12.1s | 1.35MB |
brotli | 10 | 6.8s | 1.38MB |
--- | --- | --- | --- |
zstd | default | 0.07s | 1.81 MB |
brotli | 2 | 0.08s | 1.83MB |
--- | --- | --- | --- |
zstd | 14 | 0.8s | 1.54MB |
brotli | 9 | 0.8s | 1.51MB |
I actually really like brotli(9)
here, since it's still sub-second compression (and brotli has good decompression) along with a slightly lower file size than the compression-time equivalent of the zstd profile. I think using brotli(2)
for debug builds and brotli(9)
for release builds is a good balance. We can always somehow add a hurt-me-plenty
option that uses best to try and crank out the last kilobytes of the assets at the cost of runtime (during compile time in the codegen) performance for those that really want it.
brotli
would be my choice hands down at the replacement. Here's reasons why:
- Rust implementation is well-maintained (by dropbox)
- It includes some optional stuff over base brotli including multi-threading
- Compatible licensing (I think... bsd 3-clause)
- Faster to compile than
zstd
along with really good sub-second compression options.
miniz_oxide
A Rust implementation of DEFLATE. Compiles clean in ~2.8s. I didn't look into it further because compression ratio and decompression performance (32k window size) are not ideal. This doesn't really knock it out as a contender, I just prefer the brotli solution first.
Wow, thanks for digging this far. Out of interest, what's the underlying CPU used by the VM? Considering my cpu is a mobile Kaby Lake architecture, that could explain the vast differences. Could you try doing a clean build of my project to see if you get similar timings on your machine? |
Host OS: Windows 10 Pro I used VirtualBox, with the processor set to 2 cores (out of 16 although only 8 are select-able). Looking at it again, I also had allocated to 8GiB of RAM, which I should probably knock down to 4 the next time I test with it. The timings were just done with PowerShell, using commands like Next time I do tests I can try your project too, although I will probably focus on the A side note that I forgot about until this morning is another reason why I prefer brotli, and why I originally used it in my proof of concept a year ago, is that browsers accept |
I also have an old Thinkpad T410 which has an Intel Core i5-520M 2.4GHz CPU (2 cores 4 threads) that I can do some of these tests on. I do have the storage upgraded to a SSD and additionally I believe I replaced the RAM to 8GiB. The CPU is still stock though, so it should be a good test of a CPU constrained build. I will probably stick to Linux when testing on it, but maybe if I want to feel the pain I can try and get a windows install on there too. |
I'm glad I'm not just going insane! That's some impressive performance problem recreation there, chippers. And it just goes to show how a little bit of insight to how tools operate can be a massive boon 😄 As for the zstd improvement when it's the sole crate that is building C code, I wonder if as a stopgap solution, it's possible to use zstd without a sys crate, or at least a rust-native implementation of the sys crate. Obviously the route forward is to use Brotli, but I imagine that would take some refactoring. So if there was a way to at least lessen the impact of zstd, with minimal effort, then it's probably worth putting in place in the meantime |
It's easier to switch to brotli as changing zstd implementations would mean there needs to be a good quality rust only crate of it, which I don't believe exists. As is, it's rather encapsulated because it decompresses before handing it off to the asset handler. We can leave handing it over still compressed until later as a performance gain. I have a branch on that ThinkPad that replaces |
Ok, so an update on the situation: Chip is a literal god, pulling a full 20% performance increase with #3773 🎉 There ended up being literally no crates that take longer than 20 seconds to compile, so I had to reduce the min unit time slider to 15 seconds Looks like Serde, Tao, and tauri-utils are now the biggest hitters in the compile time. For those interested, here's a gist containing the full cargo timings report html if you decide you want to Oh, and here's a link to my comment on Chip's PR. As a reminder, I did this using the dev configuration here, disabling all optimisations, and running |
The |
Describe the bug
At times, I've gotten build times as large as 10 minutes. I've even gotten a 20 minute build time, but I've lost the cargo output to my bash terminal buffer.
To investigate, I set up some profile settings in my project to disable LTO and disable optimisation (in order to get the best case scenario), and then did a clean build using
cargo clean && cargo +nightly build --timings
.The report shows that the three biggest hinderances are
zstd-sys
,blake3
, andbzip2-sys
.Using
cargo tree -i
, I found the followingAs you can see, they all impact the compile time of Tauri, and this is the best-case scenario (no overhead of optimising or LTO to increase the compile time).
I know there's nothing Tauri can do about the compile time of the libraries, but are there any lightweight alternatives that could replace these behemoths?
I do only have an Intel Core i7 7500U, so use that to put the performance in perspective. Only 4 cores, a maximum concurrency of 7, and 321 total compilation units. Even so, the fact that zstd-sys took half of the total build time is insane and is something that is worth at least looking further into. With those 3 hogging 3 CPU cores, I only had 1 thread remaining for everything else. And quad core isn't exactly a niche setup
Reproduction
No response
Expected behavior
No response
Platform and versions
Stack trace
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: