Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bug] Inflated build time due to heavy dependencies #3571

Closed
JamesYeoman opened this issue Feb 28, 2022 · 18 comments
Closed

[bug] Inflated build time due to heavy dependencies #3571

JamesYeoman opened this issue Feb 28, 2022 · 18 comments

Comments

@JamesYeoman
Copy link

Describe the bug

At times, I've gotten build times as large as 10 minutes. I've even gotten a 20 minute build time, but I've lost the cargo output to my bash terminal buffer.

To investigate, I set up some profile settings in my project to disable LTO and disable optimisation (in order to get the best case scenario), and then did a clean build using cargo clean && cargo +nightly build --timings.

image

The report shows that the three biggest hinderances are zstd-sys, blake3, and bzip2-sys.

Using cargo tree -i, I found the following

$ cargo tree -i zstd-sys
zstd-sys v1.6.3+zstd.1.5.2
└── zstd-safe v4.1.4+zstd.1.5.2
    └── zstd v0.10.0+zstd.1.5.2
        ├── tauri-codegen v1.0.0-rc.2
        │   └── tauri-macros v1.0.0-rc.2 (proc-macro)
        │       └── tauri v1.0.0-rc.3
        │           └── dsrbmm v0.1.0
        ├── tauri-utils v1.0.0-rc.2
        │   ├── tauri-build v1.0.0-rc.3
        │   │   [build-dependencies]
        │   │   └── dsrbmm v0.1.0
        │   ├── tauri-codegen v1.0.0-rc.2 (*)
        │   └── tauri-macros v1.0.0-rc.2 (proc-macro) (*)
        └── tauri-utils v1.0.0-rc.2
            ├── tauri v1.0.0-rc.3 (*)
            ├── tauri-runtime v0.3.2
            │   ├── tauri v1.0.0-rc.3 (*)
            │   └── tauri-runtime-wry v0.3.2
            │       └── tauri v1.0.0-rc.3 (*)
            └── tauri-runtime-wry v0.3.2 (*)

$ cargo tree -i blake3
blake3 v1.3.1
└── tauri-codegen v1.0.0-rc.2
    └── tauri-macros v1.0.0-rc.2 (proc-macro)
        └── tauri v1.0.0-rc.3
            └── dsrbmm v0.1.0

$ cargo tree -i bzip2-sys
bzip2-sys v0.1.11+1.0.8
└── bzip2 v0.4.3
    └── zip v0.5.13
        └── tauri v1.0.0-rc.3
            └── dsrbmm v0.1.0

As you can see, they all impact the compile time of Tauri, and this is the best-case scenario (no overhead of optimising or LTO to increase the compile time).

I know there's nothing Tauri can do about the compile time of the libraries, but are there any lightweight alternatives that could replace these behemoths?

I do only have an Intel Core i7 7500U, so use that to put the performance in perspective. Only 4 cores, a maximum concurrency of 7, and 321 total compilation units. Even so, the fact that zstd-sys took half of the total build time is insane and is something that is worth at least looking further into. With those 3 hogging 3 CPU cores, I only had 1 thread remaining for everything else. And quad core isn't exactly a niche setup

Reproduction

No response

Expected behavior

No response

Platform and versions

Operating System - Windows, version 10.0.19043 X64
Webview2 - 98.0.1108.56
Visual Studio Build Tools:
   - Visual Studio Build Tools 2019
WARNING: no lock files found, defaulting to npm

Node.js environment
  Node.js - 16.10.0
  @tauri-apps/cli - 1.0.0-rc.5
  @tauri-apps/api - Not installed

Global packages
  npm - 8.4.1
  pnpm - 6.29.1
  yarn - 3.1.1

Rust environment
  rustup - 1.24.3
  rustc - 1.59.0
  cargo - 1.59.0
  toolchain - stable-x86_64-pc-windows-msvc

App directory structure
/icons
/src
/target
/WixTools

App
  tauri - 1.0.0-rc.3
  tauri-build - 1.0.0-rc.3
  tao - 0.6.2
  wry - 0.13.3
  build-type - bundle
  CSP - default-src blob: data: filesystem: ws: wss: http: https: tauri: 'unsafe-eval' 'unsafe-inline' 'self' img-src: 'self'
  distDir - ../frontend/build/dist
  devPath - http://localhost:3000/
  framework - React

Stack trace

No response

Additional context

No response

@nothingismagick
Copy link
Sponsor Member

Are you using anything like this in your Cargo.toml:

[profile.release]
panic = "abort" # Strip expensive panic clean-up logic
codegen-units = 1 # Compile crates one after another so the compiler can optimize better
lto = true # Enables link to optimizations
opt-level = "s" # Optimize for binary size

codegen-units will DEFINITELY slow you down

@FabianLars
Copy link
Member

FabianLars commented Feb 28, 2022

I think we can rid of bzip if we disable the default features of the zip crate.
Edit: This ^ would also remove 1 of our last 2 usages of time@v1 🥳
The updater doesn't use bzip, but not sure for the extract apis right now so maybe we need to enable it conditionally.

The situation looks worse for the other 2 crates tho (at least on first glance)

@lucasfernog
Copy link
Member

I think we can rid of bzip if we disable the default features of the zip crate. The updater doesn't use bzip, but not sure for the extract apis right now so maybe we need to enable it conditionally.

Also, I pushed a PR that disables zip entirely if you're not using the updater. We can extend that to disable bzip and add a feature for it.

@lucasfernog
Copy link
Member

I think we can easily replace blake3, it's only used to hash the assets. Just need to find a good replacement.

@JamesYeoman
Copy link
Author

Are you using anything like this in your Cargo.toml:

[profile.release]
panic = "abort" # Strip expensive panic clean-up logic
codegen-units = 1 # Compile crates one after another so the compiler can optimize better
lto = true # Enables link to optimizations
opt-level = "s" # Optimize for binary size

codegen-units will DEFINITELY slow you down

For the release profile, yes. For the debug profile, I have the following

[profile.dev]
incremental = true
strip = false
lto = false
opt-level = 0

[profile.dev.package."*"]
opt-level = 0
incremental = true

[profile.dev.build-override]
opt-level = 0
incremental = true

@JamesYeoman
Copy link
Author

JamesYeoman commented Feb 28, 2022

I think we can easily replace blake3, it's only used to hash the assets. Just need to find a good replacement.

What hash algorithm is used for the assets? Because there's md5-tools which also support the SHA family of hashing algos, or there's sha256, both of which look to be lightweight enough

@FabianLars
Copy link
Member

FabianLars commented Feb 28, 2022

What hash algorithm is used for the assets?

blake3 is not only the name of the crate but also of the algorithm (afaik)

@JonasKruckenberg
Copy link
Contributor

blake3 is not only the name of the crate but also of the algorithm (afaik)

https://en.m.wikipedia.org/wiki/BLAKE_(hash_function)

@JamesYeoman
Copy link
Author

Hmm... is the high performance of Blake3 needed? Because that's the only benefit of using Blake3 that I can find, unless it was a decision from the security audit for a specific reason?

@lucasfernog
Copy link
Member

hey @chippers can you give us your thoughts here since you introduced blake3 and zstd to the asset system?

@chippers
Copy link
Member

chippers commented Mar 1, 2022

At the time of #1430 i had used the cargo timings to test compilation times in general. When I was running it, the linking time absolutely swamped everything else. On computers where it is more CPU constrained than IO constrained (like this issue's example), these slow to build crates can become a larger issues.

Over time, it seems there have also been issues with longer compile times in some of these projects. Specifically zstd which seems to get slower to compile as versions go on while speeding up their runtime performance. Since zstd-sys is building the actual zstd c code, it seems to continuously get slower to compile over time. I don't recall blake3 taking up nearly as much time as this timing output, so it may be the case that some cpu platforms are slower to build than others because it adds in cpu specific code for performance reasons.

Overall, I was not really concerned with the crate dependency compile times because I was much more focused on dirty builds (the dependencies have already been compiled) since that is the typical developer workflow loop. Additionally, I was not aware of how severe being CPU constrained (crate compilation) instead IO constrained (linking) could affect build times.

Along with the above reasoning, I chose zstd because:

  1. good rust bindings
  2. zstd has very good compression ratio and decompression speed/memory usage
  3. compatible licensing

I ended up adding blake3 in #1430 to prevent additional work from being performed if the asset file was not changed during a dirty build. This was much more important when compression was not an optional feature, as not only did it prevent IO work but it would also prevent CPU work. Nowadays, if compression is disabled it will only prevent IO work. This brought good wins for large asset files without extra dirty build time work. That said, I didn't recall it having such an effect on a clean build time, but perhaps that is because I was not CPU constrained. I chose blake3 because:

  1. good rust bindings
    • the rust crate itself is the main project, although they also offer a c implementation
  2. compatible licensing
  3. great performance, especially great debug performance
    • they are also compiled with the same cargo profile of the main crate, causing debug builds to use debug-compiled proc-macros and their dependencies

The runtime performance was important because its runtime is also during compile time due to its use in tauri-codegen. It had to be faster to hash the file than writing out the contents to disk even with a debug profile.

tl;dr - I mostly focused on dirty build compile times (the dependency is already compiled) and otherwise focused on runtime performance, which zstd and blake3 brings plenty of.

There are multiple ways to go for solutions, I'll start with blake3.

blake3 alternatives

blake3 actually comes with a Rust reference implementation. It is a single file with <400 LoC and no generics (from a quick glance) which usually translates to very fast compile times. I tested a release build on a Windows 10 VM with 2 CPU and it took 1s flat for a clean build and no caches. Downside, it is not published as a crate. The file changes very little (the logic changes even less) so vendoring this should be fine as long as we keep a bit of an eye on the original reference. We also may be able to convince the blake3 project in publishing it as a crate. Also it exists in a repo that is dual Apache-2.0/CC0 1.0 Universal but it itself is not specified and we may want to make sure it is also licensed the same.

I didn't really look into anything else because this seemed good enough. I will add though, that if the slower performance doesn't keep up wins for asset inclusion on HDDs (thinking of a JS project with megabytes of dependencies), we may also want to only enable if compression is also enabled since more work is done there.

The 1s compilation time for the reference can be compared to 13s for the main crate on the same machine, and 32.37s for the main crate with the rayon feature enabled. Perhaps just disabling rayon support will also bring a big-enough compilation time win with minimal performance impact.

As for runtime performance (6.7MB JavaScript file)...

method note time
b3sum (blake3 w/ rayon) --- 0.017s
b3sum (blake3 w/ rayon) --no-mmap 0.018s
b3sum (blake3 w/ rayon) --no-mmap --num-threads=1 0.02s
rust reference release 0.026s
rust reference debug 0.85s
rust reference debug opt-level = 1 0.03s
rust reference debug opt-level = 2 0.027s
rust reference debug opt-level = 3 0.026s
Click me to see the rust reference sum wrapper code
use crate::reference::Hasher;
use std::env::args_os;
use std::fmt::Write;
use std::fs::File;
use std::io::{self, Read};

mod reference;

fn copy_wide(mut reader: impl Read, hasher: &mut reference::Hasher) -> io::Result<u64> {
    let mut buffer = [0; 65536];
    let mut total = 0;
    loop {
        match reader.read(&mut buffer) {
            Ok(0) => return Ok(total),
            Ok(n) => {
                hasher.update(&buffer[..n]);
                total += n as u64;
            }
            Err(ref e) if e.kind() == io::ErrorKind::Interrupted => continue,
            Err(e) => return Err(e),
        }
    }
}

fn main() {
    let input = args_os().nth(1).expect("at least 1 argument of file path");
    let file = File::open(&input).expect("unable to open up input file path");
    let mut hasher = Hasher::new();
    let _copied = copy_wide(file, &mut hasher).expect("io error while copying file to hasher");
    let mut output = [0u8; 32];
    hasher.finalize(&mut output);
    let mut s = String::with_capacity(2 * output.len());
    for byte in output {
        write!(s, "{:02x}", byte).expect("cant write hex byte to hex buffer");
    }
    println!("{} {}", s, input.to_string_lossy());
}

As a sanity check against b3sum to make sure the output was the same, here is the output for both. Side note that the release build and the debug builds for the reference take a very similar amount of compile time (both ~1s) due to no dependencies. It may be worth it to enable an opt-level for it in the proc macro with a profile override (if we can do it in the proc-macro?) to change the runtime from 0.8s -> 0.03s. Not sure if the overrides work on the proc macro or only the root crate being built

PS C:\Users\chip\Documents\b3sum-reference> .\target\release\b3sum-reference.exe ..\..\Downloads\vendors_init.js
668031821b2ae54e9856e2b09fbc3403d5052567904fb76f47c9e2e42370bb18 ..\..\Downloads\vendors_init.js
PS C:\Users\chip\Documents\b3sum-reference> C:\Users\chip\.cargo\bin\b3sum.exe --no-mmap --num-threads=1 ..\..\Downloads\vendors_init.js
668031821b2ae54e9856e2b09fbc3403d5052567904fb76f47c9e2e42370bb18  ../../Downloads/vendors_init.js

Summary, if disabling rayon doesn't give us enough compile-time gains on blake3, using the reference implementation is almost instant to compile, and only 50% slower (of a very fast runtime).

zstd alternatives

I started off building zstd in that same virtual machine (2 cpu). Clean build took only 13s which seems really low compared to the timings in this issue because that's half the time blake3 build time took with rayon. The timings thinks it takes longer than blake3. Perhaps this is another issue that's difficult to see for all computers.

I did get a warning while compiling the zstd-sys crate of warning: cl : Command line warning D9002 : ignoring unknown option '-fvisibility=hidden', but cargo test passes so I don't believe it has an effect.

brotli

I first checked out brotli because that is actually what I used when first adding compression a long time ago. A clean build of brotli took 7s on the VM when ffi-api was disabled (compared to 12s). Dropbox's implementation of brotli (the brotli crate) includes a few things over the base brotli project, most notably multithreading which brings it back into the performance ballpark of zstd.

This is looking promising, so I did some comparisons using the JS vendor file from https://app.element.io (6.7MB). These timings were taken on the same 2 cpu VM. Note that the brotli command used was the binary available on the rust crate. Note that brotli's default profile is the same as best.

algorithm profile time size
none none 0s 6.7MB
zstd best 4s 1.42MB
brotli best 12.1s 1.35MB
brotli 10 6.8s 1.38MB
--- --- --- ---
zstd default 0.07s 1.81 MB
brotli 2 0.08s 1.83MB
--- --- --- ---
zstd 14 0.8s 1.54MB
brotli 9 0.8s 1.51MB

I actually really like brotli(9) here, since it's still sub-second compression (and brotli has good decompression) along with a slightly lower file size than the compression-time equivalent of the zstd profile. I think using brotli(2) for debug builds and brotli(9) for release builds is a good balance. We can always somehow add a hurt-me-plenty option that uses best to try and crank out the last kilobytes of the assets at the cost of runtime (during compile time in the codegen) performance for those that really want it.

brotli would be my choice hands down at the replacement. Here's reasons why:

  1. Rust implementation is well-maintained (by dropbox)
  2. It includes some optional stuff over base brotli including multi-threading
  3. Compatible licensing (I think... bsd 3-clause)
  4. Faster to compile than zstd along with really good sub-second compression options.

miniz_oxide

A Rust implementation of DEFLATE. Compiles clean in ~2.8s. I didn't look into it further because compression ratio and decompression performance (32k window size) are not ideal. This doesn't really knock it out as a contender, I just prefer the brotli solution first.

@JamesYeoman
Copy link
Author

Wow, thanks for digging this far. Out of interest, what's the underlying CPU used by the VM? Considering my cpu is a mobile Kaby Lake architecture, that could explain the vast differences. Could you try doing a clean build of my project to see if you get similar timings on your machine?

@chippers
Copy link
Member

chippers commented Mar 1, 2022

Wow, thanks for digging this far. Out of interest, what's the underlying CPU used by the VM? Considering my cpu is a mobile Kaby Lake architecture, that could explain the vast differences. Could you try doing a clean build of my project to see if you get similar timings on your machine?

Host OS: Windows 10 Pro
Host CPU: AMD Ryzen 7 2700X 8 cores (16 logical)

I used VirtualBox, with the processor set to 2 cores (out of 16 although only 8 are select-able). Looking at it again, I also had allocated to 8GiB of RAM, which I should probably knock down to 4 the next time I test with it.
Guest OS: Windows 10 Home

The timings were just done with PowerShell, using commands like Measure-Command {brotli --quality=9 vendor-init.js -o vendor-init.js.br}, typically twice to reduce measuring caching on commands that would be ran multiple times.

Next time I do tests I can try your project too, although I will probably focus on the api example mostly. I plan to work on this sometime today or perhaps slightly later this week.


A side note that I forgot about until this morning is another reason why I prefer brotli, and why I originally used it in my proof of concept a year ago, is that browsers accept br in their Accept-Encoding headers. This means that when we detect br being accepted by the browser's headers we can directly pass the compressed brotli file to the browser without decompressing it in Tauri. I think when switching from gzip to zstd it was because the max compression for zstd was 3x faster than brotli, for only a ~5% size difference in the worse cases. Relooking at this and evaluating other brotli level (like -9), I think that they give better tradeoffs in the end. This still leaves us room in the future to add a "really hurt me with the compression time" in the future that developers can enable when they are building their final public release build.

@chippers
Copy link
Member

chippers commented Mar 1, 2022

I also have an old Thinkpad T410 which has an Intel Core i5-520M 2.4GHz CPU (2 cores 4 threads) that I can do some of these tests on. I do have the storage upgraded to a SSD and additionally I believe I replaced the RAM to 8GiB. The CPU is still stock though, so it should be a good test of a CPU constrained build. I will probably stick to Linux when testing on it, but maybe if I want to feel the pain I can try and get a windows install on there too.

@JamesYeoman
Copy link
Author

I'm glad I'm not just going insane! That's some impressive performance problem recreation there, chippers. And it just goes to show how a little bit of insight to how tools operate can be a massive boon 😄

As for the zstd improvement when it's the sole crate that is building C code, I wonder if as a stopgap solution, it's possible to use zstd without a sys crate, or at least a rust-native implementation of the sys crate. Obviously the route forward is to use Brotli, but I imagine that would take some refactoring. So if there was a way to at least lessen the impact of zstd, with minimal effort, then it's probably worth putting in place in the meantime

@chippers
Copy link
Member

It's easier to switch to brotli as changing zstd implementations would mean there needs to be a good quality rust only crate of it, which I don't believe exists. As is, it's rather encapsulated because it decompresses before handing it off to the asset handler. We can leave handing it over still compressed until later as a performance gain. I have a branch on that ThinkPad that replaces zstd, blake3, and ring but I haven't uploaded it because I didn't set up my ssh keys or gpg key on it yet. I'll do that and PR it. I was hoping maybe to use the blake3 reference as a crate but that can be done later once a decision on that is made.

@JamesYeoman
Copy link
Author

Ok, so an update on the situation: Chip is a literal god, pulling a full 20% performance increase with #3773 🎉

There ended up being literally no crates that take longer than 20 seconds to compile, so I had to reduce the min unit time slider to 15 seconds
image

Looks like Serde, Tao, and tauri-utils are now the biggest hitters in the compile time.

For those interested, here's a gist containing the full cargo timings report html if you decide you want to inflict pain upon yourself explore the timings further.

Oh, and here's a link to my comment on Chip's PR.

As a reminder, I did this using the dev configuration here, disabling all optimisations, and running cargo clean before running cargo +nightly build --timings

@chippers
Copy link
Member

chippers commented Mar 25, 2022

The tauri-utils building twice is probably due to it building inside a proc-macro crate along with non proc-macro dependencies. A little strange since the codegen features are a strict subset of the non-one, but I recall this being a difference in the new cargo resolution where it allows different copies for build deps vs deps. Maybe we can find a way to share it again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants