Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cargo test performance regression on Windows on version 1.75.0 #119560

Open
Jefffrey opened this issue Jan 3, 2024 · 9 comments
Open

cargo test performance regression on Windows on version 1.75.0 #119560

Jefffrey opened this issue Jan 3, 2024 · 9 comments
Labels
C-bug Category: This is a bug. I-slow Issue: Problems and improvements with respect to performance of generated code. O-windows Operating system: Windows regression-untriaged Untriaged performance or correctness regression. T-libs Relevant to the library team, which will review and decide on the PR/issue.

Comments

@Jefffrey
Copy link
Contributor

Jefffrey commented Jan 3, 2024

Code

DataFusion issue with complete details: apache/datafusion#8696

arrow-datafusion runs cargo test on Windows runner as part of CI. When on Rust version 1.74.1 (and below), the check takes under 30 minutes. After upgrading to Rust version 1.75.0, it now takes over 3 hours, with no other change in code on our side. This seems to only take effect on Windows, as Linux/Mac tests didn't seem to be affected.

After debugging, I found the regression occurs between toolchains nightly-2023-10-29 (rust e5cfc5547) and nightly-2023-10-30 (rust 608e9682f).

We are running on GitHub actions runner windows-latest.

So expected run is here, on toolchain nightly-2023-10-29: https://github.com/apache/arrow-datafusion/actions/runs/7394674719/job/20116418078

When I bump to toolchain nightly-2023-10-30, with no other code changes: https://github.com/apache/arrow-datafusion/actions/runs/7394848426/job/20116910586

Slow tests

The slowness occurs primarily in two tests.

tpcds_planning

Test code: https://github.com/apache/arrow-datafusion/blob/1179a76567892b259c88f08243ee01f05c4c3d5c/datafusion/core/tests/tpcds_planning.rs

On the good run (before regression):

On the bad run (after regression):

sqllogictest

Test code: https://github.com/apache/arrow-datafusion/tree/1179a76567892b259c88f08243ee01f05c4c3d5c/datafusion/sqllogictest

On the good run (before regression):

On the bad run (after regression):

These two tests are the only ones with a significant delta, the rest don't seem affected by the upgrade.

Version it worked on

Ran fast with Rust 1.74.1 (and nightly-2023-10-29)

Version with regression

Ran slow on Rust 1.75.0 (and nightly-2023-10-30)

Additional context

Apologies if the example is too large to easily determine where the issue is. I'll try to reduce this to a smaller MRE, as I don't have a Windows machine to locally test on, so have had to check via CI.

@Jefffrey Jefffrey added C-bug Category: This is a bug. regression-untriaged Untriaged performance or correctness regression. labels Jan 3, 2024
@rustbot rustbot added I-prioritize Issue: Indicates that prioritization has been requested for this issue. needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. labels Jan 3, 2024
@comphead
Copy link

comphead commented Jan 3, 2024

Maybe this also helps a bit to investigate, if we run datafusion windows CI with

RUSTFLAGS: "-C debuginfo=0 -C opt-level=1 -C target-feature=+crt-static -C incremental=false -C codegen-units=256"

it helps to reduce timing from 190m to 60m, but it still way longer than before

@ChrisDenton
Copy link
Contributor

It might help to look at the raw log and narrow down which steps specifically are taking a long time.

So the issue is maybe around the generation of debuginfo?

@Jefffrey
Copy link
Contributor Author

Jefffrey commented Jan 4, 2024

Just a quick update, narrowed down the testing:

Limit to running only tpcds_physical_q44 with backtrace feature:

good (2023-10-29 nightly) = 70 seconds

bad (2023-10-30 nightly) = 692 seconds

Without backtrace:

  • cargo test -p datafusion --test tpcds_planning tpcds_physical_q44

good (2023-10-29 nightly) = 1.25 seconds

bad (2023-10-30 nightly) = 1.26 seconds

Edit:

Also just to be clear, we are running with env vars:

  • RUSTFLAGS: "-C debuginfo=0"
  • RUST_BACKTRACE: "1"

For both good & bad runs.

Maybe this also helps a bit to investigate, if we run datafusion windows CI with

RUSTFLAGS: "-C debuginfo=0 -C opt-level=1 -C target-feature=+crt-static -C incremental=false -C codegen-units=256"

it helps to reduce timing from 190m to 60m, but it still way longer than before

The speedup seems to occur due to adding -C opt-level=1 -C target-feature=+crt-static -C incremental=false -C codegen-units=256 flags as the debuginfo flag was already set previously

@saethlin
Copy link
Member

saethlin commented Jan 4, 2024

#117089 was merged in the right window, and the PR description notes that it contains rust-lang/backtrace-rs#569 which sounds relevant.

@saethlin saethlin added I-slow Issue: Problems and improvements with respect to performance of generated code. T-libs Relevant to the library team, which will review and decide on the PR/issue. O-windows Operating system: Windows and removed needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. labels Jan 4, 2024
@apiraino
Copy link
Contributor

apiraino commented Jan 4, 2024

cc @wesleywiser who authored #117089 and changes to the backtrace-rs submodule (IIUC)

@apiraino apiraino removed the I-prioritize Issue: Indicates that prioritization has been requested for this issue. label Jan 4, 2024
@wesleywiser
Copy link
Member

wesleywiser commented Jan 6, 2024

I'm able to repro locally:

> $env:RUSTFLAGS="-Cdebuginfo=0"
>  cargo +nightly-2023-10-29 test -p datafusion --test tpcds_planning tpcds_physical_q44 --features backtrace
...
test tpcds_physical_q44 ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 197 filtered out; finished in 36.23s

> cargo +nightly-2023-10-30 test -p datafusion --test tpcds_planning tpcds_physical_q44 --features backtrace
...
running 1 test
test tpcds_physical_q44 has been running for over 60 seconds

Looking at profiling data, the vast majority of time in the regressed case is spent in dbghelp!SymAddrIncludeInlineTrace which is a necessary call I introduced in rust-lang/backtrace-rs#569 when symbolizing the backtrace. I'm investigating if there is some other way to avoid calling that function but haven't had any luck.

One thing that's very interesting is that if do not set RUSTFLAGS="-Cdebuginfo=0", the regression disappears. This suggests to me we're hitting a pathological issue in dbghelp when there isn't any debug info present. On Windows, debug info is basically required to get any kind of meaningful backtrace so setting both -Cdebuginfo=0 and --feature backtrace in the test case doesn't make much sense to me.

@Jefffrey do you know why -Cdebuginfo=0 is being set? My assumption is that the intent is to make builds faster while still being able to test --feature backtrace. If that's the case, I would suggest actually setting -Cdebuginfo=line-tables-only which generates the minimal amount of information to give proper backtraces and doesn't have much impact on compile times (especially in comparison to -Cdebuginfo=2 which is the default in this case):

> $env:RUSTFLAGS="-Cdebuginfo=0"
>  cargo +nightly-2023-10-30 test -p datafusion --test tpcds_planning tpcds_physical_q44 --features backtrace
...
    Finished test [unoptimized + debuginfo] target(s) in 1m 25s

> $env:RUSTFLAGS="-Cdebuginfo=line-tables-only"
> cargo +nightly-2023-10-30 test -p datafusion --test tpcds_planning tpcds_physical_q44 --features backtrace
...
    Finished test [unoptimized + debuginfo] target(s) in 1m 32s

> $env:RUSTFLAGS="-Cdebuginfo=2"
> cargo +nightly-2023-10-30 test -p datafusion --test tpcds_planning tpcds_physical_q44 --features backtrace
...
    Finished test [unoptimized + debuginfo] target(s) in 2m 26s

@Jefffrey
Copy link
Contributor Author

Jefffrey commented Jan 6, 2024

Thank you for the detailed breakdown! @wesleywiser

@Jefffrey do you know why -Cdebuginfo=0 is being set? My assumption is that the intent is to make builds faster while still being able to test --feature backtrace. If that's the case, I would suggest actually setting -Cdebuginfo=line-tables-only which generates the minimal amount of information to give proper backtraces and doesn't have much impact on compile times (especially in comparison to -Cdebuginfo=2 which is the default in this case):

The reasoning you provide makes sense, I'm not sure why both debuginfo=0 was set along with backtrace (probably had debuginfo=0 set first then backtrace feature was added as a separate feature so needed to be added to test).

  • There was a comment stating do not produce debug symbols to keep memory usage down for the debuginfo=0 setting fwiw, so I guess that's the main reason

I'll experiment with setting debuginfo=line-tables-only and will see if that helps speed things back up 👍

@Jefffrey
Copy link
Contributor Author

Jefffrey commented Jan 6, 2024

Thanks for the config fix suggested @wesleywiser

This does eliminate the performance regression (or at least bring the times within a more reasonable range of each other).

Even though it was caused by some weird settings on our end (disabling debuginfo but still using backtrace, on Windows), it was still a significant regression in performance. Let me know if this is the expected behaviour now and I can close this issue, or to keep this open if you're planning to investigate this weird edge case further 👍 (or move it to backtrace repo?)

@ChrisDenton
Copy link
Contributor

This was discussed in the libs meeting. While this is an edge case and a likely bug in dbghlp, it was suggested that maybe backtrace could workaround it by somehow detecting if there is no debug info present? Even if it can only tell after trying, it could perhaps cache this knowledge so it doesn't try again the next time?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Category: This is a bug. I-slow Issue: Problems and improvements with respect to performance of generated code. O-windows Operating system: Windows regression-untriaged Untriaged performance or correctness regression. T-libs Relevant to the library team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

7 participants