Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use undef for (some) partially-uninit constants #94130

Merged
merged 5 commits into from Feb 25, 2022

Conversation

erikdesjardins
Copy link
Contributor

@erikdesjardins erikdesjardins commented Feb 18, 2022

There needs to be some limit to avoid perf regressions on large arrays
with undef in each element (see comment in the code).

Fixes: #84565
Original PR: #83698

Depends on LLVM 14: #93577

There needs to be some limit to avoid perf regressions on large arrays
with undef in each element (see comment in the code).
@rustbot rustbot added the T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. label Feb 18, 2022
@rust-highfive rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Feb 18, 2022
@lqd
Copy link
Member

lqd commented Feb 18, 2022

@bors try @rust-timer queue

@rust-timer
Copy link
Collaborator

Awaiting bors try build completion.

@rustbot label: +S-waiting-on-perf

@rustbot rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Feb 18, 2022
@bors
Copy link
Contributor

bors commented Feb 18, 2022

⌛ Trying commit b7e5597 with merge 796a3796f114911f2d4b98ea97c8ff28a6f37634...

@RalfJung
Copy link
Member

I know next to nothing about how LLVM represents consts and when or how that might cause slowdowns, and this doesn't even touch the CTFE engine, so I don't think I am a suitable reviewer for this.

@erikdesjardins
Copy link
Contributor Author

r? @nagisa

@rust-highfive rust-highfive assigned nagisa and unassigned RalfJung Feb 18, 2022
@nagisa
Copy link
Member

nagisa commented Feb 18, 2022

cc @nikic

@nikic
Copy link
Contributor

nikic commented Feb 18, 2022

I wonder if the right heuristic here isn't related to size, but rather whether the constant would otherwise be zeroinitializer? If it isn't, then the value needs to be made explicit anyway, and it shouldn't matter whether the explicit value is undef or some other value.

@bors
Copy link
Contributor

bors commented Feb 18, 2022

☀️ Try build successful - checks-actions
Build commit: 796a3796f114911f2d4b98ea97c8ff28a6f37634 (796a3796f114911f2d4b98ea97c8ff28a6f37634)

@rust-timer
Copy link
Collaborator

Queued 796a3796f114911f2d4b98ea97c8ff28a6f37634 with parent b8c56fa, future comparison URL.

@erikdesjardins
Copy link
Contributor Author

erikdesjardins commented Feb 18, 2022

I wonder if the right heuristic here isn't related to size, but rather whether the constant would otherwise be zeroinitializer

I think that makes some sense* -- but I do want to emit undef for some values that would otherwise be zeroinitializer.

e.g. a type like

struct Foo {
    init: bool,
    data: MaybeUninit<SomeLargeStruct>,
}
const FOO: Foo = Foo { init: false, data: MaybeUninit::uninit() };

could be zeroinitializer, but would benefit from being partially undef since only one byte is initialized.
(i.e. so memcpys of it could be shrunk to just one byte--doesn't look like LLVM currently does this, but it could be done)

Combining the two heuristics is an option of course (using undef for consts that are either nonzero or < size limit), although at that point I'm not sure it would be worth it.

Another option: limit the number of fields in the anonymous struct that we generate, i.e., the number of contiguous chunks of all-init or all-uninit bytes. I think this would be a decently close match to the actual cost of generating the const expression. It would also let us handle (potentially) profitable cases like my struct Foo example regardless of how large they are. (Hmm, I think I've convinced myself that this is better than size, at least.)

* I do think it's still not fully accurate though--it ought to be more expensive to generate { [1 x i8] c"1", [1 x i8] undef, [1 x i8] c"2", ... } than { [1000 x i8] c"..." }, since for the former we have to allocate 1000500 const strings vs. just 1 large one for the latter.

@erikdesjardins
Copy link
Contributor Author

(Hmm, I think I've convinced myself that this is better than size, at least.)

Switched to limiting the number of chunks. Let me know if you'd prefer something else.

@erikdesjardins erikdesjardins changed the title Use undef for partially-uninit constants up to 1024 bytes Use undef for (some) partially-uninit constants Feb 19, 2022
@rust-log-analyzer

This comment has been minimized.

@rust-timer
Copy link
Collaborator

Finished benchmarking commit (796a3796f114911f2d4b98ea97c8ff28a6f37634): comparison url.

Summary: This benchmark run did not return any relevant results. 5 results were found to be statistically significant but too small to be relevant.

If you disagree with this performance assessment, please file an issue in rust-lang/rustc-perf.

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR led to changes in compiler perf.

@bors rollup=never
@rustbot label: +S-waiting-on-review -S-waiting-on-perf -perf-regression

@rustbot rustbot removed the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Feb 19, 2022
@nikic
Copy link
Contributor

nikic commented Feb 19, 2022

Yeah, using the number of chunks makes a lot of sense to me.

@nikic
Copy link
Contributor

nikic commented Feb 19, 2022

@bors r+

@RalfJung
Copy link
Member

@bors r=nikic

@bors
Copy link
Contributor

bors commented Feb 20, 2022

📌 Commit 067f628 has been approved by nikic

@rust-log-analyzer

This comment has been minimized.

@RalfJung
Copy link
Member

@bors r-

@bors bors added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. labels Feb 20, 2022
@nikic
Copy link
Contributor

nikic commented Feb 20, 2022

@bors r+

@bors
Copy link
Contributor

bors commented Feb 20, 2022

📌 Commit 5bf8303 has been approved by nikic

@bors bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. labels Feb 20, 2022
@bors
Copy link
Contributor

bors commented Feb 24, 2022

⌛ Testing commit 5bf8303 with merge 3450b91e88c8e5343cffeead37603a63a8f904a9...

@bors
Copy link
Contributor

bors commented Feb 24, 2022

💔 Test failed - checks-actions

@bors bors added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. and removed S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. labels Feb 24, 2022
@rust-log-analyzer
Copy link
Collaborator

The job aarch64-gnu failed! Check out the build log: (web) (plain)

Click to see the possible cause of the failure (guessed by this bot)

@nikic
Copy link
Contributor

nikic commented Feb 24, 2022

@bors retry spurious network error

@bors bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Feb 24, 2022
@bors
Copy link
Contributor

bors commented Feb 25, 2022

⌛ Testing commit 5bf8303 with merge ece55d4...

@bors
Copy link
Contributor

bors commented Feb 25, 2022

☀️ Test successful - checks-actions
Approved by: nikic
Pushing ece55d4 to master...

@bors bors added the merged-by-bors This PR was explicitly merged by bors. label Feb 25, 2022
@bors bors merged commit ece55d4 into rust-lang:master Feb 25, 2022
@rustbot rustbot added this to the 1.61.0 milestone Feb 25, 2022
@rust-timer
Copy link
Collaborator

Finished benchmarking commit (ece55d4): comparison url.

Summary: This benchmark run shows 3 relevant improvements 🎉 but 6 relevant regressions 😿 to instruction counts.

  • Arithmetic mean of relevant regressions: 0.7%
  • Arithmetic mean of relevant improvements: -1.9%
  • Arithmetic mean of all relevant changes: -0.1%
  • Largest improvement in instruction counts: -2.9% on full builds of ctfe-stress-4 opt
  • Largest regression in instruction counts: 0.9% on incr-unchanged builds of externs opt

If you disagree with this performance assessment, please file an issue in rust-lang/rustc-perf.

Next Steps: If you can justify the regressions found in this perf run, please indicate this with @rustbot label: +perf-regression-triaged along with sufficient written justification. If you cannot justify the regressions please open an issue or create a new PR that fixes the regressions, add a comment linking to the newly created issue or PR, and then add the perf-regression-triaged label to this PR.

@rustbot label: +perf-regression

@rustbot rustbot added the perf-regression Performance regression. label Feb 25, 2022
@erikdesjardins erikdesjardins deleted the partially branch February 25, 2022 19:14
@erikdesjardins
Copy link
Contributor Author

erikdesjardins commented Feb 25, 2022

Perf changes:

ctfe-stress-4 opt full improvement:

LLVM does less work. Judging from the involved functions, this is likely because the generated bitcode is much smaller, since we emit just undef for large parts of some constant(s) instead of a large string.

--------------------------------------------------------------------------------
Ir             
--------------------------------------------------------------------------------
-1,234,509,749  PROGRAM TOTALS

--------------------------------------------------------------------------------
Ir             file:function
--------------------------------------------------------------------------------
 -293,601,221  ???:(anonymous namespace)::ModuleBitcodeWriter::writeConstants(unsigned int, unsigned int, bool)
 -270,535,020  ???:void llvm::BitstreamWriter::EmitRecordWithAbbrevImpl<unsigned long>(unsigned int, llvm::ArrayRef<unsigned long>, llvm::StringRef, llvm::Optional<unsigned int>)
 -255,855,613  ???:llvm::BitstreamCursor::readRecord(unsigned int, llvm::SmallVectorImpl<unsigned long>&, llvm::StringRef*)
 -184,944,003  ???:llvm::SHA1::hashBlock()
 -117,440,526  ???:llvm::Type::getPrimitiveSizeInBits() const
  -87,031,635  ???:llvm::ConstantDataArray::getString(llvm::LLVMContext&, llvm::StringRef, bool)
  -66,059,028  ???:(anonymous namespace)::BitcodeReader::parseConstants()
   17,301,570  ???:llvm::raw_svector_ostream::write_impl(char const*, unsigned long)
   15,204,381  ???:llvm::raw_ostream::write(char const*, unsigned long)
  -15,006,677  ./string/../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:__memcpy_sse2_unaligned_erms
    9,831,661  ???:<rustc_const_eval::interpret::eval_context::InterpCx<rustc_const_eval::const_eval::machine::CompileTimeInterpreter>>::run
    9,437,220  ???:llvm::raw_ostream::flush_tied_then_write(char const*, unsigned long)
   -7,077,468  ???:llvm::SHA1::update(llvm::ArrayRef<unsigned char>)
    5,244,443  ???:llvm::MCAssembler::writeSectionData(llvm::raw_ostream&, llvm::MCSection const*, llvm::MCAsmLayout const&) const
    5,240,004  ./string/../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:memcpy@GLIBC_2.2.5
    2,753,552  ???:<rustc_middle::mir::interpret::allocation::InitMask>::find_bit
   -1,310,727  ???:<rustc_const_eval::const_eval::machine::CompileTimeInterpreter as rustc_const_eval::interpret::machine::Machine>::find_mir_or_eval_fn

ucd debug full regression:

LLVM does more work, likely because we generate constants with more fields (which is expected).

Functions like <&mut alloc::vec::Vec<ena::unify::VarValue<rustc_type_ir::TyVid>> as core::convert::AsRef<[ena::unify::VarValue<rustc_type_ir::TyVid>]>>::as_ref (this is impl<T> AsRef<T> for &mut T) should never show up in diffs since they're trivial (and Vec<T> as AsRef<[T]> is also pretty simple). I'll open a PR adding #[inline] to those to see if it causes improvements more generally.

Edit: opened #94372

--------------------------------------------------------------------------------
Ir          
--------------------------------------------------------------------------------
112,870,638  PROGRAM TOTALS

--------------------------------------------------------------------------------
Ir           file:function
--------------------------------------------------------------------------------
-19,111,572  ???:<&mut alloc::vec::Vec<ena::unify::VarValue<rustc_type_ir::TyVid>> as core::convert::AsRef<[ena::unify::VarValue<rustc_type_ir::TyVid>]>>::as_ref
 15,031,469  ???:llvm::DataLayout::getAlignment(llvm::Type*, bool) const
 14,773,437  ???:llvm::MCAssembler::writeSectionData(llvm::raw_ostream&, llvm::MCSection const*, llvm::MCAsmLayout const&) const
  6,930,326  ???:llvm::ConstantDataArray::getString(llvm::LLVMContext&, llvm::StringRef, bool)
  6,135,110  ???:llvm::SmallPtrSetImplBase::insert_imp_big(void const*)
  5,341,507  ???:llvm::DataLayout::getTypeSizeInBits(llvm::Type*) const
  5,186,440  ???:llvm::MCAsmLayout::layoutFragment(llvm::MCFragment*)
 -4,139,756  ???:<rustc_infer::infer::sub::Sub as rustc_middle::ty::relate::TypeRelation>::tys
  4,015,730  ???:llvm::MCAssembler::layout(llvm::MCAsmLayout&)
  3,861,034  ???:<rustc_const_eval::interpret::validity::ValidityVisitor<rustc_const_eval::const_eval::machine::CompileTimeInterpreter> as rustc_const_eval::interpret::visitor::ValueVisitor<rustc_const_eval::const_eval::machine::CompileTimeInterpreter>>::walk_value
  3,765,544  ???:llvm::MCObjectStreamer::emitFill(llvm::MCExpr const&, unsigned long, llvm::SMLoc)
  3,615,753  obj/build/x86_64-unknown-linux-gnu/stage1-rustc/x86_64-unknown-linux-gnu/release/build/tikv-jemalloc-sys-93177ab130254195/out/build/src/arena.c:arena_dalloc_bin_locked_impl
  3,447,968  ???:llvm::MCObjectStreamer::getOrCreateDataFragment(llvm::MCSubtargetInfo const*)
...

all the externs regressions look like:

--------------------------------------------------------------------------------
Ir         
--------------------------------------------------------------------------------
13,448,267  PROGRAM TOTALS

--------------------------------------------------------------------------------
Ir          file:function
--------------------------------------------------------------------------------
13,504,500  ???:rustc_middle::ty::context::tls::TLV::__getit
  -145,666  ???:<rustc_query_system::dep_graph::graph::CurrentDepGraph<rustc_middle::dep_graph::dep_node::DepKind>>::intern_node
   -66,004  ???:<rustc_middle::dep_graph::dep_node::DepKind as rustc_query_system::dep_graph::DepKind>::read_deps::<<rustc_query_system::dep_graph::graph::DepGraph<rustc_middle::dep_graph::dep_node::DepKind>>::read_index::{closure
    27,018  ./nptl/../nptl/pthread_mutex_trylock.c:pthread_mutex_trylock
   -20,467  obj/build/x86_64-unknown-linux-gnu/stage1-rustc/x86_64-unknown-linux-gnu/release/build/tikv-jemalloc-sys-93177ab130254195/out/build/src/extent.c:_rjem_je_extent_heap_remove_first
    18,004  ???:<rustc_ast::ast::NestedMetaItem>::from_tokens::<rustc_ast::tokenstream::Cursor>
   -15,157  obj/build/x86_64-unknown-linux-gnu/stage1-rustc/x86_64-unknown-linux-gnu/release/build/tikv-jemalloc-sys-93177ab130254195/out/build/src/base.c:base_alloc_impl
    14,724  obj/build/x86_64-unknown-linux-gnu/stage1-rustc/x86_64-unknown-linux-gnu/release/build/tikv-jemalloc-sys-93177ab130254195/out/build/include/jemalloc/internal/hash.h:extent_try_coalesce_impl
    14,315  obj/build/x86_64-unknown-linux-gnu/stage1-rustc/x86_64-unknown-linux-gnu/release/build/tikv-jemalloc-sys-93177ab130254195/out/build/include/jemalloc/internal/hash.h:extent_lock_from_addr
    14,220  ./nptl/pthread_mutex_unlock.c:__pthread_mutex_unlock_usercnt

Judging from callgrind (omitted from this comment), this is a __getit call that was inlined before this change and isn't afterwards.
__getit is already #[inline], I'll open a PR to try making it #[inline(always)] to see if that is worthwhile.

Edit: opened #94373

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
merged-by-bors This PR was explicitly merged by bors. perf-regression Performance regression. S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Partially-undef constants cause a large size regression in src/test/run-make/wasm-panic-small
10 participants