Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement a Concurrent Generational Typed Arena (and AtomicRef{,Mut}::map) #13777

Closed
wants to merge 2 commits into from

Conversation

@bholley
Copy link
Contributor

@bholley bholley commented Oct 14, 2016

This is a core part of the new incremental restyle architecture.

CC @Manishearth @SimonSapin @heycam @emilio


This change is Reviewable

MozReview-Commit-ID: 8iOALQylOuK
@highfive
Copy link

@highfive highfive commented Oct 14, 2016

warning Warning warning

  • These commits modify unsafe code. Please review it carefully!
@bholley
Copy link
Contributor Author

@bholley bholley commented Oct 14, 2016

r? @emilio

@highfive highfive assigned emilio and unassigned Manishearth Oct 14, 2016
@bholley
Copy link
Contributor Author

@bholley bholley commented Oct 14, 2016

@bors-servo
Copy link
Contributor

@bors-servo bors-servo commented Oct 14, 2016

Trying commit 183a3e4 with merge eba3113...

bors-servo added a commit that referenced this pull request Oct 14, 2016
Implement a Concurrent Generational Typed Arena (and AtomicRef{,Mut}::map)

This is a core part of the new incremental restyle architecture.

CC @Manishearth @SimonSapin @heycam @emilio

<!-- Reviewable:start -->
---
This change is [<img src="https://reviewable.io/review_button.svg" height="34" align="absmiddle" alt="Reviewable"/>](https://reviewable.io/reviews/servo/servo/13777)
<!-- Reviewable:end -->
Copy link
Member

@emilio emilio left a comment

A few comments from skimming this only.

let result = Arc::new(AtomicRefCell::new(Arena {
generation: 1,
current_index: AtomicIsize::new(0),
chunk_list: ChunkList::<Item>::new(),

This comment has been minimized.

@emilio

emilio Oct 14, 2016
Member

nit: I think ChunkList::new would work here.

/// operations for callers that don't need to dereference the result.
pub fn allocate_raw(&self) -> TokenData {
let idx_signed = self.current_index.fetch_add(1, Ordering::Relaxed);
if idx_signed < 0 {

This comment has been minimized.

@emilio

emilio Oct 14, 2016
Member

Wasn't this undefined behavior? Could we instead use add_fetch and check current_index == 0? Also, you're using isize, which means than on 64-bit you will overflow in the integer conversion below before this point.

This comment has been minimized.

@emilio

emilio Oct 14, 2016
Member

I mean using add_fetch with an AtomicUsize here, ofc.


self.chunk_list.allocate(idx as usize);
TokenData {
generation: self.generation,

This comment has been minimized.

@emilio

emilio Oct 14, 2016
Member

nit: Spacing is off here.

Copy link
Member

@emilio emilio left a comment

As another unrelated note, probably a benchmark against concurrent allocations of Box<Foo> could be interesting to see, and even to test.

use std::sync::{Arc, Weak};
use std::sync::atomic::{AtomicPtr, AtomicIsize, Ordering};

///! A Concurrent Generational Typed Arena.

This comment has been minimized.

@emilio

emilio Oct 14, 2016
Member

nit: I think the preferred style is either /// or //!, but not this one.

@samlh
Copy link
Contributor

@samlh samlh commented Oct 14, 2016

[...] the new incremental restyle architecture

Just curious - is there any information available about what this is referring to?

@bholley
Copy link
Contributor Author

@bholley bholley commented Oct 14, 2016

This is the design document: https://docs.google.com/a/mozilla.com/document/d/1TcTAQMm-jIiNgzkDgpO04Pja_aQNRNkrD0CNCCYfbo8/edit?usp=sharing

The plan was designed around stylo, but with some tweaks it should work for regular Servo as well.

@bholley
Copy link
Contributor Author

@bholley bholley commented Oct 14, 2016

And the tl;dr is that Servo's incremental restyle is pretty basic, and we believe the new architecture will make it world-class.

@bors-servo
Copy link
Contributor

@bors-servo bors-servo commented Oct 14, 2016

💔 Test failed - linux-rel-wpt

@bholley
Copy link
Contributor Author

@bholley bholley commented Oct 14, 2016

@bholley
Copy link
Contributor Author

@bholley bholley commented Oct 14, 2016

(will retry once I push a few fixups)

@samlh
Copy link
Contributor

@samlh samlh commented Oct 14, 2016

Thank you for the quick response! It'll take me a little while to read through the document, but it definitely sounds interesting. Cheers!

MozReview-Commit-ID: JyTV5VqKG8U
@bholley bholley force-pushed the bholley:cgt_arena branch from 183a3e4 to 9941d3b Oct 15, 2016
@bholley
Copy link
Contributor Author

@bholley bholley commented Oct 15, 2016

@bors-servo
Copy link
Contributor

@bors-servo bors-servo commented Oct 15, 2016

Trying commit 9941d3b with merge d7b8cd4...

bors-servo added a commit that referenced this pull request Oct 15, 2016
Implement a Concurrent Generational Typed Arena (and AtomicRef{,Mut}::map)

This is a core part of the new incremental restyle architecture.

CC @Manishearth @SimonSapin @heycam @emilio

<!-- Reviewable:start -->
---
This change is [<img src="https://reviewable.io/review_button.svg" height="34" align="absmiddle" alt="Reviewable"/>](https://reviewable.io/reviews/servo/servo/13777)
<!-- Reviewable:end -->
@bholley
Copy link
Contributor Author

@bholley bholley commented Oct 15, 2016

Benchmark numbers on my machine:

sequential_allocations_arena : 216 ns/iter (+/- 126)
sequential_allocations_arena_preborrowed : 111 ns/iter (+/- 10)
sequential_allocations_box : 19 ns/iter (+/- 5)
sequential_raw_allocations_arena_preborrowed : 30 ns/iter (+/- 6)
sequential_stack_allocations : 1 ns/iter (+/- 0)

This suggests there's some room for optimization, particularly around the handles (which appear to be the lion's share of the overhead). I'll look into it more next time, but I think it's probably reasonable to land this and address those issues as a followup.

@bors-servo
Copy link
Contributor

@bors-servo bors-servo commented Oct 15, 2016

💔 Test failed - linux-rel-css

@nox
Copy link
Member

@nox nox commented Oct 16, 2016

Could we get a comparison with the usual allocation stuff from Rust itself?

Copy link
Member

@nox nox left a comment

I have some questions.

}

/// Not publicly exposed, invoked internally by `Token`. The caller must
/// ensure that the `TokenData` does not represent an expired `Token`.

This comment has been minimized.

@nox

nox Oct 16, 2016
Member

What happens if the caller does not ensure that?

This comment has been minimized.

@bholley

bholley Oct 16, 2016
Author Contributor

Token ensures it, and it's an internal API, so it doesn't particularly matter.


// Shift the index left one bit to leave the bottom bit unused. This
// allows the callers to use tokens as tagged pointers.
debug_assert!(self.index & (1_u32 << 31) == 0);

This comment has been minimized.

@nox

nox Oct 16, 2016
Member

Shouldn't that be assert!? It can happen in practice that self.index becomes big enough, right?

This comment has been minimized.

@bholley

bholley Oct 16, 2016
Author Contributor

Nope, because allocate_raw panics if we overflow.


/// Deserializes a TokenData from a `u64`.
pub fn deserialize(x: u64) -> Self {
debug_assert!(x & 1 == 0, "Caller should strip the tag bit");

This comment has been minimized.

@nox

nox Oct 16, 2016
Member

Shouldn't this be assert!? What happens if this is false in release mode?

This comment has been minimized.

@bholley

bholley Oct 16, 2016
Author Contributor

It will just get shifted out. But anyway, none of this violates safety because Token::create is unsafe.

That said, it looks like there's a bug right now where deserialize doesn't unshift. I'll fix that and add a test.

@bholley
Copy link
Contributor Author

@bholley bholley commented Oct 17, 2016

I've done some interesting analysis that suggests that this arena may be the wrong approach for what we need. I'll write more later this evening or tomorrow.

@bholley
Copy link
Contributor Author

@bholley bholley commented Oct 17, 2016

So, the original motivation for this was that we want to create tons and tons of short-lived TransientStyleData objects during re-styling (on various threads), store them across FFI boundaries, and free them all at once after styling completes. There were a few reasons that an arena was attractive:
(1) Removing the requirement to locate all the TransientStyleData objects to free them.
(2) Better allocation / deallocation performance
(3) Reduced fragmentation

(1) is the core architectural reason to use an arena. This originally seemed like a big plus that would potentially reduce our traversal requirements, but given further thought and recent architectural changes I think we're probably going to end up traversing any node with a TSD anyway.

(2) turned out to be a premature optimization, at least with the performance characteristics on my local machine with simple benchmarks. I took the following reference measurements:

Arc clone + release: 13ns (just clone is 6ns)
Mutable AtomicRefCell borrow + release: 11ns (just borrow 5ns)
Immutable AtomicRefCell borrow + release: 21ns (just borrow is 10ns)
Foo::new() on the stack: 1ns
8-word heap uninitialized heap-allocation (box mem::uninitialized()): reliably 22-27ns
8-word initialized heap allocation (Box::new(Foo::new)): bimodal variation between ~25ns and ~50ns, presumably based on cache locality.

The upshot here is that heap allocation is almost certainly not a bottleneck for our uses here. It's about equivalent to the memory traffic of memmove initialization of the allocated memory on a cache miss, and is on the same order any of the various atomic operations that we perform all over the place.

Using uninitialized memory, I was able to get the raw allocation performance of the arena down to 10ns (pretty much the cost of the atomic index bump, with the actual allocations amortized). So there's some win to be had, but it mostly disappears once you start interacting with the allocated memory.

As for (3), I think we'll probably be ok, and I don't think fragmentation itself is enough to justify the complexity here.

I think this arena would still be very useful for somebody needing the GC-like characteristics of (1), but given that we don't need them for stylo I can't justify spending any more time on it. I'll re-open a PR if something changes.

@bholley bholley closed this Oct 17, 2016
@bholley
Copy link
Contributor Author

@bholley bholley commented Oct 17, 2016

For posterity, here are some fixes that should be made if this code ever gets used:

  • Add the missing shift to TokenData::deserialize
  • Optimize the API to account to reduce expensive atomic operations
  • Increase the chunk list length by a factor of 8 or so.
  • Make dropping smarter by not atomically checking every chunk slot.
  • Avoid memmoves (may require unstable |box| syntax)
@Manishearth
Copy link
Member

@Manishearth Manishearth commented Oct 17, 2016

(2) Better allocation / deallocation performance

FWIW jemalloc already does some arena-ing so arenas aren't that effective in improving allocation performance.

bors-servo added a commit that referenced this pull request Nov 1, 2016
Implement AtomicRef{,Mut}::map

I was originally bundling this with #13777 but am splitting it out since that's been deprioritized.

r? @SimonSapin

<!-- Reviewable:start -->
---

This change is [<img src="https://reviewable.io/review_button.svg" height="34" align="absmiddle" alt="Reviewable"/>](https://reviewable.io/reviews/servo/servo/13797)

<!-- Reviewable:end -->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

7 participants
You can’t perform that action at this time.