Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

core: consider using an atomic linked list for the callsite registry #860

Closed
hawkw opened this issue Jul 29, 2020 · 6 comments · Fixed by #988
Closed

core: consider using an atomic linked list for the callsite registry #860

hawkw opened this issue Jul 29, 2020 · 6 comments · Fixed by #988
Assignees
Labels
crate/core Related to the `tracing-core` crate kind/perf Performance improvements

Comments

@hawkw
Copy link
Member

hawkw commented Jul 29, 2020

Feature Request

Crates

  • tracing-core

Motivation

tracing-core's callsite registry stores every active dispatcher, and a &'static reference to every callsite, so that the callsite can be re-registered when the Interest cache is invalidated.

Currently, this is represented as a pair of Vecs inside a Mutex:

static ref REGISTRY: Mutex<Registry> = Mutex::new(Registry {
callsites: Vec::new(),
dispatchers: Vec::new(),
});

The callsite registry is accessed in a few ways:

  1. New callsites are added to the registry. This happens the first time a callsite is hit.
  2. A Dispatch is added or removed from the registry. This happens whenever a Dispatch is created or dropped, which is infrequent — if a program only uses the global default dispatcher, a Dispatch is created once in the process' lifetime, and is never dropped. This will iterate over every registered callsite to register it with the new Dispatch.
  3. The registered callsites are iterated over to invalidate the cache. Depending on the program, this is either somewhat infrequent (a filtering change is triggered by user input), or never occurs (all filter configurations are static and don't change).

Of these, adding new callsites to the registry is the hottest path. Although it is not terribly frequent, it still happens several orders of magnitude than iterating over the registered callsites. Because the callsite registry is behind a Mutex, every registration is forced to synchronize. This means that if multiple threads hit different callsites for the first time, they must all contend the lock.

In most programs, registrations will happen more frequently at the beginning of the process' lifetime, and decrease over time as more and more callsites are hit for the first time. In long-running programs, the impact of contending the registry lock is highly amortized over the programs lifetime, and has a fairly small performance impact. However, in shorter-running programs, like CLI tools, it may be more meaningful. Also, even in longer-running programs, decreased performance at startup may be an issue.

Proposal

We may want to consider replacing the Mutex<Vec<...>> representation for the callsite registry with an atomic singly-linked list. With the current implementation, iterating over the registry is fast (because it's a Vec) but inserting must always contend the lock. A linked list is slower to iterate, but there is no Mutex in the insertion path. This probably makes more sense in this use case, since (as we discussed above) adding new callsites to the registry occurs much more frequently than iterating over currently registered callsites.

Since we only need to iterate in one direction, and never remove from the middle of the list (or from anywhere in the list, in point of fact), a singly-linked list is fine. This means that we can use an atomic linked list without too much pain. We don't even need to worry about the ABA problem, because (as I mentioned) callsites are never removed, and have unique addresses. We would treat the list somewhat like a stack (except without pops), storing the pointer to the most recently appended callsite, and use this as the head for both appends and iterations. This means that we don't have to link-hop in appends.

If we want to be fancy, we could even make this an intrusive list, by adding a next pointer to the callsites themselves. That way, the static callsite would store the next pointer's address, rather than putting it on the heap. Then, we wouldn't need to make any allocations. I don't think the performance impact of allocations in this code path is terribly significant — getting rid of the lock is probably good enough for the callsite registration path. However, it does show up as a "leak" in tools like valgrind, since any allocations we make to register a callsite are never freed, and this can look scary to users who aren't aware of what this code is doing. It could be nice for this to not show up in valgrind. A side benefit is that this would also move tracing-core one step closer to compiling on no_std without liballoc. I don't know if there's really any current demand for this (and it would take some additional work to pull off, such as making the global dispatcher work without Arc) but it would be cool to be able to say we can do it.

Obviously, this change should be guided by benchmarking. It's possible my intuition is just wrong, and the Mutex is faster. Of course, we would want to benchmark this under concurrency, since that's when the Mutex contention might actually matter.

Alternatives

  • We could just not do this. Since registering callsites only happens once, it is entirely possible that there are bigger things to focus on optimizing...and they may not require writing Yet Another Atomic Linked List, too.
  • We could get rid of runtime callsite registration entirely, and — @nagisa will love this one, I think — use a #[link_section] attribute to get the linker to put a reference to each callsite in a special section, which we would then treat as an array at runtime. This is similar to what @mystor did here: https://github.com/hawkw/mycelium/blob/03ef7cdefb2563d04c2c22219d24da32e8db387f/util/src/testing.rs#L38-L69.
    • The main advantages of this approach are:
      • We could iterate over the section as an array immediately when the first subscriber is created, so every callsite will already be registered the first time it's hit — no registration overhead in user code at all
      • There's no need for any concurrency control of the callsite registry, since it will already be generated for us and never need to be mutated
      • Array iteration is fast
      • This would never use the heap at all
      • It's a cool hack that would make me feel like a real systems programmer
    • However, it has some downsides:
      • Adds a bunch of cool new unsafe code that some people will definitely side-eye
      • I have no idea how portable this is. My understanding is that #[link_section] is not specific to a particular linker like lld or GNU ld, but I don't know if this would work at all on exotic targets like WASM, or if someone tries to test code using tracing in Miri. We could try to guard the whole system behind conditional compilation flags, and fall back to runtime registration if we can't guarantee that Cool Linker Tricks will work on the target platform...but that is a whole new can of worms in and of itself...
@hawkw hawkw added crate/core Related to the `tracing-core` crate kind/perf Performance improvements labels Jul 29, 2020
@hawkw hawkw self-assigned this Jul 29, 2020
@hawkw
Copy link
Member Author

hawkw commented Jul 29, 2020

@LucioFranco i imagine you might be interested in hacking on this?

@LucioFranco
Copy link
Member

Yes!

@LucioFranco LucioFranco self-assigned this Jul 30, 2020
@hawkw
Copy link
Member Author

hawkw commented Jul 30, 2020

Great! Some notes for starting out on this:

  • Let's punt on the intrusive list thing for now. I don't think we can do it nicely without adding a new method to the callsite API to link a callsite into the list (and perhaps also to load its next pointer), which is tricky to do (perhaps not possible) without breaking the API. We can revisit that for 0.2.
  • Let's make sure that we have a benchmark that actually measures the registration overhead before we work on this. If we want to specifically measure the overhead of adding a callsite to the list, we'll have to create a new process for every iteration, so I don't think we can do that easily with criterion.

@hawkw
Copy link
Member Author

hawkw commented Aug 18, 2020

Since 0.2 is coming up, we could potentially make changes to the callsite trait etc to support an intrusive, non-allocating list...

@hawkw
Copy link
Member Author

hawkw commented Aug 19, 2020

The intrusive list would also be a major step in getting tracing-core to work without liballoc, which could be handy for no_std users (I imagine @steveklabnik, @Hoverbear, and @baloo would all really appreciate getting rid of tracing-core's no. 1 use of heap space).

It would require adding a new method to the callsite API to get a next callsite pointer, and that would have to be a breaking change (since a default impl returning None would break the callsite registry).

The other issue is that &'static dyn Callsite is a wide pointer (ptr+vtable) and therefore does not easily fit in an AtomicUsize. So, we would need to indirect this somehow, perhaps by building the list out of pointers to a struct that internally owns a &'static dyn Callsite pointer. That could be a good thing, since we could hide all of the linked-list operations inside tracing-core, and downstream implementations wouldn't have to worry about it...

@hawkw
Copy link
Member Author

hawkw commented Aug 19, 2020

I'm imagining something like

pub struct Registration {
    callsite: &'static dyn Callsite + 'static),
    next: AtomicPtr<Registration>,
}

impl Registration {
    // == internal APIs ===
    // called when traversing the registry (re-registering)
    fn next(&self) -> Option<&Registration> { todo!() }
    // called when registering
    fn set_next(&self, next:: &'static Self) { todo!() }

    // called by callsite implementations
    pub const fn new(&'static (dyn Callsite + 'static) -> Self { todo!() }
}

pub trait Callsite {
   // implemented by `Callsite` impls, basically they have to store a registration in a static
   fn registration(&'static self) -> &'static Registration;
  
   // ... other trait methods ...
}

LucioFranco added a commit that referenced this issue Sep 25, 2020
This change adds an intrusive `LinkedList` for the
callsite registry. The linked list is lock-free and
can be written to from multiple threads. This list
also does not require any allocations due to its
intrusive nature. This though does require a
breaking change to the `Callsite` trait to allow
callsites to store the pointer to the next item in
the intrusive list.

Properoties of the intrusive atomically linked-list:

- Only supports `LinkedList::push` and `LinkedList::for_each`.
- The items in the list can only be added and not removed.

Closes #860
LucioFranco added a commit that referenced this issue Sep 25, 2020
This change adds an intrusive `LinkedList` for the
callsite registry. The linked list is lock-free and
can be written to from multiple threads. This list
also does not require any allocations due to its
intrusive nature. This though does require a
breaking change to the `Callsite` trait to allow
callsites to store the pointer to the next item in
the intrusive list.

Properoties of the intrusive atomically linked-list:

- Only supports `LinkedList::push` and `LinkedList::for_each`.
- The items in the list can only be added and not removed.

Closes #860
LucioFranco added a commit that referenced this issue Sep 25, 2020
This change adds an intrusive `LinkedList` for the
callsite registry. The linked list is lock-free and
can be written to from multiple threads. This list
also does not require any allocations due to its
intrusive nature. This though does require a
breaking change to the `Callsite` trait to allow
callsites to store the pointer to the next item in
the intrusive list.

Properoties of the intrusive atomically linked-list:

- Only supports `LinkedList::push` and `LinkedList::for_each`.
- The items in the list can only be added and not removed.

Closes #860
LucioFranco added a commit that referenced this issue Sep 25, 2020
This change adds an intrusive `LinkedList` for the
callsite registry. The linked list is lock-free and
can be written to from multiple threads. This list
also does not require any allocations due to its
intrusive nature. This though does require a
breaking change to the `Callsite` trait to allow
callsites to store the pointer to the next item in
the intrusive list.

Properoties of the intrusive atomically linked-list:

- Only supports `LinkedList::push` and `LinkedList::for_each`.
- The items in the list can only be added and not removed.

Closes #860
LucioFranco added a commit that referenced this issue Sep 25, 2020
This change adds an intrusive `LinkedList` for the
callsite registry. The linked list is lock-free and
can be written to from multiple threads. This list
also does not require any allocations due to its
intrusive nature. This though does require a
breaking change to the `Callsite` trait to allow
callsites to store the pointer to the next item in
the intrusive list.

Properoties of the intrusive atomically linked-list:

- Only supports `LinkedList::push` and `LinkedList::for_each`.
- The items in the list can only be added and not removed.

Closes #860
@hawkw hawkw closed this as completed in #988 Oct 2, 2020
hawkw pushed a commit that referenced this issue Oct 2, 2020
This change adds an intrusive `LinkedList` for the
callsite registry. The linked list is lock-free and
can be written to from multiple threads. This list
also does not require any allocations due to its
intrusive nature. This though does require a
breaking change to the `Callsite` trait to allow
callsites to store the pointer to the next item in
the intrusive list.

Properoties of the intrusive atomically linked-list:

- Only supports `LinkedList::push` and `LinkedList::for_each`.
- The items in the list can only be added and not removed.

Closes #860
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
crate/core Related to the `tracing-core` crate kind/perf Performance improvements
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants