Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow register_callsite() can cause other subscribers to not receive register_callsite() at all #2743

Open
hds opened this issue Oct 10, 2023 · 1 comment
Labels
crate/core Related to the `tracing-core` crate kind/bug Something isn't working

Comments

@hds
Copy link
Contributor

hds commented Oct 10, 2023

Bug Report

There is code below that reproduces this issue consistently (increase the sleep time to make it 100%).

I've uploaded a branch containing the test that reproduces the issue and my first attempt at a fix (which doesn't appear to fix anything). There's a draft PR open against v0.1.x to show the different #2744.

Version

  • tracing-core: 0.1.30
  • tracing: 0.1.37 (not necessary, can also be reproduced from just tracing-core)

Also confirmed present on the v0.1.x branch.

This error is not reproducible on master.

Platform

Linux and macOS confirmed. Have not tested on Windows.

  • Linux <name> 5.19.0-46-generic #47~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Wed Jun 21 15:35:31 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
  • Darwin <name> 23.4.0 Darwin Kernel Version 23.4.0: Wed Feb 21 21:44:31 PST 2024; root:xnu-10063.101.15~2/RELEASE_X86_64 x86_64

Crates

tracing-core

Description

If a Subscriber's register_callsite() function takes too long to complete, it can cause other subscribers to not have register_callsite called at all. Even if the second subscriber is the one which is then recording the tracing event at that callsite.

  1. We start two threads, each thread:
    a. Registers a local Subscriber (via tracing::subscriber::set_default)
    b. Records a single event.

When the first subscriber takes too long inside register_callsite(), the second subscriber doesn't receive a call to register_callsite() at all, instead it receives a call to event() directly.

What is too long?

By ensuring the order in which the threads start executing, a thread::sleep of 50 microseconds will

I can reliably reproduce this issue with the code below with a thread::sleep of 50 microseconds, although it will sometimes succeed. I have seen it reproduce with a sleep of only 20 microseconds, although that was less frequent (3 failures out of 10 tries just now). The machine is a 6 core 3.xGHz Dell desktop that's about 6 years old - so it's not top of the line, but also not underpowered. Keep in mind that the test is calling println!, so that may be adding additional delay.

Here is a full reproduction of the issue as an integration test is below. Note that the assert that would actually make this test fail has been commented out so that it can be observed that subscriber-2 doesn't receive a call to register_callsite() at all (it is not that it receives it after the call to event() which I had originally suspected).

use std::{
    ptr,
    sync::atomic::{AtomicPtr, Ordering},
    thread::{self, JoinHandle},
    time::Duration,
};

use tracing_core::{
    callsite::{Callsite as _, DefaultCallsite},
    dispatcher::set_default,
    field::{FieldSet, Value},
    span, Dispatch, Event, Kind, Level, Metadata, Subscriber,
};

struct TestSubscriber {
    sleep: Duration,
    callsite: AtomicPtr<Metadata<'static>>,
}

impl TestSubscriber {
    fn new(sleep_micros: u64) -> Self {
        Self {
            sleep: Duration::from_micros(sleep_micros),
            callsite: AtomicPtr::new(ptr::null_mut()),
        }
    }
}

impl Subscriber for TestSubscriber {
    fn register_callsite(&self, metadata: &'static Metadata<'static>) -> tracing_core::Interest {
        if !self.sleep.is_zero() {
            thread::sleep(self.sleep);
        }

        self.callsite
            .store(metadata as *const _ as *mut _, Ordering::SeqCst);

        tracing_core::Interest::always()
    }

    fn event(&self, event: &tracing_core::Event<'_>) {
        let stored_callsite = self.callsite.load(Ordering::SeqCst);
        let event_callsite: *mut Metadata<'static> = event.metadata() as *const _ as *mut _;

        // This assert is the actual test.
        assert_eq!(
            stored_callsite, event_callsite,
            "stored callsite: {stored_callsite:#?} does not match event \
            callsite: {event_callsite:#?}. Was `event` called before \
            `register_callsite`?"
        );
    }

    fn enabled(&self, _metadata: &Metadata<'_>) -> bool {
        true
    }
    fn new_span(&self, _span: &span::Attributes<'_>) -> span::Id {
        span::Id::from_u64(0)
    }
    fn record(&self, _span: &span::Id, _values: &span::Record<'_>) {}
    fn record_follows_from(&self, _span: &span::Id, _follows: &span::Id) {}
    fn enter(&self, _span: &tracing_core::span::Id) {}
    fn exit(&self, _span: &tracing_core::span::Id) {}
}

fn subscriber_thread(idx: usize, register_sleep_micros: u64) -> JoinHandle<()> {
    thread::Builder::new()
        .name(format!("subscriber-{idx}"))
        .spawn(move || {
            // We use a sleep to ensure the starting order of the 2 threads.
            let subscriber = TestSubscriber::new(register_sleep_micros);
            let _dispatch_guard = set_default(&Dispatch::new(subscriber));

            static CALLSITE: DefaultCallsite = {
                // The values of the metadata are unimportant
                static META: Metadata<'static> = Metadata::new(
                    "event ",
                    "module::path",
                    Level::INFO,
                    None,
                    None,
                    None,
                    FieldSet::new(&["message"], tracing_core::callsite::Identifier(&CALLSITE)),
                    Kind::EVENT,
                );
                DefaultCallsite::new(&META)
            };
            let _interest = CALLSITE.interest();

            let meta = CALLSITE.metadata();
            let field = meta.fields().field("message").unwrap();
            let message = format!("event-from-{idx}", idx = idx);
            let values = [(&field, Some(&message as &dyn Value))];
            let value_set = CALLSITE.metadata().fields().value_set(&values);

            Event::dispatch(meta, &value_set);

            // Wait a bit for everything to end (we don't want to remove the subscriber
            // immediately because that will influence the test).
            thread::sleep(Duration::from_millis(10));
        })
        .expect("failed to spawn thread")
}

/// Regression test for missing register_callsite call (#2743)
///
/// This test provokes the race condition which causes the second subscriber to not receive a
/// call to `register_callsite` before it receives a call to `event`.
///
/// Because the test depends on the interaction of multiple dispatchers in different threads,
/// it needs to be in a test file by itself.
#[test]
fn event_before_register() {
    let subscriber_1_register_sleep_micros = 30;
    let subscriber_2_register_sleep_micros = 0;

    let jh1 = subscriber_thread(1, subscriber_1_register_sleep_micros);

    // This delay ensures that the event callsite has interest() called first.
    thread::sleep(Duration::from_micros(10));
    let jh2 = subscriber_thread(2, subscriber_2_register_sleep_micros);

    jh1.join().expect("failed to join thread");
    jh2.join().expect("failed to join thread");
}

If we add println! messages to the register_callsite and event methods, we get the following output:

subscriber-1: register_callsite: 0x000055555564bb88
subscriber-2: event (with callsite): 0x000055555564bb88 (stored callsite: 0x0000000000000000)
subscriber-1: event (with callsite): 0x000055555564bb88 (stored callsite: 0x000055555564bb88)

When the sleep in the first subscriber is set to 0, the expected behavior is observed:

subscriber-1: register_callsite: 0x000055555564bb88
subscriber-1: event (with callsite): 0x000055555564bb88 (stored callsite: 0x000055555564bb88)
subscriber-1: register_callsite: 0x000055555564bb88
subscriber-2: register_callsite: 0x000055555564bb88
subscriber-2: event (with callsite): 0x000055555564bb88 (stored callsite: 0x000055555564bb88)

My best guess is that the reregistration of the callsite (because there's a new subscriber subscriber-2 in the new thread) causes subscriber-2 to skip registering the callsite of the event when it executes it.

A description of the fix can be found in #2938.

hds added a commit that referenced this issue Oct 10, 2023
This branch contains code to reproduce an issue where 2 thread local
subscribers interfer so that when the first one takes too long to
complete `register_callsite()`, the second subscriber doesn't have
`register_callsite()` called at all.

It also contains an attempt at a fix by removing a race condition in
`DefaultCallsite::register`. This fix may be necessary anyway, but it
isn't sufficient.

The issue for this problem is #2743.
hds added a commit that referenced this issue Oct 10, 2023
This branch contains code to reproduce an issue where 2 thread local
subscribers interfer so that when the first one takes too long to
complete `register_callsite()`, the second subscriber doesn't have
`register_callsite()` called at all.

It also contains an attempt at a fix by removing a race condition in
`DefaultCallsite::register`. This fix may be necessary anyway, but it
isn't sufficient.

The issue for this problem is #2743.
hds added a commit that referenced this issue Oct 10, 2023
This branch contains code to reproduce an issue where 2 thread local
subscribers interfer so that when the first one takes too long to
complete `register_callsite()`, the second subscriber doesn't have
`register_callsite()` called at all.

It also contains an attempt at a fix by removing a race condition in
`DefaultCallsite::register`. This fix may be necessary anyway, but it
isn't sufficient.

The issue for this problem is #2743.
hds added a commit that referenced this issue Oct 10, 2023
This branch contains code to reproduce an issue where 2 thread local
subscribers interfer so that when the first one takes too long to
complete `register_callsite()`, the second subscriber doesn't have
`register_callsite()` called at all.

It also contains an attempt at a fix by removing a race condition in
`DefaultCallsite::register`. This fix may be necessary anyway, but it
isn't sufficient.

The issue for this problem is #2743.
hds added a commit that referenced this issue Oct 10, 2023
This branch contains code to reproduce an issue where 2 thread local
subscribers interfer so that when the first one takes too long to
complete `register_callsite()`, the second subscriber doesn't have
`register_callsite()` called at all.

It also contains an attempt at a fix by removing a race condition in
`DefaultCallsite::register`. This fix may be necessary anyway, but it
isn't sufficient.

The issue for this problem is #2743.
@hawkw
Copy link
Member

hawkw commented Oct 10, 2023

Thanks for the repro, I can look into the potential fix and see what else is necessary to make it pass!

@hawkw hawkw added kind/bug Something isn't working crate/core Related to the `tracing-core` crate labels Oct 10, 2023
hds added a commit to tokio-rs/console that referenced this issue Apr 10, 2024
We have identified that some of the `console-subscriber` integration
tests are flaky (#473). This appears to be a result of the following
issue in `tracing` tokio-rs/tracing#2743.

Flaky tests are really worse than no tests. So these tests will be
removed until the flakiness can be fixed.
hds added a commit to tokio-rs/console that referenced this issue Apr 10, 2024
We have identified that some of the `console-subscriber` integration
tests are flaky (#473). This appears to be a result of the following
issue in `tracing` tokio-rs/tracing#2743.

Flaky tests are really worse than no tests. So these tests will be
removed until the flakiness can be fixed.
hds added a commit that referenced this issue Apr 16, 2024
There are 2 triggers which will cause a subscriber to receive a call to
`Subscriber::register_callsite` for a specific callsite.
1. The first time the event or span at that callsite is executed.
2. When a new subscriber is added or removed (for example, calls to
   `set_default` or `with_default`)

It is trigger (2) that will cause a new subscriber to receive
`Subscriber::register_callsite` for all the callsites which had already
been registered before it became active.

When a callsite is registered for trigger (1), the callsite starts in
state `UNREGISTERED`.

The first thread to encounter the callsite will transition it to
`REGISTERING` and determine the overall interest for the callsite by
registering with all known dispatchers (which will call into
`Subscriber::register_callsite`).

Once that is complete, the callsite is added to the list of all known
callsites and its state is transitioned to `REGISTERED`.

is (re)built for all known dispatchers. The callsite starts in state
`UNREGISTERED`. The  This calls down into
`Subscriber::register_callsite` for each subscriber. Once that is
complete, the callsite is added to the global list of known callsites.

While the callsite interest is being rebuilt, other threads that
encounter the callsite will be given `Interest::sometimes()` until the
registration is complete. However, if a new subscriber is added during
this window, all the interest for all callsites will be rebuilt, but
because the new callsite (in state `REGISTERING`) won't be included
because it isn't yet in the global list of callsites.

This can cause a case where that new subscriber being added won't
receive `Subscriber::register_callsite` before it receives the subsequent
call to `Subscriber::event` or `Subscriber::new_span`.

The documentation on [Registering Callsites] is not very explicit on
this point, but it does suggest that `Subscriber::register_callsite`
will be called before the call to either `Subscriber::event` or
`Subscriber::new_span`, and the current behavior can break this implicit
contract.

[Registering Callsites]: https://docs.rs/tracing-core/0.1.32/tracing_core/callsite/index.html#registering-callsites

This change swaps the order of rebuilding the callsite interest and
adding the callsite to the global list so that the callsite gets pushed
first, avoiding this window in which a subscriber won't get a call to
`register_callsite`.

As such, a callsite may have its interest read before it is set. In this
case, the existing implementation will return `Interest::sometimes()`
for the `DefaultCallsite` implementation. Other implementations (outside
of the `tracing` project) may perform this differently, but in this
case, there is no documented guarantee regarding the ordering.

A regression test is included which provokes the race condition 100% of
the time before the changes in this fix.

Fixes: #2743
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
crate/core Related to the `tracing-core` crate kind/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants