Performance degradation after enabling tracing in a service #1474

siningma · 2021-07-16T17:49:40Z

siningma
Jul 16, 2021

Hey folks,

I am having a problem after enabling tracing in a rust service. This service serves as an in-memory cache for data loaded from persistence. Each host of the service handles 7K-8K QPS at peak of grpc requests from clients.
Before enabling tracing, the average and P95 response latency are below 1 ms. CPU usage of each host is below 40%.
However, after turning on the tracing, the average response latency went up to around 5 ms and P95 can be above 100ms. CPU usage increased to 50-60%.

The dependencies that are used in this service:

opentelemetry = { version = "0.15", features = ["rt-tokio"] }
opentelemetry-datadog = { version = "0.3.1", features = [ "reqwest-client"] }
opentelemetry-http = "0.4"
tracing = "0.1"
tracing-attributes = "0.1"
tracing-futures = { version = "0.2", features = ["futures-03"] }
tracing-log = "0.1"
tracing-opentelemetry = "0.14"
tracing-subscriber = "0.2
tonic = "0.4"

Basic information about the service:

The service is executed in tokio runtime.
The tracing spans were added at the persistence communications by instrumenting an async fn which executes the persistent load call and awaits for the response.
Trace level was set to INFO.
Grpc server is tonic.
Tracing sample rate is 0.1%.
Some environment variables settings:
- Environment="OTEL_BSP_SCHEDULE_DELAY_MILLIS=1000"
- Environment="OTEL_BSP_MAX_QUEUE_SIZE=100000"
- Environment="OTEL_BSP_MAX_EXPORT_BATCH_SIZE=10000"
- Environment="OTEL_BSP_EXPORT_TIMEOUT_MILLIS=10000"

Tracing pipeline configuration:

    pub fn install(self) -> Result<Uninstall, Box<dyn std::error::Error + Send + Sync + 'static>> {
        // Setup Datadog propagator
        opentelemetry::global::set_text_map_propagator(DatadogPropagator::default());

        let datadog_agent_host = std::env::var("DD_AGENT_HOST")
            .or_else(|_| std::env::var("DATADOG_TRACE_AGENT_HOSTNAME"))
            .unwrap_or("127.0.0.1".into());

        let sampler = Sampler::ParentBased(Box::new(Sampler::TraceIdRatioBased(self.sample_rate)));
        let trace_config = trace::config()
            .with_sampler(sampler)
            .with_id_generator(IdGenerator::default());

        // Create an OpenTelemetry tracer with a Datadog exporter if requested.
        let tracer = opentelemetry_datadog::new_pipeline()
            .with_agent_endpoint(format!("http://{}:8126", datadog_agent_host))
            .with_service_name(self.service_name)
            .with_version(opentelemetry_datadog::ApiVersion::Version03)
            .with_trace_config(trace_config)
            .install_batch(opentelemetry::runtime::Tokio)?;

        let telemetry = tracing_opentelemetry::layer().with_tracer(tracer);

        // Configure our default JSON logging format.
        let fmt = fmt::Layer::default()
            .json()
            .with_timer(ChronoUtc::rfc3339());

        // Configure our default logging level.
        // Try to parse from RUST_LOG, falling back to defaults.
        let mut logging_level_from_conf = false;
        let filter = EnvFilter::try_from_default_env().unwrap_or_else(|_| {
            logging_level_from_conf = true;
            EnvFilter::default().add_directive(self.default_log_level.into())
        });

        // Create a Tokio tracing subscriber with our desired configuration.
        // CadenceLayer reports the numbers of active spans as a statsd metric.
        let subscriber = Registry::default()
            .with(telemetry)
            .with(CadenceLayer::new())
            .with(fmt)
            .with(filter);

        // Register our subscriber as the global default, making it usable by
        // the `#[tracing::instrument]` attribute macro.
        tracing::subscriber::set_global_default(subscriber)
            .expect("failed to set global tracing subscriber");

        // Now that logging is configured, report any log level fallback from earlier.
        if logging_level_from_conf {
            tracing::info!(
                "did not detect logging level from RUST_LOG, using configured value instead"
            );
        }

        // Initialize the LogTracer.
        let trace_level: Option<Level> = LevelFilter::current().into();
        if let Some(level) = trace_level {
            if let Err(e) = LogTracer::init_with_filter(level.as_log().to_level_filter()) {
                tracing::warn!("could not init LogTracer: {:?}", e);
            }
        }

        Ok(Uninstall)
    }

We are running out of ideas how to resolve the performance degradation after tracing is enabled.
I am happy to provide more details.

Could anyone shed some lights?

Thank you very much in advance.

davidbarsky · 2021-07-18T18:00:32Z

davidbarsky
Jul 18, 2021
Maintainer

Sorry for the delay in responding. Here are my disorganized thoughts:

I have some suspicions as to where the performance issue is occurring, but without data from a profiler, any advice that I can give you will be limited. If it's possible, can you attach a profiler and upload a (partially redacted, if necessary) flamegraph here?
Is CadenceLayer something that y'all wrote yourselves? I haven't been able to find that publicly.
My (probably inaccurate!) gut feeling is that the issue might be either within the CadenceLayer or the within the OpenTelemetry stack. A flamegraph would be wildly helpful in understanding us understand where bottlenecks are.

3 replies

siningma Jul 20, 2021
Author

Hi @davidbarsky ,
Thanks for the reply. Yes, you are right. The issue is in the CadenceLayer which is developed by us.
The performance issue went away after we commented out the line .with(CadenceLayer::new()).

davidbarsky Jul 20, 2021
Maintainer

That's interesting to know. Can you share the CadenceLayer, by any chance? I'm curious if the CadanceLayer accessing the Registry in a potentially problematic way—it's been an issue for metrics-rs/metrics' tracing integration as well. I might be able to help you figure out how to reduce the impact.

siningma Jul 21, 2021
Author

Hey David,
The CadanceLayer that we wrote was some crappy code. It was very slow due to a lock contention on a mutex.
We have rewritten the code and verified the new implementation in production.
Thanks again for your help, David.

hawkw · 2021-07-19T16:19:06Z

hawkw
Jul 19, 2021
Maintainer

Typically, aggressive sampling is necessary for reducing performance overhead when using distributed tracing. I notice you're setting a sample rate of 0.1%, which should be sufficient. However, the sampling is implemented purely in opentelemetry --- so it only determines which spans are actually sent to the OpenTelemetry collector. The tracing crate will currently still perform all the work of actually recording those spans, meaning that they will still introduce some overhead.

We'd like to add better support for sampling in tracing, so that we can avoid recording spans that are not sampled entirely, similarly to the way we can currently skip spans that are filtered out inexpensively. However, currently, tracing itself is not aware of sampling.

I suspect that baked-in sampling support in tracing will significantly reduce performance impact in this kind of use case.

1 reply

siningma Jul 20, 2021
Author

Hi @hawkw ,
Thanks for the reply.
David's suggestion was correct. The issue has been solved.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance degradation after enabling tracing in a service #1474

{{title}}

Replies: 2 comments 4 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Performance degradation after enabling tracing in a service #1474

siningma Jul 16, 2021

The dependencies that are used in this service:

Basic information about the service:

Tracing pipeline configuration:

Replies: 2 comments · 4 replies

davidbarsky Jul 18, 2021 Maintainer

siningma Jul 20, 2021 Author

davidbarsky Jul 20, 2021 Maintainer

siningma Jul 21, 2021 Author

hawkw Jul 19, 2021 Maintainer

siningma Jul 20, 2021 Author

siningma
Jul 16, 2021

Replies: 2 comments 4 replies

davidbarsky
Jul 18, 2021
Maintainer

siningma Jul 20, 2021
Author

davidbarsky Jul 20, 2021
Maintainer

siningma Jul 21, 2021
Author

hawkw
Jul 19, 2021
Maintainer

siningma Jul 20, 2021
Author