Skip to content
This repository has been archived by the owner on Oct 31, 2022. It is now read-only.

Race condition on PrometheusStatsReceiver.counter #24

Closed
seanbrant opened this issue Jun 21, 2018 · 10 comments
Closed

Race condition on PrometheusStatsReceiver.counter #24

seanbrant opened this issue Jun 21, 2018 · 10 comments

Comments

@seanbrant
Copy link

I'm getting the error java.lang.IllegalArgumentException: Collector already registered that provides name: finagle_my_service_443_connect_latency_ms_count.

This is my current theory:

  1. Thread 1 calls counters.getOrElseUpdate which doesn't have the counter so it grabs the lock
  2. Thread 2 calls counters.getOrElseUpdate which doesn't have the counter and can't grab the lock so it blocks
  3. Thread 1 registers a new counter and releases the lock
  4. Thread 2 grabs the lock and registers a new counter which produces the error

I think the lock needs to be around https://github.com/samstarling/finagle-prometheus/blob/master/src/main/scala/com/samstarling/prometheusfinagle/PrometheusStatsReceiver.scala#L39-L41 to prevent this error.

FWIW I was able to reproduce this issue in the PrometheusStatsReceiverRaceTest however since it's a race condition it's not consistent.

samstarling added a commit that referenced this issue Jun 22, 2018
The chaining of two getOrElseUpdate calls could result in two separate threads
attempting to register the same counter. Widening the scope of the
synchronised section should remedy this.
@samstarling
Copy link
Owner

Hey @seanbrant – thanks for the clear report! I don't get much of a chance to use this in production, so it's really useful to get things like this. Sorry you've come across this bug, though.

I just opened PR #25: could you take a look and see if it looks reasonable to you? I'm also curious if you made any changes to the PrometheusStatsReceiverRaceTest: it'd be nice to tighten up the testing there a little, so if you came across a way to replicate it more easily then please let me know.

@seanbrant
Copy link
Author

That looks correct to me. Oddly enough adding a print above https://github.com/samstarling/finagle-prometheus/blob/master/src/test/scala/com/samstarling/prometheusfinagle/PrometheusStatsReceiverRaceTest.scala#L19 caused it to happen. Though not consistently. We are seeing it in production consistently when our services get restarted.

Thanks for jumping on this so quickly!

samstarling added a commit that referenced this issue Jun 22, 2018
Fix race condition issue reported in #24
@samstarling
Copy link
Owner

No problem. I just released 0.0.8, and it's syncing from bintray to Maven Central at the moment – but it can take a little while (hours). I'll leave this issue open for now, but if you get a chance to try it out in production then I'd be really interested to know if this fixes it! 🤞🏼

@seanbrant
Copy link
Author

Seeing a new error now when my service starts up.

14:49:41.809 [main] ERROR com.twitter.app.LoadService - LoadService: failed to instantiate 'com.samstarling.prometheusfinagle.PrometheusStatsReceiver' for the requested service 'com.twitter.finagle.stats.StatsReceiver'
java.lang.NullPointerException: null
	at com.twitter.util.ProxyTimer.schedulePeriodically(Timer.scala:148)
	at com.twitter.util.Timer.schedule(Timer.scala:56)
	at com.twitter.util.Timer.schedule$(Timer.scala:53)
	at com.twitter.util.ProxyTimer.schedule(Timer.scala:139)
	at com.twitter.util.Timer.schedule(Timer.scala:66)
	at com.twitter.util.Timer.schedule$(Timer.scala:65)
	at com.twitter.util.ProxyTimer.schedule(Timer.scala:139)
	at com.samstarling.prometheusfinagle.PrometheusStatsReceiver.<init>(PrometheusStatsReceiver.scala:24)
	at com.samstarling.prometheusfinagle.PrometheusStatsReceiver.<init>(PrometheusStatsReceiver.scala:13)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
	at java.lang.Class.newInstance(Class.java:442)

@samstarling
Copy link
Owner

Hey @seanbrant – sorry it's taken me so long to reply to this. Can I ask what version of Finagle you're using? Which version of this library were you upgrading from?

@coduinix
Copy link

coduinix commented Jul 3, 2018

It looks like the synchronized for stat metrics is not in the right place. It should be synchronized outside the getOrElseUpdate call. Otherwise it could still do the side effect twice.

I've added a patch file with a test for stat which fails every now and then:
stat-test.txt

@seanbrant
Copy link
Author

@samstarling finagle 18.2.0 finagle-prometheus 0.0.7

@Fluxx
Copy link

Fluxx commented Jul 6, 2018

Also wanted to add we're seeing this two on the latest finagle 18.6.0 and finagle-prometheus 0.0.9:

[info] 14:55:18.532 [main] ERROR com.twitter.app.LoadService - LoadService: failed to instantiate 'com.samstarling.prometheusfinagle.PrometheusStatsReceiver' for the requested service 'com.twitter.finagle.stats.StatsReceiver'
[info] java.lang.NullPointerException: null
[info] 	at com.twitter.util.ProxyTimer.schedulePeriodically(Timer.scala:148)
[info] 	at com.twitter.util.Timer$class.schedule(Timer.scala:55)
[info] 	at com.twitter.util.ProxyTimer.schedule(Timer.scala:139)
[info] 	at com.twitter.util.Timer$class.schedule(Timer.scala:66)
[info] 	at com.twitter.util.ProxyTimer.schedule(Timer.scala:139)
[info] 	at com.samstarling.prometheusfinagle.PrometheusStatsReceiver.<init>(PrometheusStatsReceiver.scala:23)
[info] 	at com.samstarling.prometheusfinagle.PrometheusStatsReceiver.<init>(PrometheusStatsReceiver.scala:13)
[info] 	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
[info] 	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
[info] 	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
[info] 	at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
[info] 	at java.lang.Class.newInstance(Class.java:442)
[info] 	at com.twitter.app.LoadService$$anonfun$loadImpls$2.apply(LoadService.scala:191)
[info] 	at com.twitter.app.LoadService$$anonfun$loadImpls$2.apply(LoadService.scala:180)

@samstarling
Copy link
Owner

@coduinix Sorry for neglecting this issue. I've just raised a pull request (#28) that will fix part of the issue (the NPE), and I'll try and get this released ASAP.

Aside from that, do you think the synchronized for stat metrics is still in the wrong place? If so, I'll raise a separate issue for that.

@samstarling
Copy link
Owner

Hey @seanbrant: I pushed a variety of fixes recently, but do you still think this race condition exists? I've been hammering the tests today and haven't come across any failures. If you still think it's a problem, let me know – otherwise I'll close this issue. Thanks!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants