Race condition on PrometheusStatsReceiver.counter #24
Comments
The chaining of two getOrElseUpdate calls could result in two separate threads attempting to register the same counter. Widening the scope of the synchronised section should remedy this.
Hey @seanbrant – thanks for the clear report! I don't get much of a chance to use this in production, so it's really useful to get things like this. Sorry you've come across this bug, though. I just opened PR #25: could you take a look and see if it looks reasonable to you? I'm also curious if you made any changes to the |
That looks correct to me. Oddly enough adding a print above https://github.com/samstarling/finagle-prometheus/blob/master/src/test/scala/com/samstarling/prometheusfinagle/PrometheusStatsReceiverRaceTest.scala#L19 caused it to happen. Though not consistently. We are seeing it in production consistently when our services get restarted. Thanks for jumping on this so quickly! |
Fix race condition issue reported in #24
No problem. I just released |
Seeing a new error now when my service starts up.
|
Hey @seanbrant – sorry it's taken me so long to reply to this. Can I ask what version of Finagle you're using? Which version of this library were you upgrading from? |
It looks like the synchronized for I've added a patch file with a test for |
@samstarling finagle |
Also wanted to add we're seeing this two on the latest finagle
|
@coduinix Sorry for neglecting this issue. I've just raised a pull request (#28) that will fix part of the issue (the NPE), and I'll try and get this released ASAP. Aside from that, do you think the synchronized for |
Hey @seanbrant: I pushed a variety of fixes recently, but do you still think this race condition exists? I've been hammering the tests today and haven't come across any failures. If you still think it's a problem, let me know – otherwise I'll close this issue. Thanks! |
I'm getting the error
java.lang.IllegalArgumentException: Collector already registered that provides name: finagle_my_service_443_connect_latency_ms_count
.This is my current theory:
counters.getOrElseUpdate
which doesn't have the counter so it grabs the lockcounters.getOrElseUpdate
which doesn't have the counter and can't grab the lock so it blocksI think the lock needs to be around https://github.com/samstarling/finagle-prometheus/blob/master/src/main/scala/com/samstarling/prometheusfinagle/PrometheusStatsReceiver.scala#L39-L41 to prevent this error.
FWIW I was able to reproduce this issue in the
PrometheusStatsReceiverRaceTest
however since it's a race condition it's not consistent.The text was updated successfully, but these errors were encountered: