New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove unnecessary blocking in stats package. #1587
Conversation
Otherwise it imposes global locking in VTGate.
Not too sure about this. RWMutex is expensive compared to Mutex. I think it pays off only for very long-running operations. |
Changes Unknown when pulling 2ced68d on vtgate into * on master*. |
BenchmarkLock-12 50000000 20.5 ns/op // Mutex.Lock(); Mutex.Unlock() RWMutex.RLock is about the same as Mutex.Lock. But it does not block other goroutines accessing the critical region. |
In that case, just use Lock. No need to complicate the code. For example, doing read lock, then write lock, then reverifying... Not worth it. |
Hmm.. I think this should improve 99th latency, as goroutines competing for critical region is one of the major reasons of high 99th latency. |
Let's talk about this. I still don't see how a 20ns operation can add to tail latency. |
It's definitely better. I'll let you guys discuss whether it's worth it.
|
I like the change and in particular the clever two-level approach of using a read-lock to retrieve the counter from the map and then using the atomic function to update the counter. Liang, besides the average duration/op, the tail latency numbers between Mutex.Lock(), RWMutex.RLock() and no locking at all would also be interesting :) Anthony, what was the number of concurrent Go routines in your benchmark? Some other ideas how to further improve this:
Reviewed 3 of 3 files at r1. go/stats/counters.go, line 20 [r1] (raw file): go/stats/counters.go, line 70 [r1] (raw file): go/stats/histogram.go, line 68 [r1] (raw file): In consequence, this can mean that a user can observe that the total value no longer is the sum of all buckets. I'm totally fine with this limitation but you should document this in MarshalJSON(). go/stats/timings.go, line 20 [r1] (raw file): This seems to be a recommended Go practice, see here: https://talks.golang.org/2014/readability.slide#21 i.e. totalCount sync2.AtomicInt64
totalTime sync2.AtomicInt64
mu sync.RWMutex
histograms map[string]*Histogram
hook func(string, time.Duration) => counters are not part of the second group since they are not guarded by that mutex. Comments from the review on Reviewable.io |
I like this change too. I'd like to see the grouping of the variables protected by a mutex in the structs, as Michael mentioned, it makes the code much more readable, as it's obvious who is protecting what. Review status: all files reviewed at latest revision, 4 unresolved discussions. Comments from the review on Reviewable.io |
My main objection is against using RWMutex for performance improvements.
Finally, @enisoc's p99 benchmarks show that there's no material impact. All his numbers are in the nanoseconds. |
For the tail latency test, I used
Benchmark code: enisoc@679ce12 |
Review status: 0 of 3 files reviewed at latest revision, 4 unresolved discussions. go/stats/counters.go, line 20 [r1] (raw file): go/stats/counters.go, line 70 [r1] (raw file): go/stats/histogram.go, line 68 [r1] (raw file): go/stats/timings.go, line 20 [r1] (raw file): Comments from the review on Reviewable.io |
Anthony, thanks for benchmark the change. Can you add the benchmark code to the tests? Currently VTGate uses at least one global counter and timing to track QPS and latency. When QPS is high, a global lock causes high 99th latency. Review status: 0 of 3 files reviewed at latest revision, 4 unresolved discussions. Comments from the review on Reviewable.io |
I added the benchmarks to this branch. |
It seems we need to share the lock on the maps, to allow concurrent access, Before:
After, in most cases:
At the spot marked X, multiple go routines could be stuck, whereas it Sugu, am I missing something here? (I agree with you overall, we should not use this pattern until we actually On Tue, Mar 22, 2016 at 10:59 AM, sougou notifications@github.com wrote:
|
Measuring 20ns per operation, you need something in the order of 1e9/20 = 50MQPS to start running into contention. If measurements show something contradictory, they may be too synthetic or flawed. OTOH, the cost of acquiring a Mutex lock (https://golang.org/src/sync/mutex.go?s=1095:1117#L32) is much cheaper than acquiring one for RWMutex (https://golang.org/src/sync/rwmutex.go?s=2113:2138#L67). In effect, you're doing more work for no additional benefit. If you break the total operations down, what we had before was: Now, that becomes: |
The main change is to replace Mutex.Lock with RWMutex.RLock (not RWMutex.Lock). RWMutex.RLock is as simple as Mutex.Lock, when the mutex is not locked. However, when the mutex is locked, Mutex.Lock becomes much more complicated than RWMutex.RLock. Originally, for Counters.Add(), other goroutines are blocked not only by "Lock", but by the whole CS below: Now, other goroutines are only blocked by the "RLock": By comparing the two, with the new change, other goroutines are blocked for a shorter time Review status: 0 of 4 files reviewed at latest revision, 4 unresolved discussions. Comments from the review on Reviewable.io |
Sugu, are you saying the 400ns vs 50,000,000ns difference in p99 is attributable to a flaw in the benchmark? My interpretation of the benchmark results is that the main benefit of RWMutex is that Add() no longer causes the goroutine to yield, because the atomic operations determine that they don't need to wait. This then avoids the problem of goroutines being unfairly scheduled, improving p99 latency of Add(). |
Had a chat with @guoliang100 (more on that later). @enisoc I mis-read your benchmarks. Sorry about that. However, I think it's still synthetic. You observed the 50ms p99 only because you were doing 10MQPS. So, @guoliang100 is going to run a new benchmark where he's going to run 10KQPS and record p99, which is the more practical use case. I've bet USD 5.00 that there will be no material difference :). However, he provided me data based on previous benchmarks: getting rid of locks from certain parts of code improved tail latency. After a long brainstorm, we're thinking that locks may have some form of indirect effect on other parts of the system. As background, my main objection to this change is that: lock->cpu bound op->unlock is considered idiomatic go. It's used everywhere, including third party libraries we use. If performance is unacceptable, we have to file a bug with the go team instead of changing our code base. Having said that, I've agreed to LGTM this code because of empirical data. The added condition is that we spend time figuring out the real root cause. PS: The LGTM is valid even if the new benchmark shows no difference between the two runs: @guoliang100 deserves the latitude because he's trying to solve a difficult problem :). |
Here is the benchmark that suppose to send 10K QPS, with the code snippet showing the 2 scenarios I tried. Cmd: go test -bench=. -benchtime=10s Scenario 1: 1000 Go routines, ~10 QPS per go routine (RWMutex.RLock) (Mutex.Lock) Scenario 2: 10K go routines, ~1 QPS per go routine (RWMutex.RLock) (Mutex.Lock) Review status: 0 of 4 files reviewed at latest revision, 4 unresolved discussions. Comments from the review on Reviewable.io |
Looks like I missed the party :-) So firstly, I'm in support of this change: a RWLock is a RWLock, if you have far more reads than writes, they serve better the scenario. But the benchmark reminds me of something that I did before when debugging the concurrency issue with Go's channel. Liang, I think the turn key here is that the Go routines shouldn't run very long (and in production the Stubby handler goroutines never do). And second, the Go problem that your benchmark surfaces is the same problem we discussed with Russ Cox before: the unfairness of Go's mutex. |
Very interesting. This is $5 I'll be happy to lose :). |
@sougou wrote:
I see your point here. It's not the time per operation which is concerning but the fact that Mutex.Lock() acts as a global barrier whereas RWMutex.RLock() supports concurrency. The Go runtime probably just amplifies this problem. That means our parallel code now has serial sections. Let me try to illustrate it: All requests arrive at the same time Let's assume the worst case that 10k requests hit vtgate simultaneously. In case of a Mutex.Lock() all requests have to go through this critical section of 20 ns, one at a time. Now let's also assume that the requests are ordered (like in a queue) and they are processed fairly first-in first-out. (This is a hypothetical example ignoring the actual implementation.)
While req 1 is processed immediately, req 10,000 has to wait 9,999 * 20ns until it's up i.e. its latency increases by ~200 microseconds (~0.2ms). In contrast, with an RLock() the requests don't have to wait for each other. There's still a delay since the machine doesn't have 10k CPUs and it takes a while to go through them. Nonetheless, the delay decreases with the number of cores. Uniform Distribution of Arrival Times Now I thought: It's probably very unlikely that we'll get hit by 10k requests at the same time. Instead, their arrival time is probably distributed uniformly. For example, let's assume that 10k is also our QPS rate. When we divide 1 second by 10,000 arriving requests, the gap between two requests is 100,000 ns.
In this case, 100,000 ns are more than enough to go through the critical section of 20 ns and there won't be any impact on the tail latency. In practice, I think we cannot assume a uniform distribution for two reasons: 1. Correlated requests: The average gap is probably shorter since requests are correlated e.g. serving one page triggers multiple DB requests.
2. Global barrier synchronizes arrival times over time What I'm concerned about most is that the global barrier will synchronize the requests over time and turn the uniform distribution of arrival times into one where the requests all arrive at the same time. Therefore, let's assume that the QPS rate of 10,000 is backed by 10,000 application threads which constantly send requests, each one at a time. Let's also assume that the request duration (latency) is very similar and doesn't vary in this thought experiment. Three "rounds" of requests will look like this from vtgate's point of view:
Now let's assume for whatever timing reasons, the requests of thread 1 and 2 arrived at the critical section at the same time. From now on their arrival times are synchronized:
i.e. the gap between them is missing and thread 2 is delayed by thread 1. Over time, more threads may get "synchronized" to the same arrival time and happily wait in front of the critical section together. Eventually, we may end up in the state which I described first where all requests arrive at the same time. Conclusion: I admit that my examples are synthetic to some extent. But I wouldn't be surprised if we actually see such behaviors in practice. Highly parallel code should have as little serial sections as possible. I also agree with you that we must balance complexity and performance. But I think it's worth to experiment with this improvement :) |
I run the 10K benchmark (10K goroutines with 1 QPS) with short-lived goroutines as @guokeno0 suggested.
(Mutex.Lock) (RWMutex.RLock) Review status: 0 of 4 files reviewed at latest revision, 4 unresolved discussions. Comments from the review on Reviewable.io |
Shall we report this to the internal Go team? |
@alainjobart @sougou
Otherwise it imposes global locking in VTGate.
This change is