Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove unnecessary blocking in stats package. #1587

Merged
merged 4 commits into from Mar 23, 2016
Merged

Remove unnecessary blocking in stats package. #1587

merged 4 commits into from Mar 23, 2016

Conversation

guoliang100
Copy link
Contributor

@alainjobart @sougou

Otherwise it imposes global locking in VTGate.


This change is Reviewable

Otherwise it imposes global locking in VTGate.
@sougou
Copy link
Contributor

sougou commented Mar 21, 2016

Not too sure about this. RWMutex is expensive compared to Mutex. I think it pays off only for very long-running operations.
In cases where we're just copying maps, the regular Mutex should be much faster. However, this is based on benchmarks I ran a few years back. Can you verify first?

@coveralls
Copy link

Coverage Status

Changes Unknown when pulling 2ced68d on vtgate into * on master*.

@guoliang100
Copy link
Contributor Author

BenchmarkLock-12 50000000 20.5 ns/op // Mutex.Lock(); Mutex.Unlock()
BenchmarkRLock-12 100000000 20.0 ns/op // RWMutex.RLock(); RWMutex.RUnlock()
BenchmarkWLock-12 30000000 43.4 ns/op // RWMutex.Lock(); RWMutex.Unlock()

RWMutex.RLock is about the same as Mutex.Lock. But it does not block other goroutines accessing the critical region.

@sougou
Copy link
Contributor

sougou commented Mar 22, 2016

In that case, just use Lock. No need to complicate the code. For example, doing read lock, then write lock, then reverifying... Not worth it.

@guoliang100
Copy link
Contributor Author

Hmm.. I think this should improve 99th latency, as goroutines competing for critical region is one of the major reasons of high 99th latency.

@sougou
Copy link
Contributor

sougou commented Mar 22, 2016

Let's talk about this. I still don't see how a 20ns operation can add to tail latency.

@enisoc
Copy link
Member

enisoc commented Mar 22, 2016

It's definitely better. I'll let you guys discuss whether it's worth it.

master branch:

BenchmarkCounters-12            10000000               174 ns/op
--- BENCH: BenchmarkCountersTailLatency-12
        counters_test.go:136: 99th Percentile (for N=5000000): 67.092µs

vtgate branch:

BenchmarkCounters-12            30000000                51.8 ns/op
--- BENCH: BenchmarkCountersTailLatency-12
        counters_test.go:136: 99th Percentile (for N=5000000): 277ns

@michael-berlin
Copy link
Contributor

I like the change and in particular the clever two-level approach of using a read-lock to retrieve the counter from the map and then using the atomic function to update the counter.

Liang, besides the average duration/op, the tail latency numbers between Mutex.Lock(), RWMutex.RLock() and no locking at all would also be interesting :)

Anthony, what was the number of concurrent Go routines in your benchmark?

Some other ideas how to further improve this:

  • make the map in Counter immutable and require that all names are registered in the constructor (sounds too impractical to me though)
  • shard the counter (assuming the sync/atomic functions are very fast, this probably doesn't buy you much)

Reviewed 3 of 3 files at r1.
Review status: all files reviewed at latest revision, 4 unresolved discussions.


go/stats/counters.go, line 20 [r1] (raw file):
Maybe it's worth documenting here that the mutex protects retrieving and adding values from/to the map, but modifications to the value itself have to be made using the sync/atomic/* functions?


go/stats/counters.go, line 70 [r1] (raw file):
Awesome that you considered this :) Maybe add a comment here to explain why you check for existance again.


go/stats/histogram.go, line 68 [r1] (raw file):
This breaks the previous transactional relation between the total count and each bucket.

In consequence, this can mean that a user can observe that the total value no longer is the sum of all buckets.

I'm totally fine with this limitation but you should document this in MarshalJSON().


go/stats/timings.go, line 20 [r1] (raw file):
Can you please group the fields by their mutex?

This seems to be a recommended Go practice, see here: https://talks.golang.org/2014/readability.slide#21

i.e.

totalCount sync2.AtomicInt64
totalTime  sync2.AtomicInt64

mu         sync.RWMutex
histograms map[string]*Histogram
hook       func(string, time.Duration)

=> counters are not part of the second group since they are not guarded by that mutex.


Comments from the review on Reviewable.io

@alainjobart
Copy link
Contributor

I like this change too. I'd like to see the grouping of the variables protected by a mutex in the structs, as Michael mentioned, it makes the code much more readable, as it's obvious who is protecting what.


Review status: all files reviewed at latest revision, 4 unresolved discussions.


Comments from the review on Reviewable.io

@sougou
Copy link
Contributor

sougou commented Mar 22, 2016

My main objection is against using RWMutex for performance improvements.
RWMutex should be used for sharing locks. If someone can show me such a need here, I'll buy the arguments.

  • Acquiring and releasing locks multiple times in a function is less readable than doing once at the beginning.
  • RLock introduces more concurrent code paths. It makes it harder to reason about deadlocks and data races. It's an additional burden that was not there before.
  • RLock introduces more rules. You have to know when you can change data and when you can only read. It may look obvious for a single function. But once functions start calling each other, the situation becomes very bad.

Finally, @enisoc's p99 benchmarks show that there's no material impact. All his numbers are in the nanoseconds.

@enisoc
Copy link
Member

enisoc commented Mar 22, 2016

Anthony, what was the number of concurrent Go routines in your benchmark?

For the tail latency test, I used B.RunParallel() which by default launches GOMAXPROCS goroutines. I set the number of goroutines to 1000 instead and that makes a big difference:

master branch:

BenchmarkCounters-12            10000000               184 ns/op
--- BENCH: BenchmarkCountersTailLatency-12
        counters_test.go:138: 99th Percentile (for N=2000000): 50.776945ms

vtgate branch:

BenchmarkCounters-12            30000000                51.2 ns/op
--- BENCH: BenchmarkCountersTailLatency-12
        counters_test.go:138: 99th Percentile (for N=2000000): 437ns

Benchmark code: enisoc@679ce12

@guoliang100
Copy link
Contributor Author

Review status: 0 of 3 files reviewed at latest revision, 4 unresolved discussions.


go/stats/counters.go, line 20 [r1] (raw file):
Done.


go/stats/counters.go, line 70 [r1] (raw file):
Done.


go/stats/histogram.go, line 68 [r1] (raw file):
Done.


go/stats/timings.go, line 20 [r1] (raw file):
Done.


Comments from the review on Reviewable.io

@guoliang100
Copy link
Contributor Author

Anthony, thanks for benchmark the change. Can you add the benchmark code to the tests?

Currently VTGate uses at least one global counter and timing to track QPS and latency. When QPS is high, a global lock causes high 99th latency.


Review status: 0 of 3 files reviewed at latest revision, 4 unresolved discussions.


Comments from the review on Reviewable.io

@enisoc
Copy link
Member

enisoc commented Mar 22, 2016

I added the benchmarks to this branch.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.006%) to 60.471% when pulling 32d95aa on vtgate into cd63133 on master.

@alainjobart
Copy link
Contributor

It seems we need to share the lock on the maps, to allow concurrent access,
here?

Before:

  • X - lock the map
  • lookup and change the value
  • unlock the map

After, in most cases:

  • Rlock the map
  • lookup the value
  • Runlock the map
  • Y - use an atomic increment to change the value

At the spot marked X, multiple go routines could be stuck, whereas it
disappears in the second case, where the concurrency on Y is much faster
and not introducing a big wait, with a possibly long tail latency?

Sugu, am I missing something here?

(I agree with you overall, we should not use this pattern until we actually
measure a problem, but I think Liang did measure a problem there).

On Tue, Mar 22, 2016 at 10:59 AM, sougou notifications@github.com wrote:

My main objection is against using RWMutex for performance improvements.
RWMutex should be used for sharing locks. If someone can show me such a
need here, I'll buy the arguments.

  • Acquiring and releasing locks multiple times in a function is less
    readable than doing once at the beginning.
  • RLock introduces more concurrent code paths. It makes it harder to
    reason about deadlocks and data races. It's an additional burden that was
    not there before.
  • RLock introduces more rules. You have to know when you can change
    data and when you can only read. It may look obvious for a single function.
    But once functions start calling each other, the situation becomes very bad.

Finally, @enisoc https://github.com/enisoc's p99 benchmarks show that
there's no material impact. All his numbers are in the nanoseconds.


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#1587 (comment)

@coveralls
Copy link

Coverage Status

Coverage increased (+0.03%) to 60.508% when pulling 32d95aa on vtgate into cd63133 on master.

@sougou
Copy link
Contributor

sougou commented Mar 22, 2016

Measuring 20ns per operation, you need something in the order of 1e9/20 = 50MQPS to start running into contention. If measurements show something contradictory, they may be too synthetic or flawed.

OTOH, the cost of acquiring a Mutex lock (https://golang.org/src/sync/mutex.go?s=1095:1117#L32) is much cheaper than acquiring one for RWMutex (https://golang.org/src/sync/rwmutex.go?s=2113:2138#L67). In effect, you're doing more work for no additional benefit. If you break the total operations down, what we had before was:
atomic inc
map access
atomic dec

Now, that becomes:
atomic inc * 2
map access
atomic dec * 2
atomic access

@guoliang100
Copy link
Contributor Author

The main change is to replace Mutex.Lock with RWMutex.RLock (not RWMutex.Lock). RWMutex.RLock is as simple as Mutex.Lock, when the mutex is not locked. However, when the mutex is locked, Mutex.Lock becomes much more complicated than RWMutex.RLock.

Originally, for Counters.Add(), other goroutines are blocked not only by "Lock", but by the whole CS below:
lock
map read
add
map set
unlock

Now, other goroutines are only blocked by the "RLock":
rlock
map read
runlock
atomic add
// note that entries in the map becomes constant, so the Lock part in getValueAddr() is not exercised in majority of the time

By comparing the two, with the new change, other goroutines are blocked for a shorter time


Review status: 0 of 4 files reviewed at latest revision, 4 unresolved discussions.


Comments from the review on Reviewable.io

@enisoc
Copy link
Member

enisoc commented Mar 22, 2016

Sugu, are you saying the 400ns vs 50,000,000ns difference in p99 is attributable to a flaw in the benchmark?

My interpretation of the benchmark results is that the main benefit of RWMutex is that Add() no longer causes the goroutine to yield, because the atomic operations determine that they don't need to wait. This then avoids the problem of goroutines being unfairly scheduled, improving p99 latency of Add().

@sougou
Copy link
Contributor

sougou commented Mar 22, 2016

Had a chat with @guoliang100 (more on that later).

@enisoc I mis-read your benchmarks. Sorry about that. However, I think it's still synthetic. You observed the 50ms p99 only because you were doing 10MQPS. So, @guoliang100 is going to run a new benchmark where he's going to run 10KQPS and record p99, which is the more practical use case. I've bet USD 5.00 that there will be no material difference :).

However, he provided me data based on previous benchmarks: getting rid of locks from certain parts of code improved tail latency. After a long brainstorm, we're thinking that locks may have some form of indirect effect on other parts of the system.
TL;DR: We have no clue what the real problem is. I've suggested that we collect more data.

As background, my main objection to this change is that: lock->cpu bound op->unlock is considered idiomatic go. It's used everywhere, including third party libraries we use. If performance is unacceptable, we have to file a bug with the go team instead of changing our code base.

Having said that, I've agreed to LGTM this code because of empirical data. The added condition is that we spend time figuring out the real root cause.

PS: The LGTM is valid even if the new benchmark shows no difference between the two runs: @guoliang100 deserves the latitude because he's trying to solve a difficult problem :).

@guoliang100
Copy link
Contributor Author

Here is the benchmark that suppose to send 10K QPS, with the code snippet showing the 2 scenarios I tried.
Let me know if you think there is something wrong with the benchmark code.

Cmd: go test -bench=. -benchtime=10s
(gomaxproc=12)

Scenario 1: 1000 Go routines, ~10 QPS per go routine
b.SetParallelism(1000)
b.RunParallel(func(pb *testing.PB) {
var start time.Time
for pb.Next() {
start = time.Now()
benchCounter.Add("c1", 1)
c <- time.Since(start)
time.Sleep(time.Duration(rand.Int63n(200)) * time.Millisecond)
}
})

(RWMutex.RLock)
BenchmarkCounters-12 300000000 52.3 ns/op
BenchmarkCountersTailLatency-12 2000000 8834 ns/op
--- BENCH: BenchmarkCountersTailLatency-12
counters_test.go:142: 99th Percentile (for N=1): 576ns
counters_test.go:142: 99th Percentile (for N=100): 2.209µs
counters_test.go:142: 99th Percentile (for N=10000): 1.166µs
counters_test.go:142: 99th Percentile (for N=500000): 1.654µs
counters_test.go:142: 99th Percentile (for N=2000000): 1.687µs

(Mutex.Lock)
BenchmarkCounters-12 30000000 447 ns/op
BenchmarkCountersTailLatency-12 2000000 8827 ns/op
--- BENCH: BenchmarkCountersTailLatency-12
counters_test.go:142: 99th Percentile (for N=1): 7.318µs
counters_test.go:142: 99th Percentile (for N=100): 201.303µs
counters_test.go:142: 99th Percentile (for N=10000): 4.285986ms
counters_test.go:142: 99th Percentile (for N=1000000): 144.387µs
counters_test.go:142: 99th Percentile (for N=2000000): 35.092µs

Scenario 2: 10K go routines, ~1 QPS per go routine
b.SetParallelism(10000)
b.RunParallel(func(pb *testing.PB) {
var start time.Time
for pb.Next() {
start = time.Now()
benchCounter.Add("c1", 1)
c <- time.Since(start)
time.Sleep(time.Duration(rand.Int63n(2000)) * time.Millisecond)
}
})

(RWMutex.RLock)
BenchmarkCounters-12 300000000 51.3 ns/op
BenchmarkCountersTailLatency-12 1000000 16142 ns/op
--- BENCH: BenchmarkCountersTailLatency-12
counters_test.go:142: 99th Percentile (for N=1): 643ns
counters_test.go:142: 99th Percentile (for N=20): 966ns
counters_test.go:142: 99th Percentile (for N=200): 828ns
counters_test.go:142: 99th Percentile (for N=2000): 490ns
counters_test.go:142: 99th Percentile (for N=20000): 471ns
counters_test.go:142: 99th Percentile (for N=200000): 1.504µs
counters_test.go:142: 99th Percentile (for N=1000000): 1.799µs

(Mutex.Lock)
BenchmarkCounters-12 30000000 463 ns/op
BenchmarkCountersTailLatency-12 1000000 14763 ns/op
--- BENCH: BenchmarkCountersTailLatency-12
counters_test.go:142: 99th Percentile (for N=1): 5.851µs
counters_test.go:142: 99th Percentile (for N=10): 4.571µs
counters_test.go:142: 99th Percentile (for N=100): 145.61µs
counters_test.go:142: 99th Percentile (for N=1000): 884.191µs
counters_test.go:142: 99th Percentile (for N=10000): 11.210028ms
counters_test.go:142: 99th Percentile (for N=100000): 244.048755ms
counters_test.go:142: 99th Percentile (for N=500000): 199.402598ms
counters_test.go:142: 99th Percentile (for N=1000000): 98.10374ms


Review status: 0 of 4 files reviewed at latest revision, 4 unresolved discussions.


Comments from the review on Reviewable.io

@guokeno0
Copy link
Contributor

Looks like I missed the party :-)

So firstly, I'm in support of this change: a RWLock is a RWLock, if you have far more reads than writes, they serve better the scenario. But the benchmark reminds me of something that I did before when debugging the concurrency issue with Go's channel. Liang, I think the turn key here is that the Go routines shouldn't run very long (and in production the Stubby handler goroutines never do). And second, the Go problem that your benchmark surfaces is the same problem we discussed with Russ Cox before: the unfairness of Go's mutex.

@sougou
Copy link
Contributor

sougou commented Mar 23, 2016

Very interesting. This is $5 I'll be happy to lose :).
I think we've found an easily reproducible problem to report to the go team. Per @guokeno0's comment, it may be the known issue. But let's not make assumptions. Let's just report it, and point them to the other issue, and they can dup this one if they think it's the same.
We should also mention that this is causing real-life tail latencies.

@michael-berlin
Copy link
Contributor

@sougou wrote:

Measuring 20ns per operation, you need something in the order of 1e9/20 = 50MQPS to start running
into contention. If measurements show something contradictory, they may be too synthetic or flawed.

I see your point here.

It's not the time per operation which is concerning but the fact that Mutex.Lock() acts as a global barrier whereas RWMutex.RLock() supports concurrency. The Go runtime probably just amplifies this problem. That means our parallel code now has serial sections.

Let me try to illustrate it:

All requests arrive at the same time

Let's assume the worst case that 10k requests hit vtgate simultaneously. In case of a Mutex.Lock() all requests have to go through this critical section of 20 ns, one at a time. Now let's also assume that the requests are ordered (like in a queue) and they are processed fairly first-in first-out. (This is a hypothetical example ignoring the actual implementation.)

req 1
req 2 (arrived after 0 ns)
...
req 10,000 (arrived after 0 ns)

While req 1 is processed immediately, req 10,000 has to wait 9,999 * 20ns until it's up i.e. its latency increases by ~200 microseconds (~0.2ms). In contrast, with an RLock() the requests don't have to wait for each other. There's still a delay since the machine doesn't have 10k CPUs and it takes a while to go through them. Nonetheless, the delay decreases with the number of cores.

Uniform Distribution of Arrival Times

Now I thought: It's probably very unlikely that we'll get hit by 10k requests at the same time. Instead, their arrival time is probably distributed uniformly. For example, let's assume that 10k is also our QPS rate. When we divide 1 second by 10,000 arriving requests, the gap between two requests is 100,000 ns.

req 1
req 2 (arrived after 100,000 ns)
req 3 (arrived after 200,000 ns)

In this case, 100,000 ns are more than enough to go through the critical section of 20 ns and there won't be any impact on the tail latency.

In practice, I think we cannot assume a uniform distribution for two reasons:

1. Correlated requests:

The average gap is probably shorter since requests are correlated e.g. serving one page triggers multiple DB requests.

req 1
req 2 (arrived after 1 ns)
req 3 (arrived after 1 ns)
...
req 101 (arrived after 10,000,000 ns)
req 102 (arrived after 10,000,001 ns)

2. Global barrier synchronizes arrival times over time

What I'm concerned about most is that the global barrier will synchronize the requests over time and turn the uniform distribution of arrival times into one where the requests all arrive at the same time.

Therefore, let's assume that the QPS rate of 10,000 is backed by 10,000 application threads which constantly send requests, each one at a time. Let's also assume that the request duration (latency) is very similar and doesn't vary in this thought experiment. Three "rounds" of requests will look like this from vtgate's point of view:

app thread 1: req 1
app thread 2: req 2 (arrived after 100,000 ns)
app thread 3: req 3 (arrived after 200,000 ns)
...
app thread 1: req 10,001 (arrived after 1 s)
app thread 2: req 10,002 (arrived after 1 s + 100,00 ns)
app thread 3: req 10,003 (arrived after 1 s + 200,00 ns)
...
app thread 1: req 20,001 (arrived after 2 s)
...

Now let's assume for whatever timing reasons, the requests of thread 1 and 2 arrived at the critical section at the same time. From now on their arrival times are synchronized:

app thread 1: req 30,001 (arrived after 3 s)
app thread 2: req 30,002 (arrived after 3 s)
...
app thread 1: req 40,001 (arrived after 4 s)
app thread 2: req 40,002 (arrived after 4 s + 20 ns)

i.e. the gap between them is missing and thread 2 is delayed by thread 1. Over time, more threads may get "synchronized" to the same arrival time and happily wait in front of the critical section together. Eventually, we may end up in the state which I described first where all requests arrive at the same time.

Conclusion:

I admit that my examples are synthetic to some extent. But I wouldn't be surprised if we actually see such behaviors in practice.

Highly parallel code should have as little serial sections as possible.

I also agree with you that we must balance complexity and performance. But I think it's worth to experiment with this improvement :)

@guoliang100
Copy link
Contributor Author

I run the 10K benchmark (10K goroutines with 1 QPS) with short-lived goroutines as @guokeno0 suggested.
It still shows huge contention with Mutex.Lock.

b.SetParallelism(10000)
b.RunParallel(func(pb *testing.PB) {
    var start time.Time

    for pb.Next() {
        go func() {
            start = time.Now()
            benchCounter.Add("c1", 1)
            c <- time.Since(start)
        }()
        rTime := time.Duration(rand.Int63n(2000)) * time.Millisecond
        time.Sleep(rTime)
    }
})

(Mutex.Lock)
BenchmarkCounters-12 30000000 455 ns/op
BenchmarkCountersTailLatency-12 1000000 16241 ns/op
--- BENCH: BenchmarkCountersTailLatency-12
counters_test.go:142: 99th Percentile (for N=1): 3.619µs
counters_test.go:142: 99th Percentile (for N=20): 72.917µs
counters_test.go:142: 99th Percentile (for N=200): 166.896µs
counters_test.go:142: 99th Percentile (for N=2000): 1.406406ms
counters_test.go:142: 99th Percentile (for N=20000): 16.559897ms
counters_test.go:142: 99th Percentile (for N=200000): 141.337784ms
counters_test.go:142: 99th Percentile (for N=1000000): 107.917461ms

(RWMutex.RLock)
BenchmarkCounters-12 300000000 49.9 ns/op
BenchmarkCountersTailLatency-12 1000000 16426 ns/op
--- BENCH: BenchmarkCountersTailLatency-12
counters_test.go:142: 99th Percentile (for N=1): 547ns
counters_test.go:142: 99th Percentile (for N=20): 1.577µs
counters_test.go:142: 99th Percentile (for N=200): 1.726µs
counters_test.go:142: 99th Percentile (for N=2000): 783ns
counters_test.go:142: 99th Percentile (for N=20000): 981ns
counters_test.go:142: 99th Percentile (for N=200000): 1.166µs
counters_test.go:142: 99th Percentile (for N=1000000): 1.309µs


Review status: 0 of 4 files reviewed at latest revision, 4 unresolved discussions.


Comments from the review on Reviewable.io

@sougou
Copy link
Contributor

sougou commented Mar 23, 2016

LGTM

Approved with PullApprove

@coveralls
Copy link

Coverage Status

Coverage increased (+0.01%) to 60.437% when pulling 6313e3b on vtgate into da3ab78 on master.

@guokeno0
Copy link
Contributor

LGTM

Approved with PullApprove

@guokeno0
Copy link
Contributor

Shall we report this to the internal Go team?

@guokeno0 guokeno0 closed this Mar 23, 2016
@guoliang100 guoliang100 reopened this Mar 23, 2016
@guoliang100 guoliang100 merged commit 268fcf9 into master Mar 23, 2016
@coveralls
Copy link

Coverage Status

Coverage increased (+0.02%) to 60.439% when pulling 6313e3b on vtgate into da3ab78 on master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants