Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Impact of Lock Contention
Credit: This very informative analysis is done by Rao Fu
The performance testing environment is as below
- Server: two 8-core CPUs, 72G ram, Intel 82576 NIC, twemcache forked from https://github.com/twitter/twemcache.
- Load generator: mcperf. Each mcperf instance establishes one persistent TCP connection to the server and a command request is only submitted after the response for the previous request has been received. Multiple mcperf instances run simultaneously on the mesos cluster and there can be at most 8 mcperf instances running on one mesos slave machine.
- Request type: Requests are GET only and the size of all objects is 100 bytes. Each mcperf instance sends 500K GET requests. The cache is warmed up prior to the start of the mcperf instances and all GET requests hit in the cache.
Latency measurement: All reported latencies are measured on the client side. It's the time duration from when the
sendsystem call is made to send the request to when the
epollsystem call returns and indicates the response is ready to be received.
Performance of twemcache with respect to the number of threads.
The graph below shows how well twemcache scales with the number of the threads. The best performance is observed with eight threads. Compared with four threads (default # of threads used by twemcache), eight threads provides 17% speedup on the average latency and 8% speedup on the P95 latency with a sacrifice on the P999 (~-15%). Increasing the number of threads beyond 8 affects performance negatively.
Performance data on breaking the global cache lock.
twemcache has one lock (cache_lock) that protects both the hashtable and the LRU lists. It's the primary reason why twemcache does not scale well with the number of threads. The top 10 functions are shown in below for both 4 threads and 8 threads.
__lll_lock_wait are two lower-level functions in glibc that implements pthread mutex. They only account for 4.2% of the CPU time for 4 threads but 19.1% of the CPU time for 8 threads. For 16 threads, the two function takes 60% of the CPU time and that wipes off any gain from using more threads.
Top 10 functions for 4 threads
Top 10 functions for 8 threads
To evaluate how much benefit we can get by breaking the global lock, we made a simple code change to make the hashtable lock free. Since the workload we used is
GET only, twemcache behaves correctly with this change. As shown in the table below,
__lll_lock_wait becomes negligible even with 8 threads after we made the hashtable lock free. This results in a 31% reduction on the average latency.
GET is a relatively "cheap" operation and for other operations such as
SET a thread likely spends more time in the critical regions. The performance gain of breaking the global lock is potentially larger.
Top 10 functions for 8 threads with the lock-free hashtable+
Performance gain from making the hashtable lock free
|top of trunk||0.343||0.065||15.473||0.146||0.570||0.787||1.200|