Questions about twemproxy and performance #513

jennyfountain · 2017-01-26T20:36:23Z

We are seeing a few issues that I was hoping someone could help me resolve or point me in the right direction.

During high loads, we are seeing a lot of backup in the out_queue_bytes. On normal traffic loads, this is 0.

Example (sometimes goes into 2k/3k range as well):
"out_queue_bytes": 33
"out_queue_bytes": 91
"out_queue_bytes": 29
"out_queue_bytes": 29
"out_queue_bytes": 174

In addition, it shows that our time spent in memcache goes up from 400 ms to 1000-2000 ms. This seriously affects our application.

auto eject also seems to not work as expected. Server goes down and our app freaks out - saying it cannot access a memcache server.

here is an example of a config:

web:
listen: /var/run/nutcracker/web.sock 0777
auto_eject_hosts: true
distribution: ketama
hash: one_at_a_time
backlog: 65536
server_connections: 16
server_failure_limit: 3
server_retry_timeout: 30000
servers:

1.2.3.4:11211:1
1.2.3.5:11211:1
1.2.3.6:11211:1
timeout: 2000

somaxconn = 128

What we tried and didn't help.

mbuf to 512
server connection from 1 to 200

Thank you for any guidance on this problem.

jennyfountain · 2017-02-22T01:23:29Z

@manjuraj We are seeing similar issues referenced here #145

Here is our config.

listen: /var/run/nutcracker/our.socket 0777
auto_eject_hosts: true
distribution: ketama
hash: one_at_a_time
backlog: 65536
server_connections: 16
server_failure_limit: 3
server_retry_timeout: 30000
servers:

x.x.x.x:11211:1
x.x.x.x:11211:1
x.x.x.x:11211:1
timeout: 2000

We are seeing a major backup in out_queue and it basically makes our site unusable.

In addition, auto_eject_hosts: true is not working as we thought.

Thanks for any insight or information!
-J

manjuraj · 2017-02-23T15:08:13Z

@jennyfountain - I believe some of the issues can be solved by following the recommendations listed out here: https://github.com/twitter/twemproxy/blob/master/notes/recommendation.md

Could you try applying for the timeout: parameter and setting server_connections to 1 for you cluster

manjuraj · 2017-02-23T15:16:59Z

Also do you have client side retries? If so, the number of retries must be at least 4 for a request to succeed in the event of a server failure and ejection

The reason you are noticing backlog is because memcahce is getting more load that expected and taking longer to respond.

How many memcahe servers do you have? Across how many physical machines is it distributed and how much load does each instance get at peak? Do you have p50, p90 and p95 latency numbers?

jennyfountain · 2017-02-23T15:27:51Z

We had set it to 1 and same results. We increased it hoping it would help but it didn't.

We also set timeout to 2000.

We currently have 29 memcache servers in the pool.

We do not have client side retries. When we went directly to the memcache servers, we did not see this issue.

Listed below is the current config we are using. Could this be a hash issue?

listen: /var/run/nutcracker/our.socket 0777
auto_eject_hosts: true
distribution: ketama
hash: one_at_a_time
backlog: 65536
server_connections: 16
server_failure_limit: 3
server_retry_timeout: 30000
servers:
x.x.x.x:11211:1
x.x.x.x:11211:1
x.x.x.x:11211:1
timeout: 2000

Thank you for your help on this!

manjuraj · 2017-02-23T16:21:59Z

one_at_a_time is a not an ideal hash function, but I doubt that is the issue. Also changing the hash at this point would imply that it would change the routing of a key to the backend memcache (redistribute the sharding). Unless you are bring up a new cluster of memcache, not such a good idea

jennyfountain · 2017-02-23T16:29:32Z

Just curious - What would you suggest as the ideal hash function?

Looking at our configs, does anything stand out? could this be socket issue? timeout issue?

manjuraj · 2017-02-23T16:35:48Z

murmur would be my goto hash function; fnv1a_64 is good too

manjuraj · 2017-02-23T16:36:51Z

your config looks fine;

At high load, is your CPU for twemproxy machines maxed out?

manjuraj · 2017-02-23T16:39:14Z

At high load, can you paste me the values for the following stats:

jennyfountain · 2017-02-23T16:39:26Z

No, CPU/Memory look perfect. Our out_queue backups and time in memcache increases from 100ms to 600ms

jennyfountain · 2017-02-23T16:39:58Z

Yes! I will push this and paste in a second.

manjuraj · 2017-02-23T16:40:57Z

also paste values for https://github.com/twitter/twemproxy/blob/master/src/nc_stats.h#L48 and not out_queue_bytes

jennyfountain · 2017-02-23T19:32:15Z

stats.txt

I included all of the stats for each server (sanitized :D) during our test.

Thank you so much!

manjuraj · 2017-02-28T16:10:22Z

@jennyfountain I looked at stats.txt and nothing really jumps out :( Have you used tools like mctop or twctop for monitoring or twemperf for load testing?

Also in the load testing are you using requesting the same key over and over again?

jennyfountain · 2017-02-28T20:02:38Z

I have mctop installed. I will start that during a load test and see if I can spot anything.

Does it matter if I am using https://memcached.org/? Version 1.4.17?

twctop seems to be throwing some errors and won't run for me so I will investigate that more.

thanks!

manjuraj · 2017-03-01T20:04:13Z

twctop only works with twemcache

jennyfountain · 2017-03-28T01:07:28Z

We narrowed it down to PHP. No matter what version of PHP, libmemache or memcached module, it's about 50% slower than nodejs.

@manjuraj

here is sample code that I am using as a test. Nodejs and memslap are seeing no issues :(.

private function createCacheObject() { $this->cache = new Memcached('foo'); $this->cache->setOption( Memcached::OPT_DISTRIBUTION, Memcached::DISTRIBUTION_CONSISTENT ); $this->cache->setOption( Memcached::OPT_LIBKETAMA_COMPATIBLE, true ); $this->cache->setOption( Memcached::OPT_SERIALIZER, Memcached::SERIALIZER_PHP ); $this->cache->setOption( Memcached::OPT_TCP_NODELAY, false ); $this->cache->setOption( Memcached::OPT_HASH, Memcached::HASH_MURMUR ); $this->cache->setOption( Memcached::OPT_SERVER_FAILURE_LIMIT, 256 ); $this->cache->setOption( Memcached::OPT_COMPRESSION, false ); $this->cache->setOption( Memcached::OPT_RETRY_TIMEOUT, 1 ); $this->cache->setOption( Memcached::OPT_CONNECT_TIMEOUT, 1 * 1000 ); $this->cache->addServers($this->servers[$this->serverConfig]); }

TysonAndre · 2021-04-29T04:09:53Z

I believe this can be closed. I'm working on the same application as jennyfountain and the issue no longer occurs. I'm guessing the original issue was cpu starvation (or maybe some other misconfiguration causing slow syscalls)

After this issue was filed,

The rpm was updated and the strategy used is now significantly
The servers were rebuilt on new generation hardware and peak cpu usage is much lower so cpu starvation is no longer an issue
The application was upgraded to a newer php version
Other bottlenecks/bugs were fixed

TysonAndre · 2021-07-02T00:54:18Z

This application continues to be stable after avoiding cpu starvation and moving to 4/8 nutcracker instances per host to keep nutcracker cpu usage consistently below 100% .

TysonAndre mentioned this issue Feb 24, 2021

Segmentation-Fault: ncontinuum versus Pool Server-Count #563

Closed

TysonAndre closed this as completed Jul 2, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions about twemproxy and performance #513

Questions about twemproxy and performance #513

jennyfountain commented Jan 26, 2017

jennyfountain commented Feb 22, 2017 •

edited

manjuraj commented Feb 23, 2017

manjuraj commented Feb 23, 2017

jennyfountain commented Feb 23, 2017

manjuraj commented Feb 23, 2017

jennyfountain commented Feb 23, 2017

manjuraj commented Feb 23, 2017

manjuraj commented Feb 23, 2017

manjuraj commented Feb 23, 2017

jennyfountain commented Feb 23, 2017

jennyfountain commented Feb 23, 2017

manjuraj commented Feb 23, 2017

jennyfountain commented Feb 23, 2017

manjuraj commented Feb 28, 2017

jennyfountain commented Feb 28, 2017

manjuraj commented Mar 1, 2017

jennyfountain commented Mar 28, 2017 •

edited

TysonAndre commented Apr 29, 2021

TysonAndre commented Jul 2, 2021

Questions about twemproxy and performance #513

Questions about twemproxy and performance #513

Comments

jennyfountain commented Jan 26, 2017

jennyfountain commented Feb 22, 2017 • edited

manjuraj commented Feb 23, 2017

manjuraj commented Feb 23, 2017

jennyfountain commented Feb 23, 2017

manjuraj commented Feb 23, 2017

jennyfountain commented Feb 23, 2017

manjuraj commented Feb 23, 2017

manjuraj commented Feb 23, 2017

manjuraj commented Feb 23, 2017

jennyfountain commented Feb 23, 2017

jennyfountain commented Feb 23, 2017

manjuraj commented Feb 23, 2017

jennyfountain commented Feb 23, 2017

manjuraj commented Feb 28, 2017

jennyfountain commented Feb 28, 2017

manjuraj commented Mar 1, 2017

jennyfountain commented Mar 28, 2017 • edited

TysonAndre commented Apr 29, 2021

TysonAndre commented Jul 2, 2021

jennyfountain commented Feb 22, 2017 •

edited

jennyfountain commented Mar 28, 2017 •

edited