Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions about twemproxy and performance #513

Closed
jennyfountain opened this issue Jan 26, 2017 · 19 comments
Closed

Questions about twemproxy and performance #513

jennyfountain opened this issue Jan 26, 2017 · 19 comments

Comments

@jennyfountain
Copy link

We are seeing a few issues that I was hoping someone could help me resolve or point me in the right direction.

  1. During high loads, we are seeing a lot of backup in the out_queue_bytes. On normal traffic loads, this is 0.

Example (sometimes goes into 2k/3k range as well):
"out_queue_bytes": 33
"out_queue_bytes": 91
"out_queue_bytes": 29
"out_queue_bytes": 29
"out_queue_bytes": 174

In addition, it shows that our time spent in memcache goes up from 400 ms to 1000-2000 ms. This seriously affects our application.

  1. auto eject also seems to not work as expected. Server goes down and our app freaks out - saying it cannot access a memcache server.

here is an example of a config:

web:
listen: /var/run/nutcracker/web.sock 0777
auto_eject_hosts: true
distribution: ketama
hash: one_at_a_time
backlog: 65536
server_connections: 16
server_failure_limit: 3
server_retry_timeout: 30000
servers:

  • 1.2.3.4:11211:1
  • 1.2.3.5:11211:1
  • 1.2.3.6:11211:1
    timeout: 2000

somaxconn = 128

What we tried and didn't help.

  1. mbuf to 512
  2. server connection from 1 to 200

Thank you for any guidance on this problem.

@jennyfountain
Copy link
Author

jennyfountain commented Feb 22, 2017

@manjuraj We are seeing similar issues referenced here #145

screen shot 2017-02-21 at 8 50 34 pm

Here is our config.

listen: /var/run/nutcracker/our.socket 0777
auto_eject_hosts: true
distribution: ketama
hash: one_at_a_time
backlog: 65536
server_connections: 16
server_failure_limit: 3
server_retry_timeout: 30000
servers:

  • x.x.x.x:11211:1
  • x.x.x.x:11211:1
  • x.x.x.x:11211:1
    timeout: 2000

We are seeing a major backup in out_queue and it basically makes our site unusable.

In addition, auto_eject_hosts: true is not working as we thought.

Thanks for any insight or information!
-J

@manjuraj
Copy link
Collaborator

@jennyfountain - I believe some of the issues can be solved by following the recommendations listed out here: https://github.com/twitter/twemproxy/blob/master/notes/recommendation.md

Could you try applying for the timeout: parameter and setting server_connections to 1 for you cluster

@manjuraj
Copy link
Collaborator

Also do you have client side retries? If so, the number of retries must be at least 4 for a request to succeed in the event of a server failure and ejection

The reason you are noticing backlog is because memcahce is getting more load that expected and taking longer to respond.

How many memcahe servers do you have? Across how many physical machines is it distributed and how much load does each instance get at peak? Do you have p50, p90 and p95 latency numbers?

@jennyfountain
Copy link
Author

We had set it to 1 and same results. We increased it hoping it would help but it didn't.

We also set timeout to 2000.

We currently have 29 memcache servers in the pool.

We do not have client side retries. When we went directly to the memcache servers, we did not see this issue.

Listed below is the current config we are using. Could this be a hash issue?

listen: /var/run/nutcracker/our.socket 0777
auto_eject_hosts: true
distribution: ketama
hash: one_at_a_time
backlog: 65536
server_connections: 16
server_failure_limit: 3
server_retry_timeout: 30000
servers:
x.x.x.x:11211:1
x.x.x.x:11211:1
x.x.x.x:11211:1
timeout: 2000

Thank you for your help on this!

@manjuraj
Copy link
Collaborator

one_at_a_time is a not an ideal hash function, but I doubt that is the issue. Also changing the hash at this point would imply that it would change the routing of a key to the backend memcache (redistribute the sharding). Unless you are bring up a new cluster of memcache, not such a good idea

@jennyfountain
Copy link
Author

Just curious - What would you suggest as the ideal hash function?

Looking at our configs, does anything stand out? could this be socket issue? timeout issue?

@manjuraj
Copy link
Collaborator

murmur would be my goto hash function; fnv1a_64 is good too

@manjuraj
Copy link
Collaborator

your config looks fine;

At high load, is your CPU for twemproxy machines maxed out?

@manjuraj
Copy link
Collaborator

@jennyfountain
Copy link
Author

No, CPU/Memory look perfect. Our out_queue backups and time in memcache increases from 100ms to 600ms

@jennyfountain
Copy link
Author

Yes! I will push this and paste in a second.

@manjuraj
Copy link
Collaborator

also paste values for https://github.com/twitter/twemproxy/blob/master/src/nc_stats.h#L48 and not out_queue_bytes

@jennyfountain
Copy link
Author

stats.txt

screen shot 2017-02-23 at 2 16 06 pm

I included all of the stats for each server (sanitized :D) during our test.

Thank you so much!

@manjuraj
Copy link
Collaborator

@jennyfountain I looked at stats.txt and nothing really jumps out :( Have you used tools like mctop or twctop for monitoring or twemperf for load testing?

Also in the load testing are you using requesting the same key over and over again?

@jennyfountain
Copy link
Author

I have mctop installed. I will start that during a load test and see if I can spot anything.

Does it matter if I am using https://memcached.org/? Version 1.4.17?

twctop seems to be throwing some errors and won't run for me so I will investigate that more.

thanks!

@manjuraj
Copy link
Collaborator

manjuraj commented Mar 1, 2017

twctop only works with twemcache

@jennyfountain
Copy link
Author

jennyfountain commented Mar 28, 2017

We narrowed it down to PHP. No matter what version of PHP, libmemache or memcached module, it's about 50% slower than nodejs.

@manjuraj

here is sample code that I am using as a test. Nodejs and memslap are seeing no issues :(.

private function createCacheObject() { $this->cache = new Memcached('foo'); $this->cache->setOption( Memcached::OPT_DISTRIBUTION, Memcached::DISTRIBUTION_CONSISTENT ); $this->cache->setOption( Memcached::OPT_LIBKETAMA_COMPATIBLE, true ); $this->cache->setOption( Memcached::OPT_SERIALIZER, Memcached::SERIALIZER_PHP ); $this->cache->setOption( Memcached::OPT_TCP_NODELAY, false ); $this->cache->setOption( Memcached::OPT_HASH, Memcached::HASH_MURMUR ); $this->cache->setOption( Memcached::OPT_SERVER_FAILURE_LIMIT, 256 ); $this->cache->setOption( Memcached::OPT_COMPRESSION, false ); $this->cache->setOption( Memcached::OPT_RETRY_TIMEOUT, 1 ); $this->cache->setOption( Memcached::OPT_CONNECT_TIMEOUT, 1 * 1000 ); $this->cache->addServers($this->servers[$this->serverConfig]); }

@TysonAndre
Copy link
Collaborator

I believe this can be closed. I'm working on the same application as jennyfountain and the issue no longer occurs. I'm guessing the original issue was cpu starvation (or maybe some other misconfiguration causing slow syscalls)

After this issue was filed,

  1. The rpm was updated and the strategy used is now significantly
  2. The servers were rebuilt on new generation hardware and peak cpu usage is much lower so cpu starvation is no longer an issue
  3. The application was upgraded to a newer php version
  4. Other bottlenecks/bugs were fixed

@TysonAndre
Copy link
Collaborator

This application continues to be stable after avoiding cpu starvation and moving to 4/8 nutcracker instances per host to keep nutcracker cpu usage consistently below 100% .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants