Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TCPMemcachedNodeImpl Queues overflow #24

Closed
GoogleCodeExporter opened this issue Mar 22, 2016 · 7 comments
Closed

TCPMemcachedNodeImpl Queues overflow #24

GoogleCodeExporter opened this issue Mar 22, 2016 · 7 comments

Comments

@GoogleCodeExporter
Copy link
Contributor

Using spymemcached version 2.0.1. 
Runtime environment is: Linux 2.4.21-40.ELsmp #1 SMP Thu Feb 2 22:22:39 EST
2006 i686 i686 i386 GNU/Linux

Using the spymemcached memcached api library, during our high load tests
we've observed all three of the queues in TCPMemcachedNodeImpl overflow,
resulting in "Queue Full" exceptions. Both the calling MemcachedClient
users (our servlet processing threads) and the background Spymemcached
thread/loop experience this, and fail/break. Once the Spymemcached thread
breaks all the queues back up since there is nothing to process/empty them
and everything using the memcached lib breaks.

I've downloaded the source and added some customization to deal with each
of the three BlockingQueue's. I can contribute the code if you like Dustin,
as we don't really WANT to have a custom version of your API we have to
maintain, just say so.

-Our fix to prevent overflows for writeQ is rather straight forward, just a
small change to the copyInputQueue() method.
-Our fix for readQ is more complex, as the code path(s) to the
transitionWriteItem() method is more complex and tied up in the IO channel
behavior. 
-Our fix for inputQueue is rather simple, but perhaps the most open to
debate - nearly all MemcachedClient public methods in our branch can now
throw an IOException if there is no room to enqueue on inputQueue, allowing
our servlet handlers to determine if they want to sleep/ignore/etc,. when
that condition happens.

----

Even with these changes, we still have Queue Full issues, which we are now
trying to address by the creation of multiple MemcachedClient objects...not
so much for the fact we will have "larger" queues, as for the fact we will
have more Spymemached background threads to get scheduled. 

We are running in tomcat, so we have to have a lot of servlet handler
threads. So the Spymemcached thread is, even on a multicore box, basically
getting cpu starved.

It's something like 200 servlet processing threads, 2 spymemcached threads
currently. The reason we have to have so many servlet processing threads is
because tomcat will just flat out reject requests if there isn't an
available servlet handler thread rather than queuing the requests.

Obviously creating more MemcachedClient objects is not a clean solution,
but I thought I'd put the general problem out there for you to noodle on.

Original issue reported on code.google.com by bfalconn...@gmail.com on 6 May 2008 at 11:03

@GoogleCodeExporter
Copy link
Contributor Author

Well, what I've thought would be the simple solution in the past would be to 
block
when adding to queue (optionally).  I figured an end user would want one of two
behaviors:

1)  Block when enqueuing.
2)  Get an exception and sent on your way.

You seem to suggest there's a third thing going on which is have the IO thread 
crash.
 That's definitely undesirable.

The thread scheduling thing is understandable.  It currently ends up doing two 
jobs,
and one of them is not IO (decoding the results).  I'd really like to push that 
to
the caller more, but previous attempts have resulted in rather ugly code.  
Simply
having a thread pool the transcoders pass through should speed up throughput 
and can
be implemented completely outside of the library, but I haven't tried it.

I've got a v3 branch where I'm breaking some compatibility to make room for 
more of
the things people have asked for and the kinds of things I've recognized as 
being
inflexibilities.  I'm hoping I can isolate the IO thread more and avoid 
decoding data
within it, but I haven't got to that part yet.

I'd rather not introduce checked exceptions on the interface, but an unchecked
exception for signaling such things wouldn't be too bad.

And of course, if people are having to have custom builds of my library, then 
that's
a failure in the library itself, so I do want to make sure that it is flexible 
enough
to match a variety of deployment criteria while not slowing down or growing in
complication (which is why I'm having to introduce incompatibilities moving 
forward).

If you have fixes for issues you're seeing, I'd certainly like to at least see 
them.

Original comment by dsalli...@gmail.com on 6 May 2008 at 11:28

@GoogleCodeExporter
Copy link
Contributor Author

I'm attaching the latest versions of the files I've changed. I started with the 
2.0.1
codebase and haven't integrated back your most recent updates.

When I was first making changes, I started with just blocking the thread that 
was
attempting to enqueue onto the inputQueue. The problem there was that each 
thread had
a dbConnection, so eventually there were so many blocked threads with a DB 
connection
that our dbConnectionPool ran dry. 

Now, with more time I could have each thread give up it's dbConnection before 
making
an attempt to enqueue onto the inputQueue...but in practice we would rather 
have the
operation fail immediately and take appropriate action (log an error and 
continue for
sets, and return an xml exception message for gets). Which is why the code now 
throws
an exception. I figured the users of the API client could always sleep and 
re-attempt
its operation if it wanted blocking type behavior.

I haven't come up with a clean idea for the IO thread scheduling issue, mostly
because the JVM's don't honor the thread priority very well. In my last project 
this
issue was really a core problem, and our solution was to have an intelligent 
thread
scheduler and that would give time to the appropriate threads to run to keep 
the IO
buffers moving along...

That project was entirely written around that concept, and it's not really 
viable for
the amount of time I have for this work.

Original comment by bfalconn...@gmail.com on 12 May 2008 at 7:13

Attachments:

@GoogleCodeExporter
Copy link
Contributor Author

Here is an example of the IO thread getting an unhandled queue full exception:

Exception in thread "Memcached IO over {MemcachedConnection to
xxx.xxx/xx.xx.xx.xx:11211 xxx.xxx/xx.xx.xx.xx:11211 }"
java.lang.IllegalStateException: Queue full
at java.util.AbstractQueue.add(AbstractQueue.java:64)
at java.util.AbstractQueue.addAll(AbstractQueue.java:143)
at net.spy.memcached.MemcachedNodeImpl.copyInputQueue(MemcachedNodeImpl.java:62)
at 
net.spy.memcached.MemcachedConnection.handleInputQueue(MemcachedConnection.java:
210)
at net.spy.memcached.MemcachedConnection.handleIO(MemcachedConnection.java:141)
at net.spy.memcached.MemcachedClient.run(MemcachedClient.java:715) 

This is the writeQ running out of space. I have an example of the readQ doing 
the
same if you'd like to see :)

Original comment by bfalconn...@gmail.com on 12 May 2008 at 7:19

@GoogleCodeExporter
Copy link
Contributor Author

I didn't much like the IOException thing.  Without that, I got your change down 
to this:

diff --git a/src/main/java/net/spy/memcached/protocol/TCPMemcachedNodeImpl.java 
index 23ea3a7..6cdf5f3 100644
--- a/src/main/java/net/spy/memcached/protocol/TCPMemcachedNodeImpl.java
+++ b/src/main/java/net/spy/memcached/protocol/TCPMemcachedNodeImpl.java
@@ -61,7 +61,10 @@ public abstract class TCPMemcachedNodeImpl extends SpyObject
         */
        public final void copyInputQueue() {
                Collection<Operation> tmp=new ArrayList<Operation>();
-               inputQueue.drainTo(tmp);
+
+               // don't drain more than we have space to place
+               inputQueue.drainTo(tmp, writeQ.remainingCapacity());
+
                writeQ.addAll(tmp);
        }

@@ -108,7 +111,7 @@ public abstract class TCPMemcachedNodeImpl extends SpyObject
         * @see net.spy.memcached.MemcachedNode#fillWriteBuffer(boolean)
         */
        public final void fillWriteBuffer(boolean optimizeGets) {
-               if(toWrite == 0) {
+               if(toWrite == 0 && readQ.remainingCapacity() > 0) {
                        getWbuf().clear();
                        Operation o=getCurrentWriteOp();
                        while(o != null && toWrite < getWbuf().capacity()) {


I don't have a test that overflows, though.  The stack trace provides some 
hints as
to where it might be possible to cut it off, but the code you sent me looks 
like it
might do a decent job of that itself.

I'll have to play around with some manual tests.

Original comment by dsalli...@gmail.com on 13 May 2008 at 6:06

@GoogleCodeExporter
Copy link
Contributor Author

OK, as of f1969bf1f88b62a71dcc9f392c4c9f0756fcea09 I think I've got more 
control over
this.

Specifically, I've written some tests to contrive queue overflows and worked on
getting them to pass.

Please enhance the test if you have another failing case.

Original comment by dsalli...@gmail.com on 14 May 2008 at 5:21

  • Changed state: Fixed

@GoogleCodeExporter
Copy link
Contributor Author

Using 2.1rc2, I get the "Queue full" regularly.  It's a simple load generator,
iterating over a key collection, putting key:value pairs into a memcached.  
When my
collection gets bigger then ~75,000, the exception happens consistently.  
(50,000
works consistently so the threshold is between 50-75K).

I'm running the spy client and the native memcached on separate Sun W2200.  The 
OS is
Solaris 10u5; Java 1.6.0_02.

I'm happy to apply any patches and make changes per a request to run in my 
environment.

--clc

Original comment by ccheet...@gmail.com on 22 May 2008 at 7:25

@GoogleCodeExporter
Copy link
Contributor Author

This bug is really more about having the client stop functioning when queues 
burst
internally.  It's OK to fill a queue and be notified of it.  What you don't 
want is
the client to fall apart and to get a new one.

Your case is somewhat unique and I don't optimize for it.  That is, lots of
unattended sets is not a normal mode of operation.  If you do want to do that, 
you
can just call waitForQueues or something every once in a while (maybe after 
ever 50k
in your example) and things would be better.

Original comment by dsalli...@gmail.com on 22 May 2008 at 9:43

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant