New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
performance regression with Redis protocol parser #488
Comments
This is probably this issue: #487 For my investigation see: #310 (comment) |
My service does not use any pub/sub features, just GET/SET/EXISTS commands. #486 looks like it is only concerned with pub/sub commands. Am I missing some sort of more fundamental change made by that PR? |
@tdyas What are you using that is using the aio stream? Are you hitting this line: https://github.com/mitsuhiko/redis-rs/blob/master/src/parser.rs#L148 Read this: #310 (comment) |
I'm using the Tokio |
Hard to tell from the backtraces I posted above. Unfortunately, they don't have line number info and the one call that is in the parser code is just listed as |
I see Framed<...> and ValueCodec in your stacktrace. Are you processing a Stream from redis somehow? |
Yes, this is a storage service using a gRPC streaming RPC. The service implements the Google ByteStream gRPC service where both reads and writes are gRPC streaming RPCs. The |
Does |
That's certainly a possibility to look at. The assumption that I've seen in the code is that the response always comes after the request... which is not true in at least one case. When the server sends a message that isn't a response to a request. |
Looking at the Redis protocol documentation, the protocol is described as "simple" request-response but with two exceptions: pub/sub and pipelining:
My service does use Redis pipelining in part of its implementation. Maybe relevant? |
Also the |
I think you're definitely on the right track with pipe-lining. |
I would create a MVCE that does what you're doing now (minus the grpc, etc) but much faster and runs into the issue. That we can throw into some debugging and find out what the protocol stream is doing. |
I wrote this branch to throw away unparseable lines. It consumes them so that the parser can continue: It would need to be combined with the #486 PR to not break sub/unsub. See this for why it gets stuck: https://docs.rs/combine/4.5.2/combine/fn.unexpected_any.html |
The parser shouldn't hang regardless, but why is the server sending something that is not expected by the protocol in the first place |
For context, the server in this case is AWS Elasticache for Redis with the redis6 profile. |
Is there any recommended way to see what the server is sending in these cases? Probably just run my server with a modified version of the parser to log the data that is causing it an issue? |
The change I posted (https://github.com/mitsuhiko/redis-rs/compare/master...enzious:fix/parser-infinite-loop?expand=1) will have the offending line returned in the error. You could also change it to print to stdout. |
The code as is doesn't seem to print an error though so changing it to try and print a line probably wont help |
@Marwes I've gotten it to print out a partial message with the subscription issue I fixed in my other PR. |
I think this may be fixed by 306797c which hasn't been released. The stacktraces point into |
FYI I'm going to try and get out a deploy of my service today with v0.20.1. |
0.20.1 didn't fix the issue. The infinite loop occurred again this morning. Stack trace taken via
|
Hi, I think I hit this bug too ; but for me it seems to have been introduced when I upgraded from 0.20.0 to 0.20.1
Downgrading to 0.20.0 seems to solve the issue for my usage. |
Without a reproduction it is going to be hard to figure out :/ |
What additional observability would help debug this issue? |
Some input that triggers this behavior. There should be some specific input causing this. |
From the latest backtrace, it looks like If not, maybe it would help to adding a counter to detect the recursion and then dump the input that caused it? |
What makes you think it is reentered? That in and of itself is not a problem as the multiplexed connections may need to tell multiple requests that the connection is at EOF (has shutdown) |
There are two calls to |
If you want to try anything it is easy to setup a git dependency with any patches that you may want. We do not need to do an entire release for things like temporary logging. |
Yeah, I am not really sure how to read that trace :/ |
Right. What would be useful would be any suggestions you may have for where to log (assuming you are more familiar with this code than me). |
I guess logging when |
Are both of you using the |
Yes, I'm using
No requests returns error prior to this issue. When the CPU consumption occurs, the connection to redis is not usable anymore: any requests hangs forever without returning errors. |
I had two clients with redis-rs go into an infinite parser loop at the same time so it's getting triggered externally somehow. I put some debug statements in, so I'll see where that goes. |
I believe I have found the cause for this, still trying to figure where the correct place to apply the fix however. |
The async decoder would return unexpected end of input errors forever instead of `None` to indicate that the stream was done. Since the `MultiplexedConnection` would just drop these errors if there was no in_flight requests, ending up in an infinite loop inside `poll_read` on shutdown. cc redis-rs#488
The async decoder would return unexpected end of input errors forever instead of `None` to indicate that the stream was done. Since the `MultiplexedConnection` would just drop these errors if there was no in_flight requests, ending up in an infinite loop inside `poll_read` on shutdown. cc redis-rs#488
Released 0.20.2 with a hopeful fix |
Awesome! Thank you!! |
Thank you! I can confirm that 0.20.2 seems to fix the issue (at least for my usage). |
I have encountered what I believe to be a performance bug in the Redis protocol parser, given thread dumps of my service using the redis crate show the service to be stuck in the Redis protocol parser. I would like advice on good ways to continue debugging this issue to narrow down a root cause.
Versions
Background
My company is running a service that uses the
redis
crate to access an AWS Elasticache Redis instance. The service serves requests from its own clients via a Tokio event loop. The service is a cache service and typically uses less than 0.1 vCPU.About 1-2 times per week, the service suddenly starts consuming all of its CPU limit and then gets throttled by the Linux kernel. This has occurred both with the limit to set to 1 vCPU and set to 0.5 vCPU. When this occurs, the service stops responding to requests entirely.
Diagnosis
I took two thread dumps of the same process about 10 minutes apart using
gdb
, and it appears that execution is stuck in the Redis parser.Backtrace #1:
Note: The first trace was taken before I installed glibc debug symbols into the node for the node. However, the service runs in a container using a different Linux distribution, so I'm not sure the
malloc
decoding in the second trace is correct.Backtrace #2:
I would like advice on what data would be useful to proceed to narrow this down.
The text was updated successfully, but these errors were encountered: