Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ERR Protocol error: too big inline request (Redis::CommandError) #1185

Open
Jakanapes opened this issue Mar 16, 2023 · 23 comments
Open

ERR Protocol error: too big inline request (Redis::CommandError) #1185

Jakanapes opened this issue Mar 16, 2023 · 23 comments

Comments

@Jakanapes
Copy link

Redis 5.3.3
Ruby 3.1.3
Rails 7.0.4.2
redis gem 5.0.6
resque gem 2.4.0

Just upgraded the redis/resque gems and their dependencies and am getting this intermittent error ERROR -- : ERR Protocol error: too big inline request (Redis::CommandError)

The backtrace is not super helpful

E, [2023-03-14T14:41:13.543261 #44489] ERROR -- : ERR Protocol error: too big inline request (Redis::CommandError)
.....shared/bundle/ruby/3.1.0/gems/redis-5.0.6/lib/redis/subscribe.rb:52:in `subscription'
.....shared/bundle/ruby/3.1.0/gems/redis-5.0.6/lib/redis/subscribe.rb:17:in `subscribe'
.....shared/bundle/ruby/3.1.0/gems/redis-5.0.6/lib/redis.rb:188:in `_subscription'
.....shared/bundle/ruby/3.1.0/gems/redis-5.0.6/lib/redis/commands/pubsub.rb:17:in `subscribe'
.....shared/bundle/ruby/3.1.0/gems/actioncable-7.0.4.2/lib/action_cable/subscription_adapter/redis.rb:83:in `block in listen'
.....shared/bundle/ruby/3.1.0/gems/redis-client-0.12.1/lib/redis_client.rb:645:in `ensure_connected'
.....shared/bundle/ruby/3.1.0/gems/redis-5.0.6/lib/redis/client.rb:104:in `disable_reconnection'
.....shared/bundle/ruby/3.1.0/gems/redis-5.0.6/lib/redis.rb:79:in `without_reconnect'
.....shared/bundle/ruby/3.1.0/gems/actioncable-7.0.4.2/lib/action_cable/subscription_adapter/redis.rb:80:in `listen'
.....shared/bundle/ruby/3.1.0/gems/actioncable-7.0.4.2/lib/action_cable/subscription_adapter/redis.rb:154:in `block in ensure_listener_running'

seems to indicate that action_cable is involved, but we use it a lot and having trouble tracking it down. I can't reproduce it and it appears on calls that don't trigger any actioncable actions.

The error is generated in redis, which has not changed, but looking at the gem difference, the error being raised in subscribe.rb is new. Am I looking at something that was happening all along, but was being swallowed by earlier versions of the redis gem?

@byroot
Copy link
Collaborator

byroot commented Mar 17, 2023

So this error is thrown by the server here: https://github.com/redis/redis/blob/2dd5c3a180a9892c339b66f76cc203f8b0f6fe5a/src/networking.c#L2112-L2115

And that constant is defined as 64kB:

#define PROTO_INLINE_MAX_SIZE   (1024*64)

So this seem to suggest that somehow the client sends more than 64kB of data without a single newline, that is very fishy.

But somehow that error rings a bell, it feels like I researched this before.

@Jakanapes
Copy link
Author

What would be sending inline, though? Looking at the ActionCable source, it's calling publish in Redis::Commands::Pubsub, I believe. We're trying to eliminate possible causes so we're tracking any AC broadcast messages over 64k, but they don't correspond to the redis protocol errors because they bubble up in a separate thread from the AC listener. Neither AC nor Redis versions have changed, just the redis gem getting bumped up from 4.1.0

@byroot
Copy link
Collaborator

byroot commented Mar 17, 2023

What would be sending inline, though?

Yeah, I don't know, it could be an overly long channel name on something like that. It will be hard to debug without a repro.

I'm not sure what you setup look like, but if you have that flexibility I'd try to add some logging in there

call_v([start, *channels])
to see what can of command is emitted and whether you can figure out a smaller repro?

@Jakanapes
Copy link
Author

So looking at the backtrace, the line in actioncable that ends up at call_v is here

https://github.com/rails/rails/blob/7-0-stable/actioncable/lib/action_cable/subscription_adapter/redis.rb#L92

and it's explicitly sending _action_cable_internal, which is sent all the way through to call_v in the *channels params. Am I reading the code correctly there that the events pulled from client.next_event are a direct result of the call_v command?

I can try to add some logging there, but if that's the case, I don't see how it could be anything other than the hardcoded channel, let along a 64k long channel name

@byroot
Copy link
Collaborator

byroot commented Mar 17, 2023

Hum, you're right.

@Jakanapes
Copy link
Author

actioncable sets the message callback of the Subscription class to

                on.message do |chan, message|
                  broadcast(chan, message)
                end

If I'm reading it right, this block under call_v

      while event = @client.next_event(timeout)
        if event.is_a?(::RedisClient::CommandError)
          raise Client::ERROR_MAPPING.fetch(event.class), event.message
        end

        type, *rest = event
        if callback = sub.callbacks[type]
          callback.call(*rest)
        end
        break if type == stop && rest.last == 0
      end

is blocking while loop in its own thread? So it's likely that it subscribes to the channel, then waits for incoming messages.
Eventually, it gets a message, calls the callback which attempts to broadcast it on the channel, and that is what triggers to the "too big" error, which it picks up on the next loop?

If that's the case, though, we've attempted sending large messages through to trigger the exception with no luck

@byroot
Copy link
Collaborator

byroot commented Mar 17, 2023

Note that it could be a total red herring. When googling for that error, you can see lots of these being resolved in various ways on the server side.

Given that you say the issue started when you upgraded, it makes sense to assume the client doesn't something wrong, but that wrong thing may not be to send a packet too big.

just the redis gem getting bumped up from 4.1.0

Have you tried the latest 4.x? Mostly to try to pinpoint when the issue might have been introduced.

Another good thing to try would be to enable the hiredis bindings, see if that changes anything. That could help track the source of the issue.

@Jakanapes
Copy link
Author

That is a concern, that since the exception being raised is new in this version, that it's something that could have been happening all along and was either being swallowed or surfaced as another error that we haven't noticed is no longer happening.

[1] pry(main)> Redis::Connection.drivers
=> [RedisClient::HiredisConnection]

we're already using the hiredis-client gem.

Is there a way to see if the AC redis instance is using inline? I thought resp was what was always used?

@byroot
Copy link
Collaborator

byroot commented Mar 17, 2023

we're already using the hiredis-client gem.

Then the opposite, try to disable it. It's really just a basic bisection to try to narrow down where the bug could be.

Is there a way to see if the AC redis instance is using inline?

I'm not sure what you are referring to by inline.

@byroot
Copy link
Collaborator

byroot commented Mar 17, 2023

Oh wait, Hum, another thing, is there SSL into the mix? I cant remember nor find where I saw that too big inline request before, but I wonder if it wasn't a client trying to connect to an unsecured server using SSL, and the SSL handshake paquets were tripping the server.

Just a wild guess though.

@Jakanapes
Copy link
Author

It does look like the gem uses RESP2

      def config(**kwargs)
        super(protocol: 2, **kwargs)
      end

The error itself says "inline request" and the comments on the method in the redis source say
"Like processMultibulkBuffer(), but for the inline protocol instead of RESP"
https://github.com/redis/redis/blob/2dd5c3a180a9892c339b66f76cc203f8b0f6fe5a/src/networking.c#L2094-L2117

however, it seems like every client in existence uses resp.

I'll have to check on the SSL issue, but if that was the case, I'd expect it to happen on every message, no?

@byroot
Copy link
Collaborator

byroot commented Mar 17, 2023

Oh wait, this is an intermittent issue?

@byroot
Copy link
Collaborator

byroot commented Mar 17, 2023

however, it seems like every client in existence uses resp.

Yeah, I have no idea what "inline protocol" is supposed to be never heard of it, and I can confirm that redis-rb uses RESP2.

Is there a proxy or something between the client and Redis? Or anything else that isn't just a raw standalone redis-server?

@Jakanapes
Copy link
Author

Oh wait, this is an intermittent issue?

Oh yes, intermittent and extremely elusive. We've had 0 luck in recreating it on purpose. We see the error 6-8 times a day among thousands and thousands of requests. No other issues we can tell on straight redis calls for caching and resque use, just something with action cable.

We use consul to manage the url, but ATM we only have a single redis server in production.

@byroot
Copy link
Collaborator

byroot commented Mar 17, 2023

Hum, that is extremely intriguing. I'll try to find some time to review if it's possible that on 4.x that error was swallowed or retried. That could explain it, e.g. a pre-existing issue with your server, and now we're just reporting it.

@Jakanapes
Copy link
Author

It looks like in the previous version, it looks like the errors were being raised in call_loop in client.rb, but the ProtocolError child class was overwriting initialize

  # Raised by the connection when a protocol error occurs.
  class ProtocolError < BaseError
    def initialize(reply_type)
      super(<<-EOS.gsub(/(?:^|\n)\s*/, " "))
        Got '#{reply_type}' as initial reply byte.
        If you're in a forking environment, such as Unicorn, you need to
        connect to Redis after forking.
      EOS
    end
  end

@Matt-Yorkley
Copy link

I'm also seeing this intermittently after upgrading redis 4.8.1 to redis 5.0.6 and switching out hiredis 0.6.3 to hiredis-client 0.14.1. Same error and backtrace, and it's also rare and difficult to pin down...

@Jakanapes
Copy link
Author

Spent the time after my last attempt slow rolling the upgrade. Redis gem up to 4.8.1 and resque up to 2.5.0 with no issues.

We have since moved from our old redis servers to GCP memorystore, which hosts
redis_version:6.2.11

In addition, there was a rails upgrade that fixed an ActionCable redis config with sentinel
https://github.com/rails/rails/blob/v7.0.6/actioncable/CHANGELOG.md#rails-706-june-29-2023

The final upgrade was swapping out hiredis for hiredis-client and upgrading the redis gem to 5.0.6

Within an hour I had received the error from a very, very small action cable call

@Jakanapes
Copy link
Author

hmm, in fact, I can't find us using ActionCable in the call that Airbrake captures as the source at all. Could be some sneaky metaprogramming somewhere, but I'm not finding it.

@Jakanapes
Copy link
Author

@Matt-Yorkley, did you ever find any cause? I haven't been able to try again in a while, but I'm working on some more logging in the action cable process to see if I can track it down.

@mkllnk
Copy link

mkllnk commented Jan 25, 2024

No, Matt didn't solve the issue for us. I just ran CI again on our latest code and still saw this error popping up. Since Sam claimed that hiredis doesn't bring any performance improvement I decided to remove the gem from our code and just use the default redis gem. Let me know if you come to another conclusion.

@Jakanapes
Copy link
Author

No, Matt didn't solve the issue for us. I just ran CI again on our latest code and still saw this error popping up. Since Sam claimed that hiredis doesn't bring any performance improvement I decided to remove the gem from our code and just use the default redis gem. Let me know if you come to another conclusion.

I added more logging around what gets passed to the redis gem from action cable, but no smoking guns. Still very intermittent and haven't been able to consistently reproduce it.

Are you saying that removing hiredis eliminated the issue?

tiredpixel added a commit to openownership/register-sources-bods that referenced this issue Jan 25, 2024
Redis bulk load commands need to be under 64 KB per line (more or less):

redis/redis-rb#1185

https://github.com/redis/redis/blob/07b292af5e86459e435da5986f87bc4f7415d437/src/server.h#L178

If too many IDs are included in the batch file, the lines are too long,
causing an error:

    ERR Protocol error: too big inline request

Fix by chunking IDs written to lines.

What chunk size to use? 1000 IDs works, and leads to lines of
approximately 23 KB. But Register 2 generates statement IDs of around 20
characters in length, whereas BODS 0.2 and BODS 0.3 says they must be
between 32 and 64 chars (the fact that Register 2 doesn't comply with
the minimum is a separate issue). Multiplying by 20/64 gives a target of
around 20 KB, so reducing to 750 IDs should be comfortable, leading to
lines of around 17 KB (as measured with `wc -c`).
@mkllnk
Copy link

mkllnk commented Jan 26, 2024

Are you saying that removing hiredis eliminated the issue?

Yes, at least in our CI environment. We haven't tested production yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants