Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better handling of `disconnect!`/`close`. #262

Merged
merged 16 commits into from Apr 8, 2019

Conversation

@ioquatix
Copy link
Contributor

commented Apr 5, 2019

Fixes #257

@ioquatix ioquatix force-pushed the dblock:async-disconnect branch from b333f92 to 8ca3fd4 Apr 5, 2019

@dblock

This comment has been minimized.

Copy link
Collaborator

commented Apr 6, 2019

Ok, I'll take this. Care to clean it up? CHANGELOG and rubocop -a ; rubocop --auto-gen-config?

@ioquatix

This comment has been minimized.

Copy link
Contributor Author

commented Apr 6, 2019

I am taking kids to the park so can you take it from here? Otherwise I can look at it later.

@dblock

This comment has been minimized.

Copy link
Collaborator

commented Apr 6, 2019

👪 > 👨‍💻 👩‍💻

@dblock dblock force-pushed the dblock:async-disconnect branch from 8ca3fd4 to 9f08609 Apr 6, 2019

@dblock
Copy link
Collaborator

left a comment

Take a look at the updated diff @ioquatix. There were two disconnect!, @driver.close is called in Slack::RealTime::Socket#disconnect! so I made that super.

@@ -43,11 +43,6 @@ def restart_async(client, new_url)
end
end

def disconnect!
super
@reactor.cancel

This comment has been minimized.

Copy link
@dblock

dblock Apr 6, 2019

Collaborator

@ioquatix Do we not need this anymore? Are we leaving the reactor running when we explicitly call disconnect! now?

This comment has been minimized.

Copy link
@ioquatix

ioquatix Apr 6, 2019

Author Contributor

Let me check...

@dblock

This comment has been minimized.

Copy link
Collaborator

commented Apr 6, 2019

I think we do, seeing this otherwise:

I, [2019-04-06T15:53:21.696868 #251]  INFO -- : Successfully connected team  ...
  1m2s    error: <Async::Task:0x1f3a650 connected to #<Addrinfo: 18.204.248.223:443 TCP (cerberus-xxxx.lb.slack-msgs.com)> failed> [pid=251]
               |   Async::Wrapper::Cancelled: The operation has been cancelled!
               |   → /app/vendor/bundle/ruby/2.6.0/gems/async-1.17.1/lib/async/task.rb:51 in `yield'
               |     /app/vendor/bundle/ruby/2.6.0/gems/async-1.17.1/lib/async/wrapper.rb:211 in `wait_for'
               |     /app/vendor/bundle/ruby/2.6.0/gems/async-1.17.1/lib/async/wrapper.rb:102 in `wait_readable'
               |     /app/vendor/bundle/ruby/2.6.0/gems/async-io-1.20.0/lib/async/io/generic.rb:145 in `async_send'
               |     /app/vendor/bundle/ruby/2.6.0/gems/async-io-1.20.0/lib/async/io/generic.rb:47 in `block in wrap_blocking_method'
               |     /app/vendor/bundle/ruby/2.6.0/gems/async-websocket-0.8.0/lib/async/websocket/connection.rb:60 in `next_event'
               |     /app/vendor/bundle/ruby/2.6.0/bundler/gems/slack-ruby-client-9f0860949c4f/lib/slack/real_time/concurrency/async.rb:68 in `run_loop'
               |     /app/vendor/bundle/ruby/2.6.0/bundler/gems/slack-ruby-client-9f0860949c4f/lib/slack/real_time/concurrency/async.rb:52 in `connect!'
               |     /app/vendor/bundle/ruby/2.6.0/bundler/gems/slack-ruby-client-9f0860949c4f/lib/slack/real_time/client.rb:83 in `run_loop'
               |     /app/vendor/bundle/ruby/2.6.0/bundler/gems/slack-ruby-client-9f0860949c4f/lib/slack/real_time/concurrency/async.rb:30 in `block (3 levels) in start_async'
               |     /app/vendor/bundle/ruby/2.6.0/gems/async-1.17.1/lib/async/task.rb:204 in `block in make_fiber'

Updated.

@dblock dblock force-pushed the dblock:async-disconnect branch from 9f08609 to 8863848 Apr 6, 2019

@dblock

This comment has been minimized.

Copy link
Collaborator

commented Apr 6, 2019

Hm. That didn't do it. Do we just rescue and ignore it?

@dblock dblock force-pushed the dblock:async-disconnect branch from 8863848 to 055b5a7 Apr 6, 2019

@ioquatix

This comment has been minimized.

Copy link
Contributor Author

commented Apr 6, 2019

Can you add a spec that fails in the way you shown above? Otherwise it's hard to know exactly what is going on.

@dblock

This comment has been minimized.

Copy link
Collaborator

commented Apr 6, 2019

Happens on every server-side disconnect in a production bot, I tried to write one but so far didn't manage to.

@dblock

This comment has been minimized.

Copy link
Collaborator

commented Apr 6, 2019

Note that spec/integration/integration_spec.rb is failing with these changes, notably :112 with fiber called across threads.

while true
subtask.annotate 'client keep-alive'

loop do

This comment has been minimized.

Copy link
@dblock

dblock Apr 6, 2019

Collaborator

That was the effect of rubocop -a

This comment has been minimized.

Copy link
@ioquatix

ioquatix Apr 6, 2019

Author Contributor

I know and it's objectively worse.

This comment has been minimized.

Copy link
@dblock

dblock Apr 6, 2019

Collaborator
Style/InfiniteLoop:
   Enabled: false

should fix it in rubocop.yml

@ioquatix

This comment has been minimized.

Copy link
Contributor Author

commented Apr 6, 2019

Okay, I'm taking a look at the integration specs.

@ioquatix

This comment has been minimized.

Copy link
Contributor Author

commented Apr 7, 2019

@dblock there is something weird going on. I'm just doing my best to analyse how things working, not trying to blame anyone here, so let's just go through it:

There are several ways a connection can be closed:

  1. The driver receives a close event from the server. This should gracefully close the connection and possibly cause a restart.
  2. The socket stops responding or is otherwise closed.
  3. The client wants to send a close event and gracefully terminate the connection.
  4. The client wants to immediately terminate the connection (i.e. call close).

The problem is I don't know how this is supposed to work in all 4 cases, because the flow control is a bit... odd?

From my POV, the logic is independent of the concurrency model. What I mean by that, is there should be a working implementation between Client, Socket, and Driver.

Let me give you an example of one case that is confusing me.

In the integration specs, stop! is invoked to gracefully shut down the connection (case 3 above). The way I think this should work is:

driver.close # Send shutdown event
socket.close # Close socket file descriptor

Now, around that, what events do you think should be triggered?

driver.emit(:close) won't be triggered because we didn't receive a close frame, we only sent one. In theory, it might be possible to receive a close frame between driver.close and socket.close, but let's ignore that. So, if driver never emits close, then client.on(:close) is also never triggered, and then because of that, the server closed item is never added to the integration spec @queue.

So, I tried to think about in what situations we want to invoke the client.callback(:close), e.g. the cases 1-4 above. I get the feeling that client.close should be triggered in any situation where the socket is closed, not the driver. But the way it's set up right now, if you try to do that, the callback actually invokes close on itself. That causes infinite recursion.

Basically, client.close eventually invokes socket.close, which eventually invokes client.close. The only way to break this weird cycle is to nullify something... but that just seems like a hack around the broken underlying assumptions being made.

I believe we should still invoke client.callback(:close) but it should not itself call close. I'll try this.

@ioquatix

This comment has been minimized.

Copy link
Contributor Author

commented Apr 7, 2019

Okay, so that issue was resolved by breaking the recursion.

The next problem is calling client.stop!. IMHO, you can only call this method from the same thread that is handling the connection. Trying to call this between threads is problematic.

@ioquatix

This comment has been minimized.

Copy link
Contributor Author

commented Apr 7, 2019

There is a race condition in Client#run_loop.

        @socket.connect! do |driver|
          driver.on :open do |event|
            logger.debug("#{self.class}##{__method__}") { event.class.name }
            open(event)
            callback(event, :open)
          end

          driver.on :message do |event|
            logger.debug("#{self.class}##{__method__}") { "#{event.class}, #{event.data}" }
            dispatch(event)
          end

          driver.on :close do |event|
            logger.debug("#{self.class}##{__method__}") { event.class.name }
            callback(event, :close)
            close(event)
            callback(event, :closed)
          end

          # This must be called last to ensure any events are registered before invoking user code.
          @callback.call(driver) if @callback
        end

We shouldn't invoke the @callback until all events are registered, as those events are doing some pretty critical state management (mostly :close).

@ioquatix

This comment has been minimized.

Copy link
Contributor Author

commented Apr 7, 2019

I'm trying to understand how disconnect! works. In celluloid, it stops the ping timer. But is that right? Is there any normal flow control that could happen during the execution of socket.close which could hit this, and disable the ping timer, even though the connection still continues to run (i.e. client calls connect! after calling disconnect!.

Honestly, this state handling is super tricky.

@ioquatix

This comment has been minimized.

Copy link
Contributor Author

commented Apr 7, 2019

I believe this solves all the problems and more robustly handles disconnects, but you'll probably want to test it. I didn't test event machine or celluloid implementations. Honestly, I think 99% of the logic should be implemented in the Socket base class, and we should just be implementing the resource management and concurrency model in the sub-class. So I tried to improve that a bit.

ioquatix and others added some commits Apr 7, 2019

@dblock

This comment has been minimized.

Copy link
Collaborator

commented Apr 7, 2019

First, huge thank you @ioquatix. You're clearly following this code a lot better than I am, took me a minute to wrap my head around the changes ;)

I deployed this on a bot and on a server-side disconnect I am still seeing the same stack:

  1m5s    error: <Async::Task:0x2aa6d90 connected to #<Addrinfo: 18.204.248.223:443 TCP (cerberus-xxxx.lb.slack-msgs.com)> failed> [pid=250]
               |   Async::Wrapper::Cancelled: The operation has been cancelled!
               |   → /app/vendor/bundle/ruby/2.6.0/gems/async-1.17.1/lib/async/task.rb:51 in `yield'
               |     /app/vendor/bundle/ruby/2.6.0/gems/async-1.17.1/lib/async/wrapper.rb:211 in `wait_for'
               |     /app/vendor/bundle/ruby/2.6.0/gems/async-1.17.1/lib/async/wrapper.rb:102 in `wait_readable'
               |     /app/vendor/bundle/ruby/2.6.0/gems/async-io-1.20.0/lib/async/io/generic.rb:145 in `async_send'
               |     /app/vendor/bundle/ruby/2.6.0/gems/async-io-1.20.0/lib/async/io/generic.rb:47 in `block in wrap_blocking_method'
               |     /app/vendor/bundle/ruby/2.6.0/gems/async-websocket-0.8.0/lib/async/websocket/connection.rb:60 in `next_event'
               |     /app/vendor/bundle/ruby/2.6.0/bundler/gems/slack-ruby-client-da457b9b8a60/lib/slack/real_time/concurrency/async.rb:95 in `run_loop'
               |     /app/vendor/bundle/ruby/2.6.0/bundler/gems/slack-ruby-client-da457b9b8a60/lib/slack/real_time/concurrency/async.rb:71 in `connect!'
               |     /app/vendor/bundle/ruby/2.6.0/bundler/gems/slack-ruby-client-da457b9b8a60/lib/slack/real_time/client.rb:83 in `run_loop'
               |     /app/vendor/bundle/ruby/2.6.0/bundler/gems/slack-ruby-client-da457b9b8a60/lib/slack/real_time/concurrency/async.rb:48 in `block (2 levels) in start_reactor'
               |     /app/vendor/bundle/ruby/2.6.0/gems/async-1.17.1/lib/async/task.rb:204 in `block in make_fiber'

The reconnect happens correctly afterwards.

This seems to make sense if I understand it correctly, as the client is waiting for a message and gets canceled. Should that just be rescued and ignored? How did we not have this problem before?

@ioquatix

This comment has been minimized.

Copy link
Contributor Author

commented Apr 7, 2019

First, huge thank you @ioquatix. You're clearly following this code a lot better than I am, took me a minute to wrap my head around the changes ;)

it took almost a day to create this PR, so if it only took you a minute you have one up on me :p

Should that just be rescued and ignored? How did we not have this problem before?

I don't think you were handling the cancellation properly before.

You should probably not change it because it's actually telling you something about the failure - and this is an error condition that's being recovered, and so logging the exception is probably the right thing to do.

The reason why this happens is because one task is waiting for a socket to become readable, and another task cancels it, e.g.

task = Async do
	socket.read # Will be cancelled
end

task.stop

What this is telling you is that the stream is still connected, but the remote end is no longer communicating.

I would say, if I had to take a guess, that somewhere upstream there is a proxy that is keeping the connection alive even thought the remote server has disconnected/timed out.

@dblock

This comment has been minimized.

Copy link
Collaborator

commented Apr 8, 2019

I see. I'll probably want to truncate logs here with a rescue/raise.

... still need to fix specs for Celluloid/EM ... I'll get to it this week unless you beat me to it

@ioquatix

This comment has been minimized.

Copy link
Contributor Author

commented Apr 8, 2019

There is zero chance I will be able to spend any more time on this for at least a month lol, sorry. Happy to answer questions.

@dblock dblock force-pushed the dblock:async-disconnect branch from dcf4a85 to 3e5b222 Apr 8, 2019

@dblock

This comment has been minimized.

Copy link
Collaborator

commented Apr 8, 2019

Fixed concurrency specs in 3a9df01 and reduced logger noise (removed stack) in 3e5b222, seems to behave well in production for now. Will let it cook for a bit and merge.

W, [2019-04-08T11:17:32.919058 #252]  WARN -- <Async::Task:0x1e7e3ec connected to #<Addrinfo: 18.204.248.223:443 TCP (cerberus-xxxx.lb.slack-msgs.com)> running>: The operation has been cancelled!

@dblock dblock force-pushed the dblock:async-disconnect branch from e88fb0c to cf6d601 Apr 8, 2019

dblock added a commit to dblock/code.dblock.org that referenced this pull request Apr 8, 2019

@dblock dblock merged commit 8e2df24 into slack-ruby:master Apr 8, 2019

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details

@dblock dblock deleted the dblock:async-disconnect branch Apr 8, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.