Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slack-side disconnects #208

Closed
Kyrremann opened this issue Jun 11, 2018 · 89 comments
Closed

Slack-side disconnects #208

Kyrremann opened this issue Jun 11, 2018 · 89 comments

Comments

@Kyrremann
Copy link

In the last week my slack bot has started to disconnect from our Slack team. There are no logs of disconnect, or stack traces. I've started running the bot in debug-mode, and the last thing it logged before it disconnected was the following.

2018-06-08 20:44:59 20:44:59 bot.1  | D, [2018-06-08T20:44:59.795621 #34] DEBUG -- Slack::RealTime::Concurrency::Celluloid::Socket#handle_read: �
2018-06-08 20:44:59 20:44:59 bot.1  | D, [2018-06-08T20:44:59.795799 #34] DEBUG -- Slack::RealTime::Concurrency::Celluloid::Socket#write: ��iP¸
2018-06-08 20:46:59 20:46:59 bot.1  | D, [2018-06-08T20:46:59.795748 #34] DEBUG -- Slack::RealTime::Concurrency::Celluloid::Socket#handle_read: �
2018-06-08 20:46:59 20:46:59 bot.1  | D, [2018-06-08T20:46:59.795976 #34] DEBUG -- Slack::RealTime::Concurrency::Celluloid::Socket#write: ����
2018-06-08 20:48:59 20:48:59 bot.1  | D, [2018-06-08T20:48:59.796764 #34] DEBUG -- Slack::RealTime::Concurrency::Celluloid::Socket#handle_read: �
2018-06-08 20:48:59 20:48:59 bot.1  | D, [2018-06-08T20:48:59.796915 #34] DEBUG -- Slack::RealTime::Concurrency::Celluloid::Socket#write: ����o
2018-06-08 20:50:59 20:50:59 bot.1  | D, [2018-06-08T20:50:59.797632 #34] DEBUG -- Slack::RealTime::Concurrency::Celluloid::Socket#handle_read: �
2018-06-08 20:50:59 20:50:59 bot.1  | D, [2018-06-08T20:50:59.797844 #34] DEBUG -- Slack::RealTime::Concurrency::Celluloid::Socket#write: ���
2018-06-08 20:50:59 20:50:59 bot.1  | �#

I'm not sure how to debug this my self, and I have no idea for what is wrong.

My setup is running on Google App Engine, with one process for a Sinatra web and a different process running the slackbot. Neither of the process stops, but the bot is offline in Slack.

The code for the project is available here: https://github.com/navikt/standbot

Any help would be welcoming!

@dblock
Copy link
Collaborator

dblock commented Jun 11, 2018

I am seeing the same thing on larger production bots. I am going to guess this is slack-side, but it also looks like we're not seeing/handling the disconnect properly somehow. I would say first lets understand what we expect on a disconnect like this? An exception?

@dblock dblock added the bug? label Jun 11, 2018
@dblock
Copy link
Collaborator

dblock commented Jun 11, 2018

For celluloid we want to emit a :close like here. So I would add debug code around there to find out how this is caused.

@Kyrremann
Copy link
Author

Thanks for the reply.
We have a little over 1100 users in our team, but I don't feel the problem started before more of the users started using my app. In the beginning it was only 10 users, but now we are just below 50 users.

One strange thing is that the bot goes down in the evening, when no one is at work.

About what we expect. I guess an exception with a reason would be a good start. I will also reach out to Slack, and hear with them.

@dblock
Copy link
Collaborator

dblock commented Jun 13, 2018

I stuck a restart into my bots (eg. dblock/slack-strava@bc29324) and I am seeing about a dozen of these every hour on a bot system with hundreds of connections.

Did you hear back from slack? Either way we definitely don’t have correct handling code here and the client thinks the connection is still open - maybe we can reproduce this or at least attempt to handle this based on that data being recieved? What’s that data is a good question.

@aoberoi
Copy link

aoberoi commented Jun 13, 2018

hey, there! i'm from slack and i'd like to help you folks get this issue resolved. while i do have some expertise in our RTM API and Websockets (i helped build our node implementation), my Ruby is a little rusty. as a first step, i'd like to reproduce this with minimal steps, in my own app/workspace. if anyone has any shortcuts for me, please share. otherwise i'll update here as i find things.

@alexagranov
Copy link
Contributor

@dblock @Kyrremann I can also confirm the strange disconnects but not seeing any issues restarting via latest master code.

We're not running close to hundreds of teams yet and getting to the restart! handler just fine: https://github.com/slack-ruby/slack-ruby-bot-server/blob/master/lib/slack-ruby-bot-server/server.rb#L19

Attaching a log from Prod with names redacted to protect the innocent. The disconnects start happening around 10PM UTC.
slack_restarts

@alexagranov
Copy link
Contributor

Oh woops, just realized we're in slack-ruby-client project, not slack-ruby-bot-server. Well, I'll leave my comment in case it's helpful.

@dblock
Copy link
Collaborator

dblock commented Jun 14, 2018

Hi @aoberoi and thanks for jumping in!

I think the easiest is to run a full blown bot. Alternatively we can write some code to open hundreds of RTM connections and see what happens. Here's a full bot:

Clone https://github.com/slack-ruby/slack-shellbot.
Create a slack app with Slack and create .env with the following.

SLACK_CLIENT_ID=...
SLACK_CLIENT_SECRET=...

Run bundle install, make sure you have a local MongoDB, then foreman start. This will start a local instance of the bot. Navigate to http://localhost:5000 and you can install the bot with any team. Try DMing ls or help to the test bot and it should respond.

Registration creates a Team object. I think this happens more often with many connections, so I would just add this to app.rb:

team = Team.first
100.times do
   SlackShellbot::Service.instance.start!(team)
end

You should get 100 instances of the bot running.

@dblock
Copy link
Collaborator

dblock commented Jun 14, 2018

Here's a recent down log from a bot with 297 teams.

W, [2018-06-13T22:08:38.463423 #212]  WARN -- : DOWN: game=pong, name=, id=T02BTFT45
W, [2018-06-13T22:08:39.509848 #212]  WARN -- : DOWN: game=pong, name=, id=T5ATQ5K6W
W, [2018-06-13T22:08:47.103501 #212]  WARN -- : DOWN: game=pong, name=, id=T04LD63KU
W, [2018-06-13T22:08:47.571911 #212]  WARN -- : DOWN: game=pong, name=, id=T6GF0T8UT
W, [2018-06-13T22:08:54.902196 #212]  WARN -- : DOWN: game=pong, name=, id=TAV9ZRUBY
W, [2018-06-13T22:09:04.200495 #212]  WARN -- : DOWN: game=tic-tac-toe, name=, id=TB1QA85PH
W, [2018-06-13T22:09:06.035861 #212]  WARN -- : DOWN: game=pong, name=, id=T024XAHD7
W, [2018-06-13T22:09:06.789929 #212]  WARN -- : DOWN: game=pong, name=, id=T0JT3RRV2
W, [2018-06-13T22:09:13.165823 #212]  WARN -- : DOWN: game=pong, name=, id=T06N1SCTB
I, [2018-06-13T22:09:38.589368 #212]  INFO -- : RESTART: game=pong, name=, id=T02BTFT45
I, [2018-06-13T22:09:38.589496 #212]  INFO -- : Starting team game=pong, name=, id=T02BTFT45.
...

I left team IDs in there for you @aoberoi in case you have logs to look at on your side. These all went offline in the 5 minutes preceding this log.

Note that we also see successful disconnects/restarts.

I, [2018-06-14T00:10:29.391865 #212]  INFO -- : T02990GNF: socket closed, restarting ...
I, [2018-06-14T00:10:29.586615 #212]  INFO -- : Successfully connected team  (T02990GNF) to https://....slack.com.

@dblock
Copy link
Collaborator

dblock commented Jun 14, 2018

If you want to add more logging to the code that handles disconnects, put it here.

@dblock dblock changed the title Strange and silent disconnects Slack-side disconnects Jun 14, 2018
@Kyrremann
Copy link
Author

@dblock I forgot to contact the Slack team, but since @aoberoi joined the issue, I'll wait.

I havne't had time to work any more on this, but I have now had the bot running since Monday. Also, I'm just running one team.

@andrewhood125
Copy link

Any luck with this? Our bot disconnects daily. It's a important piece of our infrastructure we use to coordinate deployments.

@jhawksley
Copy link

My bot disconnects (silently) about once every three days, on a team with only a handful of users. Not sure if that helps, but at least it's a data point.

@Kyrremann
Copy link
Author

I sent a support-ticket to the Slack team this morning. I'll let you know when I hear something. I'm afraid we won't get much help, as it's kinda outside their scope to support us (maybe?).

@dblock
Copy link
Collaborator

dblock commented Jun 29, 2018

If someone understands what's being sent on the wire or wants to try to that would be helpful. In the end the websocket library simply doesn't see the disconnect.

@ioquatix
Copy link
Contributor

I'm not really able to help with this issue directly, but there has been some rumblings of using a async for concurrency, so I thought I'd pass over what I've done in the past to make this work.

https://gist.github.com/ioquatix/c24f107e2cc7f48e571a37e8e93b0cda?ts=2

Just FYI. Yes, I know it's a different gem, but it can't be all that different.

Kyrremann added a commit to navikt/standbot that referenced this issue Jul 2, 2018
@aoberoi
Copy link

aoberoi commented Jul 3, 2018

I'm sorry I've been absent from this issue for so long. I was traveling over the last 2 weeks for Slack and hadn't had any time to focus on what was going on here.

Notes:

  • I tracked down logs on some of the specific disconnects that @dblock provided. From our logs, there was nothing abnormal.

  • In that sample, we observed 2-3 disconnects a day per bot user, which is slightly more than usual, but not extraordinary. RTM clients are expected to gracefully handle disconnects by reconnecting since the server may choose to close connections at any time when it's load balancing or updating.

  • On the days this was reported, Slack was experiencing some connectivity issues due to upstream AWS problems, but no degraded service. This is a potential explanation for the increased need for disconnects.

  • For some of the disconnects, the server recorded a cause=inactivity, which is the result of our servers not receiving a pong message for a ping it sent to the websocket client.

Action Items:

  • I'll keep trying to reproduce this issue on my side (still have not observed it). I'd like to get a better understanding of how this library deals with routine disconnects (as described above).

  • I get the impression that the underlying library is unaware of a disconnect until it attempts to send a message and it fails, this is less than ideal. I'd like to help improve that situation if possible. Even if we cannot know until a message is sent, it would be better if we queued the message that needs to be sent, initiate a reconnect, sync upon reconnect, and then resume sending the message.

  • I'd also like to verify the ping/pong behavior, since there's a non-zero number of inactivity disconnects.

Questions:

  • Is anyone experiencing a much larger volume of disconnects (more than 5 a day per bot user)?

  • Are many disconnects (either across bot users or with the same bot user) occurring in small windows of time? For example, if there's 3 disconnects that happen at the exact same time and then no disconnects for the entire day after that.

What can you do?

  • If you can answer yes to any of the questions above, or have any new information, please send me the timestamp of your disconnects and the bot user ID and team ID you are connecting to. If you don't want to share that information here, you can email me at ankur@slack-corp.com

  • If you can assist with any of the action items above, that would speed things up. I'm not a ruby expert so this would take me some time.

@Kyrremann
Copy link
Author

I can't answer yes on any of the questions, but I was curios if other people also where running on Google Cloud Engine?

@dblock
Copy link
Collaborator

dblock commented Jul 10, 2018

My bots are all on DigitalOcean.

@jhawksley
Copy link

I also can't answer yes, unfortunately; my bot runs locally here on-site (Ubuntu).

@olingern
Copy link

Just wanted to chime in and say I'm having the same problem, and am on Heroku.

My logs show that the bot is still logging conversations in the slack channel, but the Slack desktop client does not show its presence and the bot does not respond to commands.

Restarting the bot does fix the issue.

@Kyrremann
Copy link
Author

Not sure if this is a step forward, but, I added a ping function, that would try to stop and restart my client when it was offline from Slack.

Sadly that didn't work, as I ended up with a DeadActorError from Celluloid. The line that crashed was instance.stop!, which makes me guess that Celluloid already knows the bot is dead. Going to change the code to remove the .stop!. See if it works with only start_async.

The code I used is this:

        get '/bot_presence' do
          client = ::Slack::Web::Client.new(token: ENV['SLACK_API_TOKEN'])
          status = client.users_getPresence(user: ENV['SLACK_BOT_ID'])
          if status['presence'] == 'away'
            instance = Standbot::Bot.instance
            instance.stop!
            instance.start_async
            json(message: 'Restarting bot successfully')
            return
          end
          json(message: status)
        end

And the stack trace is this:

2018-07-18 06:31:00 default[20180716t093850]  "GET /api/v2/bot_presence" 500
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  | 2018-07-18 06:36:00 - Celluloid::DeadActorError - attempted to call a dead actor: connected?:
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  | 	/app/vendor/bundle/ruby/2.5.0/gems/celluloid-0.17.3/lib/celluloid/proxy/sync.rb:9:in `method_missing'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  | 	/app/vendor/bundle/ruby/2.5.0/gems/slack-ruby-client-0.11.1/lib/slack/real_time/client.rb:68:in `started?'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  | 	/app/vendor/bundle/ruby/2.5.0/gems/slack-ruby-client-0.11.1/lib/slack/real_time/client.rb:63:in `stop!'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  | 	/app/vendor/bundle/ruby/2.5.0/gems/slack-ruby-bot-0.11.1/lib/slack-ruby-bot/server.rb:48:in `stop!'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  | 	/app/standweb/routes/api.rb:92:in `block (3 levels) in <class:Web>'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  | 	/app/vendor/bundle/ruby/2.5.0/gems/sinatra-2.0.3/lib/sinatra/base.rb:1635:in `call'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  | 	/app/vendor/bundle/ruby/2.5.0/gems/sinatra-2.0.3/lib/sinatra/base.rb:1635:in `block in compile!'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  | 	/app/vendor/bundle/ruby/2.5.0/gems/sinatra-2.0.3/lib/sinatra/base.rb:992:in `block (3 levels) in route!'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  | 	/app/vendor/bundle/ruby/2.5.0/gems/sinatra-2.0.3/lib/sinatra/base.rb:1011:in `route_eval'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  | 	/app/vendor/bundle/ruby/2.5.0/gems/sinatra-2.0.3/lib/sinatra/base.rb:992:in `block (2 levels) in route!'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  | 	/app/vendor/bundle/ruby/2.5.0/gems/sinatra-2.0.3/lib/sinatra/base.rb:1040:in `block in process_route'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  | 	/app/vendor/bundle/ruby/2.5.0/gems/sinatra-2.0.3/lib/sinatra/base.rb:1038:in `catch'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  | 	/app/vendor/bundle/ruby/2.5.0/gems/sinatra-2.0.3/lib/sinatra/base.rb:1038:in `process_route'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  | 	/app/vendor/bundle/ruby/2.5.0/gems/sinatra-2.0.3/lib/sinatra/base.rb:990:in `block in route!'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  | 	/app/vendor/bundle/ruby/2.5.0/gems/sinatra-2.0.3/lib/sinatra/base.rb:989:in `each'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  | 	/app/vendor/bundle/ruby/2.5.0/gems/sinatra-2.0.3/lib/sinatra/base.rb:989:in `route!'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  | 	/app/vendor/bundle/ruby/2.5.0/gems/sinatra-2.0.3/lib/sinatra/base.rb:1097:in `block in dispatch!'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  | 	/app/vendor/bundle/ruby/2.5.0/gems/sinatra-2.0.3/lib/sinatra/base.rb:1076:in `block in invoke'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  | 	/app/vendor/bundle/ruby/2.5.0/gems/sinatra-2.0.3/lib/sinatra/base.rb:1076:in `catch'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  | 	/app/vendor/bundle/ruby/2.5.0/gems/sinatra-2.0.3/lib/sinatra/base.rb:1076:in `invoke'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  | 	/app/vendor/bundle/ruby/2.5.0/gems/sinatra-2.0.3/lib/sinatra/base.rb:1094:in `dispatch!'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  | 	/app/vendor/bundle/ruby/2.5.0/gems/sinatra-2.0.3/lib/sinatra/base.rb:924:in `block in call!'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  | 	/app/vendor/bundle/ruby/2.5.0/gems/sinatra-2.0.3/lib/sinatra/base.rb:1076:in `block in invoke'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  | 	/app/vendor/bundle/ruby/2.5.0/gems/sinatra-2.0.3/lib/sinatra/base.rb:1076:in `catch'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  | 	/app/vendor/bundle/ruby/2.5.0/gems/sinatra-2.0.3/lib/sinatra/base.rb:1076:in `invoke'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  | 	/app/vendor/bundle/ruby/2.5.0/gems/sinatra-2.0.3/lib/sinatra/base.rb:924:in `call!'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  | 	/app/vendor/bundle/ruby/2.5.0/gems/sinatra-2.0.3/lib/sinatra/base.rb:913:in `call'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  | 	/app/vendor/bundle/ruby/2.5.0/gems/rack-protection-2.0.3/lib/rack/protection/xss_header.rb:18:in `call'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  | 	/app/vendor/bundle/ruby/2.5.0/gems/rack-protection-2.0.3/lib/rack/protection/base.rb:50:in `call'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  | 	/app/vendor/bundle/ruby/2.5.0/gems/rack-protection-2.0.3/lib/rack/protection/base.rb:50:in `call'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  | 	/app/vendor/bundle/ruby/2.5.0/gems/rack-protection-2.0.3/lib/rack/protection/path_traversal.rb:16:in `call'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  | 	/app/vendor/bundle/ruby/2.5.0/gems/rack-protection-2.0.3/lib/rack/protection/json_csrf.rb:26:in `call'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  | 	/app/vendor/bundle/ruby/2.5.0/gems/rack-protection-2.0.3/lib/rack/protection/base.rb:50:in `call'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  | 	/app/vendor/bundle/ruby/2.5.0/gems/rack-protection-2.0.3/lib/rack/protection/base.rb:50:in `call'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  | 	/app/vendor/bundle/ruby/2.5.0/gems/rack-protection-2.0.3/lib/rack/protection/frame_options.rb:31:in `call'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  | 	/app/vendor/bundle/ruby/2.5.0/gems/rack-2.0.5/lib/rack/session/abstract/id.rb:232:in `context'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  | 	/app/vendor/bundle/ruby/2.5.0/gems/rack-2.0.5/lib/rack/session/abstract/id.rb:226:in `call'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  | 	/app/vendor/bundle/ruby/2.5.0/gems/rack-2.0.5/lib/rack/logger.rb:15:in `call'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  | 	/app/vendor/bundle/ruby/2.5.0/gems/rack-2.0.5/lib/rack/common_logger.rb:33:in `call'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  | 	/app/vendor/bundle/ruby/2.5.0/gems/sinatra-2.0.3/lib/sinatra/base.rb:231:in `call'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  | 	/app/vendor/bundle/ruby/2.5.0/gems/sinatra-2.0.3/lib/sinatra/base.rb:224:in `call'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  | 	/app/vendor/bundle/ruby/2.5.0/gems/rack-2.0.5/lib/rack/head.rb:12:in `call'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  | 	/app/vendor/bundle/ruby/2.5.0/gems/sinatra-2.0.3/lib/sinatra/base.rb:194:in `call'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  | 	/app/vendor/bundle/ruby/2.5.0/gems/sinatra-2.0.3/lib/sinatra/base.rb:1958:in `call'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  | 	/app/vendor/bundle/ruby/2.5.0/gems/puma-3.11.4/lib/puma/configuration.rb:225:in `call'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  | 	/app/vendor/bundle/ruby/2.5.0/gems/puma-3.11.4/lib/puma/server.rb:632:in `handle_request'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  | 	/app/vendor/bundle/ruby/2.5.0/gems/puma-3.11.4/lib/puma/server.rb:446:in `process_client'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  | 	/app/vendor/bundle/ruby/2.5.0/gems/puma-3.11.4/lib/puma/server.rb:306:in `block in run'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  | 	/app/vendor/bundle/ruby/2.5.0/gems/puma-3.11.4/lib/puma/thread_pool.rb:120:in `block in spawn_thread'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  | 10.0.0.1, 10.0.0.1, 2002:ae8:f509:: - - [18/Jul/2018:06:36:00 +0000] "GET /api/v2/bot_presence HTTP/1.1" 500 30 0.1545

@dblock
Copy link
Collaborator

dblock commented Jul 18, 2018

I failed to find how to "see" this in celluloid-io and it's apparently no longer maintained. Maybe someone can add support for the most-bestest async io library du jour? #210

@jonmchan
Copy link

My bot is also hosted on digital ocean. I had a restart once a day which seemed to fix it for quite a while, but recently the problem has been getting worst. I just restarted the bot in the afternoon to find the bot offline in the evening. Is there no way that we can take an active approach in determining the connection is still alive by sending our own PING to the slack server every XX minutes and expecting some response back? If no response, just restart the connection? Wouldn't that be a workaround if the library is not properly discovering a disconnect?

@RodneyU215
Copy link
Collaborator

RodneyU215 commented Oct 29, 2018

Thanks for that information. I believe you've highlighted a new issue; slack-ruby-client reconnect logic is conflicting with the reconnect logic that exists within slack-ruby-bot.

I'm looking into this now.

@stormsilver
Copy link

stormsilver commented Oct 29, 2018 via email

@ryankemper
Copy link

ryankemper commented Oct 30, 2018

Also using slack-ruby-bot and also ran into the ClientNotStartedError. The logs I have are from before the remove retry logic branch though, so not sure how useful those would be.

@dblock
Copy link
Collaborator

dblock commented Oct 30, 2018

Seems reasonable, thanks everyone. I am excited about the possibility of killing any kind of retry stuff in slack-ruby-bot and ping inside slack-ruby-bot-server. LMK how I can help.

@zenmatt
Copy link

zenmatt commented Oct 30, 2018

I've been monitoring this issue since it was reported. Over the past few days our connections are failing repeatedly, multiple times per day across the majority of teams. I thought it might have been a network problem or temporary issue, but it hasn't gone away. I implemented a ping and restart method borrowed from slack-ruby-bot, which has helped. Has anyone else seen this problem ramp up recently?

@Mahito
Copy link

Mahito commented Oct 30, 2018

I have just encountered this problem.
I use slack-ruby-client from a month ago, but the slack connection is lost like every day.
I also tried using start_async, but it has not been resolved.

@dblock
Copy link
Collaborator

dblock commented Oct 30, 2018

For everyone above trying @RodneyU215's branch and using slack-ruby-bot, upgrade to slack-ruby/slack-ruby-bot#203, which is

gem 'slack-ruby-bot', github: 'RodneyU215/slack-ruby-bot', branch: 'removing_restart_logic'

Let us know if it's better!

@stormsilver
Copy link

My bot still dies even with both gems pulled from @RodneyU215 's branches. It eventually crashed and systemd restarted it. Here are some logs: https://gist.github.com/stormsilver/b940b0df2fde5825e46087fc80f14039

  1. It appears that we are making a request to rtm.start every 2 minutes unless it's every 2.5 minutes instead (strange).
  2. At some point we get an SSL error talking to Slack (which is mighty strange) and then we begin pinging rtm.start every second. Notice that we are now not receiving the Status: 200 line.
  3. After several minutes of this, we are able to get a 200 response again. However, we are already in some bad state - my logs now have PG connection errors (not included above since ya'll don't care 'bout Postgres - the error is could not translate host name "localhost" to address: System error)
  4. And then, next time we ping, we get E_EMFILE which isn't real great. It's possible that's related to my PG errors but I think it more likely that the PG errors are caused by step 2 above. Seems like maybe after we have a failure connecting to rtm.start we enter some Bad Place where we open connections and never close them.

@RodneyU215
Copy link
Collaborator

Thanks for everyone's help and continued efforts on triaging this. I've been on bereavement leave the last few weeks, but I'm back and will be diving back into this issue soon.

@stormsilver
Copy link

@RodneyU215 Ahhh man. That stinks. Sorry to hear this :( Thanks for all your hard work on the disconnecting issues.

@RodneyU215
Copy link
Collaborator

RodneyU215 commented Dec 22, 2018

@stormsilver I'd like to try and duplicate your issues. Where's your bot being hosted? And which concurrency library are you using?

Thus far I've been using the hi.rb example to triage this issue and it appears to be resolved. I'd like to find an example or some code that reproduces any current errors.

@stormsilver
Copy link

stormsilver commented Jan 9, 2019

@RodneyU215 Our bot is hosted on premise on an Ubuntu 18.04 machine. We are using async-websocket as our concurrency library. Ruby 2.5.1, Rails 5.2.2

I would say it's definitely much more reliable than a few months ago; however, we do still see problems where the bot goes AWOL. I have debug logging turned on permanently now so when and if it happens again, I will update this thread.

@schneidmaster
Copy link

I ended up fixing the symptoms by wrapping all of the activerecord calls in my bot with ActiveRecord::Base.connection_pool.with_connection do -- that keeps it from leaking connections when it gets restarted.

@RodneyU215
Copy link
Collaborator

Just an FYI, all pending changes related to this issue have been merged into the master branch for both slack-ruby-client and slack-ruby-bot. It is no longer necessary to use any other temporary branches we mentioned above.

Anyone affected by this issue should use master until the next release.

gem 'slack-ruby-client', github: 'slack-ruby/slack-ruby-client', branch: 'master'
gem 'slack-ruby-bot', github: 'slack-ruby/slack-ruby-bot', branch: 'master'
gem 'async-websocket'

@ioquatix
Copy link
Contributor

Do you mind giving a brief summary of what the root problem was and how it was fixed?

@RodneyU215
Copy link
Collaborator

This issue was opened to triage a problem where apps using this client would silently lose its connection. @dblock investigated the issue and found that the client did not gracefully handle disconnects. @aoberoi also investigated Slack’s infrastructure and did not find any changes or issues that would cause these disconnects. We’ve not seen the same impact on any other libraries.

Over the last few months I worked with @dblock to implement some features that should drastically reduce the impact of unexpected disconnections. When using the client with a supported asynchronous library (i.e. async-websocket) we’ll be running a regular Slack specific ping. See Slack's RTM Ping and Pong section for more info.

This ping is sent to Slack’s message server which validates from our end that the connection is truly alive. In the event that a response is not received we now:

  1. Disconnect any phantom connections.
  2. Retrieve a new web socket url from Slack and reconnect.
  3. Finally, we restart the client’s run loop to resume app activity.

All this being said investigations are still underway to both reproduce the issue and identify an underlying root cause. I truly appreciate everyone’s help, support and contributions to this issue.

@stormsilver
Copy link

Since updating to master a month ago, we have had zero further issues with our bot going AWOL. I notice it goes offline/online quite often during the day, but AIUI that's the expected behavior given the nature of the fixes. Great work everyone, and thank you!

@dblock
Copy link
Collaborator

dblock commented Feb 11, 2019

That's good news @stormsilver, which concurrency do you use? And offline/online you mean just briefly as it reconnects I assume?

@stormsilver
Copy link

@dblock Correct, just a brief blip as the bot reconnects. We are using async-websocket.

@dblock
Copy link
Collaborator

dblock commented Feb 11, 2019

Btw, I want to release this. Just didn't have time to roll it in my production bots, which is always a good test. Hopefully will get to it soon.

@TallOrderDev
Copy link

TL:DR Does the Master have any breaking changes, or can we add it in an assume most everything works the same, just the fix for disconnecting from the socket?

@TallOrderDev
Copy link

BTW, Thanks to everyone for all this work!!! Sorry I didn't say it in the last post!!! You all rock!

@dblock
Copy link
Collaborator

dblock commented Feb 13, 2019

TL:DR Does the Master have any breaking changes, or can we add it in an assume most everything works the same, just the fix for disconnecting from the socket?

No other breaking changes, see https://github.com/slack-ruby/slack-ruby-client/blob/master/CHANGELOG.md for details. If you use slack-ruby-bot you need slack-ruby/slack-ruby-bot#203 and if you use slack-ruby-bot-server we still have to remove the ping worker from that. If someone wants to PR this I'd be grateful.

I also think we need to bump major version here and make those gems next versions dependent on a minimum of slack-ruby-client to make sure we don't run into incompatibilities.

@jvargh
Copy link

jvargh commented Feb 15, 2019

is there a way to disable keepalive messages from debug logs?

[DEBUG] @slack/client:RTMClient2 send() in state: connected,ready
[DEBUG] @slack/client:KeepAlive1 setting ping timer
[DEBUG] @slack/client:RTMClient2 sending message on websocket: {"type":"ping","id":13}
[DEBUG] @slack/client:KeepAlive1 setting pong timer
[DEBUG] @slack/client:RTMClient2 received message on websocket: {"type":"pong","reply_to":13}
[DEBUG] @slack/client:KeepAlive1 received pong, clearing pong timer
[DEBUG] @slack/client:KeepAlive1 ping timer expired, sending ping
[DEBUG] @slack/client:RTMClient2 send() in state: connected,ready
[DEBUG] @slack/client:KeepAlive1 setting ping timer
[DEBUG] @slack/client:RTMClient2 sending message on websocket: {"type":"ping","id":14}
[DEBUG] @slack/client:KeepAlive1 setting pong timer

@dblock
Copy link
Collaborator

dblock commented Feb 25, 2019

I've released:

  • slack-ruby-client 0.14.0
  • slack-ruby-bot 0.12.0
  • slack-ruby-bot-server 0.9.0

These should all address the problems above. Open a new issue if they don't. Closing this one.

@dblock
Copy link
Collaborator

dblock commented Mar 5, 2019

I wrote this epic up: https://code.dblock.org/2019/03/04/solving-slack-side-disconnects-in-slack-ruby-client.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests