Slack-side disconnects #208

Kyrremann · 2018-06-11T09:35:19Z

In the last week my slack bot has started to disconnect from our Slack team. There are no logs of disconnect, or stack traces. I've started running the bot in debug-mode, and the last thing it logged before it disconnected was the following.

2018-06-08 20:44:59 20:44:59 bot.1  | D, [2018-06-08T20:44:59.795621 #34] DEBUG -- Slack::RealTime::Concurrency::Celluloid::Socket#handle_read: �
2018-06-08 20:44:59 20:44:59 bot.1  | D, [2018-06-08T20:44:59.795799 #34] DEBUG -- Slack::RealTime::Concurrency::Celluloid::Socket#write: ��iP¸
2018-06-08 20:46:59 20:46:59 bot.1  | D, [2018-06-08T20:46:59.795748 #34] DEBUG -- Slack::RealTime::Concurrency::Celluloid::Socket#handle_read: �
2018-06-08 20:46:59 20:46:59 bot.1  | D, [2018-06-08T20:46:59.795976 #34] DEBUG -- Slack::RealTime::Concurrency::Celluloid::Socket#write: ����
2018-06-08 20:48:59 20:48:59 bot.1  | D, [2018-06-08T20:48:59.796764 #34] DEBUG -- Slack::RealTime::Concurrency::Celluloid::Socket#handle_read: �
2018-06-08 20:48:59 20:48:59 bot.1  | D, [2018-06-08T20:48:59.796915 #34] DEBUG -- Slack::RealTime::Concurrency::Celluloid::Socket#write: ����o
2018-06-08 20:50:59 20:50:59 bot.1  | D, [2018-06-08T20:50:59.797632 #34] DEBUG -- Slack::RealTime::Concurrency::Celluloid::Socket#handle_read: �
2018-06-08 20:50:59 20:50:59 bot.1  | D, [2018-06-08T20:50:59.797844 #34] DEBUG -- Slack::RealTime::Concurrency::Celluloid::Socket#write: ���
2018-06-08 20:50:59 20:50:59 bot.1  | �#

I'm not sure how to debug this my self, and I have no idea for what is wrong.

My setup is running on Google App Engine, with one process for a Sinatra web and a different process running the slackbot. Neither of the process stops, but the bot is offline in Slack.

The code for the project is available here: https://github.com/navikt/standbot

Any help would be welcoming!

The text was updated successfully, but these errors were encountered:

dblock · 2018-06-11T13:23:31Z

I am seeing the same thing on larger production bots. I am going to guess this is slack-side, but it also looks like we're not seeing/handling the disconnect properly somehow. I would say first lets understand what we expect on a disconnect like this? An exception?

dblock · 2018-06-11T13:26:01Z

For celluloid we want to emit a :close like here. So I would add debug code around there to find out how this is caused.

Kyrremann · 2018-06-12T06:58:26Z

Thanks for the reply.
We have a little over 1100 users in our team, but I don't feel the problem started before more of the users started using my app. In the beginning it was only 10 users, but now we are just below 50 users.

One strange thing is that the bot goes down in the evening, when no one is at work.

About what we expect. I guess an exception with a reason would be a good start. I will also reach out to Slack, and hear with them.

dblock · 2018-06-13T13:28:18Z

I stuck a restart into my bots (eg. dblock/slack-strava@bc29324) and I am seeing about a dozen of these every hour on a bot system with hundreds of connections.

Did you hear back from slack? Either way we definitely don’t have correct handling code here and the client thinks the connection is still open - maybe we can reproduce this or at least attempt to handle this based on that data being recieved? What’s that data is a good question.

aoberoi · 2018-06-13T23:57:54Z

hey, there! i'm from slack and i'd like to help you folks get this issue resolved. while i do have some expertise in our RTM API and Websockets (i helped build our node implementation), my Ruby is a little rusty. as a first step, i'd like to reproduce this with minimal steps, in my own app/workspace. if anyone has any shortcuts for me, please share. otherwise i'll update here as i find things.

alexagranov · 2018-06-14T00:26:38Z

@dblock @Kyrremann I can also confirm the strange disconnects but not seeing any issues restarting via latest master code.

We're not running close to hundreds of teams yet and getting to the restart! handler just fine: https://github.com/slack-ruby/slack-ruby-bot-server/blob/master/lib/slack-ruby-bot-server/server.rb#L19

Attaching a log from Prod with names redacted to protect the innocent. The disconnects start happening around 10PM UTC.

alexagranov · 2018-06-14T00:29:24Z

Oh woops, just realized we're in slack-ruby-client project, not slack-ruby-bot-server. Well, I'll leave my comment in case it's helpful.

dblock · 2018-06-14T00:42:13Z

Hi @aoberoi and thanks for jumping in!

I think the easiest is to run a full blown bot. Alternatively we can write some code to open hundreds of RTM connections and see what happens. Here's a full bot:

Clone https://github.com/slack-ruby/slack-shellbot.
Create a slack app with Slack and create .env with the following.

SLACK_CLIENT_ID=...
SLACK_CLIENT_SECRET=...

Run bundle install, make sure you have a local MongoDB, then foreman start. This will start a local instance of the bot. Navigate to http://localhost:5000 and you can install the bot with any team. Try DMing ls or help to the test bot and it should respond.

Registration creates a Team object. I think this happens more often with many connections, so I would just add this to app.rb:

team = Team.first
100.times do
   SlackShellbot::Service.instance.start!(team)
end

You should get 100 instances of the bot running.

dblock · 2018-06-14T01:37:55Z

Here's a recent down log from a bot with 297 teams.

W, [2018-06-13T22:08:38.463423 #212]  WARN -- : DOWN: game=pong, name=, id=T02BTFT45
W, [2018-06-13T22:08:39.509848 #212]  WARN -- : DOWN: game=pong, name=, id=T5ATQ5K6W
W, [2018-06-13T22:08:47.103501 #212]  WARN -- : DOWN: game=pong, name=, id=T04LD63KU
W, [2018-06-13T22:08:47.571911 #212]  WARN -- : DOWN: game=pong, name=, id=T6GF0T8UT
W, [2018-06-13T22:08:54.902196 #212]  WARN -- : DOWN: game=pong, name=, id=TAV9ZRUBY
W, [2018-06-13T22:09:04.200495 #212]  WARN -- : DOWN: game=tic-tac-toe, name=, id=TB1QA85PH
W, [2018-06-13T22:09:06.035861 #212]  WARN -- : DOWN: game=pong, name=, id=T024XAHD7
W, [2018-06-13T22:09:06.789929 #212]  WARN -- : DOWN: game=pong, name=, id=T0JT3RRV2
W, [2018-06-13T22:09:13.165823 #212]  WARN -- : DOWN: game=pong, name=, id=T06N1SCTB
I, [2018-06-13T22:09:38.589368 #212]  INFO -- : RESTART: game=pong, name=, id=T02BTFT45
I, [2018-06-13T22:09:38.589496 #212]  INFO -- : Starting team game=pong, name=, id=T02BTFT45.
...

I left team IDs in there for you @aoberoi in case you have logs to look at on your side. These all went offline in the 5 minutes preceding this log.

Note that we also see successful disconnects/restarts.

I, [2018-06-14T00:10:29.391865 #212]  INFO -- : T02990GNF: socket closed, restarting ...
I, [2018-06-14T00:10:29.586615 #212]  INFO -- : Successfully connected team  (T02990GNF) to https://....slack.com.

dblock · 2018-06-14T01:43:04Z

If you want to add more logging to the code that handles disconnects, put it here.

Kyrremann · 2018-06-14T06:07:09Z

@dblock I forgot to contact the Slack team, but since @aoberoi joined the issue, I'll wait.

I havne't had time to work any more on this, but I have now had the bot running since Monday. Also, I'm just running one team.

andrewhood125 · 2018-06-25T23:29:01Z

Any luck with this? Our bot disconnects daily. It's a important piece of our infrastructure we use to coordinate deployments.

jhawksley · 2018-06-29T07:30:59Z

My bot disconnects (silently) about once every three days, on a team with only a handful of users. Not sure if that helps, but at least it's a data point.

Kyrremann · 2018-06-29T08:09:49Z

I sent a support-ticket to the Slack team this morning. I'll let you know when I hear something. I'm afraid we won't get much help, as it's kinda outside their scope to support us (maybe?).

dblock · 2018-06-29T14:53:46Z

If someone understands what's being sent on the wire or wants to try to that would be helpful. In the end the websocket library simply doesn't see the disconnect.

ioquatix · 2018-06-29T22:18:41Z

I'm not really able to help with this issue directly, but there has been some rumblings of using a async for concurrency, so I thought I'd pass over what I've done in the past to make this work.

https://gist.github.com/ioquatix/c24f107e2cc7f48e571a37e8e93b0cda?ts=2

Just FYI. Yes, I know it's a different gem, but it can't be all that different.

Se issue på Github for mer info: slack-ruby/slack-ruby-client#208

aoberoi · 2018-07-03T02:01:34Z

I'm sorry I've been absent from this issue for so long. I was traveling over the last 2 weeks for Slack and hadn't had any time to focus on what was going on here.

Notes:

I tracked down logs on some of the specific disconnects that @dblock provided. From our logs, there was nothing abnormal.
In that sample, we observed 2-3 disconnects a day per bot user, which is slightly more than usual, but not extraordinary. RTM clients are expected to gracefully handle disconnects by reconnecting since the server may choose to close connections at any time when it's load balancing or updating.
On the days this was reported, Slack was experiencing some connectivity issues due to upstream AWS problems, but no degraded service. This is a potential explanation for the increased need for disconnects.
For some of the disconnects, the server recorded a cause=inactivity, which is the result of our servers not receiving a pong message for a ping it sent to the websocket client.

Action Items:

I'll keep trying to reproduce this issue on my side (still have not observed it). I'd like to get a better understanding of how this library deals with routine disconnects (as described above).
I get the impression that the underlying library is unaware of a disconnect until it attempts to send a message and it fails, this is less than ideal. I'd like to help improve that situation if possible. Even if we cannot know until a message is sent, it would be better if we queued the message that needs to be sent, initiate a reconnect, sync upon reconnect, and then resume sending the message.
I'd also like to verify the ping/pong behavior, since there's a non-zero number of inactivity disconnects.

Questions:

Is anyone experiencing a much larger volume of disconnects (more than 5 a day per bot user)?
Are many disconnects (either across bot users or with the same bot user) occurring in small windows of time? For example, if there's 3 disconnects that happen at the exact same time and then no disconnects for the entire day after that.

What can you do?

If you can answer yes to any of the questions above, or have any new information, please send me the timestamp of your disconnects and the bot user ID and team ID you are connecting to. If you don't want to share that information here, you can email me at ankur@slack-corp.com
If you can assist with any of the action items above, that would speed things up. I'm not a ruby expert so this would take me some time.

Kyrremann · 2018-07-10T07:47:18Z

I can't answer yes on any of the questions, but I was curios if other people also where running on Google Cloud Engine?

dblock · 2018-07-10T13:08:37Z

My bots are all on DigitalOcean.

jhawksley · 2018-07-11T07:21:50Z

I also can't answer yes, unfortunately; my bot runs locally here on-site (Ubuntu).

olingern · 2018-07-18T05:35:44Z

Just wanted to chime in and say I'm having the same problem, and am on Heroku.

My logs show that the bot is still logging conversations in the slack channel, but the Slack desktop client does not show its presence and the bot does not respond to commands.

Restarting the bot does fix the issue.

Kyrremann · 2018-07-18T07:06:27Z

Not sure if this is a step forward, but, I added a ping function, that would try to stop and restart my client when it was offline from Slack.

Sadly that didn't work, as I ended up with a DeadActorError from Celluloid. The line that crashed was instance.stop!, which makes me guess that Celluloid already knows the bot is dead. Going to change the code to remove the .stop!. See if it works with only start_async.

The code I used is this:

        get '/bot_presence' do
          client = ::Slack::Web::Client.new(token: ENV['SLACK_API_TOKEN'])
          status = client.users_getPresence(user: ENV['SLACK_BOT_ID'])
          if status['presence'] == 'away'
            instance = Standbot::Bot.instance
            instance.stop!
            instance.start_async
            json(message: 'Restarting bot successfully')
            return
          end
          json(message: status)
        end

And the stack trace is this:

2018-07-18 06:31:00 default[20180716t093850]  "GET /api/v2/bot_presence" 500
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  | 2018-07-18 06:36:00 - Celluloid::DeadActorError - attempted to call a dead actor: connected?:
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  | 	/app/vendor/bundle/ruby/2.5.0/gems/celluloid-0.17.3/lib/celluloid/proxy/sync.rb:9:in `method_missing'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  | 	/app/vendor/bundle/ruby/2.5.0/gems/slack-ruby-client-0.11.1/lib/slack/real_time/client.rb:68:in `started?'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  | 	/app/vendor/bundle/ruby/2.5.0/gems/slack-ruby-client-0.11.1/lib/slack/real_time/client.rb:63:in `stop!'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  | 	/app/vendor/bundle/ruby/2.5.0/gems/slack-ruby-bot-0.11.1/lib/slack-ruby-bot/server.rb:48:in `stop!'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  | 	/app/standweb/routes/api.rb:92:in `block (3 levels) in <class:Web>'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  | 	/app/vendor/bundle/ruby/2.5.0/gems/sinatra-2.0.3/lib/sinatra/base.rb:1635:in `call'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  | 	/app/vendor/bundle/ruby/2.5.0/gems/sinatra-2.0.3/lib/sinatra/base.rb:1635:in `block in compile!'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  | 	/app/vendor/bundle/ruby/2.5.0/gems/sinatra-2.0.3/lib/sinatra/base.rb:992:in `block (3 levels) in route!'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  | 	/app/vendor/bundle/ruby/2.5.0/gems/sinatra-2.0.3/lib/sinatra/base.rb:1011:in `route_eval'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  | 	/app/vendor/bundle/ruby/2.5.0/gems/sinatra-2.0.3/lib/sinatra/base.rb:992:in `block (2 levels) in route!'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  | 	/app/vendor/bundle/ruby/2.5.0/gems/sinatra-2.0.3/lib/sinatra/base.rb:1040:in `block in process_route'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  | 	/app/vendor/bundle/ruby/2.5.0/gems/sinatra-2.0.3/lib/sinatra/base.rb:1038:in `catch'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  | 	/app/vendor/bundle/ruby/2.5.0/gems/sinatra-2.0.3/lib/sinatra/base.rb:1038:in `process_route'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  | 	/app/vendor/bundle/ruby/2.5.0/gems/sinatra-2.0.3/lib/sinatra/base.rb:990:in `block in route!'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  | 	/app/vendor/bundle/ruby/2.5.0/gems/sinatra-2.0.3/lib/sinatra/base.rb:989:in `each'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  | 	/app/vendor/bundle/ruby/2.5.0/gems/sinatra-2.0.3/lib/sinatra/base.rb:989:in `route!'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  | 	/app/vendor/bundle/ruby/2.5.0/gems/sinatra-2.0.3/lib/sinatra/base.rb:1097:in `block in dispatch!'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  | 	/app/vendor/bundle/ruby/2.5.0/gems/sinatra-2.0.3/lib/sinatra/base.rb:1076:in `block in invoke'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  | 	/app/vendor/bundle/ruby/2.5.0/gems/sinatra-2.0.3/lib/sinatra/base.rb:1076:in `catch'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  | 	/app/vendor/bundle/ruby/2.5.0/gems/sinatra-2.0.3/lib/sinatra/base.rb:1076:in `invoke'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  | 	/app/vendor/bundle/ruby/2.5.0/gems/sinatra-2.0.3/lib/sinatra/base.rb:1094:in `dispatch!'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  | 	/app/vendor/bundle/ruby/2.5.0/gems/sinatra-2.0.3/lib/sinatra/base.rb:924:in `block in call!'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  | 	/app/vendor/bundle/ruby/2.5.0/gems/sinatra-2.0.3/lib/sinatra/base.rb:1076:in `block in invoke'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  | 	/app/vendor/bundle/ruby/2.5.0/gems/sinatra-2.0.3/lib/sinatra/base.rb:1076:in `catch'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  | 	/app/vendor/bundle/ruby/2.5.0/gems/sinatra-2.0.3/lib/sinatra/base.rb:1076:in `invoke'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  | 	/app/vendor/bundle/ruby/2.5.0/gems/sinatra-2.0.3/lib/sinatra/base.rb:924:in `call!'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  | 	/app/vendor/bundle/ruby/2.5.0/gems/sinatra-2.0.3/lib/sinatra/base.rb:913:in `call'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  | 	/app/vendor/bundle/ruby/2.5.0/gems/rack-protection-2.0.3/lib/rack/protection/xss_header.rb:18:in `call'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  | 	/app/vendor/bundle/ruby/2.5.0/gems/rack-protection-2.0.3/lib/rack/protection/base.rb:50:in `call'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  | 	/app/vendor/bundle/ruby/2.5.0/gems/rack-protection-2.0.3/lib/rack/protection/base.rb:50:in `call'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  | 	/app/vendor/bundle/ruby/2.5.0/gems/rack-protection-2.0.3/lib/rack/protection/path_traversal.rb:16:in `call'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  | 	/app/vendor/bundle/ruby/2.5.0/gems/rack-protection-2.0.3/lib/rack/protection/json_csrf.rb:26:in `call'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  | 	/app/vendor/bundle/ruby/2.5.0/gems/rack-protection-2.0.3/lib/rack/protection/base.rb:50:in `call'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  | 	/app/vendor/bundle/ruby/2.5.0/gems/rack-protection-2.0.3/lib/rack/protection/base.rb:50:in `call'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  | 	/app/vendor/bundle/ruby/2.5.0/gems/rack-protection-2.0.3/lib/rack/protection/frame_options.rb:31:in `call'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  | 	/app/vendor/bundle/ruby/2.5.0/gems/rack-2.0.5/lib/rack/session/abstract/id.rb:232:in `context'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  | 	/app/vendor/bundle/ruby/2.5.0/gems/rack-2.0.5/lib/rack/session/abstract/id.rb:226:in `call'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  | 	/app/vendor/bundle/ruby/2.5.0/gems/rack-2.0.5/lib/rack/logger.rb:15:in `call'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  | 	/app/vendor/bundle/ruby/2.5.0/gems/rack-2.0.5/lib/rack/common_logger.rb:33:in `call'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  | 	/app/vendor/bundle/ruby/2.5.0/gems/sinatra-2.0.3/lib/sinatra/base.rb:231:in `call'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  | 	/app/vendor/bundle/ruby/2.5.0/gems/sinatra-2.0.3/lib/sinatra/base.rb:224:in `call'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  | 	/app/vendor/bundle/ruby/2.5.0/gems/rack-2.0.5/lib/rack/head.rb:12:in `call'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  | 	/app/vendor/bundle/ruby/2.5.0/gems/sinatra-2.0.3/lib/sinatra/base.rb:194:in `call'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  | 	/app/vendor/bundle/ruby/2.5.0/gems/sinatra-2.0.3/lib/sinatra/base.rb:1958:in `call'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  | 	/app/vendor/bundle/ruby/2.5.0/gems/puma-3.11.4/lib/puma/configuration.rb:225:in `call'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  | 	/app/vendor/bundle/ruby/2.5.0/gems/puma-3.11.4/lib/puma/server.rb:632:in `handle_request'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  | 	/app/vendor/bundle/ruby/2.5.0/gems/puma-3.11.4/lib/puma/server.rb:446:in `process_client'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  | 	/app/vendor/bundle/ruby/2.5.0/gems/puma-3.11.4/lib/puma/server.rb:306:in `block in run'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  | 	/app/vendor/bundle/ruby/2.5.0/gems/puma-3.11.4/lib/puma/thread_pool.rb:120:in `block in spawn_thread'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  | 10.0.0.1, 10.0.0.1, 2002:ae8:f509:: - - [18/Jul/2018:06:36:00 +0000] "GET /api/v2/bot_presence HTTP/1.1" 500 30 0.1545

dblock · 2018-07-18T08:33:23Z

I failed to find how to "see" this in celluloid-io and it's apparently no longer maintained. Maybe someone can add support for the most-bestest async io library du jour? #210

jonmchan · 2018-07-26T03:53:49Z

My bot is also hosted on digital ocean. I had a restart once a day which seemed to fix it for quite a while, but recently the problem has been getting worst. I just restarted the bot in the afternoon to find the bot offline in the evening. Is there no way that we can take an active approach in determining the connection is still alive by sending our own PING to the slack server every XX minutes and expecting some response back? If no response, just restart the connection? Wouldn't that be a workaround if the library is not properly discovering a disconnect?

RodneyU215 · 2018-10-29T20:05:02Z

Thanks for that information. I believe you've highlighted a new issue; slack-ruby-client reconnect logic is conflicting with the reconnect logic that exists within slack-ruby-bot.

I'm looking into this now.

stormsilver · 2018-10-29T23:51:16Z

FWIW I am also using slack-ruby-bot. Could be a thing. I’ll work on getting some debug logs together.

…

-- Eric

On Oct 29, 2018, at 15:05, Rodney Urquhart ***@***.***> wrote: Thanks for that information. I believe you've highlighted a new issue; our reconnect logic is conflicting with the reconnect logic that exists within slack-ruby-bot. I'm looking into this now. — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

ryankemper · 2018-10-30T00:41:30Z

Also using slack-ruby-bot and also ran into the ClientNotStartedError. The logs I have are from before the remove retry logic branch though, so not sure how useful those would be.

dblock · 2018-10-30T13:11:16Z

Seems reasonable, thanks everyone. I am excited about the possibility of killing any kind of retry stuff in slack-ruby-bot and ping inside slack-ruby-bot-server. LMK how I can help.

zenmatt · 2018-10-30T14:17:12Z

I've been monitoring this issue since it was reported. Over the past few days our connections are failing repeatedly, multiple times per day across the majority of teams. I thought it might have been a network problem or temporary issue, but it hasn't gone away. I implemented a ping and restart method borrowed from slack-ruby-bot, which has helped. Has anyone else seen this problem ramp up recently?

Mahito · 2018-10-30T15:12:26Z

I have just encountered this problem.
I use slack-ruby-client from a month ago, but the slack connection is lost like every day.
I also tried using start_async, but it has not been resolved.

dblock · 2018-10-30T22:53:15Z

For everyone above trying @RodneyU215's branch and using slack-ruby-bot, upgrade to slack-ruby/slack-ruby-bot#203, which is

gem 'slack-ruby-bot', github: 'RodneyU215/slack-ruby-bot', branch: 'removing_restart_logic'

Let us know if it's better!

stormsilver · 2018-11-02T15:02:29Z

My bot still dies even with both gems pulled from @RodneyU215 's branches. It eventually crashed and systemd restarted it. Here are some logs: https://gist.github.com/stormsilver/b940b0df2fde5825e46087fc80f14039

It appears that we are making a request to rtm.start every 2 minutes unless it's every 2.5 minutes instead (strange).
At some point we get an SSL error talking to Slack (which is mighty strange) and then we begin pinging rtm.start every second. Notice that we are now not receiving the Status: 200 line.
After several minutes of this, we are able to get a 200 response again. However, we are already in some bad state - my logs now have PG connection errors (not included above since ya'll don't care 'bout Postgres - the error is could not translate host name "localhost" to address: System error)
And then, next time we ping, we get E_EMFILE which isn't real great. It's possible that's related to my PG errors but I think it more likely that the PG errors are caused by step 2 above. Seems like maybe after we have a failure connecting to rtm.start we enter some Bad Place where we open connections and never close them.

RodneyU215 · 2018-12-01T00:49:09Z

Thanks for everyone's help and continued efforts on triaging this. I've been on bereavement leave the last few weeks, but I'm back and will be diving back into this issue soon.

stormsilver · 2018-12-03T13:53:44Z

@RodneyU215 Ahhh man. That stinks. Sorry to hear this :( Thanks for all your hard work on the disconnecting issues.

RodneyU215 · 2018-12-22T16:57:56Z

@stormsilver I'd like to try and duplicate your issues. Where's your bot being hosted? And which concurrency library are you using?

Thus far I've been using the hi.rb example to triage this issue and it appears to be resolved. I'd like to find an example or some code that reproduces any current errors.

stormsilver · 2019-01-09T15:05:50Z

@RodneyU215 Our bot is hosted on premise on an Ubuntu 18.04 machine. We are using async-websocket as our concurrency library. Ruby 2.5.1, Rails 5.2.2

I would say it's definitely much more reliable than a few months ago; however, we do still see problems where the bot goes AWOL. I have debug logging turned on permanently now so when and if it happens again, I will update this thread.

schneidmaster · 2019-01-09T15:16:22Z

I ended up fixing the symptoms by wrapping all of the activerecord calls in my bot with ActiveRecord::Base.connection_pool.with_connection do -- that keeps it from leaking connections when it gets restarted.

RodneyU215 · 2019-01-10T00:24:55Z

Just an FYI, all pending changes related to this issue have been merged into the master branch for both slack-ruby-client and slack-ruby-bot. It is no longer necessary to use any other temporary branches we mentioned above.

Anyone affected by this issue should use master until the next release.

gem 'slack-ruby-client', github: 'slack-ruby/slack-ruby-client', branch: 'master'
gem 'slack-ruby-bot', github: 'slack-ruby/slack-ruby-bot', branch: 'master'
gem 'async-websocket'

ioquatix · 2019-01-10T00:38:48Z

Do you mind giving a brief summary of what the root problem was and how it was fixed?

RodneyU215 · 2019-01-11T22:51:57Z

This issue was opened to triage a problem where apps using this client would silently lose its connection. @dblock investigated the issue and found that the client did not gracefully handle disconnects. @aoberoi also investigated Slack’s infrastructure and did not find any changes or issues that would cause these disconnects. We’ve not seen the same impact on any other libraries.

Over the last few months I worked with @dblock to implement some features that should drastically reduce the impact of unexpected disconnections. When using the client with a supported asynchronous library (i.e. async-websocket) we’ll be running a regular Slack specific ping. See Slack's RTM Ping and Pong section for more info.

This ping is sent to Slack’s message server which validates from our end that the connection is truly alive. In the event that a response is not received we now:

Disconnect any phantom connections.
Retrieve a new web socket url from Slack and reconnect.
Finally, we restart the client’s run loop to resume app activity.

All this being said investigations are still underway to both reproduce the issue and identify an underlying root cause. I truly appreciate everyone’s help, support and contributions to this issue.

stormsilver · 2019-02-11T13:31:39Z

Since updating to master a month ago, we have had zero further issues with our bot going AWOL. I notice it goes offline/online quite often during the day, but AIUI that's the expected behavior given the nature of the fixes. Great work everyone, and thank you!

dblock · 2019-02-11T13:53:20Z

That's good news @stormsilver, which concurrency do you use? And offline/online you mean just briefly as it reconnects I assume?

stormsilver · 2019-02-11T13:56:01Z

@dblock Correct, just a brief blip as the bot reconnects. We are using async-websocket.

dblock · 2019-02-11T13:56:20Z

Btw, I want to release this. Just didn't have time to roll it in my production bots, which is always a good test. Hopefully will get to it soon.

TallOrderDev · 2019-02-12T20:29:50Z

TL:DR Does the Master have any breaking changes, or can we add it in an assume most everything works the same, just the fix for disconnecting from the socket?

TallOrderDev · 2019-02-12T20:30:56Z

BTW, Thanks to everyone for all this work!!! Sorry I didn't say it in the last post!!! You all rock!

dblock · 2019-02-13T18:26:26Z

TL:DR Does the Master have any breaking changes, or can we add it in an assume most everything works the same, just the fix for disconnecting from the socket?

No other breaking changes, see https://github.com/slack-ruby/slack-ruby-client/blob/master/CHANGELOG.md for details. If you use slack-ruby-bot you need slack-ruby/slack-ruby-bot#203 and if you use slack-ruby-bot-server we still have to remove the ping worker from that. If someone wants to PR this I'd be grateful.

I also think we need to bump major version here and make those gems next versions dependent on a minimum of slack-ruby-client to make sure we don't run into incompatibilities.

jvargh · 2019-02-15T05:51:51Z

is there a way to disable keepalive messages from debug logs?

[DEBUG] @slack/client:RTMClient2 send() in state: connected,ready
[DEBUG] @slack/client:KeepAlive1 setting ping timer
[DEBUG] @slack/client:RTMClient2 sending message on websocket: {"type":"ping","id":13}
[DEBUG] @slack/client:KeepAlive1 setting pong timer
[DEBUG] @slack/client:RTMClient2 received message on websocket: {"type":"pong","reply_to":13}
[DEBUG] @slack/client:KeepAlive1 received pong, clearing pong timer
[DEBUG] @slack/client:KeepAlive1 ping timer expired, sending ping
[DEBUG] @slack/client:RTMClient2 send() in state: connected,ready
[DEBUG] @slack/client:KeepAlive1 setting ping timer
[DEBUG] @slack/client:RTMClient2 sending message on websocket: {"type":"ping","id":14}
[DEBUG] @slack/client:KeepAlive1 setting pong timer

dblock · 2019-02-25T19:51:44Z

I've released:

slack-ruby-client 0.14.0
slack-ruby-bot 0.12.0
slack-ruby-bot-server 0.9.0

These should all address the problems above. Open a new issue if they don't. Closing this one.

dblock · 2019-03-05T03:23:40Z

I wrote this epic up: https://code.dblock.org/2019/03/04/solving-slack-side-disconnects-in-slack-ruby-client.html

dblock added the bug? label Jun 11, 2018

dblock changed the title ~~Strange and silent disconnects~~ Slack-side disconnects Jun 14, 2018

dblock mentioned this issue Jun 18, 2018

pongbot does not reconnect slack-ruby/slack-ruby-bot#107

Closed

dblock mentioned this issue Jun 29, 2018

help with server-side disconnects in slack-ruby-client celluloid/celluloid-io#187

Open

Kyrremann added a commit to navikt/standbot that referenced this issue Jul 2, 2018

Bytter logging tilbake til info, debug gir ikke noe mer informasjon

0838822

Se issue på Github for mer info: slack-ruby/slack-ruby-client#208

dblock mentioned this issue Jul 10, 2018

Memory Leak? slack-ruby/slack-ruby-bot#190

Open

dblock mentioned this issue Jan 18, 2019

Slack Bot online / offline slack-ruby/slack-ruby-bot#212

Closed

dblock closed this as completed Feb 25, 2019

gsmetal mentioned this issue Feb 26, 2019

Issue with latest releases and Celluloid #254

Closed

dblock mentioned this issue Mar 5, 2019

Occasional failures to reconnect #257

Closed

Slack-side disconnects #208

Slack-side disconnects #208

Comments

Kyrremann commented Jun 11, 2018

dblock commented Jun 11, 2018 • edited Loading

dblock commented Jun 11, 2018

Kyrremann commented Jun 12, 2018

dblock commented Jun 13, 2018 • edited Loading

aoberoi commented Jun 13, 2018

alexagranov commented Jun 14, 2018

alexagranov commented Jun 14, 2018

dblock commented Jun 14, 2018

dblock commented Jun 14, 2018 • edited Loading

dblock commented Jun 14, 2018 • edited Loading

Kyrremann commented Jun 14, 2018

andrewhood125 commented Jun 25, 2018

jhawksley commented Jun 29, 2018

Kyrremann commented Jun 29, 2018

dblock commented Jun 29, 2018

ioquatix commented Jun 29, 2018

aoberoi commented Jul 3, 2018

Notes:

Action Items:

Questions:

What can you do?

Kyrremann commented Jul 10, 2018

dblock commented Jul 10, 2018

jhawksley commented Jul 11, 2018

olingern commented Jul 18, 2018

Kyrremann commented Jul 18, 2018

dblock commented Jul 18, 2018 • edited Loading

jonmchan commented Jul 26, 2018

RodneyU215 commented Oct 29, 2018 • edited Loading

stormsilver commented Oct 29, 2018 via email

ryankemper commented Oct 30, 2018 • edited Loading

dblock commented Oct 30, 2018

zenmatt commented Oct 30, 2018

Mahito commented Oct 30, 2018

dblock commented Oct 30, 2018

stormsilver commented Nov 2, 2018

RodneyU215 commented Dec 1, 2018

stormsilver commented Dec 3, 2018

RodneyU215 commented Dec 22, 2018 • edited Loading

stormsilver commented Jan 9, 2019 • edited Loading

schneidmaster commented Jan 9, 2019

RodneyU215 commented Jan 10, 2019

ioquatix commented Jan 10, 2019

RodneyU215 commented Jan 11, 2019

stormsilver commented Feb 11, 2019

dblock commented Feb 11, 2019 • edited Loading

stormsilver commented Feb 11, 2019

dblock commented Feb 11, 2019

TallOrderDev commented Feb 12, 2019

TallOrderDev commented Feb 12, 2019

dblock commented Feb 13, 2019

jvargh commented Feb 15, 2019

dblock commented Feb 25, 2019

dblock commented Mar 5, 2019

dblock commented Jun 11, 2018 •

edited

Loading

dblock commented Jun 13, 2018 •

edited

Loading

dblock commented Jun 14, 2018 •

edited

Loading

dblock commented Jun 14, 2018 •

edited

Loading

dblock commented Jul 18, 2018 •

edited

Loading

RodneyU215 commented Oct 29, 2018 •

edited

Loading

ryankemper commented Oct 30, 2018 •

edited

Loading

RodneyU215 commented Dec 22, 2018 •

edited

Loading

stormsilver commented Jan 9, 2019 •

edited

Loading

dblock commented Feb 11, 2019 •

edited

Loading