Safer socketmode #1150

iaburton · 2022-12-21T00:14:58Z

PR preparation

Run make pr-prep from the root of the repository to run formatting, linting and tests.

Done, though I only modified socketmode and I was running race tests with -count 1000 while making changes anyways.

Should this be an issue instead

Eh, one of the things it fixes was already an issue, #1125 as mentioned in the relevant commit message.
I decided not to make a(nother) larger issue for my other changes, for reasons listed below.

API changes

As far as I can tell, there should be no breaking api changes (though a small behavior change in one location that used to have the potential to panic). There are however small API "additions" in terms of two methods that allow context use while previous versions did not.

The why

Hello!

I'm looking to use the socketmode package of this repo for a potentially largish app and while browsing the code & existing issues decided to fix some up myself as well as address some I found on my own. Below is a TLDR-ish summary of the changes:

Fixes Socketmode panic, write on closed channel #1125 as it allows the producer of the messages on the effected channel to close it when messages are no longer being written, this lets the consumer safely see the close and neither will panic.
Fix the potential for a race condition on firstErr. While all of these goroutines go through a sync.Once to write to the variable, one is not 'waited' on by the outer goroutine. In this case, since the outer goroutine does not go through a sync type to read the variable, a race was possible between the read and write linked earlier. Simple enough to fix, use a buffered channel and some selects; first goroutine to exit with an error sends it into the buffer and the others fall through the default. Outer goroutine then selects for the buffered error.
Fix potential deadlock in socketmode.Client.Events due to the use of buffering, and single channel sends. The fact the Events chan is a public field is unfortunate but not reason enough for this problem. If consumer goroutines are slow and the buffer fills while another goroutine is writing to Events, while yet another is canceling the context provided to RunContext, you may have deadlocked goroutines unable to send since the buffer is full and they are not in a select they can use to break out of the send. Again relatively easy fix, change all Event writes into a method that selects on the propagated context, negating any chance of getting stuck. Again it would be better to force Events to be a receive channel only on the Clients struct, or use a method to return it, but that isn't doable without breaking API changes.
- For the small API addition mentioned earlier, this was SendCtx and AckCtx as those too write to an internal channel in an unsafe manner, and so context methods were needed to break a potential deadlock. The OG methods are still there, and perhaps a comment should be added about the context methods being preferred.
Other more miscellaneous changes are mostly around error handling and use of error (un)wrapping introduced in Go 1.13.
Fix race on deadman timer Elapsed+Reset. The deadman timer's channel was being read from one goroutine and reset from the pingHandler, which is not allowed;
- As in the !timer.Stop() case you could have more than one goroutine doing concurrent receives. This was not a data race, but a race condition effecting ping/pong/timeout behavior.
This should not be done concurrent to other receives from the Timer's channel.

#### DRAFT Status
~~While I'm technically submitting this now, I don't consider it 100% yet as I'd like to verify some things in more real-world use,~~
~~plus I'm considering performance and other changes, though those are likely to come in another PR. That, and it lets me get~~ ~~some feedback on this early if any maintainers have comments/questions/etc. 🙂~~
EDIT: Avoiding scope creep and stale MRs; performance and other changes can be followed up.

…nel to close when it is finished, and consumer to see the close

…outer goroutine did not wait on all inner goroutines that had a chance to set it, also make sure to check error for context.Canceled appropriately

…l via a method that has an escape hatch; unable to change public Events field without breaking api, though

… when getting buffered (first) error

…ntexts for channel ops, though they are very similar now

…her small changes

iaburton · 2023-03-16T22:27:55Z

Hi @kanata2

I know I haven't followed up with this as I intended. Unfortunately some other things came up and I haven't been able to add-to or otherwise refine this MR further.

However that doesn't mean it cannot go in as-is and follow up with another if needed. Have you or any other devs/maintainers had a chance to look it over?

kanata2 · 2023-04-15T13:01:32Z

@iaburton
Sorry for slow response. I'll confirm later. 🙏

FYI: #1125 is fixed by #1170

Functionally revert slack-go#1170 as drafted approach takes correct path of closing chan on goroutine producer side

iaburton · 2023-04-24T18:54:21Z

Hey @kanata2 , no problem on the slow response.

Btw, I've functionally reverted #1170 in favor of what I had already fixed here (including that panic), in unison with the other issues I've described above.

I would appreciate it if you, any other maintainers, and those involved with 1170 ( @lololozhkin & @parsley42 ) double check, as the scope of this MR is creeping which I would rather it not do 🙂

Further, I noticed another race I've yet to (but can) fix on the deadman timer. Atm it has a 'fixme' if we want to merge this first and let someone else tackle it, or I can in a follow up (or here if desired).
EDIT: Just went ahead and fixed the timer issue, look at relevant commit and OP for information.

…er and ping time channel. Remove deadman file as it is no longer needed

parsley42 · 2023-07-12T12:42:58Z

@iaburton FYI, I'm running your branch in production Gopherbot right now. I've reached out to @kanata2 , but it may be that he's gotten too busy with other things to devote much time to this.

I've been reading your patch for a while now this morning, but can't really take the time to fully ingest it (I'm not a daily go programmer, just a hobbyist w/ Gopherbot, which helps me with my real day job, DevOps). Your thinking and explanations are clear and sensible, so my inclination is to approve/merge as this fix covers a potential deadlock and race, and has a nicer implementation for the timeout. I'm going to wait a few more days to hear back from Naoki, though.

My other inclination is to inquire if you'd be willing to serve as a maintainer. Looking at the commit history, Naoki has been doing most of the reviews and merges. Let me know if you'd be up for helping out.

iaburton · 2023-08-01T18:02:06Z

@iaburton FYI, I'm running your branch in production Gopherbot right now. I've reached out to @kanata2 , but it may be that he's gotten too busy with other things to devote much time to this.

I've been reading your patch for a while now this morning, but can't really take the time to fully ingest it (I'm not a daily go programmer, just a hobbyist w/ Gopherbot, which helps me with my real day job, DevOps). Your thinking and explanations are clear and sensible, so my inclination is to approve/merge as this fix covers a potential deadlock and race, and has a nicer implementation for the timeout. I'm going to wait a few more days to hear back from Naoki, though.

My other inclination is to inquire if you'd be willing to serve as a maintainer. Looking at the commit history, Naoki has been doing most of the reviews and merges. Let me know if you'd be up for helping out.

Hi @parsley42 thanks for running my branch! Pardon my absence as I've been dealing with a few things IRL 🙂

Before I comment on helping with maintenance may I ask how the branch has fared w/ Gopherbot? Is there anything I can help explain w/ respect to you having looked over the patch?

parsley42 · 2023-08-10T15:10:02Z

Hi @iaburton yeah, work and RL are keeping me busy, too, as well as Naoki I guess.

The original issue (IIRC) was that when the library stopped getting a ping from slack that wasn't sufficient to generate a reconnect - instead nothing happened until the ReadJSON timed out, which took a couple of minutes, causing a long pause when a connection went bad. The first bad fix just stopped waiting on the ReadJSON, and errored out on missed ping. I didn't see how this could lead to a panic later.

Right now I'm heavily in AWS/Terraform land, and it would be a major context switch to fully grok your patch; I've been reading it for 20 minutes (about all I can allot) and still don't completely follow your changes in e.g. receiveMessagesInto - but since you asked, if you wouldn't mind expanding on how this patch reconnects when pings are missed, but avoids the panic related to the ReadJSON, that might help.

In any case, as far as Gopherbot goes, I can only tell you that two of my production robots haven't panicked in the last two weeks, and I'm not seeing any issues. When I looked at your GH profile, it appeared you spend a lot of time writing Go, so I thought maybe this would be a good fit - if not, I'll continue examining contributors to see who else might be able to pitch-in.

What we don't want is me trying to review and merge patches. 😬

parsley42 · 2023-08-18T14:35:27Z

@iaburton Ok, it looks like I'm going to need to fill in for a while. I'd like the checks to run, but I think you'll need to close and re-open the PR to get them out of the expired state.

parsley42 · 2023-08-18T17:34:14Z

Closing and re-opening for tests.

parsley42 · 2023-08-21T21:34:16Z

@lololozhkin Have you looked at this patch? It deals with the same issue as your patch, but with some additions and changes. Wanted to get your thoughts on it?

lololozhkin · 2023-08-23T07:20:14Z

@parsley42, hi, haven't seen your messages, thank you for mentioning. Cool patch, I haven't seen the opportunity to not to wait one of the goroutines, because I didn't know that this goroutine was the only producer, well done!

I am not maintainer of this repo, so I cannot approve it on github, so I can do it only in comments. Approve)

Yes, this code fixes my issue too. It's a pity, that my code goes to trash, but this project doesn't need it)

lololozhkin · 2023-08-23T07:27:44Z

I've read this discussion, and haven't understand, does my patch fixed your problem with slow reconnects? Cause, my patch fixed panics, It's running in my production and 0 panics occurred for more over than a month. And yes, reconnects are handled normally.

If ping is not received till deadman timer elapsed, context will be cancelled and propagated to json reading, the goroutine will return after ctx cancel.

Am I missing something?

parsley42 · 2023-08-23T07:36:39Z

@lololozhkin Sorry - more context; this patch is by @iaburton , not me. I am a maintainer and I could merge this patch, but wanted your opinion since it is an alternate fix. Based on the description, it may have value, even if only allowing the caller to pass in a context. What do you think?

lololozhkin · 2023-08-24T07:09:56Z

@parsley42 As I said, yes, from my point of view that patch fixes the problem from my PR. Moreover it fixes it in a better way, as I suppose. So, from my point of view, this PR should be merged fully, including allowing usage of contexts in Send and Ack, and fix of panics on reconnects :)

parsley42

The patch looks fine and has been pretty heavily tested; also reviewed by @lololozhkin - merging.

parsley42 · 2023-08-24T15:06:55Z

@lololozhkin Ok, thanks - you approved in your first message but then replied again with other questions, so I just wanted to clarify. Thanks for being responsive. I am in the process of trying to recruit more active maintainers for this library and have sent you an invite - I think more seasoned Go developers probably wouldn't pick me as "the" maintainer, but right now it seems others are busy, so I'm trying to get some things moved through. If you'd like to help out, that'd be great; if you know other devs you'd recommend I'd be happy to hear.

lololozhkin · 2023-08-25T06:52:06Z

@parsley42 TY! Accepted your invitation, I think I could be usefull for this project :)

I don't know other interested go devs, but if they appear in my life, I would tell it.

iaburton added 6 commits December 20, 2022 14:48

socketmode.Client.run: Fix slack-go#1125 by allowing producer of chan…

32bb00d

…nel to close when it is finished, and consumer to see the close

socketmode.Client.run: Fix small race potential on 'firstErr' as the …

c6ed812

…outer goroutine did not wait on all inner goroutines that had a chance to set it, also make sure to check error for context.Canceled appropriately

socketmode.Client.Events: Change all events produced to Events channe…

0254b8f

…l via a method that has an escape hatch; unable to change public Events field without breaking api, though

socketmode.Client.run: Cover edge case where all goroutines exit nil,…

25f2f6e

… when getting buffered (first) error

socketmode: Slight refactor of Send/Ack on Client to allow passing co…

bb28237

…ntexts for channel ops, though they are very similar now

socketmode: Shore up error handling via errors.Is/As/(Un)Wrapping, ot…

377ffe4

…her small changes

kanata2 added api cleanup SocketMode about SocketMode bugfix [PR] bugfix labels Dec 21, 2022

kanata2 added the needs review label Apr 15, 2023

iaburton added 2 commits April 24, 2023 13:43

Merge remote-tracking branch 'upstream/master' into safer-socketmode

8c8b886

Functionally revert slack-go#1170 as drafted approach takes correct path of closing chan on goroutine producer side

Add 'FIXME's for deadman timer reset race

5babfb7

iaburton changed the title ~~DRAFT: Safer socketmode~~ Safer socketmode Apr 24, 2023

iaburton marked this pull request as ready for review April 24, 2023 18:57

Fix race on deadman timer Elapsed+Reset by more efficient use of tick…

4dd1a1f

…er and ping time channel. Remove deadman file as it is no longer needed

parsley42 closed this Aug 18, 2023

parsley42 reopened this Aug 18, 2023

parsley42 approved these changes Aug 24, 2023

View reviewed changes

parsley42 merged commit 65cefed into slack-go:master Aug 24, 2023
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Safer socketmode #1150

Safer socketmode #1150

iaburton commented Dec 21, 2022 •

edited

iaburton commented Mar 16, 2023

kanata2 commented Apr 15, 2023 •

edited

iaburton commented Apr 24, 2023 •

edited

parsley42 commented Jul 12, 2023

iaburton commented Aug 1, 2023

parsley42 commented Aug 10, 2023

parsley42 commented Aug 18, 2023

parsley42 commented Aug 18, 2023

parsley42 commented Aug 21, 2023

lololozhkin commented Aug 23, 2023

lololozhkin commented Aug 23, 2023

parsley42 commented Aug 23, 2023

lololozhkin commented Aug 24, 2023 •

edited

parsley42 left a comment

parsley42 commented Aug 24, 2023

lololozhkin commented Aug 25, 2023

Safer socketmode #1150

Safer socketmode #1150

Conversation

iaburton commented Dec 21, 2022 • edited

PR preparation

Should this be an issue instead

API changes

The why

iaburton commented Mar 16, 2023

kanata2 commented Apr 15, 2023 • edited

iaburton commented Apr 24, 2023 • edited

parsley42 commented Jul 12, 2023

iaburton commented Aug 1, 2023

parsley42 commented Aug 10, 2023

parsley42 commented Aug 18, 2023

parsley42 commented Aug 18, 2023

parsley42 commented Aug 21, 2023

lololozhkin commented Aug 23, 2023

lololozhkin commented Aug 23, 2023

parsley42 commented Aug 23, 2023

lololozhkin commented Aug 24, 2023 • edited

parsley42 left a comment

Choose a reason for hiding this comment

parsley42 commented Aug 24, 2023

lololozhkin commented Aug 25, 2023

iaburton commented Dec 21, 2022 •

edited

kanata2 commented Apr 15, 2023 •

edited

iaburton commented Apr 24, 2023 •

edited

lololozhkin commented Aug 24, 2023 •

edited