Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Safer socketmode #1150

Merged
merged 9 commits into from Aug 24, 2023
Merged

Safer socketmode #1150

merged 9 commits into from Aug 24, 2023

Conversation

iaburton
Copy link
Contributor

@iaburton iaburton commented Dec 21, 2022

PR preparation

Run make pr-prep from the root of the repository to run formatting, linting and tests.

Done, though I only modified socketmode and I was running race tests with -count 1000 while making changes anyways.

Should this be an issue instead

Eh, one of the things it fixes was already an issue, #1125 as mentioned in the relevant commit message.
I decided not to make a(nother) larger issue for my other changes, for reasons listed below.

API changes

As far as I can tell, there should be no breaking api changes (though a small behavior change in one location that used to have the potential to panic). There are however small API "additions" in terms of two methods that allow context use while previous versions did not.

The why

Hello!

I'm looking to use the socketmode package of this repo for a potentially largish app and while browsing the code & existing issues decided to fix some up myself as well as address some I found on my own. Below is a TLDR-ish summary of the changes:

  • Fixes Socketmode panic, write on closed channel #1125 as it allows the producer of the messages on the effected channel to close it when messages are no longer being written, this lets the consumer safely see the close and neither will panic.

  • Fix the potential for a race condition on firstErr. While all of these goroutines go through a sync.Once to write to the variable, one is not 'waited' on by the outer goroutine. In this case, since the outer goroutine does not go through a sync type to read the variable, a race was possible between the read and write linked earlier. Simple enough to fix, use a buffered channel and some selects; first goroutine to exit with an error sends it into the buffer and the others fall through the default. Outer goroutine then selects for the buffered error.

  • Fix potential deadlock in socketmode.Client.Events due to the use of buffering, and single channel sends. The fact the Events chan is a public field is unfortunate but not reason enough for this problem. If consumer goroutines are slow and the buffer fills while another goroutine is writing to Events, while yet another is canceling the context provided to RunContext, you may have deadlocked goroutines unable to send since the buffer is full and they are not in a select they can use to break out of the send. Again relatively easy fix, change all Event writes into a method that selects on the propagated context, negating any chance of getting stuck. Again it would be better to force Events to be a receive channel only on the Clients struct, or use a method to return it, but that isn't doable without breaking API changes.

    • For the small API addition mentioned earlier, this was SendCtx and AckCtx as those too write to an internal channel in an unsafe manner, and so context methods were needed to break a potential deadlock. The OG methods are still there, and perhaps a comment should be added about the context methods being preferred.
  • Other more miscellaneous changes are mostly around error handling and use of error (un)wrapping introduced in Go 1.13.

  • Fix race on deadman timer Elapsed+Reset. The deadman timer's channel was being read from one goroutine and reset from the pingHandler, which is not allowed;

    • As in the !timer.Stop() case you could have more than one goroutine doing concurrent receives. This was not a data race, but a race condition effecting ping/pong/timeout behavior.

    This should not be done concurrent to other receives from the Timer's channel.

#### DRAFT Status
While I'm technically submitting this now, I don't consider it 100% yet as I'd like to verify some things in more real-world use,
plus I'm considering performance and other changes, though those are likely to come in another PR. That, and it lets me get some feedback on this early if any maintainers have comments/questions/etc. 🙂
EDIT: Avoiding scope creep and stale MRs; performance and other changes can be followed up.

…nel to close when it is finished, and consumer to see the close
…outer goroutine did not wait on all inner goroutines that had a chance to set it, also make sure to check error for context.Canceled appropriately
…l via a method that has an escape hatch; unable to change public Events field without breaking api, though
…ntexts for channel ops, though they are very similar now
@kanata2 kanata2 added api cleanup SocketMode about SocketMode bugfix [PR] bugfix labels Dec 21, 2022
@iaburton
Copy link
Contributor Author

Hi @kanata2

I know I haven't followed up with this as I intended. Unfortunately some other things came up and I haven't been able to add-to or otherwise refine this MR further.

However that doesn't mean it cannot go in as-is and follow up with another if needed. Have you or any other devs/maintainers had a chance to look it over?

@kanata2
Copy link
Member

kanata2 commented Apr 15, 2023

@iaburton
Sorry for slow response. I'll confirm later. 🙏

FYI: #1125 is fixed by #1170

Functionally revert slack-go#1170 as drafted approach takes correct path of
closing chan on goroutine producer side
@iaburton
Copy link
Contributor Author

iaburton commented Apr 24, 2023

Hey @kanata2 , no problem on the slow response.

Btw, I've functionally reverted #1170 in favor of what I had already fixed here (including that panic), in unison with the other issues I've described above.

I would appreciate it if you, any other maintainers, and those involved with 1170 ( @lololozhkin & @parsley42 ) double check, as the scope of this MR is creeping which I would rather it not do 🙂

Further, I noticed another race I've yet to (but can) fix on the deadman timer. Atm it has a 'fixme' if we want to merge this first and let someone else tackle it, or I can in a follow up (or here if desired).
EDIT: Just went ahead and fixed the timer issue, look at relevant commit and OP for information.

@iaburton iaburton changed the title DRAFT: Safer socketmode Safer socketmode Apr 24, 2023
@iaburton iaburton marked this pull request as ready for review April 24, 2023 18:57
…er and ping time channel. Remove deadman file as it is no longer needed
@parsley42
Copy link
Member

@iaburton FYI, I'm running your branch in production Gopherbot right now. I've reached out to @kanata2 , but it may be that he's gotten too busy with other things to devote much time to this.

I've been reading your patch for a while now this morning, but can't really take the time to fully ingest it (I'm not a daily go programmer, just a hobbyist w/ Gopherbot, which helps me with my real day job, DevOps). Your thinking and explanations are clear and sensible, so my inclination is to approve/merge as this fix covers a potential deadlock and race, and has a nicer implementation for the timeout. I'm going to wait a few more days to hear back from Naoki, though.

My other inclination is to inquire if you'd be willing to serve as a maintainer. Looking at the commit history, Naoki has been doing most of the reviews and merges. Let me know if you'd be up for helping out.

@iaburton
Copy link
Contributor Author

iaburton commented Aug 1, 2023

@iaburton FYI, I'm running your branch in production Gopherbot right now. I've reached out to @kanata2 , but it may be that he's gotten too busy with other things to devote much time to this.

I've been reading your patch for a while now this morning, but can't really take the time to fully ingest it (I'm not a daily go programmer, just a hobbyist w/ Gopherbot, which helps me with my real day job, DevOps). Your thinking and explanations are clear and sensible, so my inclination is to approve/merge as this fix covers a potential deadlock and race, and has a nicer implementation for the timeout. I'm going to wait a few more days to hear back from Naoki, though.

My other inclination is to inquire if you'd be willing to serve as a maintainer. Looking at the commit history, Naoki has been doing most of the reviews and merges. Let me know if you'd be up for helping out.

Hi @parsley42 thanks for running my branch! Pardon my absence as I've been dealing with a few things IRL 🙂

Before I comment on helping with maintenance may I ask how the branch has fared w/ Gopherbot? Is there anything I can help explain w/ respect to you having looked over the patch?

@parsley42
Copy link
Member

Hi @iaburton yeah, work and RL are keeping me busy, too, as well as Naoki I guess.

The original issue (IIRC) was that when the library stopped getting a ping from slack that wasn't sufficient to generate a reconnect - instead nothing happened until the ReadJSON timed out, which took a couple of minutes, causing a long pause when a connection went bad. The first bad fix just stopped waiting on the ReadJSON, and errored out on missed ping. I didn't see how this could lead to a panic later.

Right now I'm heavily in AWS/Terraform land, and it would be a major context switch to fully grok your patch; I've been reading it for 20 minutes (about all I can allot) and still don't completely follow your changes in e.g. receiveMessagesInto - but since you asked, if you wouldn't mind expanding on how this patch reconnects when pings are missed, but avoids the panic related to the ReadJSON, that might help.

In any case, as far as Gopherbot goes, I can only tell you that two of my production robots haven't panicked in the last two weeks, and I'm not seeing any issues. When I looked at your GH profile, it appeared you spend a lot of time writing Go, so I thought maybe this would be a good fit - if not, I'll continue examining contributors to see who else might be able to pitch-in.

What we don't want is me trying to review and merge patches. 😬

@parsley42
Copy link
Member

@iaburton Ok, it looks like I'm going to need to fill in for a while. I'd like the checks to run, but I think you'll need to close and re-open the PR to get them out of the expired state.

@parsley42
Copy link
Member

Closing and re-opening for tests.

@parsley42 parsley42 closed this Aug 18, 2023
@parsley42 parsley42 reopened this Aug 18, 2023
@parsley42
Copy link
Member

@lololozhkin Have you looked at this patch? It deals with the same issue as your patch, but with some additions and changes. Wanted to get your thoughts on it?

@lololozhkin
Copy link
Contributor

@parsley42, hi, haven't seen your messages, thank you for mentioning. Cool patch, I haven't seen the opportunity to not to wait one of the goroutines, because I didn't know that this goroutine was the only producer, well done!

I am not maintainer of this repo, so I cannot approve it on github, so I can do it only in comments. Approve)

Yes, this code fixes my issue too. It's a pity, that my code goes to trash, but this project doesn't need it)

@lololozhkin
Copy link
Contributor

I've read this discussion, and haven't understand, does my patch fixed your problem with slow reconnects? Cause, my patch fixed panics, It's running in my production and 0 panics occurred for more over than a month. And yes, reconnects are handled normally.

If ping is not received till deadman timer elapsed, context will be cancelled and propagated to json reading, the goroutine will return after ctx cancel.

Am I missing something?

@parsley42
Copy link
Member

@lololozhkin Sorry - more context; this patch is by @iaburton , not me. I am a maintainer and I could merge this patch, but wanted your opinion since it is an alternate fix. Based on the description, it may have value, even if only allowing the caller to pass in a context. What do you think?

@lololozhkin
Copy link
Contributor

lololozhkin commented Aug 24, 2023

@parsley42 As I said, yes, from my point of view that patch fixes the problem from my PR. Moreover it fixes it in a better way, as I suppose. So, from my point of view, this PR should be merged fully, including allowing usage of contexts in Send and Ack, and fix of panics on reconnects :)

Copy link
Member

@parsley42 parsley42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The patch looks fine and has been pretty heavily tested; also reviewed by @lololozhkin - merging.

@parsley42 parsley42 merged commit 65cefed into slack-go:master Aug 24, 2023
4 checks passed
@parsley42
Copy link
Member

@lololozhkin Ok, thanks - you approved in your first message but then replied again with other questions, so I just wanted to clarify. Thanks for being responsive. I am in the process of trying to recruit more active maintainers for this library and have sent you an invite - I think more seasoned Go developers probably wouldn't pick me as "the" maintainer, but right now it seems others are busy, so I'm trying to get some things moved through. If you'd like to help out, that'd be great; if you know other devs you'd recommend I'd be happy to hear.

@lololozhkin
Copy link
Contributor

@parsley42 TY! Accepted your invitation, I think I could be usefull for this project :)

I don't know other interested go devs, but if they appear in my life, I would tell it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Socketmode panic, write on closed channel
4 participants