Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v1.6.0 causing rabbit connection errors #160

Closed
rlk833 opened this issue Jan 30, 2023 · 33 comments · Fixed by #161 or #162
Closed

v1.6.0 causing rabbit connection errors #160

rlk833 opened this issue Jan 30, 2023 · 33 comments · Fixed by #161 or #162
Assignees
Milestone

Comments

@rlk833
Copy link

rlk833 commented Jan 30, 2023

We just pulled in v1.6.0 on Friday and within 5 mins of normal heavy load we started to see connection errors. Sometimes it got so bad we couldn't get a solid connection for an hour. We backed down to v1.5.0 and making connections were working again.

We are using:

  • golang v1.18.8
  • Rabbit MQ 3.9

In our client logs we see this error:

Channel closed with error - Exception (501) Reason: "short write"

In the Rabbit logs we are seeing:

0.1072.86>  operation none caused a connection exception frame_error: "type 20, all octets = <<>>: {frame_too_large,1936943471,131064}"
Jan 30 07:31:01 ibm-cloud-databases-prod 2023-01-30T12:31:01.498290249Z stdout F 2023-01-30 12:31:01.496657+00:00 [error] <0.22553.86>  operation none caused a connection exception frame_error: "type 3, first 16 octets = <<31,139,8,0,0,0,0,0,0,255,140,85,201,174,163,74>>: {invalid_frame_end_marker,\n                                                                               222}"
Jan 30 08:06:31 ibm-cloud-databases-prod 2023-01-30T13:06:31.1258836Z stdout F 2023-01-30 13:06:31.090056+00:00 [error] <0.5473.88>  operation none caused a connection exception frame_error: "type 20, all octets = <<>>: {frame_too_large,1936943471,131064}"
Jan 30 08:23:40 ibm-cloud-databases-prod 2023-01-30T13:23:40.184025484Z stdout F 2023-01-30 13:23:40.181798+00:00 [error] <0.17411.117>  operation none caused a connection exception frame_error: "type 26, all octets = <<>>: {frame_too_large,2037148769,131064}"
Jan 30 09:19:15 ibm-cloud-databases-prod 2023-01-30T14:19:14.732511148Z stdout F 2023-01-30 14:19:14.708010+00:00 [error] <0.22659.106>  operation none caused a connection exception frame_error: "type 59, all octets = <<>>: {frame_too_large,2621440,131064}"
@lukebakken
Copy link
Contributor

If I had to guess, it is #142 that caused this change in behavior.

@rlk833 if you could provide more information that would be great.

Could you please quantify "normal heavy load":

  • What is the message rate?
  • Message size?
  • How many connections, channels, queues, etc are in place?

Ideally if providing to reproduce the issue would help us out the most. Otherwise, if you can reproduce this issue in a test environment, use git bisect between the 1.5.0 and 1.6.0 tags to figure out which commit caused this behavior.

@lukebakken lukebakken added this to the 1.6.1 milestone Jan 30, 2023
@lukebakken
Copy link
Contributor

cc @fadams - if you have a second to look at the symptoms and chime in I would appreciate it.

@rlk833
Copy link
Author

rlk833 commented Jan 30, 2023

Sorry but unfortunately we don't have any usage metrics on Rabbit calls.
All I have is the Rabbit dashboard and it only goes back one day.
But:
Publish and Publish confirm rate: about 8/sec
Deliver Auto-ack: about 40/sec
Message size: I don't know. It varies from small to large, but the distribution of size haven't changed in over a year.

On Avg:
Connections: 32
Channels: 49
Exchanges: 24
Queues: 21
Consumers: 16

@rlk833
Copy link
Author

rlk833 commented Jan 30, 2023

most of the errors showing in the rabbit logs show frame_too_large, got some invalid_frame_end_marker

@lukebakken
Copy link
Contributor

lukebakken commented Jan 30, 2023

Thanks. I'm asking these questions so I can try to reproduce this issue, so please be as specific as possible.

  • "varies from small to large". Are we talking 1K to several megabytes?
  • Do your publishing applications use one connection or several connections?
  • For each publishing connection, how many channels are used?
  • Are there multiple application goroutines accessing the same connection to publish messages?

@rlk833
Copy link
Author

rlk833 commented Jan 30, 2023

i understand, but i just don't have the metrics.

it can from bytes to a megabyte (maybe). We actually gzip the body if > 2K

we use multiple connections - usually one or two per queue that we publish to.

We never use the same channel/connection reentrantly. It is one channel per connection. We always only ever do one transaction at a time on a connection/channel. I didn't want to get into a failure of one transaction causing a failure in another transaction that was simultaneous using the connection/channel. If there are simultaneous requests we try to have a pool of a few connections that we hand out to each request. And then the connection returned to the pool.

we try to use one connection per channel.

Connections: 32
Channels: 49
Exchanges: 24
Queues: 21
Consumers: 16

@rlk833
Copy link
Author

rlk833 commented Jan 30, 2023

just took a closer look. Publishing is ALWAYS one channel and connection. No sharing.

Consuming: we have a few consumers that share the channel when consuming.

@fadams
Copy link
Contributor

fadams commented Jan 31, 2023

Hi @lukebakken I'm afraid I won't be able to have a proper look/think until the weekend 'cause work.
Some quick thoughts. The thing I've mostly been working on is something of a "wrapper" round amqp091-go intended to make reconnections more "transparent" from a client perspective. That also does "fancy" things like allowing a connection pool so a client publish can get sent to a go channel and multiple goroutines consume from that and publish to the underlying AMQP connection/channel. I mention that 'cause it's vaguely fancy and I've been hitting it with multiple hundreds of thousands of messages/s across multiple producer/consumer connections quite happily.

The only place I can think of where I've seen anything resembling what has been described here was where my main goroutine terminated (too) early, but the publisher goroutine didn't have cancellation in place.

So 1. If I were to bet I suspect that there is a bug somewhere in the (user's) application client code where a connection/channel has been closed and pulled the rug from under a publish. 2. What you've done with the defer to force a flush on an error looks like a good call.

TBH I can't recall the exact error I saw, it was intermittent IIRC and I put it down to my (rather hacky at the time) application code terminating pretty uncleanly at the time. When I have a little spare time I'll see if I can "unfix" what I did to my application and try to reproduce. But your "flush on error" change looks a good call.

The only other thing I can think of off the top of my head is the bit around chunking body into size. In my original change I had:

		// chunk body into size (max frame size - frame header size)
		for i, j := 0, size; i < len(body); i, j = j, j+size {
			if j > len(body) {
				j = len(body)
			}

			if err = ch.connection.sendUnflushed(&bodyFrame{
				ChannelId: ch.id,
				Body:      body[i:j],
			}); err != nil {
				return
			}
		}

		// Flush the buffer only after all the Frames that comprise the Message
		// have been written to maximise benefits of using a buffered writer.
		err = ch.connection.flush()

and I did wonder whether doing a flush per chunk might be the thing to do, but with buffered IO when the underlying buffer length is exceeded the buffer flushes implicitly and the reason for explicit flushes is when you have writes less than the buffer length, if you see what I mean. I suspect that's a red herring, but worth mentioning. If there were only ever observed issues for large messages I might wonder about that more but I more suspect latent client application "lifecycle" issues exposing an error handling edge case.

That's all I can think of for now I'm afraid. I'll try to dig more deeply when I have a little more time.

@lukebakken
Copy link
Contributor

lukebakken commented Jan 31, 2023

I really appreciate it @fadams

@rlk833 - if you can test the changes I have made in #161 that would be great. I'm still working on reproducing this issue using this project. Of course everything works fine in my environment at the moment.

@rlk833
Copy link
Author

rlk833 commented Jan 31, 2023

Take a look at this stackoverflow question about "short write"

https://stackoverflow.com/questions/66174520/how-to-solve-short-write-error-when-writing-to-csv

are your sync locks correct? The last append says using bufio and writing and flushing at the same time.

Though we are also getting on the rabbit side a lot of frame_too_large

@fadams
Copy link
Contributor

fadams commented Jan 31, 2023

Ahh interesting, so on https://github.com/rabbitmq/amqp091-go/blob/main/connection.go#L490 it probably needs to read

	c.sendM.Lock()
	err = buf.Flush()
	c.sendM.Unlock()

That's definitely possible/plausible and if so my bad. In all honesty I actively avoid doing publish from multiple goroutines concurrently. I can't think of a good reason to do that (other than convenience) certainly from a performance/throughput perspective multiple goroutines publishing to a given channel, or even multiple channels on the same connection has never seemed to make any positive performance difference, so for throughput I always end up creating a pool of connections each with a (AMQP) channel and I have a handler goroutine for each connection in the pool receiving data from a (go) channel.

@rlk833
Copy link
Author

rlk833 commented Jan 31, 2023

i do too try to single thread use a connection, except for some reason my consume queues share a connection. but i am worried there may be some weird spot that it is not single threaded on a connection.

But is there a lock on the write too! It is possible during writing that a flush occurs under the covers too if the buffer becomes full.

@lukebakken
Copy link
Contributor

lukebakken commented Jan 31, 2023

OK, now I think I see the issue. ALL sendUnflushed and subsequent flush() calls must be protected by the same lock. I'm making the change in my PR now.

If these writes aren't serialized you can get interleaved frame writes on the connection if multiple goroutines use the same connection.

@lukebakken
Copy link
Contributor

lukebakken commented Jan 31, 2023

@rlk833 I'm reasonably certain that #161 will fix your issue. It would be greatly appreciated if you can test that PR in your environment, since it seems you can readily reproduce the issue.

If you need me to merge and tag a beta release of 1.6.1 I can do that as well.

@rlk833
Copy link
Author

rlk833 commented Jan 31, 2023

busy at the moment. Day of meetings, yuck. But I will get to it. Tomorrow is free so far.

I think I can ask for in go.mod for what a specific commit id. I'll need to look at that.

@fadams
Copy link
Contributor

fadams commented Jan 31, 2023

@lukebakken minor thing, but do you need startSendUnflushed()/endSendUnflushed() as explicit methods as they are sumply locking and unlocking the mutex respectively?

Wouldn't it actually be clearer to directly have ch.connection.sendM.Lock() where you have ch.connection.startSendUnflushed() and ch.connection.sendM.Unlock() where you have ch.connection.endSendUnflushed()

There are arguments both ways I guess and it is really minor, but it'd make it more explicit that all the writes and flush for the message are being protected by that mutex.

@lukebakken
Copy link
Contributor

I find it easier to understand what is going on with the code as I modified it. I just moved the flush into endSendUnflushed() as well.

I thought about adding an assertion to ensure that sendUnflushed is called only after startSendUnflushed but that seems like overkill at this time.

@fadams
Copy link
Contributor

fadams commented Jan 31, 2023

Haha NP, we all visualise things differently, right :-) TBF the startSendUnflushed/endSendUnflushed imply a sort of semantic "transaction", as I said it's minor.

On the plus side what you've done is likely to have a (tiny) performance improvement on the non concurrent case as the lock is done at the start of the "transaction" and unlocked at the end rather than for each frame write. TBF for the case of no lock contention I think it'll only save a couple of atomic CAS. but hey it's a bonus :-)

lukebakken added a commit that referenced this issue Jan 31, 2023
…flushed writes (#161)

* Ensure flush happens even with errors

Fixes #160

* Use correct lock around unflushed frame writes

* formatting

* Move flushing to endSendUnflushed
@lukebakken
Copy link
Contributor

@rlk833 I have tagged version v1.6.1-rc.1 that you should be able to use to test in your environment. Once we get confirmation from you I can produce the final release. Again, thanks a lot for helping out!

@fadams
Copy link
Contributor

fadams commented Jan 31, 2023

@lukebakken sorry, I've just had another thought on this. On your recent change the rationale is ALL sendUnflushed and subsequent flush() calls must be protected by the same lock and If these writes aren't serialized you can get interleaved frame writes on the connection if multiple goroutines use the same connection.

But thinking back to the original 1.5.0 code (that just used send()), well with that the mutex just protected WriteFrame() https://github.com/rabbitmq/amqp091-go/blob/main/connection.go#L429 so that pre separating out the flush code could have had interleaved frames if written to by multiple goroutines.

I think there's a subtlety. My guess is that most people are likely to stand up multiple channels if they were planning on writing from multiple goroutines and in that case frames interleaved on the connection from different channels should be fine and that I think was the original behaviour - and I think simply putting the mutex around the flush would have been good enough.

With your recent change the mutex protects against concurrent accesses with a wider critical section, so it now protects the case of say multiple concurrent writes to the same channel as it prevents interleaving, but the consequence of that is that it now (I think) serialises access from multiple channels so if you have say two goroutines each writing to a different channel then I think it would have been OK for those to have been interleaved (though the Flush should be protected to prevent an accidental flush mid-write) but by serialising writing the entire message it could (I think) potentially increase the latency for the case where you have the two channels with two concurrent writers where say a large message is being written and I think previously each "chunked" body frame from each channel could have been interleaved.

@lukebakken
Copy link
Contributor

I see what you're saying, let me think about it, read about how the underlying writer flushes data when it's a socket, and see what @rlk833 reports back.

@lukebakken
Copy link
Contributor

though the Flush should be protected to prevent an accidental flush mid-write

Aha, yep, this is the root of the issue introduced by #142. I'll open a new PR.

@lukebakken
Copy link
Contributor

@fadams thanks again for continuing to think about this issue.

@rlk833 - I just tagged v1.6.1-rc.2

@Zerpet
Copy link
Contributor

Zerpet commented Feb 1, 2023

I've written a benchmark test to publish 1_000_000 messages, split between 10 go routines, with a body between 10-15 bytes. Publishers await confirmations, up to 100 confirmations in-flight. RabbitMQ 3.11.8 in Docker in my Mac laptop. Results confirm what has been discussed here 🙂 In summary, #142 provided a huge boost in performance. Tag v1.6.1-rc.1 shows this boost compared to v1.5.0. Tag v1.6.1-rc.2 shows a small boost in performance, compared to v1.6.1-rc.1.

The results:

### 10 go routines for channel.PublishWithContext

v1.5.0
goos: darwin
goarch: arm64
pkg: github.com/rabbitmq/amqp091-go
BenchmarkChannel_PublishWithContext
BenchmarkChannel_PublishWithContext-10    	       1	16480017750 ns/op	     60679 msg/s

------

v1.6.1-rc.1
goos: darwin
goarch: arm64
pkg: github.com/rabbitmq/amqp091-go
BenchmarkChannel_PublishWithContext
BenchmarkChannel_PublishWithContext-10    	       1	12594736667 ns/op	     79398 msg/s

-------

v1.6.1-rc.2
goos: darwin
goarch: arm64
pkg: github.com/rabbitmq/amqp091-go
BenchmarkChannel_PublishWithContext
BenchmarkChannel_PublishWithContext-10    	       1	12244268958 ns/op	     81671 msg/s

I run each benchmark 3 times, they all yield similar results (difference in 100-200 msg/s each run).

@lukebakken lukebakken reopened this Feb 1, 2023
@lukebakken
Copy link
Contributor

I re-opened this issue because we're waiting on confirmation from @rlk833 that v1.6.1-rc.2 fixes the issue.

@rlk833
Copy link
Author

rlk833 commented Feb 1, 2023

We just pushed to our staging/test system. After two hours no errors yet. Hopefully we have a good load like we had when we had the errors.

We are going to let it run for a day to see if anything develops.

@lukebakken
Copy link
Contributor

Thanks for the update.

@rlk833
Copy link
Author

rlk833 commented Feb 2, 2023

All right. It has been over 24 hours and there have been zero errors with the Rabbit Client.

I think this patch is OK to go live.

@lukebakken
Copy link
Contributor

Roger that, thanks again!

@lukebakken
Copy link
Contributor

@rlk833 - https://github.com/rabbitmq/amqp091-go/releases/tag/v1.6.1

@fadams
Copy link
Contributor

fadams commented Feb 3, 2023

That's great news!

@slagiewka
Copy link
Contributor

I can also confirm that 1.6.0 was broken for our top producer that was updated to 1.6.0 this morning. Unfortunately via our internal library that had 1.6.0 locked earlier, so 1.6.1 didn't come up there yet.

The producer started showing issues immediately with 1.6.0. Luckily 1.6.1 is ready. I tested it and all works fine. Thank you all for reporting and fixing 😉

@fadams
Copy link
Contributor

fadams commented Feb 8, 2023

By way of completeness I put together something roughly based on the send.go tutorial https://github.com/rabbitmq/rabbitmq-tutorials/blob/main/go/send.go that reproduces this issue.

https://gist.github.com/fadams/d427e1faba74942d429a446027751689

It barfs with "panic: Failed to publish a message: short write" with 1.6.0 and seems happy when the flush is protected by the connection Mutex.

When I was playing with that it was interesting to note that multiple goroutines writing to the same AMQP channel didn't seem to barf, which is kind of surprising TBH, but having multiple AMQP channels as per the gist barfs pretty much immediately.

It's also kind of amusing to note that the throughput is higher for one goroutine than multiple. That's probably counterintuitive. My suspicion is that it's because with one goroutine and no lock contention the Mutexes are basically just doing atomic CAS, whereas with multiple goroutines there will be lots of contention, hitting the Mutex slow path and context switches. Using multiple connections seems to be more reliable as a way to improve throughput.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment