Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
[1.19] Erratic (I/O?) behavior from release target that's non-existent in debug target #42610
I'm experiencing erratic behavior in my Rust project when I compile the release target binary that I don't experience at all with the debug target binary. When I perform a database query using
I've created an SSCCE to reproduce the issue. The SCCEE includes much of the research I performed troubleshooting the issue.
While the program runs the same query, the "Packet out of order" error occurs at random locations in the network stream. As stated in the SSCCE readme, Wireshark confirms (to the best of my understanding) there is no data corruption from the MySQL server or network.
The program should receive all query results and print "done" to the console.
At least one in tens times, a "Packet out of order" error occurs and interrupts the transfer, and invalidates the query. Backtrace is provided further below.
I've run this SSCCE 355 times with the debug target with no errors whatsoever. When I run the release target, I find the "Packet out of order" error upwards of 40% of the time. This seems like an optimization bug to me, and I fear I lack the experience to track this down.
I considered there may be buffer corruption somewhere, particularly with the
I reformatted the backtrace to make it easier to read.
I've enlisted the reverse debugger to troubleshoot this problem and identified a call to
At the time the program flakes out, the
Now that the MySQL header is read, it is known that 115 more bytes are needed to complete the packet. Since there are no more bytes to process, another attempt to read from
At this point, I decided to play with the
I don't know if this bug has caught anyone's attention and concern. If so, I have a pcap tcpdump, a reverse debugger recording that enables forward and reverse debugging in gdb, and the source files I've used with slight modifications I'm willing to share.
If someone is interested in pairing via Google hangouts, I can accommodate that, too, and walk them through the debugging I've done thus far.
This bug is holding up a commercial project, and if I can't find a resolution soon, then my company will abandon Rust and pursue alternatives. I'd rather that didn't happen. My boss and I are rather concerned that optimized Rust code exhibits this behavior at all, and wonder what other problems this will cause with our product.
Data point: running the test program on a 32-bit system, I hit the panic quickly, every time.
I've been able to reproduce as well, though on a 64-bit system -- and not every time. Interestingly, I most commonly see the expected seq id to be 102 (
I suspect that this may not be a compiler bug (especially since @inejge sees this every time). However, I'm still investigating, so I may be wrong. The consistency in failure (102/50) leads me to suspect something odd is going on within the program -- not really data racy, either, since the numbers are so far apart. This is all just guess work though.
I've been unable to find the problem so far. I see about 30% failure rate with --release (115/361 runs as of writing) so it's clear that something is wrong. One thing that I've considered as a potential problem is that I think mysql_async's module for constants isn't correct -- it specifies the following, which may not be what is configured in your database. I'm also worried that some of these are unused, too. Overall, from my surface level debugging, I suspect that this is a bug in mysql_async, though I cannot confirm this. My theory is that packets become unaligned after a specific length packet is sent, leading to this problem, but I have not confirmed it as of yet (and doubt I have the time to do so).
pub static PAYLOAD_OFFSET: usize = 4; pub static MAX_PACKET_LEN: usize = 16777219; pub static MAX_PAYLOAD_LEN: usize = 16777215;
Thanks for the feedback. I'm in the same boat as @Mark-Simulacrum, the error doesn't occur 100%. I've not checked the TCP sequence id myself. I had fixated on the MySQL packet number which varied with each test. Looking back at my packet capture, I have over 2000 TCP packets from the database. I'm not sure how that compares with @Mark-Simulacrum's finding.
The only reason I have to suspect its a compiler bug is the debug target runs flawlessly while the optimized release target is sporadic, but I can't say that's definitive.
One thing to note, I am running my database on a separate machine on my LAN (SSCEE <-> Switch <-> Database). If @Mark-Simulacrum is doing something similar while @inejge is running the database on the same machine as the SSCCE, then that would help explain why the error is consistent for @inejge.
I considered this, too. If
I've gathered my binary, the
I'll spend time today cleaning up my notes and will provide instructions to replay the trace as I have.
To be clear, the sequence IDs I'm talking about I think are not the TCP sequence ids, but I could be wrong. I just added an assert above the point where the error is created in mysql.
I've attached both a good run and a bad run pcap file in a .zip file for what I'm seeing locally.
Yes, I agree that this is suspicious. I see the same thing with additions of
I am also utilizing a local server (running in the same docker container, actually). I'm actually suspecting that 32-bit may be leading to something else changing (e.g., different byte order in-memory by default)... but I didn't see anything in my overview of the codebase for mysql_async that might be the cause of this. Certainly suspicious, though.
I agree that the constants seem to be unrelated for this specific case, but I worry that we might be seeing a case where the data being loaded is accidentally correct.
@boxofrox To clarify, the client and the server in my test are separate machines on the same LAN. I ran the test on the 32-bit system after being unable to reproduce the bug on the 64-bit machine which hosts the MySQL server, running CentOS 6 -- not a single failure after 500 runs. The crashing 32-bit program ran on an old Ubuntu derivative. I have another 64-bit CentOS machine where I could try to install the test later. I'll report the findings.
One more data point: I cannot reproduce on a 32-bit Ubuntu VM reliably (in fact, the reproduction is harder -- about 4/400 fail). I've also started noticing that it sometimes hangs on start, though I don't know why. Could be related to mysql itself, but not sure. I see similar "good behavior" on the 64-bit VM -- 4/600 runs fail.
I do get the hanging behavior at times myself. Best I can figure is that the program is waiting for network data that did arrive (verified with Wireshark), but the program lost track of it and is expecting more to come.
I suspect the cause is the same for both problems, and it's a matter of whether the bug appears in the middle of the TCP stream (PacketOutOfOrder), or near the end (hung waiting on I/O).
@boxofrox This is a shot in the dark. After staring at mysql_async's packet construction code for a while, one thing started bothering me, so I removed it:
With this change, my predictably-crashing test program passed 500 iterations without a hitch. I don't really know what I'm doing, but it looked to me as the only thing that could explain having the length of 4 at one moment, and zero at the next. It that's the culprit, @Mark-Simulacrum was right in pinning this on mysql_async. But please test it thoroughly, I may be doing something nasty without realizing.
This patch corroborates the story of that previous comment.
Seems all this bug needs is for a network socket read to end with the 4-byte header of a MySQL packet, and fail to read more bytes before the next attempt at parsing.
My hat goes off to you, @inejge. 500 runs with the release build, no errors. I'll file a PR with
Many thanks to you and @Mark-Simulacrum for your involvement.