New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[1.19] Erratic (I/O?) behavior from release target that's non-existent in debug target #42610

Closed
boxofrox opened this Issue Jun 12, 2017 · 12 comments

Comments

Projects
None yet
3 participants
@boxofrox

boxofrox commented Jun 12, 2017

Problem:

I'm experiencing erratic behavior in my Rust project when I compile the release target binary that I don't experience at all with the debug target binary. When I perform a database query using tokio-rs and the mysql_async library involving a result set with 100k+ rows, sometimes the results are acquired without issue, seldomly the program hangs (I presume waiting for I/O that arrived according to Wireshark on the same machine, but was lost), otherwise a "Packet out of order" error occurs in the MySQL protocol parser.

I've created an SSCCE[1] to reproduce the issue. The SCCEE includes much of the research I performed troubleshooting the issue.

While the program runs the same query, the "Packet out of order" error occurs at random locations in the network stream. As stated in the SSCCE readme[2], Wireshark confirms (to the best of my understanding) there is no data corruption from the MySQL server or network.

[1]: https://github.com/boxofrox/sscce-future-is-buggy-01
[2]: https://github.com/boxofrox/sscce-future-is-buggy-01#how-i-know-this-isnt-a-networkdatabase-server-issue

Expected Behavior:

The program should receive all query results and print "done" to the console.

Actual behavior:

At least one in tens times, a "Packet out of order" error occurs and interrupts the transfer, and invalidates the query. Backtrace is provided further below.

Additional Notes:

I've run this SSCCE 355 times with the debug target with no errors whatsoever. When I run the release target, I find the "Packet out of order" error upwards of 40% of the time. This seems like an optimization bug to me, and I fear I lack the experience to track this down.

I considered there may be buffer corruption somewhere, particularly with the std::collections::VecDeque ring buffer used by mysql_async to buffer data before parsing MySQL packets. Perhaps the buffer head wraps around, laps the tail, and overwrites to-be-used data? However, I can't reconcile the notion that the slower debug target should have trouble reading the TCP buffer down to zero bytes and move on to parsing before the next network packet arrives, and yet it runs fine.

Meta:

▶ rustup run nightly rustc --version --verbose
rustc 1.19.0-nightly (f062832b2 2017-06-07)
binary: rustc
commit-hash: f062832b208e94f2f0f26ed7fb5c48c172069fbe
commit-date: 2017-06-07
host: x86_64-unknown-linux-gnu
release: 1.19.0-nightly
LLVM version: 4.0

Backtrace:

I reformatted the backtrace to make it easier to read.

▶ DSN=mysql://testbot:testbot@192.168.70.162/fake_data RUST_BACKTRACE=1 RUST_LOG=bug_test_01=debug target/release/bug-test-01
DEBUG:bug_test_01: running
thread 'main' panicked at 'error resolving future:
  Error ( Msg("error resolving future")
        , State { next_error: Some ( Error ( PacketOutOfOrder
                                           , State { next_error: None
                                                   , backtrace: Some ( stack backtrace:

   0:     0x5573abad9c2e - backtrace::backtrace::libunwind::trace
                        at /home/boxofrox/.cargo/registry/src/github.com-1ecc6299db9ec823/backtrace-0.3.2/src/backtrace/libunwind.rs:53
                         - backtrace::backtrace::trace<closure>
                        at /home/boxofrox/.cargo/registry/src/github.com-1ecc6299db9ec823/backtrace-0.3.2/src/backtrace/mod.rs:42
   1:     0x5573abad9d73 - backtrace::capture::{{impl}}::new
                        at /home/boxofrox/.cargo/registry/src/github.com-1ecc6299db9ec823/backtrace-0.3.2/src/capture.rs:64
   2:     0x5573aba6b916 - error_chain::make_backtrace
                        at /home/boxofrox/.cargo/registry/src/github.com-1ecc6299db9ec823/error-chain-0.10.0/src/lib.rs:417
   3:     0x5573aba6b9bd - error_chain::{{impl}}::default
                        at /home/boxofrox/.cargo/registry/src/github.com-1ecc6299db9ec823/error-chain-0.10.0/src/lib.rs:504
   4:     0x5573aba59739 - mysql_async::errors::{{impl}}::from_kind
                        at /home/boxofrox/files/development/phreebooks-search/work/bugs/001 - spurious packet out of order/bug-test-01/<error_chain_processed macros>:52
                         - mysql_async::errors::{{impl}}::from
                        at /home/boxofrox/files/development/phreebooks-search/work/bugs/001 - spurious packet out of order/bug-test-01/<error_chain_processed macros>:8
                         - core::convert::{{impl}}::into<mysql_async::errors::ErrorKind,mysql_async::errors::Error>
                        at /checkout/src/libcore/convert.rs:398
                         - mysql_async::conn::futures::read_packet::{{impl}}::poll
                        at /home/boxofrox/.cargo/registry/src/github.com-1ecc6299db9ec823/mysql_async-0.9.3/src/conn/futures/read_packet.rs:49
   5:     0x5573aba572b0 - mysql_async::conn::futures::query_result::{{impl}}::either_poll<mysql_async::conn::futures::query_result::TextResult>
                        at /home/boxofrox/.cargo/registry/src/github.com-1ecc6299db9ec823/mysql_async-0.9.3/src/conn/futures/query_result/mod.rs:202
   6:     0x5573aba58505 - mysql_async::conn::futures::query_result::{{impl}}::poll<mysql_async::conn::futures::query_result::TextResult>
                        at /home/boxofrox/.cargo/registry/src/github.com-1ecc6299db9ec823/mysql_async-0.9.3/src/conn/futures/query_result/mod.rs:220
   7:     0x5573aba5926c - mysql_async::conn::futures::query_result::{{impl}}::poll
                        at /home/boxofrox/.cargo/registry/src/github.com-1ecc6299db9ec823/mysql_async-0.9.3/src/conn/futures/query_result/mod.rs:433
   8:     0x5573aba2d348 - mysql_async::conn::futures::query_result::futures::drop_result::{{impl}}::poll<mysql_async::conn::futures::query_result::TextQueryResult>
                        at /home/boxofrox/.cargo/registry/src/github.com-1ecc6299db9ec823/mysql_async-0.9.3/src/conn/futures/query_result/futures/drop_result.rs:37
   9:     0x5573aba2dea7 - futures::future::chain::{{impl}}::poll<futures::future::and_then::AndThen<mysql_async::conn::pool::futures::get_conn::GetConn, mysql_async::conn::futures::query::Query, closure>,mysql_async::conn::futures::query_result::futures::drop_result::DropResult<mysql_async::conn::futures::query_result::TextQueryResult>,closure,closure>
                        at /home/boxofrox/.cargo/registry/src/github.com-1ecc6299db9ec823/futures-0.1.14/src/future/chain.rs:32
                         - futures::future::and_then::{{impl}}::poll<futures::future::and_then::AndThen<mysql_async::conn::pool::futures::get_conn::GetConn, mysql_async::conn::futures::query::Query, closure>,mysql_async::conn::futures::query_result::futures::drop_result::DropResult<mysql_async::conn::futures::query_result::TextQueryResult>,closure>
                        at /home/boxofrox/.cargo/registry/src/github.com-1ecc6299db9ec823/futures-0.1.14/src/future/and_then.rs:32
                         - futures::task_impl::std::{{impl}}::poll_future::{{closure}}<futures::future::and_then::AndThen<futures::future::and_then::AndThen<mysql_async::conn::pool::futures::get_conn::GetConn, mysql_async::conn::futures::query::Query, closure>, mysql_async::conn::futures::query_result::futures::drop_result::DropResult<mysql_async::conn::futures::query_result::TextQueryResult>, closure>>
                        at /home/boxofrox/.cargo/registry/src/github.com-1ecc6299db9ec823/futures-0.1.14/src/task_impl/std/mod.rs:242
                         - futures::task_impl::{{impl}}::enter::{{closure}}<futures::future::and_then::AndThen<futures::future::and_then::AndThen<mysql_async::conn::pool::futures::get_conn::GetConn, mysql_async::conn::futures::query::Query, closure>, mysql_async::conn::futures::query_result::futures::drop_result::DropResult<mysql_async::conn::futures::query_result::TextQueryResult>, closure>,closure,core::result::Result<futures::poll::Async<mysql_async::conn::Conn>, mysql_async::errors::Error>>
                        at /home/boxofrox/.cargo/registry/src/github.com-1ecc6299db9ec823/futures-0.1.14/src/task_impl/mod.rs:352
  10:     0x5573aba373de - futures::task_impl::std::set<closure,core::result::Result<futures::poll::Async<mysql_async::conn::Conn>, mysql_async::errors::Error>>
                        at /home/boxofrox/.cargo/registry/src/github.com-1ecc6299db9ec823/futures-0.1.14/src/task_impl/std/mod.rs:90
                         - futures::task_impl::{{impl}}::enter<futures::future::and_then::AndThen<futures::future::and_then::AndThen<mysql_async::conn::pool::futures::get_conn::GetConn, mysql_async::conn::futures::query::Query, closure>, mysql_async::conn::futures::query_result::futures::drop_result::DropResult<mysql_async::conn::futures::query_result::TextQueryResult>, closure>,closure,core::result::Result<futures::poll::Async<mysql_async::conn::Conn>, mysql_async::errors::Error>>
                        at /home/boxofrox/.cargo/registry/src/github.com-1ecc6299db9ec823/futures-0.1.14/src/task_impl/mod.rs:352
                         - futures::task_impl::std::{{impl}}::poll_future<futures::future::and_then::AndThen<futures::future::and_then::AndThen<mysql_async::conn::pool::futures::get_conn::GetConn, mysql_async::conn::futures::query::Query, closure>, mysql_async::conn::futures::query_result::futures::drop_result::DropResult<mysql_async::conn::futures::query_result::TextQueryResult>, closure>>
                        at /home/boxofrox/.cargo/registry/src/github.com-1ecc6299db9ec823/futures-0.1.14/src/task_impl/std/mod.rs:242
                         - tokio_core::reactor::{{impl}}::run::{{closure}}<futures::future::and_then::AndThen<futures::future::and_then::AndThen<mysql_async::conn::pool::futures::get_conn::GetConn, mysql_async::conn::futures::query::Query, closure>, mysql_async::conn::futures::query_result::futures::drop_result::DropResult<mysql_async::conn::futures::query_result::TextQueryResult>, closure>>
                        at /home/boxofrox/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-core-0.1.7/src/reactor/mod.rs:243
                         - scoped_tls::{{impl}}::set<tokio_core::reactor::Core,closure,core::result::Result<futures::poll::Async<mysql_async::conn::Conn>, mysql_async::errors::Error>>
                        at /home/boxofrox/.cargo/registry/src/github.com-1ecc6299db9ec823/scoped-tls-0.1.0/src/lib.rs:135
                         - tokio_core::reactor::{{impl}}::run<futures::future::and_then::AndThen<futures::future::and_then::AndThen<mysql_async::conn::pool::futures::get_conn::GetConn, mysql_async::conn::futures::query::Query, closure>, mysql_async::conn::futures::query_result::futures::drop_result::DropResult<mysql_async::conn::futures::query_result::TextQueryResult>, closure>>
                        at /home/boxofrox/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-core-0.1.7/src/reactor/mod.rs:242
                         - bug_test_01::main
                        at src/main.rs:37
  11:     0x5573abafce1a - panic_unwind::__rust_maybe_catch_panic
                        at /checkout/src/libpanic_unwind/lib.rs:98
  12:     0x5573abaf6518 - std::panicking::try<(),closure>
                        at /checkout/src/libstd/panicking.rs:433
                         - std::panic::catch_unwind<closure,()>
                        at /checkout/src/libstd/panic.rs:361
                         - std::rt::lang_start
                        at /checkout/src/libstd/rt.rs:59
  13:     0x7f0081732439 - __libc_start_main
  14:     0x5573aba2d209 - _start
  15:                0x0 - <unknown>

                                                                     ) // end Some
                                                   } // end State
                                           ) // end Error
                                   ) // end Some
                , backtrace: Some ( stack backtrace:

   0:     0x5573abad9c2e - backtrace::backtrace::libunwind::trace
                        at /home/boxofrox/.cargo/registry/src/github.com-1ecc6299db9ec823/backtrace-0.3.2/src/backtrace/libunwind.rs:53
                         - backtrace::backtrace::trace<closure>
                        at /home/boxofrox/.cargo/registry/src/github.com-1ecc6299db9ec823/backtrace-0.3.2/src/backtrace/mod.rs:42
   1:     0x5573abad9d73 - backtrace::capture::{{impl}}::new
                        at /home/boxofrox/.cargo/registry/src/github.com-1ecc6299db9ec823/backtrace-0.3.2/src/capture.rs:64
   2:     0x5573aba6b916 - error_chain::make_backtrace
                        at /home/boxofrox/.cargo/registry/src/github.com-1ecc6299db9ec823/error-chain-0.10.0/src/lib.rs:417
   3:     0x5573aba37652 - core::ops::FnOnce::call_once<fn() -> core::option::Option<alloc::arc::Arc<backtrace::capture::Backtrace>>,()>
                        at /checkout/src/libcore/ops.rs:2683
                         - core::option::{{impl}}::or_else<alloc::arc::Arc<backtrace::capture::Backtrace>,fn() -> core::option::Option<alloc::arc::Arc<backtrace::capture::Backtrace>>>
                        at /checkout/src/libcore/option.rs:658
                         - error_chain::{{impl}}::new<bug_test_01::Error>
                        at /home/boxofrox/.cargo/registry/src/github.com-1ecc6299db9ec823/error-chain-0.10.0/src/lib.rs:518
                         - bug_test_01::{{impl}}::chain_err::{{closure}}<mysql_async::conn::Conn,mysql_async::errors::Error,closure,&str>
                        at /home/boxofrox/files/development/phreebooks-search/work/bugs/001 - spurious packet out of order/bug-test-01/<error_chain_processed macros>:131
                         - core::result::{{impl}}::map_err<mysql_async::conn::Conn,mysql_async::errors::Error,bug_test_01::Error,closure>
                        at /checkout/src/libcore/result.rs:487
                         - bug_test_01::{{impl}}::chain_err<mysql_async::conn::Conn,mysql_async::errors::Error,closure,&str>
                        at /home/boxofrox/files/development/phreebooks-search/work/bugs/001 - spurious packet out of order/bug-test-01/<error_chain_processed macros>:129
                         - bug_test_01::main
                        at src/main.rs:37
   4:     0x5573abafce1a - panic_unwind::__rust_maybe_catch_panic
                        at /checkout/src/libpanic_unwind/lib.rs:98
   5:     0x5573abaf6518 - std::panicking::try<(),closure>
                        at /checkout/src/libstd/panicking.rs:433
                         - std::panic::catch_unwind<closure,()>
                        at /checkout/src/libstd/panic.rs:361
                         - std::rt::lang_start
                        at /checkout/src/libstd/rt.rs:59
   6:     0x7f0081732439 - __libc_start_main
   7:     0x5573aba2d209 - _start
   8:                0x0 - <unknown>

                                  ) // Some
                } // end State
        )', /checkout/src/libcore/result.rs:860

stack backtrace:
   0: std::sys::imp::backtrace::tracing::imp::unwind_backtrace
             at /checkout/src/libstd/sys/unix/backtrace/tracing/gcc_s.rs:49
   1: std::sys_common::backtrace::_print
             at /checkout/src/libstd/sys_common/backtrace.rs:71
   2: std::panicking::default_hook::{{closure}}
             at /checkout/src/libstd/sys_common/backtrace.rs:60
             at /checkout/src/libstd/panicking.rs:355
   3: std::panicking::default_hook
             at /checkout/src/libstd/panicking.rs:371
   4: std::panicking::rust_panic_with_hook
             at /checkout/src/libstd/panicking.rs:549
   5: std::panicking::begin_panic
             at /checkout/src/libstd/panicking.rs:511
   6: std::panicking::begin_panic_fmt
             at /checkout/src/libstd/panicking.rs:495
   7: rust_begin_unwind
             at /checkout/src/libstd/panicking.rs:471
   8: core::panicking::panic_fmt
             at /checkout/src/libcore/panicking.rs:69
   9: core::result::unwrap_failed
             at /checkout/src/libcore/macros.rs:29
  10: bug_test_01::main
             at /checkout/src/libcore/result.rs:762
             at src/main.rs:37
  11: __rust_maybe_catch_panic
             at /checkout/src/libpanic_unwind/lib.rs:98
  12: std::rt::lang_start
             at /checkout/src/libstd/panicking.rs:433
             at /checkout/src/libstd/panic.rs:361
             at /checkout/src/libstd/rt.rs:59
  13: __libc_start_main
  14: _start

@boxofrox boxofrox changed the title from [1.9] Erratic (I/O?) behavior from release build that's non-existent in debug build to [1.19] Erratic (I/O?) behavior from release target that's non-existent in debug target Jun 12, 2017

@boxofrox

This comment has been minimized.

boxofrox commented Jun 17, 2017

I've enlisted the reverse debugger to troubleshoot this problem and identified a call to Vec::len() that gets optimized away into a musical chairs dance between the stack and registers.

At the time the program flakes out, the mysql_async library has read the last four bytes retrieved from the tokio_core::net::TcpStream into an intermediate data structure mysql_async::proto::NewPacket. These last four bytes are the 4-byte header of the next MySQL packet to parse. These last four bytes are copied into the NewPacket.header, a std::vec::Vec (assertion NewPacket.header.len() == 4).

Now that the MySQL header is read, it is known that 115 more bytes are needed to complete the packet. Since there are no more bytes to process, another attempt to read from TcpStream commences, and fails with Err(WouldBlock), at which point the mysql_async library schedules a future read, then tries to reparse the intermediate data structure NewPacket one more time in case some bytes had been read.

The NewPacket.parse() function is a state machine transitioner that checks the current state, and with its return value, schedules where the next bytes read from the TcpStream shall go. It's already pulled four bytes for the header, the buffer was empty, and there are no more bytes to be had at this point. When the parse function checks to see if the header has all its bytes, it's here that the register--that represents NewPacket.header.len()--contains 0x00 instead of 0x04. The code erroneously returns a ParseResult::NeedHeader, which corrupts further reading of the TcpStream.

At this point, I decided to play with the opt-level settings in Cargo.toml. opt-level = 1 runs fine, while opt-level = 2 and opt-level = 3 reproduces the error.

I don't know if this bug has caught anyone's attention and concern. If so, I have a pcap tcpdump, a reverse debugger recording that enables forward and reverse debugging in gdb, and the source files I've used with slight modifications I'm willing to share.

If someone is interested in pairing via Google hangouts, I can accommodate that, too, and walk them through the debugging I've done thus far.

This bug is holding up a commercial project, and if I can't find a resolution soon, then my company will abandon Rust and pursue alternatives. I'd rather that didn't happen. My boss and I are rather concerned that optimized Rust code exhibits this behavior at all, and wonder what other problems this will cause with our product.

@Mark-Simulacrum

This comment has been minimized.

Member

Mark-Simulacrum commented Jun 17, 2017

I've nominated for the compiler team to look at, since we shouldn't be generating different code between debug and release. I suspect it isn't strictly a data race, though, since IIRC rr runs single-threaded. I could be wrong though.

@inejge

This comment has been minimized.

Contributor

inejge commented Jun 17, 2017

Data point: running the test program on a 32-bit system, I hit the panic quickly, every time.

rustc 1.18.0 (03fc9d622 2017-06-06)
binary: rustc
commit-hash: 03fc9d622e0ea26a3d37f5ab030737fcca6928b9
commit-date: 2017-06-06
host: i686-unknown-linux-gnu
release: 1.18.0
LLVM version: 3.9
@Mark-Simulacrum

This comment has been minimized.

Member

Mark-Simulacrum commented Jun 17, 2017

I've been able to reproduce as well, though on a 64-bit system -- and not every time. Interestingly, I most commonly see the expected seq id to be 102 (conn_like.get_seq_id()) and the packet's seq_id to be 50. I've seen one instance of 51 and 136 though.

I suspect that this may not be a compiler bug (especially since @inejge sees this every time). However, I'm still investigating, so I may be wrong. The consistency in failure (102/50) leads me to suspect something odd is going on within the program -- not really data racy, either, since the numbers are so far apart. This is all just guess work though.

@Mark-Simulacrum

This comment has been minimized.

Member

Mark-Simulacrum commented Jun 17, 2017

I've been unable to find the problem so far. I see about 30% failure rate with --release (115/361 runs as of writing) so it's clear that something is wrong. One thing that I've considered as a potential problem is that I think mysql_async's module for constants isn't correct -- it specifies the following, which may not be what is configured in your database. I'm also worried that some of these are unused, too. Overall, from my surface level debugging, I suspect that this is a bug in mysql_async, though I cannot confirm this. My theory is that packets become unaligned after a specific length packet is sent, leading to this problem, but I have not confirmed it as of yet (and doubt I have the time to do so).

pub static PAYLOAD_OFFSET: usize = 4;
pub static MAX_PACKET_LEN: usize = 16777219;
pub static MAX_PAYLOAD_LEN: usize = 16777215;
@boxofrox

This comment has been minimized.

boxofrox commented Jun 17, 2017

Thanks for the feedback. I'm in the same boat as @Mark-Simulacrum, the error doesn't occur 100%. I've not checked the TCP sequence id myself. I had fixated on the MySQL packet number which varied with each test. Looking back at my packet capture, I have over 2000 TCP packets from the database. I'm not sure how that compares with @Mark-Simulacrum's finding.

The only reason I have to suspect its a compiler bug is the debug target runs flawlessly while the optimized release target is sporadic, but I can't say that's definitive.

While my rr testing is single threaded, I imagine the timing by which packets are received from the database server may be creating the perfect storm for the code to fail.

One thing to note, I am running my database on a separate machine on my LAN (SSCEE <-> Switch <-> Database). If @Mark-Simulacrum is doing something similar while @inejge is running the database on the same machine as the SSCCE, then that would help explain why the error is consistent for @inejge.

My theory is that packets become unaligned after a specific length packet is sent, leading to this problem

I considered this, too. If N > MAX_PAYLOAD_LEN of data is pulled into the NextPacket.data vector, then only MAX_PAYLOAD_LEN of that data is used, and the rest would be discarded and unavailable for subsequent parsing. To verify this, I added an assertion to crash the program if self.data.len() >= consts::MAX_PAYLOAD_LEN at mysql_async/src/proto.rs:299. The next 20 test runs either passed or failed because of PacketOutOfOrder, but never from that assertion. While these constants are questionable, I believe in this instance they are a no-op.

I've gathered my binary, the rr trace directory, and source files into a 20MB tarball. I don't recommend running the executable as it should be considered unsafe unless you can verify it was compiled from the source provided. The executable is included for the rr trace. It appears rr has a binary-encoded absolute path in its trace, and I haven't found a way to point rr to at the desired executable in new location.

I'll spend time today cleaning up my notes and will provide instructions to replay the trace as I have.

@Mark-Simulacrum

This comment has been minimized.

Member

Mark-Simulacrum commented Jun 17, 2017

I've not checked the TCP sequence id myself.

To be clear, the sequence IDs I'm talking about I think are not the TCP sequence ids, but I could be wrong. I just added an assert above the point where the error is created in mysql.

I have over 2000 TCP packets from the database

I've attached both a good run and a bad run pcap file in a .zip file for what I'm seeing locally.

pcaps.zip

The only reason I have to suspect its a compiler bug is the debug target runs flawlessly while the optimized release target is sporadic, but I can't say that's definitive.

Yes, I agree that this is suspicious. I see the same thing with additions of println!, so my suspicion is the problem is that how fast we send/receive packets may be changing something, for example, the mysql/mariadb server is sending larger packets sooner (ramping up to a larger packet size). This seems somewhat unlikely though.

One thing to note, I am running my database on a separate machine on my LAN (SSCEE <-> Switch <-> Database).

I am also utilizing a local server (running in the same docker container, actually). I'm actually suspecting that 32-bit may be leading to something else changing (e.g., different byte order in-memory by default)... but I didn't see anything in my overview of the codebase for mysql_async that might be the cause of this. Certainly suspicious, though.

I considered this, too. If N > MAX_PAYLOAD_LEN of data is pulled into the NextPacket.data vector, then only MAX_PAYLOAD_LEN of that data is used, and the rest would be discarded and unavailable for subsequent parsing. To verify this, I added an assertion to crash the program if self.data.len() >= consts::MAX_PAYLOAD_LEN at mysql_async/src/proto.rs:299. The next 20 test runs either passed or failed because of PacketOutOfOrder, but never from that assertion. While these constants are questionable, I believe in this instance they are a no-op.

I agree that the constants seem to be unrelated for this specific case, but I worry that we might be seeing a case where the data being loaded is accidentally correct.

@inejge

This comment has been minimized.

Contributor

inejge commented Jun 17, 2017

@boxofrox To clarify, the client and the server in my test are separate machines on the same LAN. I ran the test on the 32-bit system after being unable to reproduce the bug on the 64-bit machine which hosts the MySQL server, running CentOS 6 -- not a single failure after 500 runs. The crashing 32-bit program ran on an old Ubuntu derivative. I have another 64-bit CentOS machine where I could try to install the test later. I'll report the findings.

@Mark-Simulacrum

This comment has been minimized.

Member

Mark-Simulacrum commented Jun 18, 2017

One more data point: I cannot reproduce on a 32-bit Ubuntu VM reliably (in fact, the reproduction is harder -- about 4/400 fail). I've also started noticing that it sometimes hangs on start, though I don't know why. Could be related to mysql itself, but not sure. I see similar "good behavior" on the 64-bit VM -- 4/600 runs fail.

@boxofrox

This comment has been minimized.

boxofrox commented Jun 18, 2017

I do get the hanging behavior at times myself. Best I can figure is that the program is waiting for network data that did arrive (verified with Wireshark), but the program lost track of it and is expecting more to come.

I suspect the cause is the same for both problems, and it's a matter of whether the bug appears in the middle of the TCP stream (PacketOutOfOrder), or near the end (hung waiting on I/O).

@inejge

This comment has been minimized.

Contributor

inejge commented Jun 18, 2017

@boxofrox This is a shot in the dark. After staring at mysql_async's packet construction code for a while, one thing started bothering me, so I removed it:

--- a/src/proto.rs
+++ b/src/proto.rs
@@ -305,7 +305,6 @@ impl NewPacket {
                 let length = u24_le(&*self.header).unwrap();
                 self.last_seq_id = self.header[3];
                 debug!("Last seq id {}", self.last_seq_id);
-                self.header.clear();
                 if length == 0 {
                     return ParseResult::Done(Packet { payload: self.data }, self.last_seq_id);
                 } else {

With this change, my predictably-crashing test program passed 500 iterations without a hitch. I don't really know what I'm doing, but it looked to me as the only thing that could explain having the length of 4 at one moment, and zero at the next. It that's the culprit, @Mark-Simulacrum was right in pinning this on mysql_async. But please test it thoroughly, I may be doing something nasty without realizing.

@boxofrox

This comment has been minimized.

boxofrox commented Jun 19, 2017

🤦‍♂️ I feel like an idiot now. I stared at the segment of code for hours this past week stepping through the debugger as I mentioned earlier. This next week would involve tracing down when in the code that header length reset to zero, and Vec.clear() kept racing through my head this weekend in anticipation. The culprit has been staring me in the face all this time.

This patch corroborates the story of that previous comment.

  1. The first call of NewPacket.parse() has the last four bytes from the I/O ring buffer (which is the header for the next MySQL packet). NewPacket.parse() takes those four bytes, decomposes them into length and self.last_seq_id, then clears the NewPacket.header. ParseResult::Incomplete is returned so the next set of reads populate the NewPacket.data vector.
  2. A second pass is made to parse the rest of the packet. There's no data available on the socket, and the I/O ring buffer is now empty, so no bytes are copied into NewPacket.data.
  3. The second pass finishes with a call to NewPacket.parse(), which goes through it's checklist. First, did we get all the bytes for the header? NewPacket.header.len() is zero (it was cleared in the previous pass), so nope, we need to get the header again... and the parser is now desynchronized from the network stream.

Seems all this bug needs is for a network socket read to end with the 4-byte header of a MySQL packet, and fail to read more bytes before the next attempt at parsing. mysql_async should not use the header vector size in its parsing checklist and then clear the header vector in the middle of parsing.

My hat goes off to you, @inejge. 500 runs with the release build, no errors. I'll file a PR with mysql_async. If you have a donation/tip link (e.g. Paypal), pass it to me and I'll buy you a cup of coffee.

Many thanks to you and @Mark-Simulacrum for your involvement.

@boxofrox boxofrox closed this Jun 19, 2017

boxofrox added a commit to boxofrox/mysql_async that referenced this issue Jun 19, 2017

NewPacket.header.len() is required for subsequent parsing, do not cle…
…ar the vector

See rust-lang/rust#42610 for details.

Signed-off-by: Justin Charette <charetjc@gmail.com>

boxofrox added a commit to boxofrox/mysql_async that referenced this issue Jun 19, 2017

NewPacket.header.len() is required for parsing, do not clear the vector
See rust-lang/rust#42610 for details.

Signed-off-by: Justin Charette <charetjc@gmail.com>

boxofrox added a commit to boxofrox/mysql_async that referenced this issue Jun 19, 2017

NewPacket.header.len() is required for parsing, do not clear the vector
See rust-lang/rust#42610 for details.

Signed-off-by: Justin Charette <charetjc@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment