Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tcp_stream test fails with TcpStream::set_ttl crash #1117

Closed
dtacalau opened this issue Oct 30, 2019 · 26 comments
Closed

tcp_stream test fails with TcpStream::set_ttl crash #1117

dtacalau opened this issue Oct 30, 2019 · 26 comments
Assignees
Labels
windows Related to the Windows OS.
Milestone

Comments

@dtacalau
Copy link
Contributor

dtacalau commented Oct 30, 2019

Problem:
When running the unit tests on Windows 8.1 Pro, tcp_stream ttl test crash in TcpStrea::set_ttl :

RUST_BACKTRACE=1 cargo test --test tcp_stream
    Finished dev [unoptimized + debuginfo] target(s) in 0.10s
     Running target\debug\deps\tcp_stream-7340f8b6dbab4dde.exe

running 12 tests
test is_send_and_sync ... ok
test shutdown_write ... ignored
test tcp_stream_ipv4 ... ignored
test tcp_stream_ipv6 ... ignored
test ttl ... FAILED
test registering ... ok
test deregistering ... ok
test nodelay ... ok
test reregistering ... ok
test shutdown_both ... ok
test shutdown_read ... ok
test try_clone ... ok

failures:

---- ttl stdout ----
thread 'ttl' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 10022, kind: InvalidInput, message: "An invalid argument was supplied." }', src\libcore\result.rs:1084:5
stack backtrace:
   0: backtrace::backtrace::trace_unsynchronized
             at C:\Users\VssAdministrator\.cargo\registry\src\github.com-1ecc6299db9ec823\backtrace-0.3.34\src\backtrace\mod.rs:66
   1: std::sys_common::backtrace::_print
             at /rustc/625451e376bb2e5283fc4741caa0a3e8a2ca4d54\/src\libstd\sys_common\backtrace.rs:47
   2: std::sys_common::backtrace::print
             at /rustc/625451e376bb2e5283fc4741caa0a3e8a2ca4d54\/src\libstd\sys_common\backtrace.rs:36
   3: std::panicking::default_hook::{{closure}}
             at /rustc/625451e376bb2e5283fc4741caa0a3e8a2ca4d54\/src\libstd\panicking.rs:200
   4: std::panicking::default_hook
             at /rustc/625451e376bb2e5283fc4741caa0a3e8a2ca4d54\/src\libstd\panicking.rs:211
   5: std::panicking::rust_panic_with_hook
             at /rustc/625451e376bb2e5283fc4741caa0a3e8a2ca4d54\/src\libstd\panicking.rs:477
   6: std::panicking::continue_panic_fmt
             at /rustc/625451e376bb2e5283fc4741caa0a3e8a2ca4d54\/src\libstd\panicking.rs:384
   7: std::panicking::rust_begin_panic
             at /rustc/625451e376bb2e5283fc4741caa0a3e8a2ca4d54\/src\libstd\panicking.rs:311
   8: core::panicking::panic_fmt
             at /rustc/625451e376bb2e5283fc4741caa0a3e8a2ca4d54\/src\libcore\panicking.rs:85
   9: core::result::unwrap_failed
             at /rustc/625451e376bb2e5283fc4741caa0a3e8a2ca4d54\/src\libcore\result.rs:1084
  10: core::result::Result<(), std::io::error::Error>::unwrap<(),std::io::error::Error>
             at /rustc/625451e376bb2e5283fc4741caa0a3e8a2ca4d54\src\libcore\result.rs:852
  11: tcp_stream::ttl
             at .\tests\tcp_stream.rs:177
  12: tcp_stream::ttl::{{closure}}
             at .\tests\tcp_stream.rs:168
  13: core::ops::function::FnOnce::call_once<closure-0,()>
             at /rustc/625451e376bb2e5283fc4741caa0a3e8a2ca4d54\src\libcore\ops\function.rs:235
  14: alloc::boxed::{{impl}}::call_once<(),FnOnce<()>>
             at /rustc/625451e376bb2e5283fc4741caa0a3e8a2ca4d54\src\liballoc\boxed.rs:787
  15: panic_unwind::__rust_maybe_catch_panic
             at /rustc/625451e376bb2e5283fc4741caa0a3e8a2ca4d54\/src\libpanic_unwind\lib.rs:80
  16: std::panicking::try
             at /rustc/625451e376bb2e5283fc4741caa0a3e8a2ca4d54\src\libstd\panicking.rs:275
  17: std::panic::catch_unwind
             at /rustc/625451e376bb2e5283fc4741caa0a3e8a2ca4d54\src\libstd\panic.rs:394
  18: test::run_test::run_test_inner::{{closure}}
             at /rustc/625451e376bb2e5283fc4741caa0a3e8a2ca4d54\/src\libtest\lib.rs:1408
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.


failures:
    ttl

test result: FAILED. 8 passed; 1 failed; 3 ignored; 0 measured; 0 filtered out

Cause:
According to @PerfectLaugh:

I figured it out that the windows tcpstream::connect was not allowed to use set_ttl when you aren't not connected yet since we used the non_blocking on connect

Solution:
?

@Thomasdezeeuw Thomasdezeeuw added this to the v0.7 milestone Oct 30, 2019
@Thomasdezeeuw
Copy link
Collaborator

Should we document that set_ttl (and maybe more methods) require a writable (or readable?) event first?

@dtacalau
Copy link
Contributor Author

dtacalau commented Oct 30, 2019

Should we document that set_ttl (and maybe more methods) require a writable (or readable?) event first?

Not sure if documenting is enough, someone might miss reading the docs, also there might be APIs sensitive to connect status on other platforms too. In my case, the crash was not always reproducing, if I enabled/disabled other tests from the same test file.

Maybe implement support in mio for checking if a stream is connected? Or even better, have the APIs sensitive to connect status, to check for connect and return an error in case socket is not connected yet.

@dtacalau
Copy link
Contributor Author

@Thomasdezeeuw please set windows flag on this issue

@Thomasdezeeuw Thomasdezeeuw added the windows Related to the Windows OS. label Nov 1, 2019
@Thomasdezeeuw
Copy link
Collaborator

Not sure if documenting is enough, someone might miss reading the docs, also there might be APIs sensitive to connect status on other platforms too. In my case, the crash was not always reproducing, if I enabled/disabled other tests from the same test file.

I don't entirely agree with documenting is not enough, although an API that can't be misused would be better, documentation is sometimes simply the best we can do without introducing overhead.

Maybe implement support in mio for checking if a stream is connected? Or even better, have the APIs sensitive to connect status, to check for connect and return an error in case socket is not connected yet.

How do propose we do this, without introducing (too much) overhead?

If I remember correctly somewhere, at some point, Mio documentation said that Evented types (now event::Source) were only available (e.g. methods such as set_ttl callable) once an event was received for it. This is how we could get away with TcpStream::connect not blocking if the stream was not yet connected. Maybe that documentation should be moved to a more central place (i.e. important) in the docs.

@dtacalau
Copy link
Contributor Author

dtacalau commented Nov 1, 2019

@Thomasdezeeuw thinking out loud here, on Linux you are allowed to call setsockopt w/ IP_TTL before connect, while on Windows set_ttl requires the socket to be connected. An application using mio interface set_ttl would have to have a different implementation for Linux and Windows, maybe we don't want that, not sure if this is a design goal of mio. Otherwise, just documenting it seem reasonable ATM.

A possible solution would be using completion ports the same way they are used for simulate polling. A connected stream event is just another poll for WRITABLE for that stream. Here's some pseudocode below.

check_stream_connected: bool {
    is_the_stream_connected == 0
        poll(READABLE)
        if poll ok 
            is_stream_connected = 1
    return is_stream_connected
} 

actual set_ttl {
    if is_the_stream_connected()
        return inner.set_ttl()
    else
        return EAGAIN
}

I'm not sure about the overhead introduced by this solution, how it would affect performance, could be minimal sincer we'd just reuse the code already present for simulating poll by using AFD driver.
Does mio have benchmarks for profilling performance of various interface calls? I might have missed them.

@Thomasdezeeuw
Copy link
Collaborator

@dtacalau Mio is currently missing benchmarks, in the v0.6 -> v0.7 transition a lot of the API changed, but its something to add again after initial v0.7 release.

As for keep track of whether or not the stream is connected; that would require (at least) a boolean value. But if we document that the TcpStream is only ready after an initial read/writable event has been received (pretending it doesn't work on Linux either) we can get away with not using any additional state. It would make the API harder to use, but I personally feel since we already have to wait before writing/reading it shouldn't be too much added complexity.

What do you think?

@dtacalau
Copy link
Contributor Author

dtacalau commented Nov 1, 2019

But if we document that the TcpStream is only ready after an initial read/writable event has been received (pretending it doesn't work on Linux either) we can get away with not using any additional state.

@Thomasdezeeuw not sure if it's possible to pretend stream is only ready after an initial read/writable event has been received on all mio supported platforms. What if mio will need to offer interface for setting a socket option which requires a not connected stream, the exact opposite of set_ttl. That's why I'm thinking it is safer to handle the issue per specific option, depending on the option needs.

Or maybe introduce a new interest called CONFIGURABLE, ability to configure behavior of a stream (call setsockopt on a stream sock), which will be platform specific behind the scenes. So, on Windows, taking set_ttl as example:

poll.register(stream, CONFIGURABLE)
poll.wait(events)
if (events matches CONFIGURABLE)
    set_ttl(stream)

Anyway, if we can pretend ATM that all currently supported options require a connected stream, I'm fine with just documenting, just keep in mind the opposite case :).

@dtacalau
Copy link
Contributor Author

dtacalau commented Nov 1, 2019

@Thomasdezeeuw I could also implement some benchmarks and have the CI running them along with the rest of the tests (or ressurect old benchmarks). Let me know what you think!

dtacalau added a commit to dtacalau/mio that referenced this issue Nov 4, 2019
…s, this test was crashing on Windows because, on Windows calling setsockopt on a

not connected socket is undefined behavior. More details can be found in issue tokio-rs#1117.

Signed-off-by: Daniel Tacalau <dst4096@gmail.com>
@dtacalau
Copy link
Contributor Author

dtacalau commented Nov 4, 2019

@Thomasdezeeuw to wrap up our discussion: I've commited a fix for the set_ttl test, but not really satisfied with the fix (since this is only a Windows limitation, other platforms would not need the extra register and wait).

Would like to know your thoughts on my ideas above: benchmarks, adding a new CONFIGURABLE interest. Thanks!

@Thomasdezeeuw
Copy link
Collaborator

@Thomasdezeeuw to wrap up our discussion: I've commited a fix for the set_ttl test, but not really satisfied with the fix (since this is only a Windows limitation, other platforms would not need the extra register and wait).

This this is more or less the best we can do without introducing overhead elsewhere, if you're talking about registering before calling set_ttl as done in #1120. Personally I'm OK with that since most, if not all, use cases involve registering the stream at some point.

Would like to know your thoughts on my ideas above: benchmarks, adding a new CONFIGURABLE interest. Thanks!

I not really in favour of a configurable interest because I wouldn't know that would be? I think it would just be another name for writable on Windows? But then you have an awkward situation in which you'll get an writable event even though you only registered configurable interests.

As for the benchmarks; there very welcome. I think the API is starting to stabilise so we I think the work to maintain them won't be too much.

@carllerche
Copy link
Member

We should make the test work in windows or make the API a *nix only API IMO.

I think we can make this work in windows by storing the ttl on the internal state until we detect the socket is connected, then apply it then. @Thomasdezeeuw wdyt?

@carllerche carllerche reopened this Nov 6, 2019
@Thomasdezeeuw
Copy link
Collaborator

After 60bf56b this works on Windows, but it does require the user to receive an event for it first. Do mean that calling set_ttl should work without waiting for an event?

@dtacalau
Copy link
Contributor Author

dtacalau commented Nov 7, 2019

@carllerche atm the ttl test works on windows, also nodelay tests work (they were similar issues).
@Thomasdezeeuw MarkLivs from windows networking started looking into this. I'd leave this issue open a little longer to get his feedback. He replied to me on mio channel saying he was not able to reproduce it and asking for repro steps.

@dtacalau
Copy link
Contributor Author

dtacalau commented Nov 14, 2019

Mark from the Windows core networking team offered to help investigating this issue. Will keep this issue updated with the progress done and other interesting facts. First update follows.

Mark was unable to reproduce it: "I've been working on reproing the TTL issue you mentioned, and I'm having some trouble doing so. I tried setting + getting the option on client and server TCP AF_INET sockets before any connection attempt is made, after client calls connect but before server accepts and after server accepts, and they pass the smoke test of being able to set, and then retrieve the same value. " " I tried this on my development machine, and on a Windows 8.1 pro 64-bit VM I spun up and both pass the tests. "

He asked me to collect a time travel debugging trace of the repro.

The issue does not reproduce when run under time travel debugging tool tttracer tool. I think tracing make things run slower and changes the timing. However I've prepared a branch where the issue reproduces almost always, I've test it under Windows 8.1 and also Windows 10:

https://github.com/dtacalau/mio/tree/reproduce_set_get_ttl_issue_1117
to reproduce: cargo build; cargo test --test=tcp_stream

@MarkLivsc
Copy link

Carl and I are continuing to investigate this issue. Some other findings include that running the test in isolation, and running the test with --test-threads 1 makes the repro go away. However, putting a 2 second sleep in the server thread with multi-threading enabled makes the repro 100%. Carl identified what looks like uninitialized reads using Dr Memory. So, overall the investigation is ongoing.

@dtacalau
Copy link
Contributor Author

dtacalau commented Nov 18, 2019

Hey @MarkLivsc, it's Daniel not Carl :).

In addition to what Mark said, more facts:

  • issue doesn't reproduce under tracing in no conditions. Without tracing we're always able to reproduce it if a 2 sec sleep is added before accept call in the server thread, with multi-threaded tests enabled.
  • the uninitialized read errors found by profiler are present for all unit tests, I've opened a separate issue Investigate UNINITIALIZED READ errors reported by profiler on Windows #1151 for them.
  • App Verifier revealed runtime exceptions. We've enabled Windows App Verifier checks (Basic & Networking) for the tcp_stream.exe app, and run it under debugger:
Exception thrown at 0x00007FFC2660908C in tcp_stream-7340f8b6dbab4dde.exe: Microsoft C++ exception: unsigned __int64 at memory location 0x00000017E495ECD0.

VERIFIER STOP 000000000000E10C: pid 0x1A48: WSPCleanup aborted and closed existing service provider SOCKET handles underneath the code that owned those SOCKETs 
	0000000000000002 : Number of service provider sockets that were outstanding

VERIFIER STOP 000000000000E10B: pid 0x1A48: WSACleanup aborted and closed existing SOCKET handles underneath the code that owned those SOCKETs 
	0000000000000002 : Number of sockets that were outstanding

@dtacalau
Copy link
Contributor Author

@Thomasdezeeuw please re-assign this issue to me, thanks.

@Thomasdezeeuw
Copy link
Collaborator

@dtacalau is there anything left to do here?

@dtacalau
Copy link
Contributor Author

dtacalau commented Dec 4, 2019

@Thomasdezeeuw yes, this is wip. Me and Mark have continued investigating this. Biggest problem here is that the issue repro on my machine only. Mark was unable to repro this on his machines.

Mark has prepared a set of custom drivers with more debug info which I installed on a machine where the issue repro. But unfortunately issue did not reproduce under custom drivers. Anything that slows down the machine renders the issue to not reproduce anymore.

Next, Mark has prepared a minimalist repro test that just keep running until it hits the issue, so he's trying to isolate the issue from mio and rust and repro-ing it in another context. The most important part here is to find a way to reproduce it on other machines too.

@Thomasdezeeuw
Copy link
Collaborator

@dtacalau do you think this should block the v0.7 release? Or can we move it to the v1.0?

@dtacalau
Copy link
Contributor Author

dtacalau commented Dec 4, 2019

@Thomasdezeeuw not a blocker for v0.7 release imo.

@Thomasdezeeuw Thomasdezeeuw modified the milestones: v0.7, v1.0 Dec 4, 2019
@dtacalau
Copy link
Contributor Author

dtacalau commented Dec 6, 2019

We have a repro, this rust test always fails:

#[test]
fn set_get_ttl_mod() {
    let (mut poll, mut events) = init_with_poll();

    let address = "127.0.0.1:9999".parse().unwrap();

    let stream = TcpStream::connect(address).unwrap();
    assert!(stream.take_error().unwrap().is_none());

    // set TTL, get TTL, make sure it has the expected value
    const TTL: u32 = 10;
    stream.set_ttl(TTL).unwrap();
    assert_eq!(stream.ttl().unwrap(), TTL);
    assert!(stream.take_error().unwrap().is_none());
}

@MarkLivsc also implemented a test that runs N instances of essentially the test above (but written in C) in parallel. The issue repro with Mark's test but only when run multithreaded, when run on one thread it does not repro.

The key here was the failing async connect, so now we can define the problem better: set_ttl fails with an async failing connect when run multithreaded. This is not something documented on MSDN.

Since Mark now has a repro, he'll continue the investigation on his end and promissed to keep us posted with updates.

@Thomasdezeeuw I think we can close this issue since there's nothing else to be done on rust/mio side.

@Thomasdezeeuw
Copy link
Collaborator

I prefer to keep this open as its still an issue.

In any case maybe we should document it and maybe share the info with other project such as libuv.

@dtacalau
Copy link
Contributor Author

dtacalau commented Dec 7, 2019

In any case maybe we should document it and maybe share the info with other project such as libuv.

@Thomasdezeeuw it's already documented

/// On Windows make sure the stream is connected before calling this method,
.

@dtacalau
Copy link
Contributor Author

The behavior observed and investigated in this issue has been documented in Win docs. @MarkLivsc pointed us to the doc location:

https://docs.microsoft.com/en-us/windows/win32/api/winsock2/nf-winsock2-connect

If the connection is not completed immediately, the client should wait for connection completion before attempting to set socket options using setsockopt. Calling setsockopt while a connection is in progress is not supported.

Closing this issue as we've already had the proper usage documented:

    /// On Windows make sure the stream is connected before calling this method,
    /// by receiving an (writable) event. Trying to set `ttl` on an
    /// unconnected `TcpStream` is undefined behavior.
    pub fn set_ttl(&self, ttl: u32) -> io::Result<()> {

@Thomasdezeeuw
Copy link
Collaborator

@dtacalau and @MarkLivsc thanks for figuring this out!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
windows Related to the Windows OS.
Projects
None yet
Development

No branches or pull requests

4 participants