Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tests with network requests frequently time out under rr #22892

Open
jdm opened this issue Feb 14, 2019 · 19 comments
Open

Tests with network requests frequently time out under rr #22892

jdm opened this issue Feb 14, 2019 · 19 comments

Comments

@jdm
Copy link
Member

@jdm jdm commented Feb 14, 2019

I have observed this in an unmodified Servo build using RUST_LOG=tokio,net,hyper ./mach test-wpt --debugger=rr --debugger-args="record --chaos" --no-pause-after-test --repeat-until-unexpected tests/wpt/web-platform-tests/fetch/content-type/script.window.js. I suspect this is not limited to this particular test.

The symptom is that the test makes a number of HTTP requests to the WPT python HTTP server, the server observes all of the requests and responds appropriately, and Servo only observes a subset of the responses. The RUST_LOG output shows that the sockets aren't being notified that there is data available.

It's not clear to me whether this is a problem in tokio/hyper that rr exposes, or whether this is a problem caused by rr behaving incorrectly.

@jdm
Copy link
Member Author

@jdm jdm commented Feb 14, 2019

My attempt to write a standalone rust program that made the same kinds of requests to the server did not reproduce the same problem when run under rr.

@seanmonstar
Copy link
Contributor

@seanmonstar seanmonstar commented Feb 14, 2019

I spent a few hours today tracing through the logs you gave me, and started to notice subtle differences with what is in the latest versions of tokio. Would it be possible to update hyper and all tokio crates to the newest version?

@jdm
Copy link
Member Author

@jdm jdm commented Feb 14, 2019

That is already the case as far as I know.

@jdm
Copy link
Member Author

@jdm jdm commented Feb 14, 2019

That being said, I am starting from master and upgrading everything again and I'll grab new logs.

@seanmonstar
Copy link
Contributor

@seanmonstar seanmonstar commented Feb 14, 2019

I noticed a couple lines "starting background reactor", which with newer tokio, shouldn't happen automatically anymore, as each worker has its own reactor instead of them all sharing a background one.

@jdm
Copy link
Member Author

@jdm jdm commented Feb 14, 2019

Reproduced with master...jdm:foo2.

@seanmonstar
Copy link
Contributor

@seanmonstar seanmonstar commented Feb 15, 2019

Could I bother you for logs from the newest versions? Since a few things had changed, it was difficult to trace execution with the older logs and comparing with the tokio repo.

@jdm
Copy link
Member Author

@jdm jdm commented Feb 15, 2019

https://gist.github.com/jdm/d91481985d554250a1988c3c34808639

The problem requests start on line 654.

@seanmonstar
Copy link
Contributor

@seanmonstar seanmonstar commented Feb 15, 2019

There is still a line about background reactor, which if on the newest version, should only happen when a socket is being used outside of a runtime. I suspect this line here:

let (res, msg) = match response_future.wait() {

What if you replace that with something that spawns the ResponseFuture on the runtime, like this:

let fut = futures::sync::oneshot::spawn(response_future, &HANDLE.lock().unwrap().executor());
let (res, msg) = match fut.wait() {
@jdm
Copy link
Member Author

@jdm jdm commented Feb 15, 2019

error[E0277]: `dyn futures::future::Future<Item=(http::response::Response<decoder::Decoder>, std::option::Option<devtools_traits::ChromeToDevtoolsControlMsg>), Error=net_traits::NetworkError>` cannot be sent between threads safely
    --> components/net/http_loader.rs:1188:15
     |
1188 |     let fut = futures::sync::oneshot::spawn(response_future, &HANDLE.lock().unwrap().executor());
     |               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ `dyn futures::future::Future<Item=(http::response::Response<decoder::Decoder>, std::option::Option<devtools_traits::ChromeToDevtoolsControlMsg>), Error=net_traits::NetworkError>` cannot be sent between threads safely
     |
     = help: the trait `std::marker::Send` is not implemented for `dyn futures::future::Future<Item=(http::response::Response<decoder::Decoder>, std::option::Option<devtools_traits::ChromeToDevtoolsControlMsg>), Error=net_traits::NetworkError>`
     = note: required because of the requirements on the impl of `std::marker::Send` for `std::ptr::Unique<dyn futures::future::Future<Item=(http::response::Response<decoder::Decoder>, std::option::Option<devtools_traits::ChromeToDevtoolsControlMsg>), Error=net_traits::NetworkError>>`
     = note: required because it appears within the type `std::boxed::Box<dyn futures::future::Future<Item=(http::response::Response<decoder::Decoder>, std::option::Option<devtools_traits::ChromeToDevtoolsControlMsg>), Error=net_traits::NetworkError>>`
     = note: required because it appears within the type `futures::sync::oneshot::Execute<std::boxed::Box<dyn futures::future::Future<Item=(http::response::Response<decoder::Decoder>, std::option::Option<devtools_traits::ChromeToDevtoolsControlMsg>), Error=net_traits::NetworkError>>>`
     = note: required because of the requirements on the impl of `futures::future::Executor<futures::sync::oneshot::Execute<std::boxed::Box<dyn futures::future::Future<Item=(http::response::Response<decoder::Decoder>, std::option::Option<devtools_traits::ChromeToDevtoolsControlMsg>), Error=net_traits::NetworkError>>>>` for `tokio::runtime::threadpool::task_executor::TaskExecutor`
     = note: required by `futures::sync::oneshot::spawn`

error: aborting due to previous error
@seanmonstar
Copy link
Contributor

@seanmonstar seanmonstar commented Feb 15, 2019

@jdm
Copy link
Member Author

@jdm jdm commented Feb 15, 2019

It compiled, but I was still able to reproduce the timeout.

@seanmonstar
Copy link
Contributor

@seanmonstar seanmonstar commented Feb 15, 2019

I'm specifically trying to understand why at the end, there are no event Readable Token(...) after the Writable events get logged. Is it possible to see the logs of the version using the oneshot?

Oh also, is this Mac-specific (and so I should be looking at kqueue), or Mac + Linux?

@jdm
Copy link
Member Author

@jdm jdm commented Feb 15, 2019

I have only observed this under rr, which is linux-specific.

@jdm
Copy link
Member Author

@jdm jdm commented Feb 15, 2019

Here's a log from the oneshot build - first there's a log of a timing out run, and then there's a log of a successful run: https://gist.github.com/jdm/66cf22f47c568e59ee6c59721faa59c9

@seanmonstar
Copy link
Contributor

@seanmonstar seanmonstar commented Feb 15, 2019

Oh, this only happens in rr, not in any other situation? Hmmming intensifies

@jdm
Copy link
Member Author

@jdm jdm commented Feb 15, 2019

Yeah, that was the motivation for my comment: "It's not clear to me whether this is a problem in tokio/hyper that rr exposes, or whether this is a problem caused by rr behaving incorrectly."

It could be that rr's chaos mode is exposing a legitimate bug (which is its whole purpose). It could also be that this code is exposing a bug in rr's replaying of the relevant syscalls. I have no idea how to determine which it is.

@jdm
Copy link
Member Author

@jdm jdm commented Feb 15, 2019

Which is to say - I have observed inexplicable network-related timeouts in other tests when running tests outside of rr before. I cannot guarantee that this timeout only reproduces in rr, only that I have consistently reproduced it in rr.

@seanmonstar
Copy link
Contributor

@seanmonstar seanmonstar commented Feb 15, 2019

Ya, I didn't mean to imply the bug is in rr.

So, the confusing part is that the logs claim the requests have been written to socket, but then the reactor, when waiting on epoll, never sees Readable events for some of them. They are register with epoll, because we saw Writable events. And the reactor isn't stuck, because it sees events for other sockets. I'm at a loss...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
2 participants
You can’t perform that action at this time.