Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Manage InnerEndpoint timers with a DelayQueue #80

Closed
wants to merge 4 commits into from

Conversation

little-dude
Copy link

No description provided.

@little-dude
Copy link
Author

This is my attempt at fixing #64

There are also two unrelated clippy fixes but I can remove them from the PR if you prefer.

@little-dude
Copy link
Author

little-dude commented Oct 30, 2018

I hope I'm not changing the behavior here. It would be helpful to have tests for this kind of things. I have some experience in python and QA, so if that fits your plan, I can probably automate testing a bit. I'd have to read the RFCs first but let me know if you're interested.

Copy link
Collaborator

@Ralith Ralith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great, thanks! Just a few possible tweaks. Cosmetic clean-up is always welcome, especially when gracefully isolated from other changes.

It looks like you preserved the behavior fine, and I'll do some manual testing before I merge. While the state machine in quinn-proto has a decent number of tests, these higher-level tokio bindings are definitely missing some tests of their own; some end-to-end tests would be very welcome. Originally I'd hoped that the bindings would be simple to the point of obviousness but that hasn't quite worked out. I don't think you'd need a deep understanding of the protocol for basic smoke tests, which can just use the high-level API like the examples do. See also #59.

quinn/src/lib.rs Outdated Show resolved Hide resolved
quinn/src/lib.rs Outdated Show resolved Hide resolved
quinn/src/lib.rs Outdated Show resolved Hide resolved
quinn/src/lib.rs Outdated Show resolved Hide resolved
quinn/src/lib.rs Outdated Show resolved Hide resolved
@Ralith
Copy link
Collaborator

Ralith commented Oct 30, 2018

Also, it looks like you're running a mismatched version of rustfmt. We're standardized on the current stable release.

@codecov
Copy link

codecov bot commented Oct 30, 2018

Codecov Report

Merging #80 into master will increase coverage by 0.82%.
The diff coverage is 0%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #80      +/-   ##
==========================================
+ Coverage   56.65%   57.48%   +0.82%     
==========================================
  Files          17       17              
  Lines        4423     4436      +13     
==========================================
+ Hits         2506     2550      +44     
+ Misses       1917     1886      -31
Impacted Files Coverage Δ
quinn/src/lib.rs 0% <0%> (ø) ⬆️
quinn-proto/src/packet.rs 79.89% <0%> (+1%) ⬆️
quinn-proto/src/endpoint.rs 71.37% <0%> (+1.54%) ⬆️
quinn-proto/src/tests.rs 91.5% <0%> (+3.56%) ⬆️
quinn-proto/src/lib.rs 52.63% <0%> (+5.26%) ⬆️
quinn-proto/src/frame.rs 68.35% <0%> (+7.42%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 39267e8...23bba11. Read the comment docs.

@little-dude
Copy link
Author

Also, it looks like you're running a mismatched version of rustfmt. We're standardized on the current stable release.

Ah thank you, I was a bit confused. I amended the last commit.

I'll address the rest of your comment tomorrow morning. Thank you for the thorough review, it's much appreciated :)

@djc
Copy link
Member

djc commented Oct 30, 2018

Actually, #78 already fixes the formatting problems, so if we merge that and this PR gets rebased on top of that we don't need the extraneous commit anymore.

@little-dude little-dude force-pushed the delay-queue branch 2 times, most recently from 70853ac to 1251ae4 Compare October 30, 2018 14:46
@little-dude
Copy link
Author

I removed the rustfmt commit, and addressed the comments (except the reset_at one).
I'll wait for #78 to be merged and rebase on top of it (and I'll probably squash the last two commits).

Copy link
Collaborator

@Ralith Ralith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Second pass, just some cosmetics. #78 is merged, so we should be good to go afterwards.

quinn/src/lib.rs Outdated Show resolved Hide resolved
quinn/src/lib.rs Outdated Show resolved Hide resolved
quinn/src/lib.rs Outdated Show resolved Hide resolved
quinn/src/lib.rs Show resolved Hide resolved
quinn/src/lib.rs Outdated Show resolved Hide resolved
@Ralith
Copy link
Collaborator

Ralith commented Nov 1, 2018

Running the example client/server produces a panic at the reset_at call with the message assertion failed: old >= delay.deadline()

@little-dude
Copy link
Author

Sorry I should have run that first.

I don't really understand the code at https://docs.rs/tokio-timer/0.2.7/src/tokio_timer/delay_queue.rs.html#502

I'll try to reproduce this on a small example to get a better understanding of what triggers the panic.

@Ralith
Copy link
Collaborator

Ralith commented Nov 1, 2018

It seems that a timer is being reset after it should have already expired. It's not obvious to me why this is deemed panic-worthy by DelayQueue, but I'm also surprised that it's happening in the first place.

@little-dude
Copy link
Author

little-dude commented Nov 1, 2018

but I'm also surprised that it's happening in the first place.

One scenario: when timers are polled we may have an expired timer, but we don't reset loss_detect_timer or idle_timer to None, so in the:

while let Some(io) = endpoint.inner.poll_io(now) {

we may use a key for an expired timer.

I'm seeing something like that in the server:

# [...]
polling delay queue
expired timer : Expired { data: (ConnectionHandle(0), LossDetection), deadline: Instant { tv_sec: 11105, tv_nsec: 615920999 }, key: Key { index: 1 } }
polling delay queue
no expired timer
resetting timer: duration = 29.28249s, key = Key { index: 0 }, timer = Idle
resetting timer: duration = 19.29249s, key = Key { index: 1 }, timer = LossDetection
thread 'main' panicked at 'invalid key', /home/corentih/.cargo/registry/src/github.com-1ecc6299db9ec823/slab-0.4.1/src/lib.rs:756:18

However for the client I'm not seeing that. There's something weird though:

inserting timer: duration = 10.000764s, timer = Idle
new key = Some(Key { index: 0 })
inserting timer: duration = 200.764ms, timer = LossDetection
new key = Some(Key { index: 1 })
polling delay queue
no expired timer
resetting timer: duration = 10.005912s, key = Key { index: 0 }, timer = Idle
resetting timer: duration = 16.208ms, key = Key { index: 1 }, timer = LossDetection
polling delay queue
no expired timer
resetting timer: duration = 10.010598s, key = Key { index: 0 }, timer = Idle
polling delay queue
no expired timer
resetting timer: duration = 10.012813s, key = Key { index: 0 }, timer = Idle
polling delay queue
no expired timer
resetting timer: duration = 10.015086s, key = Key { index: 0 }, timer = Idle
polling delay queue
no expired timer
resetting timer: duration = 10.023696s, key = Key { index: 0 }, timer = Idle
polling delay queue
no expired timer
resetting timer: duration = 10.028703s, key = Key { index: 0 }, timer = Idle
inserting timer: duration = 38.703ms, timer = LossDetection
new key = Some(Key { index: 1 }) <---------- why do we get key = 1 here? It's already in use.
polling delay queue
no expired timer
connected at 0.03149764
stream opened at 0.03159706
resetting timer: duration = 10.031907s, key = Key { index: 0 }, timer = Idle
resetting timer: duration = 38.703ms, key = Key { index: 1 }, timer = LossDetection
thread 'main' panicked at 'assertion failed: old >= delay.deadline()', /home/corentih/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-timer-0.2.7/src/delay_queue.rs:502:13

@Ralith
Copy link
Collaborator

Ralith commented Nov 1, 2018

we may use a key for an expired timer.

Great catch! That's almost certainly the root cause.

There's something weird though:

Timer 1 was probably removed due to a TimerStop event.

@little-dude
Copy link
Author

I fixed the "invalid key" panic. But the client still crashes consistently, and the server once in a while.

@Ralith
Copy link
Collaborator

Ralith commented Nov 1, 2018

If you enable trace logging in slog (replace the slog line with slog = { version = "2.2", features = ["max_level_trace"] } in one of the Cargo.toml files) you get verbose timestamped logging, which produces this perplexing sequence of events:

inserting timer: duration = 37.949ms, timer = LossDetection
new key = Some(Key { index: 1 })
Oct 31 22:21:36.966 TRCE timer start, timer: LossDetection, time: 37.949ms
...
Oct 31 22:21:36.967 TRCE resetting timer: duration = 37.949ms, key = Key { index: 1 }, timer = LossDetection
thread 'main' panicked at 'assertion failed: old >= delay.deadline()', /home/ralith/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-timer-0.2.7/src/delay_queue.rs:502:13

Note 1ms difference in time before LossDetection's old value is deemed nonsensical, despite it still theoretically being about 37ms in the future. Hmmm.

@little-dude
Copy link
Author

little-dude commented Nov 1, 2018

Thanks for the tip with slog. I had implemented my own now_in_ms() function that I was calling in each println!(). It was tedious... I get the exact same results than you: it crashes when we reset the timer 1ms after inserting it.

Copy link
Collaborator

@Ralith Ralith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes may be causing additional confusion.

quinn/src/lib.rs Outdated Show resolved Hide resolved
quinn/src/lib.rs Outdated Show resolved Hide resolved
@little-dude
Copy link
Author

sorry for the confusion I was trying things to see if that got rid of the error. I cleaned things up a little and squashed the commits.

@djc
Copy link
Member

djc commented Nov 15, 2018

So what's the status here? How do we move forward?

@little-dude
Copy link
Author

I need to repro the panic in tokio-timer. I haven't had time for the past 2 weeks due to unexpected events IRL. I have to move out by the end of the month so it's unlikely that I'll be able to work on this before December :(

@djc
Copy link
Member

djc commented Nov 16, 2018

No worries, good luck wih the move!

@djc
Copy link
Member

djc commented Dec 29, 2018

I think this will become blocking soon, do you still have time to work on this in the next week or so? If not, I'll build on your work and finish it off.

@Ralith
Copy link
Collaborator

Ralith commented Jan 14, 2019

I'm having another go with this in the course of experimenting with async/await, and am making progress on narrowing down a root cause: tokio-rs/tokio#849

@Ralith
Copy link
Collaborator

Ralith commented Feb 7, 2019

Testing on the nightly branch suggests that DelayQueue itself is now relatively debugged.

@Ralith
Copy link
Collaborator

Ralith commented Mar 29, 2019

Mooted by #264

@Ralith Ralith closed this Mar 29, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants