Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

io: add scheduler #16633

Merged
merged 15 commits into from
Feb 23, 2024
Merged

io: add scheduler #16633

merged 15 commits into from
Feb 23, 2024

Conversation

dotnwat
Copy link
Member

@dotnwat dotnwat commented Feb 18, 2024

Adds a simple scheduler to the I/O subsystem. The purpose of the scheduler is to manage all of the active I/O queues (e.g. implementing global policies like preventing starvation or priority). However, this initial version of the scheduler only implements a policy that limits the total number of open file handles.

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v23.3.x
  • v23.2.x
  • v23.1.x

Release Notes

  • none

Signed-off-by: Noah Watkins <noahwatkins@gmail.com>
This is done for consistency with other test helpers.

Signed-off-by: Noah Watkins <noahwatkins@gmail.com>
@dotnwat dotnwat marked this pull request as draft February 19, 2024 15:55
@dotnwat
Copy link
Member Author

dotnwat commented Feb 19, 2024

Converting to draft while I fix a memory usage issue with a large test.

Signed-off-by: Noah Watkins <noahwatkins@gmail.com>
Signed-off-by: Noah Watkins <noahwatkins@gmail.com>
Signed-off-by: Noah Watkins <noahwatkins@gmail.com>
@dotnwat dotnwat marked this pull request as ready for review February 21, 2024 01:55
Copy link
Contributor

@andrwng andrwng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking pretty good! I found it a bit tough to dive into, but having stared for a while I think it mostly makes sense. I added some naming suggestions that I felt would've helped.

size_t waiters_{0};
seastar::semaphore nofiles_;
intrusive_list<queue, &queue::cache_hook_> lru_;
seastar::future<> monitor(queue*) noexcept;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: mind moving this above the members? Not sure if others feel this way but I always find it way easier to navigate a header if the methods are above the members.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no preference at all. fixed. i wonder if this should be added to our style-guide.md ?

* detailed description.
*/
size_t waiters_{0};
seastar::semaphore nofiles_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question/nit: is this "number of files/no. files" or "no files"? A comment describing what the semaphore protects would be appreciated. Or maybe name it available_file_units_ or something?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also it might be worth considering using ssx::semaphore for improved debuggability

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its number of open files. i've renamed it to be clearler.

switched to ssx::semaphore. forgot it exists!

* see scheduler::monitor implementation for more details.
*/
seastar::future<> monitor_{seastar::make_ready_future<>()};
seastar::semaphore sem_{0};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: could you add a comment describing what this semaphore is meant to be used for (and maybe any special considerations to be made when thinking about the nofile_ semaphore, if any)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one thought: perhaps this should be named has_work_sem or somesuch?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed both comments

*/
seastar::future<> monitor_{seastar::make_ready_future<>()};
seastar::semaphore sem_{0};
seastar::semaphore_units<> units_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: at least jumping into this code for the first time, I think it'd be helpful to name this scheduler_units_ or nofiles_units_ or somesuch to set readers' expectations on what usage should look like

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

co_return;
}

co_await queue->sem_.wait(std::max<size_t>(queue->sem_.current(), 1));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little confused by this. What's the significance of using a size-0 semaphore vs a condition variable?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a comment into the code to explain. basically a condition variable will drop wake-ups if there are no waiters. fix that with a predicate, but in this case the predicate isn't a simple flag or two to check. so we can use a semaphore to ensure that the loop will run again if its signaled while the worker is busy doing other things.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah got it, thanks for clarifying. A while ago we found that a similar (maybe the same) concern had addressed in seastar recently.

https://github.com/scylladb/seastar/blob/f0b2971a24ab7716933a720572207fa9a1df38bd/tests/unit/condition_variable_test.cc#L44-L49

Is this the scenario you're thinking about?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, that's interesting. Seems they addressed that 2 years ago and it hasn't yet solidified in my brain. Part of that is because I don't think pthread condition variable has this property. Anyway, that's good to know. I'll likely switch this back to condition variable then in a later PR.

* a queue to close in order to enforce the maximum open queue limit.
*/
intrusive_list_hook sched_hook_;
intrusive_list_hook cache_hook_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: again might just be a first-time-reader thing, but I think it'd be more easily followable naming this lru_hook_ or renaming lru_ to cache_

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

*/
size_t waiters_{0};
seastar::semaphore nofiles_;
intrusive_list<queue, &queue::cache_hook_> lru_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment suggestion: "LRU-ordered list of currently opened queues"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Comment on lines +131 to +135
/*
* this case is possible, if for example, `nofiles_` was
* initialized with 0 units.
*/
waiters_++;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused by this: I suspect a subsequent change will allow changing the semaphore to be adjustable dynamically, so for now, it's odd to see. Assuming adjusting the semaphore will be the way we'd be able to unblock the get_units call, shouldn't we also be decrementing waiters_ once we add ourselves to lru_?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i added a comment about waiters being used to handle cases that arise when the semaphore units are dynamically adjusted.

shouldn't we also be decrementing waiters_ once we add ourselves to lru_?

I think that the net effect is that this is happening, but we can't easily have a worker decrement uncondtionally when it gets added to the lru list (even if it knows it contributed a +1) because its hard to know if another worker woekup and woke up some waiters (the cooperative wake up at the top is a while loop that wakes up as much as possible). not being careful would have us double count decrements.

i'm not convinced its 100% correct, but here is a simple example:

say we have open file limit as 0 and three queues show up. lru is empty, so waiters will bump to 3. then we deposit one unit into the semaphpore and a waiter wakes up. it'll add itself to the lru and loop back around and wake up waiters. the waiters are still 3, but the lru only has 1 entry, so so that limits the evictions to 1 (correctly, the currently running worker is the only open file and now itll self-close so another worker runs). waiters is decremented to 2 reflecting reality.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah thanks for the example! Very helpful. I'd skipped over the fact that we can still proceed with opening without waiters_ being decremented, as long as open_file_limit_ is added to. I initially thought another monitor had to pop from the lru_ for us to proceed, which was confusing because lru_ here is empty!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right. i guess waiters has kind of a optimistic property in that it's incremented before the waiter waits, and decremented before the waiter releases.

Comment on lines 164 to 169
for (auto& driver : drivers) {
try {
seastar::remove_file(driver->queue.path().string()).get();
} catch (...) {
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe stick this in a deferred block so it runs even if the test fails?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed!

Signed-off-by: Noah Watkins <noahwatkins@gmail.com>
Signed-off-by: Noah Watkins <noahwatkins@gmail.com>
Signed-off-by: Noah Watkins <noahwatkins@gmail.com>
Signed-off-by: Noah Watkins <noahwatkins@gmail.com>
Signed-off-by: Noah Watkins <noahwatkins@gmail.com>
Signed-off-by: Noah Watkins <noahwatkins@gmail.com>
Signed-off-by: Noah Watkins <noahwatkins@gmail.com>
Signed-off-by: Noah Watkins <noahwatkins@gmail.com>
Signed-off-by: Noah Watkins <noahwatkins@gmail.com>
Signed-off-by: Noah Watkins <noahwatkins@gmail.com>
@dotnwat dotnwat requested a review from andrwng February 23, 2024 06:00
@dotnwat dotnwat merged commit 7a3b3d5 into redpanda-data:dev Feb 23, 2024
16 checks passed
*/
while (waiters_ > 0 && !lru_.empty()) {
waiters_--;
auto& queue = lru_.front();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

queue local var shadows the parameter

Copy link
Member Author

@dotnwat dotnwat Feb 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, gross. good catch! at least they are different types. i wonder if that is why clangd wasn't flagging this in my ide.

queue->sem_.signal();
auto monitor = std::exchange(
queue->monitor_, seastar::make_ready_future<>());
co_await std::move(monitor);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: it feels like this could be easier to read if the gate is used to control the lifetime of the background fiber

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you mean use a gate instead?

lru_.pop_front();
queue.sem_.signal();
}
units = co_await seastar::get_units(nofiles_, 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do I understand correctly that the invariant is that if the queue is in the lru_ list it should have units so if we are waking up the queue it's guaranteed that we will eventually acquire units on this line?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. but i think as you point out, a queue may still hold on to the units even if it isn't on the lru list. this is a temporary state, and not being on the list is a signal to the queue monitor that when it runs it should release its units after stopping the io_queue and closing the file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants