Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scx_userland: allocate tasks array based on kernel.pid_max #29

Merged
merged 1 commit into from
Dec 13, 2023

Conversation

arighi
Copy link
Collaborator

@arighi arighi commented Dec 13, 2023

Currently the array of enqueued tasks is statically allocated to a fixed size of USERLAND_MAX_TASKS to avoid potential deadlocks that could be introduced by performing dynamic allocations in the enqueue path.

However, this also adds a limit on the maximum pid that the scheduler can handle, since the pid is used as the index to access the array.

In fact, it is quite easy to trigger the following failure on an average desktop system (making this scheduler pretty much unusable in such scenario):

$ sudo scx_userland
...
Failed to enqueue task 33258: No such file or directory
EXIT: BPF scheduler unregistered

Prevent this by using sysctl's kernel.pid_max as the size of the tasks array (and still allocate it all at once during initialization).

The downside of this change is that scx_userland may require additional memory to start and in small systems it could even trigger OOMs. For this reason add an explicit message to the command help, suggesting to reduce kernel.pid_max in case of OOM conditions.

@htejun htejun requested a review from Byte-Lab December 13, 2023 15:25
Copy link
Contributor

@Byte-Lab Byte-Lab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the fix Andrea. Just left one comment but otherwise it's good to go.

Comment on lines 7 to 10
/*
* Maximum amount of tasks enqueued/dispatched between kernel and user-space.
*/
#define USERLAND_MAX_TASKS 8192
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this macro still used? If not let's just delete it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's still used in .bpf.c. There, it's just the max size of the ring buffer and overflow handling seems okay. Maybe it should be renamed for clarity?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep as @htejun says, it's still used in .bpf.c as the max amount of tasks enqueued/dispatched between kernel and userspace, that's why I added the comment to better clarify its usage, but I think it would be nice to also rename it at this point...

Copy link
Contributor

@Byte-Lab Byte-Lab Dec 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gah, right, my bad. Yeah, maybe something like MAX_ENQUEUED_TASKS? And we can probably just bring it into scx_userland.bpf.c directly. It's also an option to set the max entries value from userspace before the map is loaded, but keeping the macro seems preferable given the fallback logic as you pointed out.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's quite an internal parameter, basically it defines how many items can be pushed at most to the enqueued/dispatched maps, so I don't think there's much value to export this to user-space as a configurable option... but I think we can definitely bring it into .bpf.c, since it's used only there at this point. And I like the name MAX_ENQUEUED_TASKS, it's definitely more clear.

Copy link
Contributor

@Byte-Lab Byte-Lab Dec 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yeah, I meant that you could set the max entries in the map to the value of /proc/sys/kernel/pid_max (so not exposed to the end user, just set from user space). But yeah I think bringing the macro into the bpf file and statically setting it there is fine / preferable.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... and updated.

Setting pid-max entries in the bpf maps would be a bit of a waste IMHO, there's no way a system will have pid-max concurrent tasks running. :)

And if we fail to enqueue a task from in the bpf part the task will just be dispatched immediately. OTOH if we fail to push the task to the dispatched map from the .c we would error and exit, maybe this part can be improved a bit, i.e., re-adding the pid to the vruntime_head list?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, sounds like another nice improvement!

Currently the array of enqueued tasks is statically allocated to a fixed
size of USERLAND_MAX_TASKS to avoid potential deadlocks that could be
introduced by performing dynamic allocations in the enqueue path.

However, this also adds a limit on the maximum pid that the scheduler
can handle, since the pid is used as the index to access the array.

In fact, it is quite easy to trigger the following failure on an average
desktop system (making this scheduler pretty much unusable in such
scenario):

 $ sudo scx_userland
 ...
 Failed to enqueue task 33258: No such file or directory
 EXIT: BPF scheduler unregistered

Prevent this by using sysctl's kernel.pid_max as the size of the tasks
array (and still allocate it all at once during initialization).

The downside of this change is that scx_userland may require additional
memory to start and in small systems it could even trigger OOMs. For
this reason add an explicit message to the command help, suggesting to
reduce kernel.pid_max in case of OOM conditions.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Copy link
Contributor

@Byte-Lab Byte-Lab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks a lot @arighi

@Byte-Lab Byte-Lab merged commit a68885f into sched-ext:main Dec 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants