feat(scheduler): add GPU-aware DRF scheduling with resource requirements#434
Conversation
- Add DRF (Dominant Resource Fairness) scheduler plugin with GPU support - Add ResourceRequirement (cpu, memory, gpu) to SessionSpec for explicit resource requests - Add priority field to SessionSpec for session ordering - Regenerate Python SDK protobuf code with new fields - Add comprehensive tests for ResourceRequirement parsing and SessionAttributes This enables sessions to request specific GPU/CPU/memory resources instead of relying on slots, and introduces DRF-based fair scheduling across sessions.
There was a problem hiding this comment.
Code Review
This pull request introduces GPU support and Dominant Resource Fairness (DRF) scheduling by adding GPU tracking to resource requirements, implementing GPU detection, and allowing sessions to specify explicit resource requests (resreq). While the architectural changes are significant, the review identifies several critical bugs: the resreq field is currently dropped during session creation, the DRF plugin fails to calculate resources for sessions using explicit requests, and GPU data is not persisted in the filesystem storage engine. Furthermore, feedback points out potential integer casting issues in slot calculations and a failure to implement the GPU detection priority specified in the design document.
| max_instances: ssn_spec.max_instances, | ||
| batch_size: ssn_spec.batch_size.max(1), | ||
| priority: ssn_spec.priority, | ||
| resreq: ssn_spec.resreq.map(apis::ResourceRequirement::from), |
There was a problem hiding this comment.
The resreq is correctly extracted from the SessionSpec, but it appears to be dropped when self.controller.create_session(attr) is called. The SessionInfo struct in session_manager/src/model/mod.rs does not have a field for resreq, and its constructor SessionInfo::new doesn't accept it.
This is a critical omission, as the resource requirements are lost at session creation and cannot be used by the scheduler. This breaks the primary goal of this feature.
To fix this, you'll need to:
- Add
pub resreq: Option<ResourceRequirement>to theSessionInfostruct. - Update
SessionInfo::newto accept and storeresreq. - Update
controller::create_sessionto passresreqfromSessionAttributestoSessionInfo::new. - Ensure
resreqis also handled correctly for session persistence in storage.
|
|
||
| fn is_allocatable(&self, node: &NodeInfoPtr, ssn: &SessionInfoPtr) -> Option<bool> { | ||
| let node_alloc = self.node_allocations.get(&node.name)?; | ||
| let ssn_resreq = ResourceRequirement::new(ssn.slots, &self.unit); |
There was a problem hiding this comment.
This line calculates the session's resource requirements based on slots. However, this PR introduces resreq for explicit resource requests. If a session is created using resreq, its slots value will be 0, and ssn_resreq will be calculated as all zeros. This will lead to incorrect scheduling decisions, as the scheduler will think the session requires no resources.
This is a critical bug that undermines the DRF scheduling for sessions using resreq.
This is related to the missing resreq field in SessionInfo. Once that is fixed, this function should be updated to handle both cases:
- If
ssn.resreqisSome, use that value. - If
ssn.resreqisNone, then calculate based onssn.slotsandself.unit.
This logic needs to be applied consistently across the plugin, including in on_executor_allocate, on_executor_unallocate, etc.
| let gpu_slots = if unit.gpu > 0 { | ||
| (self.gpu / unit.gpu) as u64 | ||
| } else { | ||
| u64::MAX // No GPU requirement = unlimited GPU slots | ||
| }; |
There was a problem hiding this comment.
There's a potential issue with integer casting if self.gpu is negative. If self.gpu is negative (e.g., -1) and unit.gpu is positive, self.gpu / unit.gpu will be negative. Casting a negative i32 to u64 results in a very large positive number due to two's complement representation. This would incorrectly suggest a huge number of available GPU slots, leading to incorrect scheduling decisions.
let gpu_slots = if unit.gpu > 0 {
if self.gpu < 0 {
0
} else {
(self.gpu as u64) / (unit.gpu as u64)
}
} else {
u64::MAX // No GPU requirement = unlimited GPU slots
};| capacity: ResourceRequirement { | ||
| cpu: meta.capacity_cpu, | ||
| memory: meta.capacity_memory, | ||
| gpu: 0, |
There was a problem hiding this comment.
The gpu field is hardcoded to 0 when loading a Node from filesystem storage. The NodeMeta struct used for serialization should be updated to include capacity_gpu and allocatable_gpu fields, and these should be used here.
A similar issue exists for executors on line 1488, where resreq_gpu is also hardcoded to 0 because ExecutorMeta is missing the field.
Without these changes, GPU resource information for nodes and executors will be lost across restarts when using the filesystem storage engine.
| fn detect_gpus(ctx: &FlameClusterContext) -> i32 { | ||
| // Priority: config > env var > NVML detection (always enabled by default) | ||
|
|
||
| // 1. Check configuration override | ||
| if let Some(gpu_count) = ctx.executor.gpu_count { | ||
| return gpu_count; | ||
| } |
There was a problem hiding this comment.
The design specifies that GPU detection should prioritize a configuration override from ctx.executor.gpu_count. However, the current implementation of detect_gpus in common/src/apis/types.rs does not include this check, as it doesn't have access to the FlameClusterContext. This is a deviation from the documented design. To align with the design, detect_gpus should probably be moved to a location where it can access the context, or the context needs to be passed down to it.
- Add resreq field to Session and SessionInfo to preserve resource requirements - Fix DRF plugin to use session.resreq when available instead of slots calculation - Fix negative GPU handling in to_slots() to prevent incorrect u64 casting - Add GPU fields to filesystem storage NodeMeta and ExecutorMeta - Add session resreq fields to SQLite migration and storage - Update all test SessionInfo initializers with resreq: None
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
Summary
ResourceRequirement(cpu, memory, gpu) toSessionSpecfor explicit resource requestspriorityfield toSessionSpecfor session orderingResourceRequirementparsing andSessionAttributesChanges
Scheduler
drf.rsplugin implementing Dominant Resource Fairness algorithmresreqfieldProto/SDK Updates
types.proto: Addedpriorityandresreqfields toSessionSpecResourceRequirementclass withfrom_string()parserTests
ResourceRequirementclass covering:cpu=4,mem=16g,gpu=2)SessionAttributeswith/withoutresreqApplicationAttributes.installerfieldRelated Issue
RFE433 - GPU-aware DRF scheduling