-
Notifications
You must be signed in to change notification settings - Fork 180
Description
Is your feature request related to a problem? Please describe.
Currently, BERT processing typically runs on a fixed device (CPU or GPU), without dynamic adaptation based on query complexity. This leads to suboptimal resource utilization:
- Simple queries processed on GPU waste valuable GPU resources.
- Complex queries processed on CPU suffer from slow inference times.
- There is no mechanism to dynamically switch between CPU and GPU based on actual computational demands.
Describe the solution you'd like
Automatic CPU/GPU Switching
Implement a dynamic resource manager that:
- Estimates the computational complexity of incoming queries.
- Automatically routes simple queries to CPU and complex ones to GPU.
- Balances latency, throughput, and hardware utilization.
This s could leverage profiling metrics such as token length, syntactic complexity, or historical processing times to make real-time decisions.
To reduce GPU transfer overhead and improve throughput, introduce a queue-based batching inspired by continuous batching in vLLM:
- GPU-bound queries are batched efficiently.
- Data transfers and computation are overlapped where possible.
- CPU and GPU workloads are decoupled via a shared batch queue.
Example design:
type ResourceManager struct {
CPUProcessor *CPUProcessor
GPUProcessor *GPUProcessor
Profiler *ComputeProfiler
Queue *BatchQueue
}
func (rm *ResourceManager) ProcessQuery(query string) (*ClassificationResult, error) {
// 1. Estimate query complexity
complexity := rm.Profiler.EstimateComplexity(query)
// 2. Route based on compute bounds
if complexity.IsCPUBound() {
return rm.CPUProcessor.Process(query)
} else if complexity.IsGPUBound() {
return rm.Queue.AddToBatch(query) // Enqueue for GPU batch processing
}
// Fallback to CPU
return rm.CPUProcessor.Process(query)
}
Describe alternatives you've considered
Static Profiling at Startup:
Run a benchmark with sample data during initialization to calibrate CPU compute-bound vs. I/O-bound thresholds based on system configuration.
Additional context
If this feature aligns with the project's roadmap, would it be possible to assign this issue to me? I’d appreciate the opportunity to start working on it.