-
Notifications
You must be signed in to change notification settings - Fork 68
Add DE-Surrogate hybrid autotuner algorithm + early stopping option for DE and DE-Surrogate #1096
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This introduces DESurrogateHybrid, a novel hybrid optimization algorithm that combines Differential Evolution's robust exploration with Random Forest surrogate model's sample efficiency for GPU kernel autotuning. Key features: - Generates 3× more candidates than standard DE but only evaluates the most promising ones as predicted by the Random Forest surrogate - Achieves 6.53% average performance improvement over standard DE - 1.20× faster wall-clock time despite evaluating more configurations - Learns kernel-specific optimization patterns automatically Implementation: - Works directly with Helion's discrete parameter spaces - Uses ConfigEncoder to convert configurations to numerical vectors - Refits surrogate model every 5 generations for continuous learning - Configurable parameters: population_size, candidate_ratio, surrogate_threshold Testing on 3 diverse kernels (MatMul, GELU, FusedReLU) shows: - MatMul (compute-bound): -15.0% improvement, 1.39× faster convergence - GELU (bandwidth-bound): -5.4% improvement - FusedReLU (memory-bound): +0.8% (competitive, within margin)
Add test_de_surrogate_hybrid following the same pattern as test_differential_evolution_search. Uses small population (5) and few generations (3) for quick verification.
ConfigEncoder converts Helion's discrete configurations into numerical vectors suitable for machine learning models like Random Forests and Gaussian Processes. This is a required dependency for DESurrogateHybrid and other ML-assisted autotuners. It handles: - Power-of-2 values with log2 encoding - Categorical variables with one-hot encoding - Proper bounds computation for optimization
92ae270 to
09284ad
Compare
|
Do you have more results you could share? Which kernels have you tested this on? What is the impact on tuning time and resulting performance? |
|
@jansel sure. I've plotted the convergence of this versus regular DE for 3 different kernels:
|
|
@jansel A better write-up: Detailed Benchmark ResultsHardware Configuration
Kernels Tested I evaluated DE-Surrogate on 3 diverse kernels spanning different computational characteristics:
All algorithms ran with ~1600 evaluations per kernel for fair comparison. Performance Results
Average: 6.53% better performance
Tuning Time Results
1.20× faster in total wall-clock time despite evaluating the same number of configs. Key Insights
How It Works Standard DE generates N candidates → evaluates ALL N. DE-Surrogate generates 3×N candidates → Random Forest predicts performance → evaluates only top N. This allows exploring more of the search space while learning kernel-specific patterns like "block_size=128 with num_warps=8 is fast |
|
How does it compare to PatternSearch? |
|
#1095 might be helpful for evaluation. |
|
@jansel this is what I've got (I've trimmed the X-axis, since it converged somewhat fast).
Detailed Results by KernelMatMul-1024 (Compute-Bound)
Winner: DE-Surrogate - Best performance (0.01747ms) and faster than standard DE GELU-1M (Bandwidth-Bound)
Winner: DE-Surrogate - Best performance and 25% faster than standard DE FusedReLUAdd-1M (Memory-Bound)
Winner: 3-way tie - All algorithms found essentially the same optimum (0.0064ms) |
|
So to summarize, DE-Surrogate is consistently faster and better than DE.
|
- Add min_improvement_delta and patience parameters (default: 0.001, 3) - Stop when relative improvement <0.1% for 3 consecutive generations - DE-Surrogate benefits most: 37% reduction in evaluations when converged - DifferentialEvolution uses as safety net to prevent infinite search
|
Looks like tests/lints are failing. |
jansel
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lints/tests still failing, other than that looks good.
|
Lints still failing? You can run: |
Fixed it, sorry for the back and forth |
|
All green, ready to merge! |




This introduces DESurrogateHybrid, a novel hybrid optimization algorithm that combines Differential Evolution's robust exploration with Random Forest surrogate model's sample efficiency for GPU kernel autotuning.
Key features:
Implementation:
Testing on 3 diverse kernels (MatMul, GELU, FusedReLU) shows: