- 
                Notifications
    
You must be signed in to change notification settings  - Fork 25.8k
 
[xpu][feature] Introduce ExpandableSegment for XPU #166299
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
          
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/166299
 Note: Links to docs will display an error until the docs builds have been completed. ❗ 1 Active SEVsThere are 1 currently active SEVs. If your PR is affected, please view them below: ✅ No FailuresAs of commit f8699ce with merge base 3206677 ( This comment was automatically generated by Dr. CI and updates every 15 minutes.  | 
    
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces the ExpandableSegment struct for XPU, which manages virtual memory segments that can be dynamically expanded by mapping physical memory on demand. This is part of supporting an expandable segment feature for XPU memory allocation.
Key changes:
- Added 
SegmentRangestruct to represent contiguous virtual memory segments - Implemented 
ExpandableSegmentclass with map/unmap operations for virtual memory management - Integrated SYCL's virtual memory APIs for reservation, mapping, and access control
 
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
        
          
                c10/xpu/XPUCachingAllocator.cpp
              
                Outdated
          
        
      | .get_info<sycl::info::device::global_mem_size>(); | ||
| // The extra 1/8 allows flexibility for remapping or moving pages within the | ||
| // segment when unmapping earlier regions. | ||
| max_handles_ = numSegments(device_total * (1 + 1.0 / 8)); | 
    
      
    
      Copilot
AI
    
    
    
      Oct 31, 2025 
    
  
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The magic number 1.0 / 8 is hardcoded without a named constant. Consider defining a named constant like VIRTUAL_MEM_OVERSUBSCRIPTION_FACTOR to improve code readability and make the purpose of this calculation clearer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reasonable comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| size_t offset = p - ptr(); | ||
| return offset / segment_size_; | 
    
      
    
      Copilot
AI
    
    
    
      Oct 31, 2025 
    
  
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential undefined behavior if p is less than ptr(), resulting in a negative offset that wraps around. Add a check to ensure p >= ptr() before computing the offset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this comment is valid? We should add assert to confirm the p is always greater than or equal to ptr?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| // If `p` lies exactly on a segment boundary, this is equal to segmentLeft(p). | ||
| // Otherwise, it rounds up and returns segmentLeft(p) + 1. | ||
| size_t segmentRight(char* p) const { | ||
| size_t offset = p - ptr(); | 
    
      
    
      Copilot
AI
    
    
    
      Oct 31, 2025 
    
  
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same issue as in segmentLeft(): potential undefined behavior if p < ptr(). Add validation to ensure p >= ptr() before computing the offset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
        
          
                c10/xpu/XPUCachingAllocator.cpp
              
                Outdated
          
        
      | return numSegments(offset); | ||
| } | ||
| 
               | 
          ||
| // Constructs a SegmentRange starting at [start, end) indices. | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| // Constructs a SegmentRange starting at [start, end) indices. | |
| // Constructs a SegmentRange in the range of [begin, end). | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| // bound, useful for [begin, end) style ranges. | ||
| // If `p` lies exactly on a segment boundary, this is equal to segmentLeft(p). | ||
| // Otherwise, it rounds up and returns segmentLeft(p) + 1. | ||
| size_t segmentRight(char* p) const { | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should define a specific type for segment index, instead of using size_t directly?
Otherwise, it's not easy to read, and prone to ambiguity?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good suggestion — using SegmentIndex could improve readability. However, it’s used in very few places, and size_t is the standard type for array/container indices in C++. Since it specifically represents an index into handles_, I think it’s fine to keep it as is for now and refactor it if needed during the code unification.
| // Ensure handles_ vector is large enough to hold all segments. | ||
| while (end > handles_.size()) { | ||
| handles_.emplace_back(std::nullopt); | ||
| } | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can do same thing without the while loop?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, use resize instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure
| 
           Starting merge as part of PR stack under #166424  | 
    
# Motivation This PR intends to add expandable segment feature support on XPU. This will help - Reduce memory fragmentation; - Gradually map physical pages into virtual address space as needed. # Additional Context The traditional caching allocator frequently allocates and frees device memory blocks. However, over time, with varying tensor size, the device address space becomes fragmented. Even when there's enough total free memory, a lack of contiguous space can cause large allocations to fail. The **expandable segment** feature addresses this by dynamically extending physical memory within a reserved virtual address range, reducing fragmentation and minimizing reallocation overhead. The potential drawbacks are - Virtual memory overhead; - Potential page mapping overhead; - Increased complexity. Pull Request resolved: #166292 Approved by: https://github.com/albanD, https://github.com/EikanWang, https://github.com/gujinghui ghstack dependencies: #166299
# Motivation This PR introduces support for peer-to-peer (P2P) access between devices, including querying and enabling P2P connections between two devices. It supports two categories of allocations: - Regular allocations; - Expandable segment allocations. # Additional Context The follow-up is that we should use this feature to optimize our copy kernel when P2P is supported. Pull Request resolved: #166424 Approved by: https://github.com/gujinghui, https://github.com/albanD ghstack dependencies: #166299, #166292
# Motivation This PR aims to reuse some UT to validate the expandable segment feature. # Additional Context Currently, the failure is related to the internal track `GSD-11403`, we could get the fix when upgrading the driver to `ci-neo-master-034630` or greater TODO: add test conv and gemm into this test case when upgrading the driver. Pull Request resolved: #166495 Approved by: https://github.com/albanD, https://github.com/EikanWang, https://github.com/gujinghui ghstack dependencies: #166299, #166292, #166424
# Motivation This PR intends to add `ExpandableSegment` struct, which is used to help support the expandable segment feature. I split it to a single PR to facilitate the code review. Pull Request resolved: #166299 Approved by: https://github.com/EikanWang, https://github.com/albanD, https://github.com/gujinghui
# Motivation This PR intends to add expandable segment feature support on XPU. This will help - Reduce memory fragmentation; - Gradually map physical pages into virtual address space as needed. # Additional Context The traditional caching allocator frequently allocates and frees device memory blocks. However, over time, with varying tensor size, the device address space becomes fragmented. Even when there's enough total free memory, a lack of contiguous space can cause large allocations to fail. The **expandable segment** feature addresses this by dynamically extending physical memory within a reserved virtual address range, reducing fragmentation and minimizing reallocation overhead. The potential drawbacks are - Virtual memory overhead; - Potential page mapping overhead; - Increased complexity. Pull Request resolved: #166292 Approved by: https://github.com/albanD, https://github.com/EikanWang, https://github.com/gujinghui ghstack dependencies: #166299
# Motivation This PR introduces support for peer-to-peer (P2P) access between devices, including querying and enabling P2P connections between two devices. It supports two categories of allocations: - Regular allocations; - Expandable segment allocations. # Additional Context The follow-up is that we should use this feature to optimize our copy kernel when P2P is supported. Pull Request resolved: #166424 Approved by: https://github.com/gujinghui, https://github.com/albanD ghstack dependencies: #166299, #166292
# Motivation This PR aims to reuse some UT to validate the expandable segment feature. # Additional Context Currently, the failure is related to the internal track `GSD-11403`, we could get the fix when upgrading the driver to `ci-neo-master-034630` or greater TODO: add test conv and gemm into this test case when upgrading the driver. Pull Request resolved: #166495 Approved by: https://github.com/albanD, https://github.com/EikanWang, https://github.com/gujinghui ghstack dependencies: #166299, #166292, #166424
Stack from ghstack (oldest at bottom):
Motivation
This PR intends to add
ExpandableSegmentstruct, which is used to help support the expandable segment feature. I split it to a single PR to facilitate the code review.