Skip to content

Fix Vec2/Vec4 UVM performance regression with vectorized at::Half copy#5491

Closed
q10 wants to merge 1 commit into
pytorch:mainfrom
q10:export-D96381299
Closed

Fix Vec2/Vec4 UVM performance regression with vectorized at::Half copy#5491
q10 wants to merge 1 commit into
pytorch:mainfrom
q10:export-D96381299

Conversation

@q10
Copy link
Copy Markdown
Contributor

@q10 q10 commented Mar 18, 2026

Summary:
Apply vectorized copy optimization pattern for at::Half types in Vec2 and Vec4 classes for ROCm. This ensures at::Half copy operations use efficient 32-bit or 64-bit memory operations instead of scalar element-by-element access.

With UVM (managed memory), each separate copy can trigger a page fault, causing significant slowdown. Using vectorized operations reduces this overhead.

Reviewed By: spcyppt

Differential Revision: D96381299

Summary:
Apply vectorized copy optimization pattern for at::Half types in Vec2 and Vec4 classes for ROCm. This ensures at::Half copy operations use efficient 32-bit or 64-bit memory operations instead of scalar element-by-element access.

With UVM (managed memory), each separate copy can trigger a page fault, causing significant slowdown. Using vectorized operations reduces this overhead.

Reviewed By: spcyppt

Differential Revision: D96381299
@meta-codesync
Copy link
Copy Markdown
Contributor

meta-codesync Bot commented Mar 18, 2026

@q10 has exported this pull request. If you are a Meta employee, you can view the originating Diff in D96381299.

@meta-codesync
Copy link
Copy Markdown
Contributor

meta-codesync Bot commented Mar 19, 2026

This pull request has been merged in fc7c8f2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants