Skip to content

Conversation

jeffdaily
Copy link
Collaborator

@jeffdaily jeffdaily commented Feb 5, 2025

Allocations using cudaHostRegister should use corresponding cudaHostUnregister and similarly for cudaHostAlloc / cudaFreeHost. In test_cuda.py, the allocator config will change from test to test but the cache is not emptied prior to changing the config. This results in the wrong free being called later. Unit test sharding is avoiding this issue, but running the test_cuda.py with a single shard will fail.

The following reproducer demonstrates the problem.

int main(int argc, char **argv)
{
    void *ptr;
    assert(cudaSuccess == cudaHostAlloc(&ptr, 1024, cudaHostAllocDefault));
    assert(cudaSuccess == cudaHostUnregister(ptr));
    std::free(ptr);
    return 0;
}

The above code results in the following failure because the ptr is an invalid argument to cudaHostUnregister.

a.out: test.cpp:53: int main(int, char**): Assertion `cudaSuccess == cudaHostUnregister(ptr)' failed.

Users may change the allocator config at will. torch unit tests do this.
However, allocations using cudaHostRegister should use corresonding
cudaHostUnregister and similarly for cudaHostAlloc / cudaFreeHost.
Copy link

pytorch-bot bot commented Feb 5, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/146520

Note: Links to docs will display an error until the docs builds have been completed.

⏳ 1 Pending, 2 Unrelated Failures

As of commit b6ad576 with merge base fa0fdc0 (image):

UNSTABLE - The following jobs are marked as unstable, possibly due to flakiness on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@jeffdaily
Copy link
Collaborator Author

Notably ROCm pytorch would fail to run test_cuda.py if run as a single module without sharding. The root cause was a sequence of tests changing the allocator config resulting in the host allocator's empty_cache() causing a seg fault due to a mix of allocating using hipHostMalloc() followed by hipHostUnregister().

@jeffdaily jeffdaily added the topic: not user facing topic category label Feb 5, 2025
@mikaylagawarecki mikaylagawarecki added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Feb 7, 2025
@jeffdaily
Copy link
Collaborator Author

Alternative approaches would be to have the host caching allocator empty itself whenever the allocator config changes. Or have the unit tests empty the cache to ensure consistent state.

@jeffdaily
Copy link
Collaborator Author

@zdevito can I get a review and opinion on this approach vs others suggested?

@jeffdaily
Copy link
Collaborator Author

@mikaylagawarecki / @zdevito ping -- still waiting for a review

@pruthvistony pruthvistony added rocm This tag is for PRs from ROCm team rocm priority high priority ROCm PRs from performance or other aspects ciflow/rocm Trigger "default" config CI on ROCm ciflow/rocm-mi300 Trigger "default" config CI on ROCm MI300 labels Mar 21, 2025
@jeffdaily
Copy link
Collaborator Author

@ngimel perhaps could I get your opinion on this PR since it effects CUDA too?

@jeffdaily
Copy link
Collaborator Author

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Rebase failed due to Command git -C /home/runner/work/pytorch/pytorch rebase refs/remotes/origin/viable/strict pull/146520/head returned non-zero exit code 1

Rebasing (1/1)
Auto-merging aten/src/ATen/cuda/CachingHostAllocator.cpp
CONFLICT (content): Merge conflict in aten/src/ATen/cuda/CachingHostAllocator.cpp
error: could not apply 6ff673ddd1c... CUDA CachingHostAllocator tracks registrations to call correct free
hint: Resolve all conflicts manually, mark them as resolved with
hint: "git add/rm <conflicted_files>", then run "git rebase --continue".
hint: You can instead skip this commit: run "git rebase --skip".
hint: To abort and get back to the state before "git rebase", run "git rebase --abort".
hint: Disable this message with "git config set advice.mergeConflict false"
Could not apply 6ff673ddd1c... CUDA CachingHostAllocator tracks registrations to call correct free

Raised by https://github.com/pytorch/pytorch/actions/runs/14207074188

Copy link
Collaborator

@ngimel ngimel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!

@jeffdaily
Copy link
Collaborator Author

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Apr 3, 2025
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

timocafe pushed a commit to timocafe/pytorch that referenced this pull request Apr 16, 2025
…ytorch#146520)

Allocations using cudaHostRegister should use corresponding cudaHostUnregister and similarly for cudaHostAlloc / cudaFreeHost.  In test_cuda.py, the allocator config will change from test to test but the cache is not emptied prior to changing the config.  This results in the wrong free being called later.  Unit test sharding is avoiding this issue, but running the test_cuda.py with a single shard will fail.

The following reproducer demonstrates the problem.

```C++
int main(int argc, char **argv)
{
    void *ptr;
    assert(cudaSuccess == cudaHostAlloc(&ptr, 1024, cudaHostAllocDefault));
    assert(cudaSuccess == cudaHostUnregister(ptr));
    std::free(ptr);
    return 0;
}
```

The above code results in the following failure because the ptr is an invalid argument to cudaHostUnregister.

```
a.out: test.cpp:53: int main(int, char**): Assertion `cudaSuccess == cudaHostUnregister(ptr)' failed.
```

Pull Request resolved: pytorch#146520
Approved by: https://github.com/ngimel
amathewc pushed a commit to amathewc/pytorch that referenced this pull request Apr 17, 2025
…ytorch#146520)

Allocations using cudaHostRegister should use corresponding cudaHostUnregister and similarly for cudaHostAlloc / cudaFreeHost.  In test_cuda.py, the allocator config will change from test to test but the cache is not emptied prior to changing the config.  This results in the wrong free being called later.  Unit test sharding is avoiding this issue, but running the test_cuda.py with a single shard will fail.

The following reproducer demonstrates the problem.

```C++
int main(int argc, char **argv)
{
    void *ptr;
    assert(cudaSuccess == cudaHostAlloc(&ptr, 1024, cudaHostAllocDefault));
    assert(cudaSuccess == cudaHostUnregister(ptr));
    std::free(ptr);
    return 0;
}
```

The above code results in the following failure because the ptr is an invalid argument to cudaHostUnregister.

```
a.out: test.cpp:53: int main(int, char**): Assertion `cudaSuccess == cudaHostUnregister(ptr)' failed.
```

Pull Request resolved: pytorch#146520
Approved by: https://github.com/ngimel
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/rocm Trigger "default" config CI on ROCm ciflow/rocm-mi300 Trigger "default" config CI on ROCm MI300 ciflow/trunk Trigger trunk jobs on your pull request Merged open source rocm priority high priority ROCm PRs from performance or other aspects rocm This tag is for PRs from ROCm team topic: not user facing topic category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants