-
Notifications
You must be signed in to change notification settings - Fork 74k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GDR contrib- Suballocators fixes #22989
GDR contrib- Suballocators fixes #22989
Conversation
Thanks for the fix! I will check on our environment soon. |
Great! Note that I experienced an hang during initialization with master branch, but this also happened in vanilla grpc so it's probably unrelated. I didnt isolate the grpc problem yet but grpc and patched gdr worked with the 5 days old branch. (96a6333) Edit: |
@byronyi let me know when you have confirmed that it works. |
GPUProcessState::singleton()->AddCUDAHostAllocVisitor(bus_id, | ||
alloc_visitor); | ||
GPUProcessState::singleton()->AddCUDAHostAllocVisitor(0, alloc_visitor); | ||
GPUProcessState::singleton()->AddCUDAHostAllocVisitor(1, alloc_visitor); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems fine for now. NUMA specific allocation has not yet been turned on, pending bringing TF's eigen version up to date. It might be better to add a visitor for every available NUMA node by using port::NUMANumNodes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the tip. added it to latest commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fix!
GPUProcessState::singleton()->AddCUDAHostAllocVisitor(bus_id, | ||
alloc_visitor); | ||
|
||
for (int numa_idx = 0; numa_idx < port::NUMANumNodes(); ++numa_idx) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be moved out of the above if (IsGDRAvailable())
, as the host memory should be registered regardless. Otherwise, it fails for GDR incapable GPUs, such as those GTX cards.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably should change the name of IsGDRAvailable
to P2PKernelModuleEnabled
to avoid confusion.
Any suggestions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GDR is the correct name here. Transporting between GPU's host pinned memory is technically just "GD".
I would like to eventually add the possibility to selectively avoid GPU-direct on slow PCie routes (Like QPI), but I prefer to just focus on hotfixes for now, till the upcoming changes regarding the transport contributions.
alloc_visitor); | ||
LOG(INFO) << "Instrumenting Cuda host allocator on numa " << numa_idx; | ||
} | ||
|
||
GPUProcessState::singleton()->AddCUDAHostFreeVisitor(bus_id, free_visitor); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should rename bus_id
to numa_idx
, and move it inside the loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@poxvoculi I saw different signatures for AddGPUAllocVisitor(int bus_id, const SubAllocator::Visitor& visitor)
and AddCUDAHostAllocVisitor(int numa_node, const SubAllocator::Visitor& visitor)
. It seems rather confusing.
Maybe change bus_id
to numa_node
for AddGPUAllocVisitor
as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With requested changes, I have tested the patch with CPU only, GPU without GDR, GPU with GDR but on different NUMA, and GPU with GDR on the same NUMA. Thanks again! |
thanks for the comments. will go over them and re-commit tomorrow. |
for (int numa_idx = 0; numa_idx < port::NUMANumNodes(); ++numa_idx) { | ||
GPUProcessState::singleton()->AddGPUAllocVisitor(numa_idx, | ||
cuda_alloc_visitor); | ||
GPUProcessState::singleton()->AddCUDAHostFreeVisitor(numa_idx, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line should also be moved above (i.e. outside IsGDRAvailable()
)
@drpngx LGTM |
@@ -640,8 +642,8 @@ void GdrMemoryManager::TensorFromTransportOptions( | |||
} else { | |||
checksum = GPUUtil::Checksum(*tensor); | |||
} | |||
CHECK(checksum == remote_mr.checksum()) | |||
<< "Checksum mismatch: " << checksum << "!=" << remote_mr.checksum(); | |||
CHECK(checksum == remote_mr.checksum()) << "Checksum mismatch: " << checksum |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh this is going to create a problem internally. Could you change that into a return error::InternalError?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI: This only hits when turning VLOG>=2, and it is being used to debug memory corruption problems from hardware direct access. Like an assertion, and it’s easier to debug if we could keep the original stack instead of returning.
Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a policy that we have internally. You can LOG(ERROR) something and then return.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good to know that. @yanivbl6 mind to unpatch this line and I’ll submit a separate PR to clean things up?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, I will do it tomorrow morning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@drpngx Will just re-committing the line back to the non-formatted version be sufficient to comply with the policy or do I need to force-push over the commit with that change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the latter as they always squash the PR to a single commit :p
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it doesn't matter how the commits are organized (I prefer a linear chain, just piling up so that we can match up the review history).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
GPUProcessState::singleton()->AddGPUAllocVisitor(numa_idx, | ||
cuda_alloc_visitor); | ||
} | ||
LOG(INFO) << "Instrumenting GPU allocator(s) for all numas "; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove space. Consider using vlog.
Changed the alloactor registration so CudaHost Allocator is registered for numas 0 and 1
Ok, but the referenced patch was my initial starting point, and I have done some rebasing since without issues. Edit: |
223af86
to
5c750c2
Compare
@yifeif I'm not sure why it's not creating the CL after it's pulled and approved. Any idea? |
It might be due to https://blog.github.com/2018-10-21-october21-incident-report/ |
not sure. The checks from the force push aren't done yet, but I don't think that's the problem as I didn't see any collisions when re-basing. @byronyi, I feel like I am delaying the fix with a code you already have for a while. Do you want me to close the PR so you can push what you have been testing? I can try a new PR but other than that I am out of ideas. |
OK, the process got unstuck. Now testing internally and waiting for internal review. |
@yanivbl6 It is alright. GitHub went down yesterday so let’s just be patient to get this merged :) |
PiperOrigin-RevId: 218263330
This patch includes two fixes, all in the gdr contribution
The 1st one changes the ordering of the calls to gdr memory manager initialization and grpc initialization.
This is necessary because gdr initialization may initiates the GPU device and call GetCPUAllocator (or GetGPUAllocator), which are asserted to never occur before the registration of Suballocators to ProcessState's singleton (which happens on gdr memory manager init)
The 2nd fix is to change the cuda host sub allocators to be registered on both Numas, 0 and 1.
This fixed an error on my servers, and I think it should be fine on different setups too but please comment if you think I missed something.
@byronyi @poxvoculi