Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GDR contrib- Suballocators fixes #22989

Merged
merged 8 commits into from
Oct 23, 2018

Conversation

yanivbl6
Copy link

This patch includes two fixes, all in the gdr contribution
The 1st one changes the ordering of the calls to gdr memory manager initialization and grpc initialization.

This is necessary because gdr initialization may initiates the GPU device and call GetCPUAllocator (or GetGPUAllocator), which are asserted to never occur before the registration of Suballocators to ProcessState's singleton (which happens on gdr memory manager init)

The 2nd fix is to change the cuda host sub allocators to be registered on both Numas, 0 and 1.
This fixed an error on my servers, and I think it should be fine on different setups too but please comment if you think I missed something.

@byronyi @poxvoculi

@byronyi
Copy link
Contributor

byronyi commented Oct 15, 2018

Thanks for the fix! I will check on our environment soon.

@yanivbl6
Copy link
Author

yanivbl6 commented Oct 15, 2018

Great! Note that I experienced an hang during initialization with master branch, but this also happened in vanilla grpc so it's probably unrelated. I didnt isolate the grpc problem yet but grpc and patched gdr worked with the 5 days old branch. (96a6333)

Edit:
Tested again today, and grpc working well again on master.

@drpngx drpngx self-assigned this Oct 15, 2018
@drpngx
Copy link
Contributor

drpngx commented Oct 15, 2018

@byronyi let me know when you have confirmed that it works.

@drpngx drpngx requested a review from poxvoculi October 15, 2018 17:30
@drpngx drpngx added stat:awaiting response Status - Awaiting response from author awaiting review Pull request awaiting review labels Oct 15, 2018
poxvoculi
poxvoculi previously approved these changes Oct 15, 2018
GPUProcessState::singleton()->AddCUDAHostAllocVisitor(bus_id,
alloc_visitor);
GPUProcessState::singleton()->AddCUDAHostAllocVisitor(0, alloc_visitor);
GPUProcessState::singleton()->AddCUDAHostAllocVisitor(1, alloc_visitor);
Copy link
Contributor

@poxvoculi poxvoculi Oct 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems fine for now. NUMA specific allocation has not yet been turned on, pending bringing TF's eigen version up to date. It might be better to add a visitor for every available NUMA node by using port::NUMANumNodes.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the tip. added it to latest commit.

@drpngx drpngx added awaiting testing (then merge) kokoro:force-run Tests on submitted change and removed awaiting review Pull request awaiting review labels Oct 15, 2018
@kokoro-team kokoro-team removed the kokoro:force-run Tests on submitted change label Oct 15, 2018
@tensorflowbutler tensorflowbutler removed the stat:awaiting response Status - Awaiting response from author label Oct 17, 2018
Copy link
Contributor

@byronyi byronyi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix!

GPUProcessState::singleton()->AddCUDAHostAllocVisitor(bus_id,
alloc_visitor);

for (int numa_idx = 0; numa_idx < port::NUMANumNodes(); ++numa_idx) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be moved out of the above if (IsGDRAvailable()), as the host memory should be registered regardless. Otherwise, it fails for GDR incapable GPUs, such as those GTX cards.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably should change the name of IsGDRAvailable to P2PKernelModuleEnabled to avoid confusion.

Any suggestions?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GDR is the correct name here. Transporting between GPU's host pinned memory is technically just "GD".

I would like to eventually add the possibility to selectively avoid GPU-direct on slow PCie routes (Like QPI), but I prefer to just focus on hotfixes for now, till the upcoming changes regarding the transport contributions.

alloc_visitor);
LOG(INFO) << "Instrumenting Cuda host allocator on numa " << numa_idx;
}

GPUProcessState::singleton()->AddCUDAHostFreeVisitor(bus_id, free_visitor);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should rename bus_id to numa_idx, and move it inside the loop.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@poxvoculi I saw different signatures for AddGPUAllocVisitor(int bus_id, const SubAllocator::Visitor& visitor) and AddCUDAHostAllocVisitor(int numa_node, const SubAllocator::Visitor& visitor). It seems rather confusing.

Maybe change bus_id to numa_node for AddGPUAllocVisitor as well?

Copy link
Contributor

@byronyi byronyi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also remove this line and change bus_id here to numa_node_.

@byronyi
Copy link
Contributor

byronyi commented Oct 18, 2018

With requested changes, I have tested the patch with CPU only, GPU without GDR, GPU with GDR but on different NUMA, and GPU with GDR on the same NUMA.

Thanks again!

@yanivbl6
Copy link
Author

thanks for the comments. will go over them and re-commit tomorrow.

for (int numa_idx = 0; numa_idx < port::NUMANumNodes(); ++numa_idx) {
GPUProcessState::singleton()->AddGPUAllocVisitor(numa_idx,
cuda_alloc_visitor);
GPUProcessState::singleton()->AddCUDAHostFreeVisitor(numa_idx,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line should also be moved above (i.e. outside IsGDRAvailable())

@byronyi
Copy link
Contributor

byronyi commented Oct 19, 2018

@drpngx LGTM

@@ -640,8 +642,8 @@ void GdrMemoryManager::TensorFromTransportOptions(
} else {
checksum = GPUUtil::Checksum(*tensor);
}
CHECK(checksum == remote_mr.checksum())
<< "Checksum mismatch: " << checksum << "!=" << remote_mr.checksum();
CHECK(checksum == remote_mr.checksum()) << "Checksum mismatch: " << checksum
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh this is going to create a problem internally. Could you change that into a return error::InternalError?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI: This only hits when turning VLOG>=2, and it is being used to debug memory corruption problems from hardware direct access. Like an assertion, and it’s easier to debug if we could keep the original stack instead of returning.

Thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a policy that we have internally. You can LOG(ERROR) something and then return.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to know that. @yanivbl6 mind to unpatch this line and I’ll submit a separate PR to clean things up?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, I will do it tomorrow morning.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@drpngx Will just re-committing the line back to the non-formatted version be sufficient to comply with the policy or do I need to force-push over the commit with that change?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the latter as they always squash the PR to a single commit :p

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it doesn't matter how the commits are organized (I prefer a linear chain, just piling up so that we can match up the review history).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

GPUProcessState::singleton()->AddGPUAllocVisitor(numa_idx,
cuda_alloc_visitor);
}
LOG(INFO) << "Instrumenting GPU allocator(s) for all numas ";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove space. Consider using vlog.

@drpngx drpngx added the ready to pull PR ready for merge process label Oct 21, 2018
@drpngx drpngx added ready to pull PR ready for merge process and removed ready to pull PR ready for merge process labels Oct 22, 2018
@drpngx
Copy link
Contributor

drpngx commented Oct 22, 2018

@yanivbl6 there's some import issue with another patch (#22559). Could you pull rebase and push again?

@yanivbl6
Copy link
Author

yanivbl6 commented Oct 22, 2018

Ok, but the referenced patch was my initial starting point, and I have done some rebasing since without issues.

Edit:
fetched and rebased on master, then forced pushed.

@drpngx drpngx added ready to pull PR ready for merge process and removed ready to pull PR ready for merge process labels Oct 22, 2018
@drpngx
Copy link
Contributor

drpngx commented Oct 22, 2018

@yifeif I'm not sure why it's not creating the CL after it's pulled and approved. Any idea?

@yifeif
Copy link
Contributor

yifeif commented Oct 22, 2018

@yifeif yifeif added ready to pull PR ready for merge process and removed ready to pull PR ready for merge process labels Oct 22, 2018
@yanivbl6
Copy link
Author

not sure. The checks from the force push aren't done yet, but I don't think that's the problem as I didn't see any collisions when re-basing.

@byronyi, I feel like I am delaying the fix with a code you already have for a while. Do you want me to close the PR so you can push what you have been testing?

I can try a new PR but other than that I am out of ideas.

@yifeif yifeif added the kokoro:force-run Tests on submitted change label Oct 22, 2018
@kokoro-team kokoro-team removed the kokoro:force-run Tests on submitted change label Oct 22, 2018
@drpngx
Copy link
Contributor

drpngx commented Oct 22, 2018

OK, the process got unstuck. Now testing internally and waiting for internal review.

@byronyi
Copy link
Contributor

byronyi commented Oct 22, 2018

@yanivbl6 It is alright. GitHub went down yesterday so let’s just be patient to get this merged :)

@tensorflow-copybara tensorflow-copybara merged commit 5c750c2 into tensorflow:master Oct 23, 2018
tensorflow-copybara pushed a commit that referenced this pull request Oct 23, 2018
@yanivbl6 yanivbl6 deleted the gdr_allocators_fix branch October 23, 2018 06:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla: yes ready to pull PR ready for merge process
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants