GDR contrib- Suballocators fixes #22989

yanivbl6 · 2018-10-15T14:37:35Z

This patch includes two fixes, all in the gdr contribution
The 1st one changes the ordering of the calls to gdr memory manager initialization and grpc initialization.

This is necessary because gdr initialization may initiates the GPU device and call GetCPUAllocator (or GetGPUAllocator), which are asserted to never occur before the registration of Suballocators to ProcessState's singleton (which happens on gdr memory manager init)

The 2nd fix is to change the cuda host sub allocators to be registered on both Numas, 0 and 1.
This fixed an error on my servers, and I think it should be fine on different setups too but please comment if you think I missed something.

@byronyi @poxvoculi

byronyi · 2018-10-15T15:22:17Z

Thanks for the fix! I will check on our environment soon.

yanivbl6 · 2018-10-15T16:05:39Z

Great! Note that I experienced an hang during initialization with master branch, but this also happened in vanilla grpc so it's probably unrelated. I didnt isolate the grpc problem yet but grpc and patched gdr worked with the 5 days old branch. (96a6333)

Edit:
Tested again today, and grpc working well again on master.

drpngx · 2018-10-15T17:30:10Z

@byronyi let me know when you have confirmed that it works.

poxvoculi · 2018-10-15T17:35:26Z

tensorflow/contrib/gdr/gdr_memory_manager.cc

-    GPUProcessState::singleton()->AddCUDAHostAllocVisitor(bus_id,
-                                                          alloc_visitor);
+    GPUProcessState::singleton()->AddCUDAHostAllocVisitor(0, alloc_visitor);
+    GPUProcessState::singleton()->AddCUDAHostAllocVisitor(1, alloc_visitor);


This seems fine for now. NUMA specific allocation has not yet been turned on, pending bringing TF's eigen version up to date. It might be better to add a visitor for every available NUMA node by using port::NUMANumNodes.

thanks for the tip. added it to latest commit.

byronyi

Thanks for the fix!

byronyi · 2018-10-18T09:29:24Z

tensorflow/contrib/gdr/gdr_memory_manager.cc

-    GPUProcessState::singleton()->AddCUDAHostAllocVisitor(bus_id,
-                                                          alloc_visitor);
+
+    for (int numa_idx = 0; numa_idx < port::NUMANumNodes(); ++numa_idx) {


This should be moved out of the above if (IsGDRAvailable()), as the host memory should be registered regardless. Otherwise, it fails for GDR incapable GPUs, such as those GTX cards.

Probably should change the name of IsGDRAvailable to P2PKernelModuleEnabled to avoid confusion.

Any suggestions?

GDR is the correct name here. Transporting between GPU's host pinned memory is technically just "GD".

I would like to eventually add the possibility to selectively avoid GPU-direct on slow PCie routes (Like QPI), but I prefer to just focus on hotfixes for now, till the upcoming changes regarding the transport contributions.

byronyi · 2018-10-18T09:37:30Z

tensorflow/contrib/gdr/gdr_memory_manager.cc

+                                                            alloc_visitor);
+      LOG(INFO) << "Instrumenting Cuda host allocator on numa " << numa_idx;
+    }
+
    GPUProcessState::singleton()->AddCUDAHostFreeVisitor(bus_id, free_visitor);


Should rename bus_id to numa_idx, and move it inside the loop.

@poxvoculi I saw different signatures for AddGPUAllocVisitor(int bus_id, const SubAllocator::Visitor& visitor) and AddCUDAHostAllocVisitor(int numa_node, const SubAllocator::Visitor& visitor). It seems rather confusing.

Maybe change bus_id to numa_node for AddGPUAllocVisitor as well?

byronyi

Please also remove this line and change bus_id here to numa_node_.

byronyi · 2018-10-18T10:13:42Z

With requested changes, I have tested the patch with CPU only, GPU without GDR, GPU with GDR but on different NUMA, and GPU with GDR on the same NUMA.

Thanks again!

yanivbl6 · 2018-10-18T17:15:54Z

thanks for the comments. will go over them and re-commit tomorrow.

byronyi · 2018-10-19T08:15:04Z

tensorflow/contrib/gdr/gdr_memory_manager.cc

+    for (int numa_idx = 0; numa_idx < port::NUMANumNodes(); ++numa_idx) {
+      GPUProcessState::singleton()->AddGPUAllocVisitor(numa_idx,
+                                                       cuda_alloc_visitor);
+      GPUProcessState::singleton()->AddCUDAHostFreeVisitor(numa_idx,


This line should also be moved above (i.e. outside IsGDRAvailable())

byronyi · 2018-10-19T09:17:55Z

@drpngx LGTM

drpngx · 2018-10-19T16:38:59Z

tensorflow/contrib/gdr/gdr_memory_manager.cc

@@ -640,8 +642,8 @@ void GdrMemoryManager::TensorFromTransportOptions(
    } else {
      checksum = GPUUtil::Checksum(*tensor);
    }
-    CHECK(checksum == remote_mr.checksum())
-        << "Checksum mismatch: " << checksum << "!=" << remote_mr.checksum();
+    CHECK(checksum == remote_mr.checksum()) << "Checksum mismatch: " << checksum


Oh this is going to create a problem internally. Could you change that into a return error::InternalError?

FYI: This only hits when turning VLOG>=2, and it is being used to debug memory corruption problems from hardware direct access. Like an assertion, and it’s easier to debug if we could keep the original stack instead of returning.

Thoughts?

It's a policy that we have internally. You can LOG(ERROR) something and then return.

Good to know that. @yanivbl6 mind to unpatch this line and I’ll submit a separate PR to clean things up?

sure, I will do it tomorrow morning.

@drpngx Will just re-committing the line back to the non-formatted version be sufficient to comply with the policy or do I need to force-push over the commit with that change?

I guess the latter as they always squash the PR to a single commit :p

Yes it doesn't matter how the commits are organized (I prefer a linear chain, just piling up so that we can match up the review history).

drpngx · 2018-10-19T16:39:23Z

tensorflow/contrib/gdr/gdr_memory_manager.cc

+      GPUProcessState::singleton()->AddGPUAllocVisitor(numa_idx,
+                                                       cuda_alloc_visitor);
+    }
+    LOG(INFO) << "Instrumenting GPU allocator(s) for all numas ";


Remove space. Consider using vlog.

Changed the alloactor registration so CudaHost Allocator is registered for numas 0 and 1

drpngx · 2018-10-22T17:08:46Z

@yanivbl6 there's some import issue with another patch (#22559). Could you pull rebase and push again?

yanivbl6 · 2018-10-22T17:46:17Z

Ok, but the referenced patch was my initial starting point, and I have done some rebasing since without issues.

Edit:
fetched and rebased on master, then forced pushed.

drpngx · 2018-10-22T18:09:40Z

@yifeif I'm not sure why it's not creating the CL after it's pulled and approved. Any idea?

yifeif · 2018-10-22T18:27:46Z

It might be due to https://blog.github.com/2018-10-21-october21-incident-report/

yanivbl6 · 2018-10-22T18:29:02Z

not sure. The checks from the force push aren't done yet, but I don't think that's the problem as I didn't see any collisions when re-basing.

@byronyi, I feel like I am delaying the fix with a code you already have for a while. Do you want me to close the PR so you can push what you have been testing?

I can try a new PR but other than that I am out of ideas.

drpngx · 2018-10-22T20:04:49Z

OK, the process got unstuck. Now testing internally and waiting for internal review.

byronyi · 2018-10-22T23:39:53Z

@yanivbl6 It is alright. GitHub went down yesterday so let’s just be patient to get this merged :)

PiperOrigin-RevId: 218263330

googlebot added the cla: yes label Oct 15, 2018

drpngx self-assigned this Oct 15, 2018

drpngx requested a review from poxvoculi October 15, 2018 17:30

drpngx added stat:awaiting response Status - Awaiting response from author awaiting review Pull request awaiting review labels Oct 15, 2018

poxvoculi previously approved these changes Oct 15, 2018

View reviewed changes

drpngx added awaiting testing (then merge) kokoro:force-run Tests on submitted change and removed awaiting review Pull request awaiting review labels Oct 15, 2018

kokoro-team removed the kokoro:force-run Tests on submitted change label Oct 15, 2018

yanivbl6 dismissed poxvoculi’s stale review via 83e61f6 October 17, 2018 10:10

tensorflowbutler removed the stat:awaiting response Status - Awaiting response from author label Oct 17, 2018

byronyi suggested changes Oct 18, 2018

View reviewed changes

byronyi reviewed Oct 19, 2018

View reviewed changes

drpngx suggested changes Oct 19, 2018

View reviewed changes

drpngx added the ready to pull PR ready for merge process label Oct 21, 2018

Yaniv Blumenfeld added 6 commits October 22, 2018 13:59

Changed memory mgr init to be after GrpcServer

4a5d295

Changed the alloactor registration so CudaHost Allocator is registered for numas 0 and 1

fix build error

45d6630

Enables GPU allocators for general Numa count

4dd636d

Changed gpu visitors to support more general arch selection

f959ab8

gdr alloactors fix- clanged reformatted

59938ee

Moved the a GPU CUDA host visitor to correct location

6d03d38

Yaniv Blumenfeld and others added 2 commits October 22, 2018 13:59

Reverted clang-reformat changes in first commit

440e5e7

Use numerical value for VLOG.

5c750c2

drpngx added ready to pull PR ready for merge process and removed ready to pull PR ready for merge process labels Oct 22, 2018

yanivbl6 force-pushed the gdr_allocators_fix branch from 223af86 to 5c750c2 Compare October 22, 2018 17:56

yifeif added ready to pull PR ready for merge process and removed ready to pull PR ready for merge process labels Oct 22, 2018

yifeif added the kokoro:force-run Tests on submitted change label Oct 22, 2018

kokoro-team removed the kokoro:force-run Tests on submitted change label Oct 22, 2018

poxvoculi approved these changes Oct 22, 2018

View reviewed changes

drpngx approved these changes Oct 22, 2018

View reviewed changes

tensorflow-copybara merged commit 5c750c2 into tensorflow:master Oct 23, 2018

tensorflow-copybara pushed a commit that referenced this pull request Oct 23, 2018

Merge pull request #22989 from yanivbl6:gdr_allocators_fix

f278747

PiperOrigin-RevId: 218263330

yanivbl6 deleted the gdr_allocators_fix branch October 23, 2018 06:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GDR contrib- Suballocators fixes #22989

GDR contrib- Suballocators fixes #22989

yanivbl6 commented Oct 15, 2018

byronyi commented Oct 15, 2018

yanivbl6 commented Oct 15, 2018 •

edited

drpngx commented Oct 15, 2018

poxvoculi Oct 15, 2018 •

edited

yanivbl6 Oct 17, 2018

byronyi left a comment

byronyi Oct 18, 2018

byronyi Oct 18, 2018

yanivbl6 Oct 19, 2018

byronyi Oct 18, 2018

byronyi Oct 18, 2018

byronyi left a comment

byronyi commented Oct 18, 2018

yanivbl6 commented Oct 18, 2018

byronyi Oct 19, 2018

byronyi commented Oct 19, 2018

drpngx Oct 19, 2018

byronyi Oct 19, 2018

drpngx Oct 19, 2018

byronyi Oct 20, 2018

yanivbl6 Oct 20, 2018

yanivbl6 Oct 20, 2018

byronyi Oct 20, 2018

drpngx Oct 20, 2018

yanivbl6 Oct 21, 2018

drpngx Oct 19, 2018

drpngx commented Oct 22, 2018

yanivbl6 commented Oct 22, 2018 •

edited

drpngx commented Oct 22, 2018

yifeif commented Oct 22, 2018

yanivbl6 commented Oct 22, 2018

drpngx commented Oct 22, 2018

byronyi commented Oct 22, 2018

GDR contrib- Suballocators fixes #22989

GDR contrib- Suballocators fixes #22989

Conversation

yanivbl6 commented Oct 15, 2018

byronyi commented Oct 15, 2018

yanivbl6 commented Oct 15, 2018 • edited

drpngx commented Oct 15, 2018

poxvoculi Oct 15, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

byronyi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

byronyi left a comment

Choose a reason for hiding this comment

byronyi commented Oct 18, 2018

yanivbl6 commented Oct 18, 2018

Choose a reason for hiding this comment

byronyi commented Oct 19, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

drpngx commented Oct 22, 2018

yanivbl6 commented Oct 22, 2018 • edited

drpngx commented Oct 22, 2018

yifeif commented Oct 22, 2018

yanivbl6 commented Oct 22, 2018

drpngx commented Oct 22, 2018

byronyi commented Oct 22, 2018

yanivbl6 commented Oct 15, 2018 •

edited

poxvoculi Oct 15, 2018 •

edited

yanivbl6 commented Oct 22, 2018 •

edited