support allgather on GPU #14576

janewangfb · 2018-11-29T22:37:09Z

Summary: as titled

Differential Revision: D13266063

test/test_c10d.py

torch/lib/c10d/ProcessGroupGloo.cpp

pietern

Some nits and one structural thing.

Right now there is an implied assumption that all outputs in the nested output vector are placed on the same device. We should test that this is the case and throw if it isn't. If some part of this functionality is reused and the assumption is false then is can lead to synchronization issues.

There is also the possibility of using flattenDenseTensors directly instead of first copying into temporary tensors on the CPU side and then flattening them. That would save another copy. Since this is a backfill op it is not critical but will result in improved performance. Can you file an issue to track this? It would make for a good starter task for somebody new to the code base.

torch/lib/c10d/ProcessGroupGloo.cpp

janewangfb · 2018-12-04T05:29:11Z

Pieter, regarding possibility of using flattenDenseTensors. I think this also requires the nested output vector tensors on the same device?

pietern

Looking good. There is a problem in CI though, something about invalid events.

torch/lib/c10d/ProcessGroupGloo.cpp

pietern · 2018-12-04T17:21:17Z

Re: flattenDenseTensors, I think it would work with multiple devices, looking at the implementation:

pytorch/aten/src/THC/generic/THCTensorMath.cu

Lines 87 to 255 in 7e4a5b8

    
           void THCTensor_(catArray)(THCState *state, THCTensor *result, 
        
           			  THCTensor **inputs, int numInputs, int dimension) 
        
           { 
        
             // previously, size [0] tensors were the only possible empty tensors; thus, it wasn't possible 
        
             // to cat empty tensors unless all the other tensors were 1-dimensional, so we allowed these tensors 
        
             // to be "skipped".  We maintain this behavior for backwards compatibility, but only for this specific 
        
             // size (i.e. other empty sizes are not skipped). 
        
             // FIXME: warn if this is the case 
        
             int i, j, cohortMax; 
        
             int64_t offset; 
        
             bool hasSkippedInput = false; 
        
             THCTensor *notSkippedTensor = NULL;  // non-owning reference 
        
             auto should_skip = [](THCTensor *t) { return t->is_empty() && t->dim() == 1; }; 
        
             int nDims = 0; 
        
             for (i = 0; i < numInputs; i++) 
        
             { 
        
               if (should_skip(inputs[i])) { 
        
                 hasSkippedInput = true; 
        
                 continue; 
        
               } 
        
               nDims = inputs[i]->dim(); 
        
               notSkippedTensor = inputs[i]; 
        
             } 
        
             // If all inputs are empty tensors, return an empty tensor 
        
             if (notSkippedTensor == NULL) { 
        
               return; 
        
             } 
        
             THArgCheck(numInputs > 0, 3, "invalid number of inputs %d", numInputs); 
        
             THArgCheck(dimension >= 0, 4, "invalid dimension %d", dimension); 
        
             std::vector<int64_t> size(nDims); 
        
             // Compute size of the result in the cat dimension 
        
             int64_t cat_dim_size = 0; 
        
             for (int i = 0; i < numInputs; i++) { 
        
               THCTensor *tensor = inputs[i]; 
        
               if (should_skip(tensor)) { 
        
                 continue; 
        
               } 
        
               THCTensor_(check_shape_except_dim)(state, notSkippedTensor, tensor, dimension); 
        
               cat_dim_size += THCTensor_(size)(state, tensor, dimension); 
        
             } 
        
             // Compute the size of the result 
        
             for (int dim = 0; dim < nDims; dim++) { 
        
               int64_t result_dim_size = THCTensor_(size)(state, notSkippedTensor, dim); 
        
               if (dim == dimension) { 
        
                 result_dim_size = cat_dim_size; 
        
               } 
        
               size[dim] = result_dim_size; 
        
             } 
        
             THCTensor_(resize)(state, result, size, {}); 
        
             // We parallelize the copy if all 6 conditions pass: 
        
             // 
        
             // 1. There is more than one input tensor 
        
             // 2. No empty inputs 
        
             // 3. The result tensor is 32-bit indexable 
        
             // 4. The number of dimensions is <= 4 
        
             // 5. All input tensors are contiguous (output tensor may be non-contig) 
        
             // 6. All input tensors can use 32-bit indexing 
        
             // 7. All input tensors are on the same device 
        
             if (numInputs > 1 && 
        
                 !hasSkippedInput && 
        
                 result->dim() <= CAT_ARRAY_MAX_INPUT_DIMS && 
        
                 THCTensor_canUse32BitIndexMath(state, result) && 
        
                 THCTensor_allContiguous(state, inputs, numInputs) && 
        
                 THCTensor_all32BitIndexable(state, inputs, numInputs) && 
        
                 THCTensor_allSameDevice(state, inputs, numInputs)) { 
        
               // First, let's set up our kernel parameters. We start with a raw pointer to the storage 
        
               // for the output Tensor. 
        
               scalar_t *data = THCTensor_(data)(state, result); 
        
               // Kernel Parameter 
        
               size_t tensorMetadataSize = sizeof(CatArrInputTensor<scalar_t, unsigned int>) * CAT_ARRAY_BATCH_SIZE; 
        
               auto d_inputs = static_cast<CatArrInputTensor<scalar_t, unsigned int> *>(THCudaMalloc(state, tensorMetadataSize)); 
        
               OutputTensorSizeStride<unsigned int, CAT_ARRAY_MAX_INPUT_DIMS> param; 
        
               // Next, let's initialize the size, stride arrays for the output Tensor. 
        
               for (i = 0; i < nDims; ++i) { 
        
                 param.outputSize[i] = THCTensor_(size)(state, result, i); 
        
                 param.outputStride[i] = THCTensor_(stride)(state, result, i); 
        
               } 
        
               at::cuda::CUDAStream stream = at::cuda::getCurrentCUDAStream(); 
        
               // Template Declarations for dim = 1, 2, 3, 4 
        
           #define HANDLE_CASE(DIMS) \ 
        
             CatArrayBatchedCopy<scalar_t, unsigned int, DIMS><<<catGrid, applyBlock, 0, stream.stream()>>>(data, d_inputs, param, dimension, param.outputStride[dimension]); 
        
               // Now we loop 
        
               offset = 0; 
        
               for (i = 0; i < numInputs; i += CAT_ARRAY_BATCH_SIZE) { 
        
                 // Re-allocate stackInputs every iteration to avoid read-after-write hazard 
        
                 { 
        
                   auto stackInputs_owner = THCudaHostAlloc(state, tensorMetadataSize); 
        
                   CatArrInputTensor<scalar_t, unsigned int>* stackInputs = static_cast<CatArrInputTensor<scalar_t, unsigned int>*>(stackInputs_owner.get()); 
        
                   cohortMax = 0; 
        
                   for (j = 0; j < CAT_ARRAY_BATCH_SIZE && (i+j) < numInputs; ++j) { 
        
                     int64_t dimSize = THCTensor_(size)(state, inputs[i+j], dimension); 
        
                     stackInputs[j].input = THCTensor_(data)(state, inputs[i+j]); 
        
                     stackInputs[j].offset = offset; 
        
                     stackInputs[j].dimSize = dimSize; 
        
                     stackInputs[j].nElements = THCTensor_(nElement)(state, inputs[i+j]); 
        
                     cohortMax = cohortMax > (int) stackInputs[j].nElements ? cohortMax : (int) stackInputs[j].nElements; 
        
                     // update offset 
        
                     offset += dimSize; 
        
                   } 
        
                   THCudaCheck(cudaMemcpyAsync( 
        
                       d_inputs, 
        
                       stackInputs, 
        
                       j * sizeof(CatArrInputTensor<scalar_t, unsigned int>), 
        
                       cudaMemcpyHostToDevice, 
        
                       stream.stream())); 
        
                   THCudaHostRecord(state, stackInputs); 
        
                 } 
        
                 // Next, let's consider how we set our kernel launch parameters. 
        
                 // We borrow from THCApply, which the kernel's internal indexing 
        
                 // is based on. 
        
                 dim3 applyBlock = getApplyBlock(); 
        
                 //Get grid where x dim fills half gpu and y dim is number of tensors. 
        
                 //This will have cating two tensors fill the entire grid, but prevent 
        
                 //many threads from needlessly load meta data if their sizes is small. 
        
                 dim3 catGrid; 
        
                 getCatGrid(state, j, catGrid); 
        
                 switch (nDims) { 
        
                   case 1: 
        
                     HANDLE_CASE(1); 
        
                     break; 
        
                   case 2: 
        
                     HANDLE_CASE(2); 
        
                     break; 
        
                   case 3: 
        
                     HANDLE_CASE(3); 
        
                     break; 
        
                   case 4: 
        
                     HANDLE_CASE(4); 
        
                     break; 
        
                 } 
        
                 THCudaCheck(cudaGetLastError()); 
        
               } 
        
               THCudaFree(state, d_inputs); 
        
           #undef HANDLE_CASE 
        
             } else { 
        
               offset = 0; 
        
               for (j = 0; j < numInputs; j++) 
        
               { 
        
                 if (should_skip(inputs[j])) continue; 
        
                 int64_t dimSize = THCTensor_(size)(state, inputs[j], dimension); 
        
                 THCTensor *nt = THCTensor_(newWithTensor)(state, result); 
        
                 THCTensor_(narrow)(state, nt, NULL, dimension, offset, dimSize); 
        
                 THCTensor_(copy)(state, nt, inputs[j]); 
        
                 THCTensor_(free)(state, nt); 
        
                 offset += dimSize; 
        
               } 
        
             } 
        
           }

pietern · 2018-12-04T18:51:55Z

Regarding DeviceIndex, also see #14729.

janewangfb · 2018-12-04T19:25:16Z

@pietern it seems the failures is randomly. When I ran individual test, it always pass. But when I run all the tests, sometimes, it failed. investigating...

janewangfb · 2018-12-05T20:03:57Z

crated #14812 for new comers.

Summary: as titled Reviewed By: pietern Differential Revision: D13266063 fbshipit-source-id: 413140d80df24f4d6db26d1fcb5051fc41b2ab9a

janewangfb requested review from apaszke, pietern and teng-li as code owners November 29, 2018 22:37

janewangfb force-pushed the export-D13266063 branch from 9f8b3c6 to 877d4de Compare November 30, 2018 00:56

janewangfb force-pushed the export-D13266063 branch from 877d4de to 1ed0ff7 Compare November 30, 2018 00:57

pietern reviewed Nov 30, 2018

View reviewed changes

janewangfb force-pushed the export-D13266063 branch from 1ed0ff7 to 8261ece Compare December 1, 2018 01:21

janewangfb force-pushed the export-D13266063 branch from 8261ece to 36d6678 Compare December 1, 2018 01:24

janewangfb force-pushed the export-D13266063 branch from 36d6678 to 299ebd2 Compare December 1, 2018 01:24

janewangfb force-pushed the export-D13266063 branch from 299ebd2 to 25f5f9a Compare December 1, 2018 07:39

janewangfb force-pushed the export-D13266063 branch from 25f5f9a to 2897144 Compare December 1, 2018 07:40

pietern reviewed Dec 3, 2018

View reviewed changes

torch/lib/c10d/ProcessGroupGloo.cpp Outdated Show resolved Hide resolved

torch/lib/c10d/ProcessGroupGloo.cpp Outdated Show resolved Hide resolved

torch/lib/c10d/ProcessGroupGloo.cpp Outdated Show resolved Hide resolved

torch/lib/c10d/ProcessGroupGloo.cpp Outdated Show resolved Hide resolved

pietern mentioned this pull request Dec 3, 2018

add gloo support for gather on GPU #14592

Closed

janewangfb force-pushed the export-D13266063 branch from 2897144 to 0bd2876 Compare December 4, 2018 05:41

janewangfb force-pushed the export-D13266063 branch from 0bd2876 to 9fe8a88 Compare December 4, 2018 05:47

janewangfb force-pushed the export-D13266063 branch from 9fe8a88 to a6d10e3 Compare December 4, 2018 05:48

janewangfb force-pushed the export-D13266063 branch from a6d10e3 to acd3b8d Compare December 4, 2018 06:32

janewangfb force-pushed the export-D13266063 branch from acd3b8d to 4a4a285 Compare December 4, 2018 06:38

pietern reviewed Dec 4, 2018

View reviewed changes

torch/lib/c10d/ProcessGroupGloo.cpp Outdated Show resolved Hide resolved

torch/lib/c10d/ProcessGroupGloo.cpp Outdated Show resolved Hide resolved

pietern reviewed Dec 4, 2018

View reviewed changes

torch/lib/c10d/ProcessGroupGloo.cpp Outdated Show resolved Hide resolved

janewangfb force-pushed the export-D13266063 branch from 4a4a285 to c6383a6 Compare December 5, 2018 20:49

janewangfb force-pushed the export-D13266063 branch from c6383a6 to 42f312b Compare December 5, 2018 20:50

janewangfb force-pushed the export-D13266063 branch from 42f312b to 0940e75 Compare December 5, 2018 22:20

janewangfb force-pushed the export-D13266063 branch from 0940e75 to 6fa1498 Compare December 5, 2018 22:20

janewangfb force-pushed the export-D13266063 branch from 6fa1498 to 2f08151 Compare December 6, 2018 06:15

janewangfb force-pushed the export-D13266063 branch from 2f08151 to de5cd78 Compare December 6, 2018 06:21

janewangfb force-pushed the export-D13266063 branch from de5cd78 to 21f0f38 Compare December 6, 2018 07:24

janewangfb force-pushed the export-D13266063 branch from 21f0f38 to 9449e82 Compare December 6, 2018 22:45

janewangfb force-pushed the export-D13266063 branch from 9449e82 to 667f1d9 Compare December 6, 2018 22:45

add gloo allgather support on GPU

a1de0d9

Summary: as titled Reviewed By: pietern Differential Revision: D13266063 fbshipit-source-id: 413140d80df24f4d6db26d1fcb5051fc41b2ab9a

janewangfb force-pushed the export-D13266063 branch from 667f1d9 to a1de0d9 Compare December 10, 2018 20:05

facebook-github-bot closed this in 483ba55 Dec 10, 2018

ezyang added the merged label Jun 25, 2019

support allgather on GPU #14576

support allgather on GPU #14576

Uh oh!

Conversation

janewangfb commented Nov 29, 2018

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pietern left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

janewangfb commented Dec 4, 2018

Uh oh!

pietern left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pietern commented Dec 4, 2018

Uh oh!

pietern commented Dec 4, 2018

Uh oh!

janewangfb commented Dec 4, 2018

Uh oh!

janewangfb commented Dec 5, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants