Skip to content

Conversation

@janewangfb
Copy link
Contributor

Summary: as titled

Differential Revision: D13266063

Copy link
Contributor

@pietern pietern left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some nits and one structural thing.

Right now there is an implied assumption that all outputs in the nested output vector are placed on the same device. We should test that this is the case and throw if it isn't. If some part of this functionality is reused and the assumption is false then is can lead to synchronization issues.

There is also the possibility of using flattenDenseTensors directly instead of first copying into temporary tensors on the CPU side and then flattening them. That would save another copy. Since this is a backfill op it is not critical but will result in improved performance. Can you file an issue to track this? It would make for a good starter task for somebody new to the code base.

@janewangfb
Copy link
Contributor Author

Pieter, regarding possibility of using flattenDenseTensors. I think this also requires the nested output vector tensors on the same device?

Copy link
Contributor

@pietern pietern left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good. There is a problem in CI though, something about invalid events.

@pietern
Copy link
Contributor

pietern commented Dec 4, 2018

Re: flattenDenseTensors, I think it would work with multiple devices, looking at the implementation:

void THCTensor_(catArray)(THCState *state, THCTensor *result,
THCTensor **inputs, int numInputs, int dimension)
{
// previously, size [0] tensors were the only possible empty tensors; thus, it wasn't possible
// to cat empty tensors unless all the other tensors were 1-dimensional, so we allowed these tensors
// to be "skipped". We maintain this behavior for backwards compatibility, but only for this specific
// size (i.e. other empty sizes are not skipped).
// FIXME: warn if this is the case
int i, j, cohortMax;
int64_t offset;
bool hasSkippedInput = false;
THCTensor *notSkippedTensor = NULL; // non-owning reference
auto should_skip = [](THCTensor *t) { return t->is_empty() && t->dim() == 1; };
int nDims = 0;
for (i = 0; i < numInputs; i++)
{
if (should_skip(inputs[i])) {
hasSkippedInput = true;
continue;
}
nDims = inputs[i]->dim();
notSkippedTensor = inputs[i];
}
// If all inputs are empty tensors, return an empty tensor
if (notSkippedTensor == NULL) {
return;
}
THArgCheck(numInputs > 0, 3, "invalid number of inputs %d", numInputs);
THArgCheck(dimension >= 0, 4, "invalid dimension %d", dimension);
std::vector<int64_t> size(nDims);
// Compute size of the result in the cat dimension
int64_t cat_dim_size = 0;
for (int i = 0; i < numInputs; i++) {
THCTensor *tensor = inputs[i];
if (should_skip(tensor)) {
continue;
}
THCTensor_(check_shape_except_dim)(state, notSkippedTensor, tensor, dimension);
cat_dim_size += THCTensor_(size)(state, tensor, dimension);
}
// Compute the size of the result
for (int dim = 0; dim < nDims; dim++) {
int64_t result_dim_size = THCTensor_(size)(state, notSkippedTensor, dim);
if (dim == dimension) {
result_dim_size = cat_dim_size;
}
size[dim] = result_dim_size;
}
THCTensor_(resize)(state, result, size, {});
// We parallelize the copy if all 6 conditions pass:
//
// 1. There is more than one input tensor
// 2. No empty inputs
// 3. The result tensor is 32-bit indexable
// 4. The number of dimensions is <= 4
// 5. All input tensors are contiguous (output tensor may be non-contig)
// 6. All input tensors can use 32-bit indexing
// 7. All input tensors are on the same device
if (numInputs > 1 &&
!hasSkippedInput &&
result->dim() <= CAT_ARRAY_MAX_INPUT_DIMS &&
THCTensor_canUse32BitIndexMath(state, result) &&
THCTensor_allContiguous(state, inputs, numInputs) &&
THCTensor_all32BitIndexable(state, inputs, numInputs) &&
THCTensor_allSameDevice(state, inputs, numInputs)) {
// First, let's set up our kernel parameters. We start with a raw pointer to the storage
// for the output Tensor.
scalar_t *data = THCTensor_(data)(state, result);
// Kernel Parameter
size_t tensorMetadataSize = sizeof(CatArrInputTensor<scalar_t, unsigned int>) * CAT_ARRAY_BATCH_SIZE;
auto d_inputs = static_cast<CatArrInputTensor<scalar_t, unsigned int> *>(THCudaMalloc(state, tensorMetadataSize));
OutputTensorSizeStride<unsigned int, CAT_ARRAY_MAX_INPUT_DIMS> param;
// Next, let's initialize the size, stride arrays for the output Tensor.
for (i = 0; i < nDims; ++i) {
param.outputSize[i] = THCTensor_(size)(state, result, i);
param.outputStride[i] = THCTensor_(stride)(state, result, i);
}
at::cuda::CUDAStream stream = at::cuda::getCurrentCUDAStream();
// Template Declarations for dim = 1, 2, 3, 4
#define HANDLE_CASE(DIMS) \
CatArrayBatchedCopy<scalar_t, unsigned int, DIMS><<<catGrid, applyBlock, 0, stream.stream()>>>(data, d_inputs, param, dimension, param.outputStride[dimension]);
// Now we loop
offset = 0;
for (i = 0; i < numInputs; i += CAT_ARRAY_BATCH_SIZE) {
// Re-allocate stackInputs every iteration to avoid read-after-write hazard
{
auto stackInputs_owner = THCudaHostAlloc(state, tensorMetadataSize);
CatArrInputTensor<scalar_t, unsigned int>* stackInputs = static_cast<CatArrInputTensor<scalar_t, unsigned int>*>(stackInputs_owner.get());
cohortMax = 0;
for (j = 0; j < CAT_ARRAY_BATCH_SIZE && (i+j) < numInputs; ++j) {
int64_t dimSize = THCTensor_(size)(state, inputs[i+j], dimension);
stackInputs[j].input = THCTensor_(data)(state, inputs[i+j]);
stackInputs[j].offset = offset;
stackInputs[j].dimSize = dimSize;
stackInputs[j].nElements = THCTensor_(nElement)(state, inputs[i+j]);
cohortMax = cohortMax > (int) stackInputs[j].nElements ? cohortMax : (int) stackInputs[j].nElements;
// update offset
offset += dimSize;
}
THCudaCheck(cudaMemcpyAsync(
d_inputs,
stackInputs,
j * sizeof(CatArrInputTensor<scalar_t, unsigned int>),
cudaMemcpyHostToDevice,
stream.stream()));
THCudaHostRecord(state, stackInputs);
}
// Next, let's consider how we set our kernel launch parameters.
// We borrow from THCApply, which the kernel's internal indexing
// is based on.
dim3 applyBlock = getApplyBlock();
//Get grid where x dim fills half gpu and y dim is number of tensors.
//This will have cating two tensors fill the entire grid, but prevent
//many threads from needlessly load meta data if their sizes is small.
dim3 catGrid;
getCatGrid(state, j, catGrid);
switch (nDims) {
case 1:
HANDLE_CASE(1);
break;
case 2:
HANDLE_CASE(2);
break;
case 3:
HANDLE_CASE(3);
break;
case 4:
HANDLE_CASE(4);
break;
}
THCudaCheck(cudaGetLastError());
}
THCudaFree(state, d_inputs);
#undef HANDLE_CASE
} else {
offset = 0;
for (j = 0; j < numInputs; j++)
{
if (should_skip(inputs[j])) continue;
int64_t dimSize = THCTensor_(size)(state, inputs[j], dimension);
THCTensor *nt = THCTensor_(newWithTensor)(state, result);
THCTensor_(narrow)(state, nt, NULL, dimension, offset, dimSize);
THCTensor_(copy)(state, nt, inputs[j]);
THCTensor_(free)(state, nt);
offset += dimSize;
}
}
}

@pietern
Copy link
Contributor

pietern commented Dec 4, 2018

Regarding DeviceIndex, also see #14729.

@janewangfb
Copy link
Contributor Author

@pietern it seems the failures is randomly. When I ran individual test, it always pass. But when I run all the tests, sometimes, it failed. investigating...

@janewangfb
Copy link
Contributor Author

crated #14812 for new comers.

Summary: as titled

Reviewed By: pietern

Differential Revision: D13266063

fbshipit-source-id: 413140d80df24f4d6db26d1fcb5051fc41b2ab9a
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants