New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Support SparseSegmentSum and SparseSegmentSumWithNumSegments on GPU #45085

Closed

Lifann wants to merge 5 commits into tensorflow:master from Lifann:Lifann/sparse-segment-sum-gpu

Member

Lifann commented Nov 23, 2020 •

edited

Loading

This is a PR from JIZHI Team & TaiJi AI platform in Tencent.

Currently, there is no GPU support for sparse_segment_reduction ops in Tensorflow. This commit offers GPU supports on SparseSegmentSum and SparseSegmentSumWithNumSegments. And we are looking forward to supporting more sparse reduction ops for training and inference in sparse scenarios.

google-ml-butler bot added the size:L label

google-cla bot added the cla: yes label

gbaned self-assigned this

Contributor

gbaned commented Nov 23, 2020

@Lifann Can you please resolve conflicts? Thanks!

gbaned requested a review from sanjoy

November 23, 2020 07:29


          Support SparseSegmentSum and SparseSegmentSumWithNumSegments on GPU

69228f3

Lifann force-pushed the Lifann/sparse-segment-sum-gpu branch from 0ef4ada to 69228f3 Compare

November 24, 2020 02:10


          Merge branch 'master' into Lifann/sparse-segment-sum-gpu

e5332a9

Member Author

Lifann commented Nov 24, 2020

@Lifann Can you please resolve conflicts? Thanks!

Thanks for your reminder! I have resolved the conflicts.

gbaned added comp:core awaiting review labels

sanjoy suggested changes

View reviewed changes

tensorflow/python/kernel_tests/segment_reduction_ops_test.py Outdated

+                  segment_indices = [0, 1, 2, 2]
+                  num_indices = len(segment_indices)
+                  for tf_op in [math_ops.sparse_segment_sum_with_num_segments]:
+                    with self.cached_session(use_gpu=test_util.is_gpu_available()):

Contributor

sanjoy Jan 11, 2021

Generally you can do use_gpu=True, it will use GPUs only when available.

Member Author

Lifann Feb 2, 2021

Sorry for the delay. It's very nice to have your review. I've modified the use_gpu in session parameter to True

tensorflow/core/kernels/segment_reduction_ops_impl.h Outdated

+                  const TensorShape input_shape = input_data.shape();
+                  Index element_size_muta = 1;
+                  if (input_dims > 1) {
+                    ACCUMULATE_MUL(input_shape, input_dims, accum);

Contributor

sanjoy Jan 11, 2021

I'd suggest just inlining the macro here. If you definitely need some abstraction, please use a function.

Member Author

Lifann Feb 2, 2021 •

edited

Loading

tensorflow/core/kernels/segment_reduction_ops_impl.h Outdated

+                                              "same length."),
+                      done);
+                  ScratchSpace<Index> output_rows_host(context, 1, /* on_host */ true);

Contributor

sanjoy Jan 12, 2021

Please use the following spelling:

ScratchSpace<Index> output_rows_host(context, /*size=*/1, /*on_host=*/ true);

Our internal tooling recognizes this format and will check that the parameter names actually match up.

Member Author

Lifann Feb 2, 2021 •

edited

Loading

tensorflow/core/kernels/segment_reduction_ops_impl.h Outdated

Comment on lines 991 to 1001

+                    const Tensor& num_segments = context->input(3);
+                    se::DeviceMemoryBase num_segments_device(
+                        const_cast<Tensor&>(num_segments).template flat<Index>().data()
+                    );
+                    OP_REQUIRES_ASYNC(
+                        context,
+                        stream
+                            ->ThenMemcpy(output_rows_host.mutable_data(), num_segments_device, sizeof(Index)).ok(),
+                        errors::Internal(
+                            "SparseSegmentSumGpuOp: failed to copy num_segments to host."),
+                        done);

Contributor

sanjoy Jan 12, 2021

If I'm reading the code correctly, the correct way to do this is to use the HostMemory annotation. The runtime will then copy that input to the host before running the OpKernel.

Member Author

Lifann Feb 2, 2021 •

edited

Loading

Great idea! It seems more concise!
registration
simpler impl

tensorflow/core/kernels/segment_reduction_ops_impl.h Outdated

Comment on lines 1003 to 1005

+                    se::DeviceMemoryBase last_segment_id_on_device(
+                        const_cast<Tensor&>(segment_ids).template flat<Index>().data() +
+                        (num_indices - 1));

Contributor

sanjoy Jan 12, 2021

Are you sure the const_cast is needed? I think you should be able to create and pass a const DeviceMemoryBase to ThenMemcpy.

Member Author

Lifann Feb 2, 2021 •

edited

Loading

The pointer here is used to get the value of the last id in segment_ids. So I think it might always need a conversion for the offset, maybe not const.

tensorflow/core/kernels/segment_reduction_ops_impl.h

+                    se::DeviceMemoryBase last_segment_id_on_device(
+                        const_cast<Tensor&>(segment_ids).template flat<Index>().data() +
+                        (num_indices - 1));
+                    OP_REQUIRES_ASYNC(

Contributor

sanjoy Jan 12, 2021

Please use OP_REQUIRES_OK_ASYNC.

Member Author

Lifann Feb 2, 2021

I checked the OP_REQUIRES_OK_ASYNC macro. It seems that a Status condition is needed. But ThenMemcpy returns a Stream, which only have an ok() method with bool return value. It's a little confusing with Status::ok().

tensorflow/core/kernels/segment_reduction_ops_impl.h Outdated

+                  }
+                  const Index element_size = element_size_muta;
+                  functor::SparseSegmentSumFunctor<T, Index> functor_;

Contributor

sanjoy Jan 12, 2021

Local vars don't have a _ suffix.

Member Author

Lifann Feb 2, 2021 •

edited

Loading

Accepted
To discriminate from the namespace functor, I use executant.

tensorflow/core/kernels/segment_reduction_ops_impl.h Outdated

Comment on lines 1040 to 1042

+                  auto create_and_check_output = [context, num_indices, element_size,
+                                                  output_rows_host, &input_data, &indices,
+                                                  &segment_ids, &functor_, done]() {

Contributor

sanjoy Jan 12, 2021

Wouldn't we have a use-after-free on segment_ids and functor_?

I'd recommend creating a functor that has the captured values as member fields. This will make the types explicit and catch use-after-free issues.

Member Author

Lifann Feb 2, 2021 •

edited

Loading

tensorflow/core/kernels/segment_reduction_ops_impl.h Outdated

Comment on lines 1047 to 1049

+                    if (! has_num_segments) {
+                      output_rows++;
+                    }

Contributor

sanjoy Jan 12, 2021

Please add a short comment on why this is needed.

Member Author

Lifann Feb 2, 2021 •

edited

Loading

Thanks for your advice. I've added the comment.

tensorflowbutler removed the awaiting review label

Contributor

gbaned commented Jan 15, 2021

@Lifann Can you please check @sanjoy's comments and keep us posted ? Thanks!

gbaned added the stat:awaiting response label

Contributor

gbaned commented Jan 27, 2021

@Lifann Any update on this PR? Please. Thanks!

gbaned requested a review from sanjoy

February 2, 2021 11:04

gbaned removed the stat:awaiting response label

Lifann force-pushed the Lifann/sparse-segment-sum-gpu branch from 1636b36 to 13c00d7 Compare

February 2, 2021 11:43

Member Author

Lifann commented Feb 2, 2021 •

edited

Loading

@Lifann Can you please check @sanjoy's comments and keep us posted ? Thanks!

Sorry for the delay. And thanks for @sanjoy 's review.

Besides, I also found and fixed an asynchronous problem in D2H copy, which may lead to the host gets output_rows before the last_segment_id_on_device copied to host.

Lifann force-pushed the Lifann/sparse-segment-sum-gpu branch from 13c00d7 to 3c1dffa Compare

February 3, 2021 07:09


          Revision: eliminate unsafe and redundant code.

6ac8ef4

Lifann force-pushed the Lifann/sparse-segment-sum-gpu branch from 3c1dffa to 6ac8ef4 Compare

February 3, 2021 07:14

gbaned requested review from sanjoy and removed request for sanjoy

February 4, 2021 14:04


          Unpack the executant for eager mode

dcf8c57

Lifann force-pushed the Lifann/sparse-segment-sum-gpu branch from 324bd24 to dcf8c57 Compare

February 5, 2021 08:48

gbaned added the awaiting review label

Contributor

sanjoy commented Feb 16, 2021

@Lifann Is this ready for review or are you still working on it?

tensorflowbutler removed the awaiting review label

Member Author

Lifann commented Feb 20, 2021

@Lifann Is this ready for review or are you still working on it?

This PR is ready for review.

gbaned added the awaiting review label


          Merge branch 'master' into Lifann/sparse-segment-sum-gpu

d46a596

Member Author

Lifann commented Mar 19, 2021

@gbaned @sanjoy Sorry for bothering you. Is there any further problem with this PR? If there is any need for modification, please let me know. Thanks.

Contributor

sanjoy commented Mar 25, 2021

Sorry for the delayed response, this PR somehow fell through the cracks.

It seems like @benbarsdell has also implemented these ops in #47974. Would it be OK if we review & merged that PR (please feel free to leave a review there)? On first glance that implementation looks like it is vectorized so I'm expecting that to be faster.

Member Author

Lifann commented Mar 25, 2021

Hi @Lifann,

Sorry for the delayed response, this PR somehow fell through the cracks.

It seems like @benbarsdell has also implemented these ops in #47974. Would it be OK if we review & merged that PR (please feel free to leave a review there)? On first glance that implementation looks like it is vectorized so I'm expecting that to be faster.

Thanks for the reply.

I go check #47974. It seems to be much better than this PR.

Good to know there is more professional developer aiming at the same goal. And also it's a nice chance for me to learn from a better experience.

Member Author

Lifann commented Mar 25, 2021

Hi @Lifann,

Sorry for the delayed response, this PR somehow fell through the cracks.

It seems like @benbarsdell has also implemented these ops in #47974. Would it be OK if we review & merged that PR (please feel free to leave a review there)? On first glance that implementation looks like it is vectorized so I'm expecting that to be faster.

Then I close this PR.

Lifann closed this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

awaiting review cla: yes comp:core size:L