Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cleanup tensorflow/c/experimental/gradients Part 1 #45547

Merged
merged 6 commits into from
Dec 11, 2020

Conversation

vnghia
Copy link
Contributor

@vnghia vnghia commented Dec 9, 2020

@saxenasaurabh

One thing I don't understand, if computing numerical gradients with TensorFloat-32 is numerically unstable, does disabling TensorFloat-32 inside gradient_checker make a lot more sense than disabling it inside the test, since we will have to disable TF-32 in all binaries depending on gradient_checker anyway ?

@google-ml-butler google-ml-butler bot added kokoro:force-run Tests on submitted change ready to pull PR ready for merge process labels Dec 9, 2020
@rthadur rthadur added kokoro:force-run Tests on submitted change and removed kokoro:force-run Tests on submitted change labels Dec 9, 2020
@rthadur rthadur added this to Assigned Reviewer in PR Queue via automation Dec 9, 2020
@kokoro-team kokoro-team removed the kokoro:force-run Tests on submitted change label Dec 9, 2020
@saxenasaurabh
Copy link
Member

tensorflow/c/eager/unified_api_test.cc:124:76: error: invalid conversion from 'tensorflow::int64* {aka long long int*}' to 'int64_t* {aka long int*}' [-fpermissive]
TestTensorHandleWithDimsFloat(ctx.get(), data, dim_sizes, 2, &x_raw);

Please fix.

@google-ml-butler google-ml-butler bot removed the ready to pull PR ready for merge process label Dec 10, 2020
@vnghia
Copy link
Contributor Author

vnghia commented Dec 10, 2020

tensorflow/c/eager/unified_api_test.cc:124:76: error: invalid conversion from 'tensorflow::int64* {aka long long int*}' to 'int64_t* {aka long int*}' [-fpermissive]
TestTensorHandleWithDimsFloat(ctx.get(), data, dim_sizes, 2, &x_raw);

Done

PR Queue automation moved this from Assigned Reviewer to Approved by Reviewer Dec 10, 2020
@google-ml-butler google-ml-butler bot added kokoro:force-run Tests on submitted change ready to pull PR ready for merge process labels Dec 10, 2020
@kokoro-team kokoro-team removed the kokoro:force-run Tests on submitted change label Dec 10, 2020
@google-ml-butler google-ml-butler bot removed the ready to pull PR ready for merge process label Dec 10, 2020
@google-ml-butler google-ml-butler bot added kokoro:force-run Tests on submitted change ready to pull PR ready for merge process labels Dec 10, 2020
@kokoro-team kokoro-team removed the kokoro:force-run Tests on submitted change label Dec 10, 2020
@rthadur
Copy link
Contributor

rthadur commented Dec 10, 2020

@vnvo2409 can you please resolve conflicts ?

@google-ml-butler google-ml-butler bot removed the ready to pull PR ready for merge process label Dec 10, 2020
@vnghia
Copy link
Contributor Author

vnghia commented Dec 10, 2020

@vnvo2409 can you please resolve conflicts ?

Done

@google-ml-butler google-ml-butler bot added kokoro:force-run Tests on submitted change ready to pull PR ready for merge process labels Dec 11, 2020
@kokoro-team kokoro-team removed the kokoro:force-run Tests on submitted change label Dec 11, 2020
@saxenasaurabh
Copy link
Member

Looks like eager/BUILD has conflicts. Rebase maybe? Also nn_grad_test is segfaulting internally. I will try to patch and see what's going on.

@vnghia
Copy link
Contributor Author

vnghia commented Dec 11, 2020

@saxenasaurabh

eager/BUILD has conflicts.

I've already fixed it.

Also nn_grad_test is segfaulting internally.

There are 3 memory problems with nn_grad_test.


One is comming from SoftMaxModel and RunAndMaybeSum

Status RunAndMaybeSum(AbstractContext* ctx, Model forward,
absl::Span<AbstractTensorHandle* const> inputs,
absl::Span<AbstractTensorHandle*> outputs,
bool use_function) {
GradientRegistry registry;
std::vector<AbstractTensorHandle*> model_outputs(1);

SoftMaxModel returns 2 tensors but model_outputs take only 1 tensor so it causes a memory overflow. I intend to fix this issue in the next PR ( because running without asan is fine ), but we could fix in this PR if need though. WDYT ?


gradient_checker leaks a lot of AbstractTensorHandle. Will be fixed in the next PR too.


It seems that tape->ComputeGradients leaks a tensor. The traceback points to the allocation of a new tensor in BuildOnesLike. However, I could not find a fix for it yet.

@saxenasaurabh
Copy link
Member

Ah got it. Let's fix the SoftmaxModel. The leaks we can address in a follow-up.

@google-ml-butler google-ml-butler bot removed the ready to pull PR ready for merge process label Dec 11, 2020
@vnghia
Copy link
Contributor Author

vnghia commented Dec 11, 2020

Let's fix the SoftmaxModel.

Done

@google-ml-butler google-ml-butler bot added kokoro:force-run Tests on submitted change ready to pull PR ready for merge process labels Dec 11, 2020
@vnghia
Copy link
Contributor Author

vnghia commented Dec 11, 2020

@saxenasaurabh
The copybara/feedback failed

@copybara-service copybara-service bot merged commit c4e8492 into tensorflow:master Dec 11, 2020
PR Queue automation moved this from Approved by Reviewer to Merged Dec 11, 2020
@vnghia vnghia deleted the gradients-cleanup branch December 11, 2020 23:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla: yes kokoro:force-run Tests on submitted change ready to pull PR ready for merge process size:L CL Change Size: Large
Projects
PR Queue
  
Merged
Development

Successfully merging this pull request may close these issues.

None yet

5 participants