Change Redzone space limit for XLA GPU #54860

kaixih · 2022-03-02T19:30:43Z

This PR changes how the redzone space limit is set in the XLA gpu conv algorithm picker.

It sets the numeric max of int64 for the input/output allocator. So, we can have consistent behavior with the gemm picker.
It allows the adjustment of the space limit for the scratch allocator via an env var. So, users can adjust it via XLA_FLAGS=--xla_gpu_redzone_scratch_max_megabytes=6144.

cheshire · 2022-03-03T19:20:52Z

Hi @kaixih , could you provide some more context on what is the desired goal?

kaixih · 2022-03-07T17:25:36Z

@cheshire Sure. Basically, we found that the max space limit of redzone allocator for the XLA conv is set to be 4GB, which is insufficient for some models that expect large input/output tensors. In addition, we also noticed that this limit is not adjustable during runtime. So, compared to the XLA gemm, which has already set the limit of the input/output redzone allocator to the numeric max of int, we think it might be appropriate to follow it for the XLA conv. Moreover, we introduced a new env var to control the scratch redzone allocator max limit as well in case it needs to be adjusted.

tensorflow/compiler/xla/service/gpu/gpu_conv_algorithm_picker.cc

kaixih · 2022-03-10T17:41:21Z

@cheshire Can you help check what blocks the merging? It seems some "Google internal checks" failed. Thanks.

PiperOrigin-RevId: 434441627

akuegel · 2022-03-14T13:04:26Z

There was a merge conflict in xla.proto, you used the same tag as was already used in a recent change. I fixed that and got your PR merged.

gbaned · 2022-03-14T14:10:59Z

Seems auto-merge is not happening but the changes are merged into master now, so we can close this. Thank you for the PR.

google-ml-butler bot added the size:S CL Change Size: Small label Mar 2, 2022

google-ml-butler bot assigned gbaned Mar 2, 2022

kaixih force-pushed the fix_xla_oom_upstream branch from c5548ae to 86fbb3a Compare March 2, 2022 19:49

gbaned requested a review from chsigg March 3, 2022 12:26

google-ml-butler bot added the awaiting review Pull request awaiting review label Mar 3, 2022

gbaned added this to Assigned Reviewer in PR Queue via automation Mar 3, 2022

chsigg requested review from cheshire and removed request for chsigg March 3, 2022 12:27

gbaned removed the awaiting review Pull request awaiting review label Mar 4, 2022

Added XLA env var for redzone space limit

2886d6d

kaixih force-pushed the fix_xla_oom_upstream branch from 86fbb3a to 2886d6d Compare March 7, 2022 17:43

cheshire requested changes Mar 7, 2022

View reviewed changes

tensorflow/compiler/xla/service/gpu/gpu_conv_algorithm_picker.cc Show resolved Hide resolved

PR Queue automation moved this from Assigned Reviewer to Reviewer Requested Changes Mar 7, 2022

Remove kDefaultMemoryLimit

741c650

cheshire approved these changes Mar 7, 2022

View reviewed changes

google-ml-butler bot added kokoro:force-run Tests on submitted change ready to pull PR ready for merge process labels Mar 7, 2022

PR Queue automation moved this from Reviewer Requested Changes to Approved by Reviewer Mar 7, 2022

kokoro-team removed the kokoro:force-run Tests on submitted change label Mar 7, 2022

kaixih added the kokoro:force-run Tests on submitted change label Mar 7, 2022

kokoro-team removed the kokoro:force-run Tests on submitted change label Mar 7, 2022

Ran clang format check

11ec173

google-ml-butler bot removed the ready to pull PR ready for merge process label Mar 7, 2022

gbaned requested a review from cheshire March 8, 2022 13:51

google-ml-butler bot added the awaiting review Pull request awaiting review label Mar 8, 2022

cheshire approved these changes Mar 8, 2022

View reviewed changes

google-ml-butler bot added kokoro:force-run Tests on submitted change ready to pull PR ready for merge process labels Mar 8, 2022

kokoro-team removed the kokoro:force-run Tests on submitted change label Mar 8, 2022

gbaned added ready to pull PR ready for merge process and removed awaiting review Pull request awaiting review ready to pull PR ready for merge process labels Mar 9, 2022

copybara-service bot pushed a commit that referenced this pull request Mar 14, 2022

Merge pull request #54860 from kaixih:fix_xla_oom_upstream

9768822

PiperOrigin-RevId: 434441627

gbaned closed this Mar 14, 2022

PR Queue automation moved this from Approved by Reviewer to Closed/Rejected Mar 14, 2022

google-ml-butler bot removed the ready to pull PR ready for merge process label Mar 14, 2022

bhack mentioned this pull request Apr 12, 2022

PR internally edited as CL are not automatically merged tensorflow/community#413

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change Redzone space limit for XLA GPU #54860

Change Redzone space limit for XLA GPU #54860

kaixih commented Mar 2, 2022

cheshire commented Mar 3, 2022

kaixih commented Mar 7, 2022

kaixih commented Mar 10, 2022

akuegel commented Mar 14, 2022

gbaned commented Mar 14, 2022

Change Redzone space limit for XLA GPU #54860

Change Redzone space limit for XLA GPU #54860

Conversation

kaixih commented Mar 2, 2022

cheshire commented Mar 3, 2022

kaixih commented Mar 7, 2022

kaixih commented Mar 10, 2022

akuegel commented Mar 14, 2022

gbaned commented Mar 14, 2022