[XLA] When ptxas do not know about an SM, fallback to the driver. #43888

nouiz · 2020-10-08T16:29:01Z

Currently XLA always use ptxas. If a user have an old container, but a newer GPU, ptxas won't know its SM version.
In that case, instead of erroring, fallback to the driver to compile instead of PTXAS.
It won't have all optimization, but it will be working.

nouiz · 2020-10-08T18:11:26Z

tensorflow/compiler/xla/service/gpu/nvptx_compiler.cc

@@ -198,6 +198,42 @@ absl::optional<bool> CanShareBufferHint(const HloInstruction* user,
  return absl::nullopt;
 }

+// Try to load ptx from files defined in the FLAGS. If successful, return true.
+bool MaybeLoadPtxFromFile(const HloModule* module, std::string* ptx) {


The diff show up stream. I didn't move MaybeLoadPtxFromFile.
I moved the function WarnIfBadDriverJITVersion outside the unnamed namespace.

thomasjoerg · 2020-10-09T07:44:42Z

tensorflow/compiler/xla/service/gpu/nvptx_compiler.cc

@@ -415,7 +415,10 @@ std::vector<uint8> NVPTXCompiler::CompileGpuAsmOrGetCachedResult(
                  "using $PATH.",
                  hlo_module_config);
            }
-          } else {
+          } else if (maybe_cubin.status().code() !=


XLA used to silently fallback to the driver when ptxas couldn't be found or compilation failed. The silent fallback behavior led to several bugs reported by users that were hard to reproduce and diagnose. We hence decided to turn ptxas issues into fatal errors (few people notice warnings in logs) and allow manual overwrite by the flag (--xla_gpu_unsafe_fallback_to_driver_on_ptxas_not_found).
Have you considered to introduce a flag for the _ptxas_too_old case? I think that would be the better option.

Here it isn't a real compilation failure. Just that PTXAS doesn't know a specific SM version.

Where can I have more information about the bug that you had that triggered this decision?

I'll think about your suggestion and come back about it.

Quick follow up:
For currently supported GPU, there is no change. The new fallback is only for new GPUs. So when an (current or old) container is used on a newer GPU, the fallback is used.

nouiz · 2020-10-13T18:51:28Z

I amended the commit as it had one debug leftover.

thomasjoerg · 2020-10-15T12:43:23Z

tensorflow/compiler/xla/service/mlir_gpu/BUILD

@@ -113,6 +113,7 @@ cc_library(
        "//tensorflow/compiler/xla/service/gpu:ir_emission_utils",
        "//tensorflow/compiler/xla/service/gpu:nvptx_compiler_impl",
        "//tensorflow/compiler/xla/service/gpu:launch_dimensions",
+        "//tensorflow/compiler/xla/service/gpu:nvptx_compiler",


This extra dependency breaks tensorflow/compiler/xla/service/mlir_gpu/tests. The linking step will register a compiler and these tests fail with Check failed: factories->find(platform_id) == factories->end() Compiler factory already registered for platform.
Please consider moving WarnIfBadDriverJITVersion into a different place. I believe asm_compiler.cc would work.

I amended the commit to just remove this line. This fix the mlir_gpu tests. //tensorflow/compiler/xla/service/gpu:nvptx_compiler_impl is already included and it is enough. As it was already included, I do not see value in moving that function.

You are right, the dependency to nvptx_compiler_impl is enough and the other one was redundant/wrong.

google-ml-butler bot added the size:M CL Change Size: Medium label Oct 8, 2020

google-ml-butler bot requested a review from joker-eph October 8, 2020 16:29

google-cla bot added the cla: yes label Oct 8, 2020

gbaned self-assigned this Oct 8, 2020

gbaned added the comp:xla XLA label Oct 8, 2020

gbaned added this to Assigned Reviewer in PR Queue via automation Oct 8, 2020

gbaned requested a review from thomasjoerg October 8, 2020 16:47

nouiz commented Oct 8, 2020

View reviewed changes

thomasjoerg reviewed Oct 9, 2020

View reviewed changes

gbaned added the awaiting review Pull request awaiting review label Oct 13, 2020

nouiz force-pushed the upstream_old_ptxas branch from ce59dbc to 40c7faf Compare October 13, 2020 18:49

thomasjoerg approved these changes Oct 14, 2020

View reviewed changes

google-ml-butler bot added kokoro:force-run Tests on submitted change ready to pull PR ready for merge process labels Oct 14, 2020

PR Queue automation moved this from Assigned Reviewer to Approved by Reviewer Oct 14, 2020

kokoro-team removed the kokoro:force-run Tests on submitted change label Oct 14, 2020

gbaned added ready to pull PR ready for merge process and removed awaiting review Pull request awaiting review ready to pull PR ready for merge process labels Oct 14, 2020

thomasjoerg requested changes Oct 15, 2020

View reviewed changes

PR Queue automation moved this from Approved by Reviewer to Reviewer Requested Changes Oct 15, 2020

When ptxas do not know about an SM, fallback to the driver.

7544886

nouiz force-pushed the upstream_old_ptxas branch from 40c7faf to 7544886 Compare October 15, 2020 15:15

google-ml-butler bot removed the ready to pull PR ready for merge process label Oct 15, 2020

akuegel approved these changes Oct 16, 2020

View reviewed changes

google-ml-butler bot added kokoro:force-run Tests on submitted change ready to pull PR ready for merge process labels Oct 16, 2020

kokoro-team removed the kokoro:force-run Tests on submitted change label Oct 16, 2020

thomasjoerg approved these changes Oct 16, 2020

View reviewed changes

google-ml-butler bot added the kokoro:force-run Tests on submitted change label Oct 16, 2020

PR Queue automation moved this from Reviewer Requested Changes to Approved by Reviewer Oct 16, 2020

kokoro-team removed the kokoro:force-run Tests on submitted change label Oct 16, 2020

gbaned added ready to pull PR ready for merge process and removed ready to pull PR ready for merge process labels Oct 16, 2020

copybara-service bot merged commit de2c020 into tensorflow:master Oct 16, 2020

PR Queue automation moved this from Approved by Reviewer to Merged Oct 16, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[XLA] When ptxas do not know about an SM, fallback to the driver. #43888

[XLA] When ptxas do not know about an SM, fallback to the driver. #43888

nouiz commented Oct 8, 2020

nouiz Oct 8, 2020

thomasjoerg Oct 9, 2020

nouiz Oct 9, 2020

nouiz Oct 13, 2020

nouiz commented Oct 13, 2020

thomasjoerg Oct 15, 2020

nouiz Oct 15, 2020

akuegel Oct 16, 2020

[XLA] When ptxas do not know about an SM, fallback to the driver. #43888

[XLA] When ptxas do not know about an SM, fallback to the driver. #43888

Conversation

nouiz commented Oct 8, 2020

nouiz Oct 8, 2020

Choose a reason for hiding this comment

thomasjoerg Oct 9, 2020

Choose a reason for hiding this comment

nouiz Oct 9, 2020

Choose a reason for hiding this comment

nouiz Oct 13, 2020

Choose a reason for hiding this comment

nouiz commented Oct 13, 2020

thomasjoerg Oct 15, 2020

Choose a reason for hiding this comment

nouiz Oct 15, 2020

Choose a reason for hiding this comment

akuegel Oct 16, 2020

Choose a reason for hiding this comment