pytorch · ysiraichi · Jul 25, 2024 · Jul 9, 2024 · Jul 9, 2024 · Jul 12, 2024
diff --git a/CODEGEN_MIGRATION_GUIDE.md b/CODEGEN_MIGRATION_GUIDE.md
@@ -52,7 +52,7 @@ When you work on your first few codegens, we generally recommend you to start wi
 ```
   if (!IsSupportedAdaptivePool(XlaHelpers::I64List(self.sizes()),
                                output_size_list, /*pool_dim=*/3)) {
-    return at::native::call_fallback_fn<&xla_cpu_fallback, ATEN_OP(_adaptive_avg_pool3d)>::call(self, output_size);
+    return at::native::call_fallback_fn<&xla_fallback, ATEN_OP(_adaptive_avg_pool3d)>::call(self, output_size);
   }
 ```
 2. Results in dynamic shape as these ops are WIP and may evolve over time. At some future point, we may bring the ops into codegen.

diff --git a/OP_LOWERING_GUIDE.md b/OP_LOWERING_GUIDE.md
@@ -16,15 +16,15 @@ export PJRT_DEVICE=CPU
 ```
 
 ## Understanding the operation
-You can find the definition of the C++ ATen operations in [native_functions.yaml](https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/native_functions.yaml). After you build Pytorch/XLA from source, you will also find our default implementation (a boxed kernel which forwards calls to PyTorch native CPU) in `xla/torch_xla/csrc/aten_cpu_fallback.h/cpp`. Pytorch operations can usually be mapped to [PyTorch tensor api](https://pytorch.org/docs/stable/index.html) easily. If that is not the case searching the PyTorch native implementation under [PyTorch repo](https://github.com/pytorch/pytorch) is recommended. The goal is to lower the PyTorch operations into a sequence of XLA operations defined in [here](https://www.tensorflow.org/xla/operation_semantics).
+You can find the definition of the C++ ATen operations in [native_functions.yaml](https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/native_functions.yaml). After you build Pytorch/XLA from source, you will also find our default implementation (a boxed kernel which forwards calls to either PyTorch native kernels) in `xla/torch_xla/csrc/aten_fallback.h/cpp`. Pytorch operations can usually be mapped to [PyTorch tensor api](https://pytorch.org/docs/stable/index.html) easily. If that is not the case searching the PyTorch native implementation under [PyTorch repo](https://github.com/pytorch/pytorch) is recommended. The goal is to lower the PyTorch operations into a sequence of XLA operations defined in [here](https://www.tensorflow.org/xla/operation_semantics).
 
 ## File structure
 All file mentioned below lives under the `xla/torch_xla/csrc` folder, with the exception of `codegen/xla_native_functions.yaml`
 
 1. `xla_native_functions.yaml` contains the list of all operators (from the [Core Aten list](https://pytorch.org/docs/stable/torch.compiler_ir.html)) that are explicitly lowered. Composed operators are not listed here. Each operator name here must directly match a pytorch operator listed in [native_functions.yaml](https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/native_functions.yaml). This file serves as the interface to adding new xla operators, and is an input to PyTorch's [codegen machinery](https://github.com/pytorch/pytorch/blob/main/torchgen/gen_backend_stubs.py). It generates the below 3 files: `XLANativeFunctions.h`, `RegisterXLA.cpp`, and `RegisterAutogradXLA.cpp`
 2. `XLANativeFunctions.h` and `aten_xla_type.cpp` are entry points of PyTorch to the pytorch_xla world, and contain the manually written lowerings to XLA for each operator. `XLANativeFunctions.h` is auto-generated through a combination of `xla_native_functions.yaml` and the PyTorch core `native_functions.yaml` file, and contains declarations for kernels that need to be defined in `aten_xla_type.cpp`. The kernels written here need to construct 'XLATensor' using the input `at::Tensor` and other parameters. The resulting `XLATensor` needs to be converted back to the `at::Tensor` before returning to the PyTorch world.
 3. `RegisterXLA.cpp` and `RegisterAutogradXLA.cpp` are auto-generated files that register all lowerings to the PyTorch Dispatcher. They also include auto-generated wrapper implementations of `out=` and `inplace` operators.
-4. `aten_cpu_fallback.h/.cpp` contain our boxed fallback implementation to CPU. The boxed fallback kernel will be used if a lowering is not explicitly defined in `xla_native_functions.yaml` + `aten_xla_type.cpp`, and the operator is not composite.
+4. `aten_fallback.h/.cpp` contain our boxed fallback implementation. The boxed fallback kernel will be used if a lowering is not explicitly defined in `xla_native_functions.yaml` + `aten_xla_type.cpp`, and the operator is not composite.
 5. `tensor_methods.h` contains the `XLATensor` declarations. These declarations are usually a one to one mapping of the `at::Tensor` nodes we declared in `XLANativeFunctions.h`
 6. `tensor_methods.cpp` contains the implementation of `XLATensor node` defined in `tensor_methods.h`. We constructed the corresponding `ir::op` from the parameter’s `ir::Value` and wrapped it inside a `XLATensor`. Ir stands for intermediate representation.
 7. `ops/` directory contains all `ir::ops` declaration and definition. Smaller nodes can be put in `ops/ops.h/.cpp`. More complicated nodes can be put into a separate file. All ops inherit from `ir::ops::Node` and provide a way to lower input `ir::Value` to a sequence of `XlaOp`.

diff --git a/benchmarks/experiment_runner.py b/benchmarks/experiment_runner.py
@@ -645,9 +645,6 @@ def _collect_cuda_cpu_metrics_individual_ops(
     def is_aten_op(op_name):
       return 'aten::' in op_name
 
-    def get_xla_cpu_fallback_ops(met):
-      return set(name for name in met.counter_names() if is_aten_op(name))
-
     extract_prof_info = lambda event: {
         "self_cpu_time_s": us_to_s(event.self_cpu_time_total),
         "self_cuda_time_s": us_to_s(event.self_cuda_time_total),
@@ -657,7 +654,7 @@ def get_xla_cpu_fallback_ops(met):
     }
 
     if benchmark_experiment.xla:
-      unlowered_ops = get_xla_cpu_fallback_ops(met)
+      unlowered_ops = met.executed_fallback_ops()
       if not unlowered_ops:
         return
       if "xla_unlowered_ops" not in metrics:

diff --git a/configuration.yaml b/configuration.yaml
@@ -394,3 +394,8 @@ variables:
           your code.
       type: bool
       default_value: false
+    XLA_FALLBACK_CPU:
+      description:
+        - Forces CPU OpenXLA fallback. By default, PyTorch/XLA will run any operation
+          that doesn't have a lowering using PyTorch CUDA as fallback. Setting this
+          flag will force PyTorch/XLA to use PyTorch CPU as fallback.
diff --git a/torch_xla/csrc/BUILD b/torch_xla/csrc/BUILD
@@ -32,7 +32,7 @@ ptxla_cc_library(
     name = "tensor",
     srcs = [
         "aten_autograd_ops.cpp",
-        "aten_cpu_fallback.cpp",
+        "aten_fallback.cpp",
         "aten_xla_bridge.cpp",
         "aten_xla_type.cpp",
         "autocast_mode.cpp",
@@ -75,7 +75,7 @@ ptxla_cc_library(
     ] + glob(["ops/*.cpp"]),
     hdrs = [
         "aten_autograd_ops.h",
-        "aten_cpu_fallback.h",
+        "aten_fallback.h",
         "aten_cuda_functions.h",
         "aten_xla_bridge.h",
         "batch_norm.h",

diff --git a/torch_xla/csrc/aten_autograd_ops.cpp b/torch_xla/csrc/aten_autograd_ops.cpp
@@ -5,7 +5,7 @@
 #include <ATen/native/CPUFallback.h>
 #include <c10/core/impl/PythonDispatcherTLS.h>
 
-#include "torch_xla/csrc/aten_cpu_fallback.h"
+#include "torch_xla/csrc/aten_fallback.h"
 #include "torch_xla/csrc/aten_xla_bridge.h"
 #include "torch_xla/csrc/helpers.h"
 #include "torch_xla/csrc/tensor_methods.h"

diff --git a/torch_xla/csrc/aten_cuda_functions.cpp b/torch_xla/csrc/aten_cuda_functions.cpp
@@ -7,7 +7,7 @@
 
 // Context
 // =======
-// aten_cpu_fallback.cpp (compiled into _XLAC.so library) uses these functions
+// aten_fallback.cpp (compiled into _XLAC.so library) uses these functions
 // for providing OpenXLA fallback on CUDA. Therefore, they must be defined at
 // some point, somewhere.
 //

diff --git a/torch_xla/csrc/aten_cpu_fallback.cpp → torch_xla/csrc/aten_fallback.cpp b/torch_xla/csrc/aten_cpu_fallback.cpp → torch_xla/csrc/aten_fallback.cpp
@@ -1,4 +1,4 @@
-#include "torch_xla/csrc/aten_cpu_fallback.h"
+#include "torch_xla/csrc/aten_fallback.h"
 
 #include <ATen/DLConvertor.h>
 #include <ATen/ops/_copy_from_and_resize.h>

diff --git a/torch_xla/csrc/aten_cpu_fallback.h → torch_xla/csrc/aten_fallback.h b/torch_xla/csrc/aten_cpu_fallback.h → torch_xla/csrc/aten_fallback.h
diff --git a/torch_xla/csrc/aten_xla_type.cpp b/torch_xla/csrc/aten_xla_type.cpp
@@ -24,7 +24,7 @@
 #include "torch_xla/csrc/LazyIr.h"
 #include "torch_xla/csrc/XLANativeFunctions.h"
 #include "torch_xla/csrc/aten_autograd_ops.h"
-#include "torch_xla/csrc/aten_cpu_fallback.h"
+#include "torch_xla/csrc/aten_fallback.h"
 #include "torch_xla/csrc/aten_xla_bridge.h"
 #include "torch_xla/csrc/debug_util.h"
 #include "torch_xla/csrc/device.h"

diff --git a/torch_xla/csrc/generated_file_include.h b/torch_xla/csrc/generated_file_include.h
@@ -3,7 +3,7 @@
 
 #include <torch/csrc/lazy/core/shape.h>
 
-#include "torch_xla/csrc/aten_cpu_fallback.h"
+#include "torch_xla/csrc/aten_fallback.h"
 #include "torch_xla/csrc/aten_xla_bridge.h"
 #include "torch_xla/csrc/ir.h"
 #include "torch_xla/csrc/ops/ops_xla_shape_fn.h"

diff --git a/torch_xla/csrc/init_python_bindings.cpp b/torch_xla/csrc/init_python_bindings.cpp
@@ -34,7 +34,7 @@
 #include "pybind11/stl_bind.h"
 #include "torch_xla/csrc/XLANativeFunctions.h"
 #include "torch_xla/csrc/aten_autograd_ops.h"
-#include "torch_xla/csrc/aten_cpu_fallback.h"
+#include "torch_xla/csrc/aten_fallback.h"
 #include "torch_xla/csrc/aten_xla_bridge.h"
 #include "torch_xla/csrc/device.h"
 #include "torch_xla/csrc/dl_convertor.h"

diff --git a/torch_xla/csrc/xla_manual_registration.cpp b/torch_xla/csrc/xla_manual_registration.cpp
@@ -1,7 +1,7 @@
 #include <ATen/ATen.h>
 #include <torch/library.h>
 
-#include "torch_xla/csrc/aten_cpu_fallback.h"
+#include "torch_xla/csrc/aten_fallback.h"
 #include "torch_xla/csrc/aten_xla_bridge.h"
 #include "torch_xla/csrc/debug_util.h"
 #include "torch_xla/csrc/ops/nms.h"