Update on "[2/2] Intel GPU Runtime Upstreaming for Generator"

# Motivation According to [[1/2] Intel GPU Runtime Upstreaming for Generator](#118528), as mentioned in [[RFC] Intel GPU Runtime Upstreaming](#114842), the second PR covers the changes under `python frontend`. # Design Currently, it primarily offers geneartor-related APIs, including - `torch.xpu.default_generators` - `torch.xpu.get_rng_state` - `torch.xpu.get_rng_state_all` - `torch.xpu.initial_seed` - `torch.xpu.manual_seed` - `torch.xpu.manual_seed_all` - `torch.xpu.seed` - `torch.xpu.seed_all` - `torch.xpu.set_rng_state` - `torch.xpu.set_rng_state_all` # Additional Context The differences with CUDA: The generator-related frontend python APIs are 1:1 mapping with CUDA. cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
pytorch · Feb 26, 2024 · 9b8b4db · 9b8b4db
2 parents b42325e + 70a01c3
commit 9b8b4db
Show file tree

Hide file tree

Showing 6,276 changed files with 4,437 additions and 9,275 deletions.
diff --git a/.ci/pytorch/test.sh b/.ci/pytorch/test.sh
@@ -982,7 +982,8 @@ test_bazel() {
 
     tools/bazel test --config=cpu-only --test_timeout=480 --test_output=all --test_tag_filters=-gpu-required --test_filter=-*CUDA :all_tests
   else
-    tools/bazel test --test_output=errors \
+    # Increase the test timeout to 480 like CPU tests because modules_test frequently timeout
+    tools/bazel test --test_timeout=480 --test_output=errors \
       //:any_test \
       //:autograd_test \
       //:dataloader_test \

diff --git a/.github/scripts/s390x-ci/README.md b/.github/scripts/s390x-ci/README.md
@@ -0,0 +1,51 @@
+# Configuring the builder.
+
+## Install prerequisites.
+
+```
+$ sudo dnf install docker
+```
+
+## Add services.
+
+```
+$ sudo cp self-hosted-builder/*.service /etc/systemd/system/
+$ sudo systemctl daemon-reload
+```
+
+## Download qemu-user-static image
+
+```
+# sudo docker pull docker.io/iiilinuxibmcom/qemu-user-static:6.1.0-1
+```
+
+## Autostart the x86_64 emulation support.
+
+```
+$ sudo systemctl enable --now qemu-user-static
+```
+
+## Rebuild the image
+
+In order to build or update the `iiilinuxibmcom/actions-runner` image, e.g. to get the
+latest OS security fixes, use the following commands:
+
+```
+$ cd self-hosted-builder
+$ sudo docker build \
+      --build-arg repo=<owner>/<name> \
+      --build-arg token=<***> \
+      --pull \
+      -f actions-runner.Dockerfile \
+      -t iiilinuxibmcom/actions-runner \
+      .
+```
+
+If it fails, ensure that selinux doesn't prevent it from working.
+In worst case, selinux can be disabled with `setenforce 0`.
+
+## Autostart the runner.
+
+```
+$ sudo systemctl enable --now actions-runner@$NAME
+```
diff --git a/.github/scripts/s390x-ci/self-hosted-builder/actions-runner.Dockerfile b/.github/scripts/s390x-ci/self-hosted-builder/actions-runner.Dockerfile
@@ -0,0 +1,66 @@
+# Self-Hosted IBM Z Github Actions Runner.
+
+# Temporary image: amd64 dependencies.
+FROM docker.io/amd64/ubuntu:22.04 as ld-prefix
+ENV DEBIAN_FRONTEND=noninteractive
+RUN apt-get update && apt-get -y install ca-certificates libicu70 libssl3
+
+# Main image.
+FROM docker.io/s390x/ubuntu:22.04
+
+# Packages for pytorch building and testing.
+ENV DEBIAN_FRONTEND=noninteractive
+RUN apt-get update && apt-get -y install \
+        cmake \
+        curl \
+        gcc \
+        git \
+        jq \
+        libxml2-dev \
+        libxslt-dev \
+        ninja-build \
+        python-is-python3 \
+        python3 \
+        python3-dev \
+        python3-pip \
+        pybind11-dev \
+        python3-numpy \
+        libopenblas-dev \
+        liblapack-dev \
+        libgloo-dev \
+        python3-yaml \
+        python3-scipy \
+        virtualenv
+
+# amd64 dependencies.
+COPY --from=ld-prefix / /usr/x86_64-linux-gnu/
+RUN ln -fs ../lib/x86_64-linux-gnu/ld-linux-x86-64.so.2 /usr/x86_64-linux-gnu/lib64/
+RUN ln -fs /etc/resolv.conf /usr/x86_64-linux-gnu/etc/
+ENV QEMU_LD_PREFIX=/usr/x86_64-linux-gnu
+
+# Scripts.
+COPY fs/ /
+
+RUN chmod +x /usr/bin/actions-runner /usr/bin/entrypoint
+
+# amd64 Github Actions Runner.
+RUN useradd -m actions-runner
+USER actions-runner
+WORKDIR /home/actions-runner
+RUN curl -L https://github.com/actions/runner/releases/download/v2.309.0/actions-runner-linux-x64-2.309.0.tar.gz | tar -xz
+
+# repository
+ARG repo
+
+# repository token
+ARG token
+
+RUN ./config.sh \
+        --unattended \
+        --url "https://github.com/${repo}" \
+        --token "${token}" \
+        --no-default-labels \
+        --labels self-hosted,linux.s390x
+
+ENTRYPOINT ["/usr/bin/entrypoint"]
+CMD ["/usr/bin/actions-runner"]
diff --git a/.github/scripts/s390x-ci/self-hosted-builder/actions-runner@.service b/.github/scripts/s390x-ci/self-hosted-builder/actions-runner@.service
@@ -0,0 +1,22 @@
+[Unit]
+Description=Self-Hosted IBM Z Github Actions Runner
+Wants=qemu-user-static
+After=qemu-user-static
+StartLimitIntervalSec=0
+
+[Service]
+Type=simple
+Restart=always
+ExecStartPre=-/usr/bin/docker rm --force actions-runner.%i
+ExecStart=/usr/bin/docker run \
+              --init \
+              --interactive \
+              --name=actions-runner.%i \
+              --rm \
+              iiilinuxibmcom/actions-runner
+ExecStop=/bin/sh -c "docker exec actions-runner.%i kill -INT -- -1"
+ExecStop=/bin/sh -c "docker wait actions-runner.%i"
+ExecStop=/bin/sh -c "docker rm actions-runner.%i"
+
+[Install]
+WantedBy=multi-user.target
diff --git a/.github/scripts/s390x-ci/self-hosted-builder/fs/usr/bin/actions-runner b/.github/scripts/s390x-ci/self-hosted-builder/fs/usr/bin/actions-runner
@@ -0,0 +1,6 @@
+#!/usr/bin/env bash
+
+set -e -u
+
+# Run one job.
+./run.sh --once
diff --git a/.github/scripts/s390x-ci/self-hosted-builder/fs/usr/bin/entrypoint b/.github/scripts/s390x-ci/self-hosted-builder/fs/usr/bin/entrypoint
@@ -0,0 +1,30 @@
+#!/usr/bin/env bash
+
+#
+# Container entrypoint that waits for all spawned processes.
+#
+
+set -e -u
+
+# Create a FIFO and start reading from its read end.
+tempdir=$(mktemp -d "/tmp/done.XXXXXXXXXX")
+trap 'rm -r "$tempdir"' EXIT
+done="$tempdir/pipe"
+mkfifo "$done"
+cat "$done" & waiter=$!
+
+# Start the workload. Its descendants will inherit the FIFO's write end.
+status=0
+if [ "$#" -eq 0 ]; then
+    bash 9>"$done" || status=$?
+else
+    "$@" 9>"$done" || status=$?
+fi
+
+# When the workload and all of its descendants exit, the FIFO's write end will
+# be closed and `cat "$done"` will exit. Wait until it happens. This is needed
+# in order to handle SelfUpdater, which the workload may start in background
+# before exiting.
+wait "$waiter"
+
+exit "$status"
diff --git a/.github/scripts/s390x-ci/self-hosted-builder/qemu-user-static.service b/.github/scripts/s390x-ci/self-hosted-builder/qemu-user-static.service
@@ -0,0 +1,11 @@
+[Unit]
+Description=Support for transparent execution of non-native binaries with QEMU user emulation
+
+[Service]
+Type=oneshot
+# The source code for iiilinuxibmcom/qemu-user-static is at https://github.com/iii-i/qemu-user-static/tree/v6.1.0-1
+# TODO: replace it with multiarch/qemu-user-static once version >6.1 is available
+ExecStart=/usr/bin/docker run --rm --interactive --privileged docker.io/iiilinuxibmcom/qemu-user-static:6.1.0-1 --reset -p yes
+
+[Install]
+WantedBy=multi-user.target
diff --git a/.lintrunner.toml b/.lintrunner.toml
@@ -191,7 +191,8 @@ include_patterns = [
     'c10/**/*.h',
     # Enable coverage of headers in torch/csrc and excluding sub-directories for now.
     'torch/csrc/*.h',
-    'torch/csrc/autograd/**/*.h',
+    'torch/csrc/*.cpp',
+    'torch/csrc/**/*.h',
     'torch/csrc/**/*.cpp',
 ]
 exclude_patterns = [
@@ -200,6 +201,8 @@ exclude_patterns = [
     # CUDA files are also excluded.
     '**/fb/**',
     '**/*pb.h',
+    'c10/**/cuda/*pp',
+    'aten/**/cuda/*pp',
     '**/cuda/*pp',
     '**/*XPU*',
     '**/xpu/*pp',
@@ -219,8 +222,6 @@ exclude_patterns = [
     'third_party/**/*',
     'torch/csrc/api/**',
     'torch/csrc/autograd/generated/**',
-    'torch/csrc/autograd/profiler_legacy.cpp',
-    'torch/csrc/cuda/**',
     'torch/csrc/dynamo/*',
     'torch/csrc/distributed/**/*',
     'torch/csrc/inductor/**/*',
@@ -2651,7 +2652,7 @@ init_command = [
     'python3',
     'tools/linter/adapters/pip_init.py',
     '--dry-run={{DRYRUN}}',
-    'ruff==0.1.14',
+    'ruff==0.2.2',
 ]
 is_formatter = true
 

diff --git a/CMakeLists.txt b/CMakeLists.txt
@@ -43,9 +43,9 @@ set(CMAKE_C_STANDARD   11 CACHE STRING "The C standard whose features are reques
 # ---[ Utils
 include(cmake/public/utils.cmake)
 
-# --- [ Check that minimal gcc version is 9.4+
-if(CMAKE_COMPILER_IS_GNUCXX AND CMAKE_CXX_COMPILER_VERSION VERSION_LESS 9.4)
-  message(FATAL "GCC-9.4 or newer is required to compile PyTorch, but found ${CMAKE_CXX_COMPILER_VERSION}")
+# --- [ Check that minimal gcc version is 9.3+
+if(CMAKE_COMPILER_IS_GNUCXX AND CMAKE_CXX_COMPILER_VERSION VERSION_LESS 9.3)
+  message(FATAL_ERROR "GCC-9.3 or newer is required to compile PyTorch, but found ${CMAKE_CXX_COMPILER_VERSION}")
 endif()
 
 # This define is needed to preserve behavior given anticpated changes to cccl/thrust

diff --git a/aten/src/ATen/LegacyVmapTransforms.cpp b/aten/src/ATen/LegacyVmapTransforms.cpp
@@ -135,9 +135,7 @@ static Tensor alignBatchDimsAtFront(
     const Tensor& self,
     std::bitset<kVmapNumLevels> requested_levels,
     int64_t requested_example_dim) {
-  Tensor physical_tensor;
-  std::bitset<kVmapNumLevels> tensor_levels;
-  std::tie(physical_tensor, tensor_levels) = getPhysicalTensorAndLevels(self);
+  auto [physical_tensor, tensor_levels] = getPhysicalTensorAndLevels(self);
 
   TORCH_INTERNAL_ASSERT(
     (tensor_levels | requested_levels) == requested_levels,
@@ -263,10 +261,7 @@ VmapPhysicalViewVec BroadcastingVmapTransform::logicalToPhysical(TensorList logi
 
   VmapPhysicalViewVec result;
 
-  std::bitset<kVmapNumLevels> levels;
-  // NOLINTNEXTLINE(cppcoreguidelines-init-variables)
-  int64_t largest_logical_dim;
-  std::tie(levels, largest_logical_dim) = getLevelsAndLargestLogicalDim(logical_tensors);
+  auto [levels, largest_logical_dim] = getLevelsAndLargestLogicalDim(logical_tensors);
 
   for (const auto& tensor : logical_tensors) {
     // NB: It's possible that we didn't actually need to align `tensor`.

diff --git a/aten/src/ATen/SparseCsrTensorImpl.cpp b/aten/src/ATen/SparseCsrTensorImpl.cpp
@@ -55,7 +55,8 @@ SparseCsrTensorImpl::SparseCsrTensorImpl(
                   "to https://github.com/pytorch/pytorch/issues.");
 
   TORCH_INTERNAL_ASSERT(((key_set.has(DispatchKey::SparseCsrCPU) && device().type() == kCPU)
-                         || (key_set.has(DispatchKey::SparseCsrCUDA) && device().type() == kCUDA)),
+                         || (key_set.has(DispatchKey::SparseCsrCUDA) && device().type() == kCUDA)
+                         || (key_set.has(DispatchKey::SparseCsrMeta) && device().type() == kMeta)),
                         "Inconsistent key_set (=", key_set, ") and device (=", device(), ")");
 
   set_storage_access_should_throw();
@@ -166,9 +167,7 @@ void SparseCsrTensorImpl::resize_as_sparse_compressed_tensor_(
       src.layout(),
       ")");
 
-  Tensor compressed_indices;
-  Tensor plain_indices;
-  std::tie(compressed_indices, plain_indices) =
+  auto [compressed_indices, plain_indices] =
       sparse_csr::getCompressedPlainIndices(src);
   // reuse self indices storage
   if (crow_indices_.sizes() != compressed_indices.sizes()) {

diff --git a/aten/src/ATen/cpu/vec/functional_base.h b/aten/src/ATen/cpu/vec/functional_base.h
@@ -76,6 +76,34 @@ struct VecReduceAllSIMD<float, Op> {
   }
 };
 #endif // defined(CPU_CAPABILITY_AVX512)
+#if defined(CPU_CAPABILITY_NEON)
+template <typename Op>
+struct VecReduceAllSIMD<float, Op> {
+  static inline float apply(const Op& vec_fun, const Vectorized<float>& acc_vec) {
+    using Vec = Vectorized<float>;
+    Vec v = acc_vec;
+
+    // 128-bit shuffle: [a1, a2, a3, a4, a5, a6, a7, a8] -> [a5, a6, a7, a8, a1, a2, a3, a4]
+    Vec v1 = {v.get_high(), v.get_low()};
+    // [a1+a5, a2+a6, a3+a7, a4+a8, -, -, -, -] ('+' stands for the reduction function. Note that the last 4 elements are not required)
+    v = vec_fun(v, v1);
+
+    // 64-bit shuffle: [a1+a5, a2+a6, a3+a7, a4+a8, -, -, -, -] -> [a3+a7, a4+a8, a1+a5, a2+a6, -, -, -, -]
+    float32x4_t v1_1 = vextq_f32(v.get_low(), v.get_low(), 2);
+    v1 = {v1_1, v1_1};
+    // [a1+a3+a5+a7, a2+a4+a6+a8, a1+a3+a5+a7, a2+a4+a6+a8, -, -, -, -]
+    v = vec_fun(v, v1);
+
+    // 32-bit shuffle: [a1+a3+a5+a7, a2+a4+a6+a8, a1+a3+a5+a7, a2+a4+a6+a8, -, -, -, -] -> [a2+a4+a6+a8, a1+a3+a5+a7, a2+a4+a6+a8, a1+a3+a5+a7, -, -, -, -]
+    v1_1 = vrev64q_f32(v.get_low());
+    v1 = {v1_1, v1_1};
+    // [a1+a2+a3+a4+a5+a6+a7+a8, a1+a2+a3+a4+a5+a6+a7+a8, a1+a2+a3+a4+a5+a6+a7+a8, a1+a2+a3+a4+a5+a6+a7+a8, -, -, -, -]
+    v = vec_fun(v, v1);
+
+    return v.get_low()[0];
+  }
+};
+#endif // defined(CPU_CAPABILITY_NEON)
 #endif // defined(__GNUC__) && (__GNUC__ > 5) && !defined(_MSC_VER) && !defined(C10_MOBILE)
 
 template <typename scalar_t, typename Op>

diff --git a/aten/src/ATen/cudnn/Descriptors.h b/aten/src/ATen/cudnn/Descriptors.h
@@ -318,6 +318,15 @@ struct TORCH_CUDA_CPP_API RNNDescriptor : public Descriptor<
       }
     }
 #else
+    cudaDeviceProp* prop = at::cuda::getCurrentDeviceProperties();
+    auto math_type = CUDNN_DEFAULT_MATH;
+    if (prop->major >= 7) {
+      if (input_type == CUDNN_DATA_HALF) {
+        math_type = CUDNN_TENSOR_OP_MATH;
+      } else if (!allow_tf32) {
+        math_type = CUDNN_FMA_MATH;
+      }
+    }
     AT_CUDNN_CHECK(cudnnSetRNNDescriptor_v8(
           mut_desc(),
           algo,
@@ -327,7 +336,7 @@ struct TORCH_CUDA_CPP_API RNNDescriptor : public Descriptor<
           input_mode,
           input_type,
           datatype,
-          allow_tf32 ? CUDNN_DEFAULT_MATH : CUDNN_FMA_MATH,
+          math_type,
           input_size,
           hidden_size,
           proj_size ? proj_size : hidden_size,

diff --git a/aten/src/ATen/native/Bucketization.cpp b/aten/src/ATen/native/Bucketization.cpp
@@ -16,7 +16,7 @@
 
 /* Implement a numpy like searchsorted and a TF like bucketize function running on cpu
  *
- * - torch.searchsorted(sorted_sequence, values, right=False, side='left', out_int32=False, sorter=None)
+ * - torch.searchsorted(sorted_sequence, values, right=False, side=None, out_int32=False, sorter=None)
  *   sorted_sequence - N*D or 1D (apply to all values) tensor containing sorted sequences in last dimension
  *   values          - N*D tensor or a Scalar (when sorted_sequence is 1D) containing the search values
  *   right           - corresponding to lower bound if False and upper bound if True

diff --git a/aten/src/ATen/native/TensorCompare.cpp b/aten/src/ATen/native/TensorCompare.cpp
@@ -13,6 +13,7 @@
 #include <ATen/native/TensorCompare.h>
 #include <ATen/native/TypeProperties.h>
 #include <ATen/TensorSubclassLikeUtils.h>
+#include <iostream>
 #include <c10/util/Exception.h>
 
 #ifndef AT_PER_OPERATOR_HEADERS
@@ -22,6 +23,7 @@
 #include <ATen/ops/_aminmax_native.h>
 #include <ATen/ops/_assert_async_native.h>
 #include <ATen/ops/_functional_assert_async_native.h>
+#include <ATen/ops/_print_native.h>
 #include <ATen/ops/_assert_scalar_native.h>
 #include <ATen/ops/_functional_assert_scalar_native.h>
 #include <ATen/ops/_make_per_tensor_quantized_tensor.h>
@@ -71,6 +73,7 @@
 #include <ATen/ops/where_native.h>
 #include <ATen/ops/zeros_like.h>
 
+#include <iostream>
 #include <utility>
 #endif
 
@@ -440,6 +443,9 @@ Tensor _functional_assert_async_msg_cpu(
   return dep_token.clone();
 }
 
+void _print(c10::string_view s) {
+  std::cout << s << "\n";
+}
 
 // Sorting-based algorithm for isin(); used when the number of test elements is large.
 static void isin_sorting(