Skip to content

Commit

Permalink
Update on "[2/2] Intel GPU Runtime Upstreaming for Generator"
Browse files Browse the repository at this point in the history
# Motivation
According to [[1/2] Intel GPU Runtime Upstreaming for Generator](#118528), as mentioned in [[RFC] Intel GPU Runtime Upstreaming](#114842), the second PR covers the changes under `python frontend`.

# Design
Currently, it primarily offers geneartor-related APIs, including

- `torch.xpu.default_generators`
- `torch.xpu.get_rng_state`
- `torch.xpu.get_rng_state_all`
- `torch.xpu.initial_seed`
- `torch.xpu.manual_seed`
- `torch.xpu.manual_seed_all`
- `torch.xpu.seed`
- `torch.xpu.seed_all`
- `torch.xpu.set_rng_state`
- `torch.xpu.set_rng_state_all`

# Additional Context
The differences with CUDA:
The generator-related frontend python APIs are 1:1 mapping with CUDA.

cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]
  • Loading branch information
guangyey committed Feb 26, 2024
2 parents b42325e + 70a01c3 commit 9b8b4db
Show file tree
Hide file tree
Showing 6,276 changed files with 4,437 additions and 9,275 deletions.
The diff you're trying to view is too large. We only load the first 3000 changed files.
3 changes: 2 additions & 1 deletion .ci/pytorch/test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -982,7 +982,8 @@ test_bazel() {

tools/bazel test --config=cpu-only --test_timeout=480 --test_output=all --test_tag_filters=-gpu-required --test_filter=-*CUDA :all_tests
else
tools/bazel test --test_output=errors \
# Increase the test timeout to 480 like CPU tests because modules_test frequently timeout
tools/bazel test --test_timeout=480 --test_output=errors \
//:any_test \
//:autograd_test \
//:dataloader_test \
Expand Down
51 changes: 51 additions & 0 deletions .github/scripts/s390x-ci/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
# Configuring the builder.

## Install prerequisites.

```
$ sudo dnf install docker
```

## Add services.

```
$ sudo cp self-hosted-builder/*.service /etc/systemd/system/
$ sudo systemctl daemon-reload
```

## Download qemu-user-static image

```
# sudo docker pull docker.io/iiilinuxibmcom/qemu-user-static:6.1.0-1
```

## Autostart the x86_64 emulation support.

```
$ sudo systemctl enable --now qemu-user-static
```

## Rebuild the image

In order to build or update the `iiilinuxibmcom/actions-runner` image, e.g. to get the
latest OS security fixes, use the following commands:

```
$ cd self-hosted-builder
$ sudo docker build \
--build-arg repo=<owner>/<name> \
--build-arg token=<***> \
--pull \
-f actions-runner.Dockerfile \
-t iiilinuxibmcom/actions-runner \
.
```

If it fails, ensure that selinux doesn't prevent it from working.
In worst case, selinux can be disabled with `setenforce 0`.

## Autostart the runner.

```
$ sudo systemctl enable --now actions-runner@$NAME
```
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
# Self-Hosted IBM Z Github Actions Runner.

# Temporary image: amd64 dependencies.
FROM docker.io/amd64/ubuntu:22.04 as ld-prefix
ENV DEBIAN_FRONTEND=noninteractive
RUN apt-get update && apt-get -y install ca-certificates libicu70 libssl3

# Main image.
FROM docker.io/s390x/ubuntu:22.04

# Packages for pytorch building and testing.
ENV DEBIAN_FRONTEND=noninteractive
RUN apt-get update && apt-get -y install \
cmake \
curl \
gcc \
git \
jq \
libxml2-dev \
libxslt-dev \
ninja-build \
python-is-python3 \
python3 \
python3-dev \
python3-pip \
pybind11-dev \
python3-numpy \
libopenblas-dev \
liblapack-dev \
libgloo-dev \
python3-yaml \
python3-scipy \
virtualenv

# amd64 dependencies.
COPY --from=ld-prefix / /usr/x86_64-linux-gnu/
RUN ln -fs ../lib/x86_64-linux-gnu/ld-linux-x86-64.so.2 /usr/x86_64-linux-gnu/lib64/
RUN ln -fs /etc/resolv.conf /usr/x86_64-linux-gnu/etc/
ENV QEMU_LD_PREFIX=/usr/x86_64-linux-gnu

# Scripts.
COPY fs/ /

RUN chmod +x /usr/bin/actions-runner /usr/bin/entrypoint

# amd64 Github Actions Runner.
RUN useradd -m actions-runner
USER actions-runner
WORKDIR /home/actions-runner
RUN curl -L https://github.com/actions/runner/releases/download/v2.309.0/actions-runner-linux-x64-2.309.0.tar.gz | tar -xz

# repository
ARG repo

# repository token
ARG token

RUN ./config.sh \
--unattended \
--url "https://github.com/${repo}" \
--token "${token}" \
--no-default-labels \
--labels self-hosted,linux.s390x

ENTRYPOINT ["/usr/bin/entrypoint"]
CMD ["/usr/bin/actions-runner"]
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
[Unit]
Description=Self-Hosted IBM Z Github Actions Runner
Wants=qemu-user-static
After=qemu-user-static
StartLimitIntervalSec=0

[Service]
Type=simple
Restart=always
ExecStartPre=-/usr/bin/docker rm --force actions-runner.%i
ExecStart=/usr/bin/docker run \
--init \
--interactive \
--name=actions-runner.%i \
--rm \
iiilinuxibmcom/actions-runner
ExecStop=/bin/sh -c "docker exec actions-runner.%i kill -INT -- -1"
ExecStop=/bin/sh -c "docker wait actions-runner.%i"
ExecStop=/bin/sh -c "docker rm actions-runner.%i"

[Install]
WantedBy=multi-user.target
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
#!/usr/bin/env bash

set -e -u

# Run one job.
./run.sh --once
30 changes: 30 additions & 0 deletions .github/scripts/s390x-ci/self-hosted-builder/fs/usr/bin/entrypoint
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
#!/usr/bin/env bash

#
# Container entrypoint that waits for all spawned processes.
#

set -e -u

# Create a FIFO and start reading from its read end.
tempdir=$(mktemp -d "/tmp/done.XXXXXXXXXX")
trap 'rm -r "$tempdir"' EXIT
done="$tempdir/pipe"
mkfifo "$done"
cat "$done" & waiter=$!

# Start the workload. Its descendants will inherit the FIFO's write end.
status=0
if [ "$#" -eq 0 ]; then
bash 9>"$done" || status=$?
else
"$@" 9>"$done" || status=$?
fi

# When the workload and all of its descendants exit, the FIFO's write end will
# be closed and `cat "$done"` will exit. Wait until it happens. This is needed
# in order to handle SelfUpdater, which the workload may start in background
# before exiting.
wait "$waiter"

exit "$status"
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
[Unit]
Description=Support for transparent execution of non-native binaries with QEMU user emulation

[Service]
Type=oneshot
# The source code for iiilinuxibmcom/qemu-user-static is at https://github.com/iii-i/qemu-user-static/tree/v6.1.0-1
# TODO: replace it with multiarch/qemu-user-static once version >6.1 is available
ExecStart=/usr/bin/docker run --rm --interactive --privileged docker.io/iiilinuxibmcom/qemu-user-static:6.1.0-1 --reset -p yes

[Install]
WantedBy=multi-user.target
9 changes: 5 additions & 4 deletions .lintrunner.toml
Original file line number Diff line number Diff line change
Expand Up @@ -191,7 +191,8 @@ include_patterns = [
'c10/**/*.h',
# Enable coverage of headers in torch/csrc and excluding sub-directories for now.
'torch/csrc/*.h',
'torch/csrc/autograd/**/*.h',
'torch/csrc/*.cpp',
'torch/csrc/**/*.h',
'torch/csrc/**/*.cpp',
]
exclude_patterns = [
Expand All @@ -200,6 +201,8 @@ exclude_patterns = [
# CUDA files are also excluded.
'**/fb/**',
'**/*pb.h',
'c10/**/cuda/*pp',
'aten/**/cuda/*pp',
'**/cuda/*pp',
'**/*XPU*',
'**/xpu/*pp',
Expand All @@ -219,8 +222,6 @@ exclude_patterns = [
'third_party/**/*',
'torch/csrc/api/**',
'torch/csrc/autograd/generated/**',
'torch/csrc/autograd/profiler_legacy.cpp',
'torch/csrc/cuda/**',
'torch/csrc/dynamo/*',
'torch/csrc/distributed/**/*',
'torch/csrc/inductor/**/*',
Expand Down Expand Up @@ -2651,7 +2652,7 @@ init_command = [
'python3',
'tools/linter/adapters/pip_init.py',
'--dry-run={{DRYRUN}}',
'ruff==0.1.14',
'ruff==0.2.2',
]
is_formatter = true

Expand Down
6 changes: 3 additions & 3 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -43,9 +43,9 @@ set(CMAKE_C_STANDARD 11 CACHE STRING "The C standard whose features are reques
# ---[ Utils
include(cmake/public/utils.cmake)

# --- [ Check that minimal gcc version is 9.4+
if(CMAKE_COMPILER_IS_GNUCXX AND CMAKE_CXX_COMPILER_VERSION VERSION_LESS 9.4)
message(FATAL "GCC-9.4 or newer is required to compile PyTorch, but found ${CMAKE_CXX_COMPILER_VERSION}")
# --- [ Check that minimal gcc version is 9.3+
if(CMAKE_COMPILER_IS_GNUCXX AND CMAKE_CXX_COMPILER_VERSION VERSION_LESS 9.3)
message(FATAL_ERROR "GCC-9.3 or newer is required to compile PyTorch, but found ${CMAKE_CXX_COMPILER_VERSION}")
endif()

# This define is needed to preserve behavior given anticpated changes to cccl/thrust
Expand Down
9 changes: 2 additions & 7 deletions aten/src/ATen/LegacyVmapTransforms.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -135,9 +135,7 @@ static Tensor alignBatchDimsAtFront(
const Tensor& self,
std::bitset<kVmapNumLevels> requested_levels,
int64_t requested_example_dim) {
Tensor physical_tensor;
std::bitset<kVmapNumLevels> tensor_levels;
std::tie(physical_tensor, tensor_levels) = getPhysicalTensorAndLevels(self);
auto [physical_tensor, tensor_levels] = getPhysicalTensorAndLevels(self);

TORCH_INTERNAL_ASSERT(
(tensor_levels | requested_levels) == requested_levels,
Expand Down Expand Up @@ -263,10 +261,7 @@ VmapPhysicalViewVec BroadcastingVmapTransform::logicalToPhysical(TensorList logi

VmapPhysicalViewVec result;

std::bitset<kVmapNumLevels> levels;
// NOLINTNEXTLINE(cppcoreguidelines-init-variables)
int64_t largest_logical_dim;
std::tie(levels, largest_logical_dim) = getLevelsAndLargestLogicalDim(logical_tensors);
auto [levels, largest_logical_dim] = getLevelsAndLargestLogicalDim(logical_tensors);

for (const auto& tensor : logical_tensors) {
// NB: It's possible that we didn't actually need to align `tensor`.
Expand Down
7 changes: 3 additions & 4 deletions aten/src/ATen/SparseCsrTensorImpl.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,8 @@ SparseCsrTensorImpl::SparseCsrTensorImpl(
"to https://github.com/pytorch/pytorch/issues.");

TORCH_INTERNAL_ASSERT(((key_set.has(DispatchKey::SparseCsrCPU) && device().type() == kCPU)
|| (key_set.has(DispatchKey::SparseCsrCUDA) && device().type() == kCUDA)),
|| (key_set.has(DispatchKey::SparseCsrCUDA) && device().type() == kCUDA)
|| (key_set.has(DispatchKey::SparseCsrMeta) && device().type() == kMeta)),
"Inconsistent key_set (=", key_set, ") and device (=", device(), ")");

set_storage_access_should_throw();
Expand Down Expand Up @@ -166,9 +167,7 @@ void SparseCsrTensorImpl::resize_as_sparse_compressed_tensor_(
src.layout(),
")");

Tensor compressed_indices;
Tensor plain_indices;
std::tie(compressed_indices, plain_indices) =
auto [compressed_indices, plain_indices] =
sparse_csr::getCompressedPlainIndices(src);
// reuse self indices storage
if (crow_indices_.sizes() != compressed_indices.sizes()) {
Expand Down
28 changes: 28 additions & 0 deletions aten/src/ATen/cpu/vec/functional_base.h
Original file line number Diff line number Diff line change
Expand Up @@ -76,6 +76,34 @@ struct VecReduceAllSIMD<float, Op> {
}
};
#endif // defined(CPU_CAPABILITY_AVX512)
#if defined(CPU_CAPABILITY_NEON)
template <typename Op>
struct VecReduceAllSIMD<float, Op> {
static inline float apply(const Op& vec_fun, const Vectorized<float>& acc_vec) {
using Vec = Vectorized<float>;
Vec v = acc_vec;

// 128-bit shuffle: [a1, a2, a3, a4, a5, a6, a7, a8] -> [a5, a6, a7, a8, a1, a2, a3, a4]
Vec v1 = {v.get_high(), v.get_low()};
// [a1+a5, a2+a6, a3+a7, a4+a8, -, -, -, -] ('+' stands for the reduction function. Note that the last 4 elements are not required)
v = vec_fun(v, v1);

// 64-bit shuffle: [a1+a5, a2+a6, a3+a7, a4+a8, -, -, -, -] -> [a3+a7, a4+a8, a1+a5, a2+a6, -, -, -, -]
float32x4_t v1_1 = vextq_f32(v.get_low(), v.get_low(), 2);
v1 = {v1_1, v1_1};
// [a1+a3+a5+a7, a2+a4+a6+a8, a1+a3+a5+a7, a2+a4+a6+a8, -, -, -, -]
v = vec_fun(v, v1);

// 32-bit shuffle: [a1+a3+a5+a7, a2+a4+a6+a8, a1+a3+a5+a7, a2+a4+a6+a8, -, -, -, -] -> [a2+a4+a6+a8, a1+a3+a5+a7, a2+a4+a6+a8, a1+a3+a5+a7, -, -, -, -]
v1_1 = vrev64q_f32(v.get_low());
v1 = {v1_1, v1_1};
// [a1+a2+a3+a4+a5+a6+a7+a8, a1+a2+a3+a4+a5+a6+a7+a8, a1+a2+a3+a4+a5+a6+a7+a8, a1+a2+a3+a4+a5+a6+a7+a8, -, -, -, -]
v = vec_fun(v, v1);

return v.get_low()[0];
}
};
#endif // defined(CPU_CAPABILITY_NEON)
#endif // defined(__GNUC__) && (__GNUC__ > 5) && !defined(_MSC_VER) && !defined(C10_MOBILE)

template <typename scalar_t, typename Op>
Expand Down
11 changes: 10 additions & 1 deletion aten/src/ATen/cudnn/Descriptors.h
Original file line number Diff line number Diff line change
Expand Up @@ -318,6 +318,15 @@ struct TORCH_CUDA_CPP_API RNNDescriptor : public Descriptor<
}
}
#else
cudaDeviceProp* prop = at::cuda::getCurrentDeviceProperties();
auto math_type = CUDNN_DEFAULT_MATH;
if (prop->major >= 7) {
if (input_type == CUDNN_DATA_HALF) {
math_type = CUDNN_TENSOR_OP_MATH;
} else if (!allow_tf32) {
math_type = CUDNN_FMA_MATH;
}
}
AT_CUDNN_CHECK(cudnnSetRNNDescriptor_v8(
mut_desc(),
algo,
Expand All @@ -327,7 +336,7 @@ struct TORCH_CUDA_CPP_API RNNDescriptor : public Descriptor<
input_mode,
input_type,
datatype,
allow_tf32 ? CUDNN_DEFAULT_MATH : CUDNN_FMA_MATH,
math_type,
input_size,
hidden_size,
proj_size ? proj_size : hidden_size,
Expand Down
2 changes: 1 addition & 1 deletion aten/src/ATen/native/Bucketization.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@

/* Implement a numpy like searchsorted and a TF like bucketize function running on cpu
*
* - torch.searchsorted(sorted_sequence, values, right=False, side='left', out_int32=False, sorter=None)
* - torch.searchsorted(sorted_sequence, values, right=False, side=None, out_int32=False, sorter=None)
* sorted_sequence - N*D or 1D (apply to all values) tensor containing sorted sequences in last dimension
* values - N*D tensor or a Scalar (when sorted_sequence is 1D) containing the search values
* right - corresponding to lower bound if False and upper bound if True
Expand Down
6 changes: 6 additions & 0 deletions aten/src/ATen/native/TensorCompare.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@
#include <ATen/native/TensorCompare.h>
#include <ATen/native/TypeProperties.h>
#include <ATen/TensorSubclassLikeUtils.h>
#include <iostream>
#include <c10/util/Exception.h>

#ifndef AT_PER_OPERATOR_HEADERS
Expand All @@ -22,6 +23,7 @@
#include <ATen/ops/_aminmax_native.h>
#include <ATen/ops/_assert_async_native.h>
#include <ATen/ops/_functional_assert_async_native.h>
#include <ATen/ops/_print_native.h>
#include <ATen/ops/_assert_scalar_native.h>
#include <ATen/ops/_functional_assert_scalar_native.h>
#include <ATen/ops/_make_per_tensor_quantized_tensor.h>
Expand Down Expand Up @@ -71,6 +73,7 @@
#include <ATen/ops/where_native.h>
#include <ATen/ops/zeros_like.h>

#include <iostream>
#include <utility>
#endif

Expand Down Expand Up @@ -440,6 +443,9 @@ Tensor _functional_assert_async_msg_cpu(
return dep_token.clone();
}

void _print(c10::string_view s) {
std::cout << s << "\n";
}

// Sorting-based algorithm for isin(); used when the number of test elements is large.
static void isin_sorting(
Expand Down

0 comments on commit 9b8b4db

Please sign in to comment.