Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[1/2] Intel GPU Runtime Upstreaming for Generator #118528

Closed
wants to merge 65 commits into from
Closed
Show file tree
Hide file tree
Changes from 51 commits
Commits
Show all changes
65 commits
Select commit Hold shift + click to select a range
8115ce3
[1/2] Intel GPU Runtime Upstreaming for Generator
guangyey Jan 29, 2024
1e7a78a
Update on "[WIP] [1/2] Intel GPU Runtime Upstreaming for Generator"
guangyey Jan 29, 2024
b79871d
Update on "[WIP] [1/2] Intel GPU Runtime Upstreaming for Generator"
guangyey Jan 30, 2024
cfa8d20
Update on "[WIP] [1/2] Intel GPU Runtime Upstreaming for Generator"
guangyey Jan 30, 2024
6e04041
Update on "[WIP] [1/2] Intel GPU Runtime Upstreaming for Generator"
guangyey Jan 30, 2024
f326cac
Update on "[WIP] [1/2] Intel GPU Runtime Upstreaming for Generator"
guangyey Jan 30, 2024
015afbc
Update on "[WIP] [1/2] Intel GPU Runtime Upstreaming for Generator"
guangyey Jan 30, 2024
b124234
Update on "[WIP] [1/2] Intel GPU Runtime Upstreaming for Generator"
guangyey Jan 30, 2024
b08c032
Update on "[WIP] [1/2] Intel GPU Runtime Upstreaming for Generator"
guangyey Jan 30, 2024
a86685f
Update on "[WIP] [1/2] Intel GPU Runtime Upstreaming for Generator"
guangyey Jan 30, 2024
3436cab
Update on "[WIP] [1/2] Intel GPU Runtime Upstreaming for Generator"
guangyey Jan 30, 2024
a9eef30
Update on "[WIP] [1/2] Intel GPU Runtime Upstreaming for Generator"
guangyey Jan 30, 2024
97d46c1
Update on "[WIP] [1/2] Intel GPU Runtime Upstreaming for Generator"
guangyey Jan 30, 2024
9cfb90f
Update on "[WIP] [1/2] Intel GPU Runtime Upstreaming for Generator"
guangyey Jan 31, 2024
6a980a6
Update on "[WIP] [1/2] Intel GPU Runtime Upstreaming for Generator"
guangyey Jan 31, 2024
01195cc
Update on "[WIP] [1/2] Intel GPU Runtime Upstreaming for Generator"
guangyey Jan 31, 2024
dde8977
Update on "[WIP] [1/2] Intel GPU Runtime Upstreaming for Generator"
guangyey Jan 31, 2024
ffcc027
Update on "[WIP] [1/2] Intel GPU Runtime Upstreaming for Generator"
guangyey Jan 31, 2024
9cf585d
Update on "[WIP] [1/2] Intel GPU Runtime Upstreaming for Generator"
guangyey Jan 31, 2024
3fd1741
Update on "[1/2] Intel GPU Runtime Upstreaming for Generator"
guangyey Jan 31, 2024
4d3e231
Update on "[1/2] Intel GPU Runtime Upstreaming for Generator"
guangyey Jan 31, 2024
1c22b4b
Update on "[1/2] Intel GPU Runtime Upstreaming for Generator"
guangyey Feb 1, 2024
cbf0fd0
Update on "[1/2] Intel GPU Runtime Upstreaming for Generator"
guangyey Feb 1, 2024
eb7ce1b
Update on "[1/2] Intel GPU Runtime Upstreaming for Generator"
guangyey Feb 1, 2024
adc16bd
Update on "[1/2] Intel GPU Runtime Upstreaming for Generator"
guangyey Feb 1, 2024
9244bd9
Update on "[1/2] Intel GPU Runtime Upstreaming for Generator"
guangyey Feb 1, 2024
92c7960
Update on "[1/2] Intel GPU Runtime Upstreaming for Generator"
guangyey Feb 1, 2024
1a22017
Update on "[1/2] Intel GPU Runtime Upstreaming for Generator"
guangyey Feb 1, 2024
394810a
Update on "[1/2] Intel GPU Runtime Upstreaming for Generator"
guangyey Feb 1, 2024
8425282
Update on "[1/2] Intel GPU Runtime Upstreaming for Generator"
guangyey Feb 1, 2024
69be3b0
Update on "[1/2] Intel GPU Runtime Upstreaming for Generator"
guangyey Feb 2, 2024
e9918aa
Update on "[1/2] Intel GPU Runtime Upstreaming for Generator"
guangyey Feb 2, 2024
7d166f9
Update on "[1/2] Intel GPU Runtime Upstreaming for Generator"
guangyey Feb 2, 2024
3ec9249
Update on "[1/2] Intel GPU Runtime Upstreaming for Generator"
guangyey Feb 2, 2024
164479b
Update on "[1/2] Intel GPU Runtime Upstreaming for Generator"
guangyey Feb 4, 2024
94d76f2
Update on "[1/2] Intel GPU Runtime Upstreaming for Generator"
guangyey Feb 6, 2024
a718e2a
Update on "[1/2] Intel GPU Runtime Upstreaming for Generator"
guangyey Feb 8, 2024
0d01458
Update on "[1/2] Intel GPU Runtime Upstreaming for Generator"
guangyey Feb 8, 2024
2dde0b8
Update on "[1/2] Intel GPU Runtime Upstreaming for Generator"
guangyey Feb 8, 2024
1f4b17a
Update on "[1/2] Intel GPU Runtime Upstreaming for Generator"
guangyey Feb 8, 2024
4c96b2b
Update on "[1/2] Intel GPU Runtime Upstreaming for Generator"
guangyey Feb 8, 2024
ef37ac5
Update on "[1/2] Intel GPU Runtime Upstreaming for Generator"
guangyey Feb 8, 2024
3c0b395
Update on "[1/2] Intel GPU Runtime Upstreaming for Generator"
guangyey Feb 8, 2024
2e463d3
Update on "[1/2] Intel GPU Runtime Upstreaming for Generator"
guangyey Feb 8, 2024
0e2d656
Update on "[1/2] Intel GPU Runtime Upstreaming for Generator"
guangyey Feb 9, 2024
a1cf537
Update on "[1/2] Intel GPU Runtime Upstreaming for Generator"
guangyey Feb 9, 2024
b100940
Update on "[1/2] Intel GPU Runtime Upstreaming for Generator"
guangyey Feb 10, 2024
feb8c86
Update on "[1/2] Intel GPU Runtime Upstreaming for Generator"
guangyey Feb 10, 2024
10cc032
Update on "[1/2] Intel GPU Runtime Upstreaming for Generator"
guangyey Feb 10, 2024
b7b1a81
Update on "[1/2] Intel GPU Runtime Upstreaming for Generator"
guangyey Feb 13, 2024
f12e8ad
Update on "[1/2] Intel GPU Runtime Upstreaming for Generator"
guangyey Feb 13, 2024
c847086
Update on "[1/2] Intel GPU Runtime Upstreaming for Generator"
guangyey Feb 16, 2024
b5ddd52
Update on "[1/2] Intel GPU Runtime Upstreaming for Generator"
guangyey Feb 21, 2024
b6d21d9
Update on "[1/2] Intel GPU Runtime Upstreaming for Generator"
guangyey Feb 21, 2024
859f118
Update on "[1/2] Intel GPU Runtime Upstreaming for Generator"
guangyey Feb 21, 2024
e9cef1a
Update on "[1/2] Intel GPU Runtime Upstreaming for Generator"
guangyey Feb 21, 2024
c015f4a
Update on "[1/2] Intel GPU Runtime Upstreaming for Generator"
guangyey Feb 21, 2024
ae25166
Update on "[1/2] Intel GPU Runtime Upstreaming for Generator"
guangyey Feb 21, 2024
89790a2
Update on "[1/2] Intel GPU Runtime Upstreaming for Generator"
guangyey Feb 22, 2024
954b19c
Update on "[1/2] Intel GPU Runtime Upstreaming for Generator"
guangyey Feb 22, 2024
9cf7d35
Update on "[1/2] Intel GPU Runtime Upstreaming for Generator"
guangyey Feb 22, 2024
2a20a29
Update on "[1/2] Intel GPU Runtime Upstreaming for Generator"
guangyey Feb 22, 2024
75f6754
Update on "[1/2] Intel GPU Runtime Upstreaming for Generator"
guangyey Feb 22, 2024
b1c711b
Update on "[1/2] Intel GPU Runtime Upstreaming for Generator"
guangyey Feb 22, 2024
03217cd
Update on "[1/2] Intel GPU Runtime Upstreaming for Generator"
guangyey Feb 26, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
13 changes: 13 additions & 0 deletions aten/src/ATen/Context.h
Original file line number Diff line number Diff line change
Expand Up @@ -489,6 +489,19 @@ static inline void manual_seed(uint64_t seed) {
}
}

const auto xpu_num_gpus = detail::getXPUHooks().getNumGPUs();
if (hasXPU() && xpu_num_gpus) {
for (const auto i : c10::irange(xpu_num_gpus)) {
auto xpu_gen = globalContext().defaultGenerator(
Device(at::kXPU, static_cast<c10::DeviceIndex>(i)));
{
// See Note [Acquire lock when using random generators]
std::lock_guard<std::mutex> lock(xpu_gen.mutex());
xpu_gen.set_current_seed(seed);
}
}
}

if (hasMPS()) {
auto mps_gen = globalContext().defaultGenerator(c10::DeviceType::MPS);
// See Note [Acquire lock when using random generators]
Expand Down
1 change: 1 addition & 0 deletions aten/src/ATen/test/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -119,6 +119,7 @@ endif()
list(APPEND ATen_XPU_TEST_SRCS
${CMAKE_CURRENT_SOURCE_DIR}/xpu_device_test.cpp
${CMAKE_CURRENT_SOURCE_DIR}/xpu_event_test.cpp
${CMAKE_CURRENT_SOURCE_DIR}/xpu_generator_test.cpp
)

# Caffe2 specific tests
Expand Down
82 changes: 82 additions & 0 deletions aten/src/ATen/test/xpu_generator_test.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,82 @@
#include <gtest/gtest.h>

#include <ATen/ATen.h>
#include <ATen/xpu/XPUContext.h>
#include <ATen/xpu/XPUGeneratorImpl.h>
#include <ATen/core/PhiloxRNGEngine.h>

#include <assert.h>
#include <thread>

TEST(XpuGeneratorTest, testGeneratorDynamicCast) {
if (!at::xpu::is_available()) {
return;
}
auto foo = at::xpu::detail::createXPUGenerator();
auto result = foo.get<at::XPUGeneratorImpl>();
EXPECT_EQ(typeid(at::XPUGeneratorImpl*).hash_code(), typeid(result).hash_code());
}

TEST(XpuGeneratorTest, testDefaultGenerator) {
if (!at::xpu::is_available()) {
return;
}
auto foo = at::xpu::detail::getDefaultXPUGenerator();
auto bar = at::xpu::detail::getDefaultXPUGenerator();
EXPECT_EQ(foo, bar);

auto offset = foo.get_offset() << 1;
foo.set_offset(offset);
EXPECT_EQ(foo.get_offset(), offset);

if (c10::xpu::device_count() >= 2) {
foo = at::xpu::detail::getDefaultXPUGenerator(0);
bar = at::xpu::detail::getDefaultXPUGenerator(0);
EXPECT_EQ(foo, bar);

foo = at::xpu::detail::getDefaultXPUGenerator(0);
bar = at::xpu::detail::getDefaultXPUGenerator(1);
EXPECT_NE(foo, bar);
}
}

TEST(XpuGeneratorTest, testCloning) {
if (!at::xpu::is_available()) {
return;
}
auto gen1 = at::xpu::detail::createXPUGenerator();
gen1.set_current_seed(123); // modify gen1 state
auto xpu_gen1 = at::check_generator<at::XPUGeneratorImpl>(gen1);
xpu_gen1->set_philox_offset_per_thread(4);
auto gen2 = at::xpu::detail::createXPUGenerator();
gen2 = gen1.clone();
auto xpu_gen2 = at::check_generator<at::XPUGeneratorImpl>(gen2);
EXPECT_EQ(gen1.current_seed(), gen2.current_seed());
EXPECT_EQ(
xpu_gen1->philox_offset_per_thread(),
xpu_gen2->philox_offset_per_thread()
);
}

void thread_func_get_set_current_seed(at::Generator generator) {
std::lock_guard<std::mutex> lock(generator.mutex());
auto current_seed = generator.current_seed();
current_seed++;
generator.set_current_seed(current_seed);
}

TEST(XpuGeneratorTest, testMultithreadingGetSetCurrentSeed) {
// See Note [Acquire lock when using random generators]
if (!at::xpu::is_available()) {
return;
}
auto gen1 = at::xpu::detail::getDefaultXPUGenerator();
auto initial_seed = gen1.current_seed();
std::thread t0{thread_func_get_set_current_seed, gen1};
std::thread t1{thread_func_get_set_current_seed, gen1};
std::thread t2{thread_func_get_set_current_seed, gen1};
t0.join();
t1.join();
t2.join();
EXPECT_EQ(gen1.current_seed(), initial_seed+3);
}
176 changes: 176 additions & 0 deletions aten/src/ATen/xpu/XPUGeneratorImpl.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,176 @@
#include <ATen/Utils.h>
#include <ATen/xpu/XPUGeneratorImpl.h>
#include <c10/core/StreamGuard.h>
#include <c10/util/CallOnce.h>
#include <c10/xpu/XPUFunctions.h>

namespace at {
namespace xpu::detail {
namespace {

/*
* Currently, there is one generator pool containing XPU generator per device.
* Each generator is lazily initialized when the first time generator is
albanD marked this conversation as resolved.
Show resolved Hide resolved
* requested for a device.
*/
c10::once_flag init_flag;
DeviceIndex num_gpus = -1;
std::deque<c10::once_flag> xpu_gens_init_flag;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not a vector like below?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change to std::vector<c10::once_flag> xpu_gens_init_flag;
Here std::vector is more efficient than std::deque.

Copy link
Collaborator Author

@guangyey guangyey Feb 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry @albanD, looks like std::vector<c10::once_flag> doesn't support resize method because once_flag lacks a copy constructor. see

once_flag(const once_flag&) = delete;

I have also tried on godbolt. So, here I change back to std::deque.
I think I missed this error before because I didn't save my code change when I rebuilt it in my local machine. I'm very sorry about this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@albanD Could you help review again?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ho interesting. Sounds good!

std::vector<Generator> default_gens_xpu;

void initXPUGenVector() {
num_gpus = device_count();
xpu_gens_init_flag.resize(num_gpus);
default_gens_xpu.resize(num_gpus);
}

inline void check_device(DeviceIndex device) {
TORCH_CHECK(
device >= 0 && device < num_gpus,
"device is out of range, device is ",
static_cast<int16_t>(device),
", total number of device is ",
static_cast<int16_t>(num_gpus),
".");
}

} // anonymous namespace

const Generator& getDefaultXPUGenerator(DeviceIndex device) {
c10::call_once(init_flag, initXPUGenVector);
if (device == -1) {
device = c10::xpu::current_device();
}
check_device(device);
c10::call_once(xpu_gens_init_flag[device], [&]() {
default_gens_xpu[device] = make_generator<XPUGeneratorImpl>(device);
default_gens_xpu[device].seed();
guangyey marked this conversation as resolved.
Show resolved Hide resolved
});
return default_gens_xpu[device];
}

Generator createXPUGenerator(DeviceIndex device) {
c10::call_once(init_flag, initXPUGenVector);
albanD marked this conversation as resolved.
Show resolved Hide resolved
if (device == -1) {
device = c10::xpu::current_device();
}
check_device(device);
auto gen = make_generator<XPUGeneratorImpl>(device);
auto xpu_gen = check_generator<XPUGeneratorImpl>(gen);
xpu_gen->set_current_seed(default_rng_seed_val);
xpu_gen->set_philox_offset_per_thread(0);
return gen;
}

} // namespace xpu::detail

XPUGeneratorImpl::XPUGeneratorImpl(DeviceIndex device_index)
: GeneratorImpl{
Device(DeviceType::XPU, device_index),
DispatchKeySet(c10::DispatchKey::XPU)} {}

void XPUGeneratorImpl::set_current_seed(uint64_t seed) {
seed_ = seed;
philox_offset_per_thread_ = 0;
albanD marked this conversation as resolved.
Show resolved Hide resolved
}

void XPUGeneratorImpl::set_offset(uint64_t offset) {
set_philox_offset_per_thread(offset);
}

uint64_t XPUGeneratorImpl::get_offset() const {
return philox_offset_per_thread_;
}

uint64_t XPUGeneratorImpl::current_seed() const {
return seed_;
}

uint64_t XPUGeneratorImpl::seed() {
auto random = c10::detail::getNonDeterministicRandom(true);
this->set_current_seed(random);
return random;
}

c10::intrusive_ptr<c10::TensorImpl> XPUGeneratorImpl::get_state() const {
// The RNG state comprises the seed, and an offset used for Philox.
static const size_t seed_size = sizeof(uint64_t);
static const size_t offset_size = sizeof(uint64_t);
static const size_t total_size = seed_size + offset_size;

auto state_tensor = at::detail::empty_cpu(
{(int64_t)total_size},
guangyey marked this conversation as resolved.
Show resolved Hide resolved
ScalarType::Byte,
c10::nullopt,
c10::nullopt,
c10::nullopt,
c10::nullopt);
auto rng_state = state_tensor.data_ptr<uint8_t>();
auto current_seed = this->current_seed();
auto offset = this->philox_offset_per_thread();
memcpy(rng_state, &current_seed, seed_size);
memcpy(rng_state + seed_size, &offset, offset_size);

return state_tensor.getIntrusivePtr();
}

void XPUGeneratorImpl::set_state(const c10::TensorImpl& new_state) {
static const size_t seed_size = sizeof(uint64_t);
static const size_t offset_size = sizeof(uint64_t);
static const size_t total_size = seed_size + offset_size;

at::detail::check_rng_state(new_state);

bool no_philox_seed = false;
auto new_state_size = new_state.numel();
if (new_state_size == total_size - offset_size) {
no_philox_seed = true;
} else {
TORCH_CHECK(new_state_size == total_size, "RNG state is wrong size");
}

uint64_t input_seed;
auto new_rng_state = new_state.data_dtype_initialized<uint8_t>();
memcpy(&input_seed, new_rng_state, seed_size);
this->set_current_seed(input_seed);
uint64_t philox_offset = 0;
if (!no_philox_seed) {
memcpy(&philox_offset, new_rng_state + seed_size, offset_size);
}
this->set_philox_offset_per_thread(philox_offset);
}
albanD marked this conversation as resolved.
Show resolved Hide resolved

void XPUGeneratorImpl::set_philox_offset_per_thread(uint64_t offset) {
TORCH_CHECK(offset % 4 == 0, "offset must be a multiple of 4");
philox_offset_per_thread_ = offset;
}

uint64_t XPUGeneratorImpl::philox_offset_per_thread() const {
return philox_offset_per_thread_;
}

std::pair<uint64_t, uint64_t> XPUGeneratorImpl::philox_engine_inputs(
uint64_t increment) {
increment = ((increment + 3) / 4) * 4;
TORCH_INTERNAL_ASSERT(this->philox_offset_per_thread_ % 4 == 0);
uint64_t offset = this->philox_offset_per_thread_;
this->philox_offset_per_thread_ += increment;
return std::make_pair(this->seed_, offset);
}

DeviceType XPUGeneratorImpl::device_type() {
return DeviceType::XPU;
}

std::shared_ptr<XPUGeneratorImpl> XPUGeneratorImpl::clone() const {
return std::shared_ptr<XPUGeneratorImpl>(this->clone_impl());
}

XPUGeneratorImpl* XPUGeneratorImpl::clone_impl() const {
auto gen = new XPUGeneratorImpl(this->device().index());
gen->set_current_seed(this->seed_);
gen->set_philox_offset_per_thread(this->philox_offset_per_thread_);
return gen;
}

} // namespace at
39 changes: 39 additions & 0 deletions aten/src/ATen/xpu/XPUGeneratorImpl.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
#pragma once

#include <ATen/core/Generator.h>

namespace at {

struct TORCH_API XPUGeneratorImpl : public GeneratorImpl {
// Constructors
XPUGeneratorImpl(DeviceIndex device_index = -1);
~XPUGeneratorImpl() override = default;

// XPUGeneratorImpl methods
std::shared_ptr<XPUGeneratorImpl> clone() const;
void set_current_seed(uint64_t seed) override;
void set_offset(uint64_t offset) override;
uint64_t get_offset() const override;
uint64_t current_seed() const override;
uint64_t seed() override;
void set_state(const c10::TensorImpl& new_state) override;
c10::intrusive_ptr<c10::TensorImpl> get_state() const override;
void set_philox_offset_per_thread(uint64_t offset);
uint64_t philox_offset_per_thread() const;
std::pair<uint64_t, uint64_t> philox_engine_inputs(uint64_t increment);
static c10::DeviceType device_type();

private:
XPUGeneratorImpl* clone_impl() const override;
uint64_t seed_ = default_rng_seed_val;
uint64_t philox_offset_per_thread_ = 0;
};

namespace xpu::detail {

TORCH_XPU_API const Generator& getDefaultXPUGenerator(DeviceIndex device = -1);

TORCH_XPU_API Generator createXPUGenerator(DeviceIndex device = -1);

} // namespace xpu::detail
} // namespace at
6 changes: 6 additions & 0 deletions aten/src/ATen/xpu/detail/XPUHooks.cpp
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
#include <ATen/xpu/XPUContext.h>
#include <ATen/xpu/XPUDevice.h>
#include <ATen/xpu/XPUGeneratorImpl.h>
#include <ATen/xpu/detail/XPUHooks.h>
#include <c10/util/CallOnce.h>
#include <c10/util/Logging.h>
Expand All @@ -26,6 +27,11 @@ int XPUHooks::getGlobalIdxFromDevice(const at::Device& device) const {
return at::xpu::getGlobalIdxFromDevice(device.index());
}

const Generator& XPUHooks::getDefaultXPUGenerator(
DeviceIndex device_index) const {
return at::xpu::detail::getDefaultXPUGenerator(device_index);
}

Device XPUHooks::getDeviceFromPtr(void* data) const {
return at::xpu::getDeviceFromPtr(data);
}
Expand Down
2 changes: 2 additions & 0 deletions aten/src/ATen/xpu/detail/XPUHooks.h
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,8 @@ struct XPUHooks : public at::XPUHooksInterface {
bool hasXPU() const override;
std::string showConfig() const override;
int getGlobalIdxFromDevice(const at::Device& device) const override;
const Generator& getDefaultXPUGenerator(
DeviceIndex device_index = -1) const override;
Device getDeviceFromPtr(void* data) const override;
int getNumGPUs() const override;
void deviceSynchronize(DeviceIndex device_index) const override;
Expand Down