Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for AMDGPU #6434

Open
galeselee opened this issue Oct 26, 2022 · 13 comments
Open

Add support for AMDGPU #6434

galeselee opened this issue Oct 26, 2022 · 13 comments
Assignees
Labels
amdgpu AMDGPU backend discussion Welcome discussion! feature request Suggest an idea on this project

Comments

@galeselee
Copy link
Contributor

galeselee commented Oct 26, 2022

Purpose
I find Taichi users have a demand for AMDGPU. (related issue #4586)
There is also a discussion of the AMD GPU backend here #412
Thus, I would like to add an AMDGPU backend to the compiler so that my 6900XT can be utilized.

Solution
The purpose is to give Taichi preliminary support for AMDPU via LLVM-14 and ROCm.
I started by completing the following steps step by step

  1. the relevant dependencies in cmake
  2. a specific version of Taichi-runtime module(taichi/runtime/llvm/runtime_module) for AMDGPU. During the implementation, I found that I couldn't reuse the CPU's llvm directly like taichi does for CUDA. I needed to do some addrspace conversions, so I temporarily used hipcc to help me deal with these issues. (The final approach taken is to hack address space, so we don't need an extra runtime.cpp)
  3. interaction between Taichi and ROCm library functions. In cuda backend, some driver functions are used. However, It's convenient to replace these with functions provided by ROCm, such as hipMemcpy, hipModuleLaunchKernel.
  4. a codegen skeleton (like taichi/codegen/cuda)
  5. convert llvm IR to amdgpu ISA / hsaco(ref link)
  6. Unit test configuration for amdgpu.
@galeselee galeselee added feature request Suggest an idea on this project discussion Welcome discussion! amdgpu AMDGPU backend labels Oct 26, 2022
@galeselee
Copy link
Contributor Author

It seems better to add cmake at the end(after step5)

galeselee added a commit that referenced this issue Oct 28, 2022
Issue: # #6434

### Brief Summary
Add pr tag for amdgpu
@galeselee
Copy link
Contributor Author

galeselee commented Oct 31, 2022

@ITCJ
Copy link

ITCJ commented Nov 16, 2022

Amazing works.

I'm also considering this task since our new HPC is using Sugon gtx106 (known as Vega 20). I am quite interested is there any possible to cooperate ?

Besides,
Are you also working on ASC 2022 competition ? I notice that you forked DeePMD-kit which is one of the tasks of ASC2022.

@galeselee
Copy link
Contributor Author

galeselee commented Nov 16, 2022

I'm also considering this task since our new HPC is using Sugon gtx106 (known as Vega 20). I am quite interested is there any possible to cooperate ?

@ITCJ Certainly! Now, I have implemented a prototype for AMDGPU backend based on a RDNA2-arch card(don't worry, there are lots of work need to do). If you are interested, here is the code
By the way, I don't know sth about Sugon gtx106 :-). In my knowledge, gtx is a serial production from nvidia. Do you mean DCU?

@ITCJ
Copy link

ITCJ commented Nov 16, 2022

I'm also considering this task since our new HPC is using Sugon gtx106 (known as Vega 20). I am quite interested is there any possible to cooperate ?

@ITCJ Certainly! Now, I have implemented a prototype for AMDGPU backend based on a RDNA2-arch card(don't worry, there are lots of work need to do). If you are interested, here is the code By the way, I don't know sth about Sugon gtx106 :-). In my knowledge, gtx is a serial production from nvidia. Do you mean DCU?

@galeselee yeah, it's Surgon DCU. I recheck the rocminfo and found it is gfx906. I am confusing for this duplication for a time but never check rocminfo again, ORZ.

  ISA Info:                
    ISA 1                    
      Name:                    amdgcn-amd-amdhsa--gfx906:sramecc-:xnack-

AND,
Would you mind if I friend you on Wechat?I have send you my wechat id AND QR code to your edu email.

feisuzhu pushed a commit that referenced this issue Nov 24, 2022
Issue: #6434

### Brief Summary
This dockerfile refs to Dockerfile.ubuntu.20.04. The differences are
following
1. based docker image
2. del python3-pip, mesa-common-dev which are already downloaded in
based image.
3. remove vulkan module(amdgpu is not yet adapted to vulkan)
4. add user into video usergroup. (Some Ubuntu-20.04 versions also
require render group)
galeselee added a commit that referenced this issue Dec 6, 2022
Issue: #6434

### Brief Summary
Add the logic of docker for AMDGPU ci
1. The llvm used which only contains AMDGPU and X86 targets is from
docker image
2. Using AMDGPU in docker requires that `/dev/kfd` and the directory
`/dev/dri` be mounted on. For `dev` user, there is no permission to
access these character devices by default.
3. `TI_WITH_CUDA=OFF`
4. `TI_RUN_RELEASE_TESTS=OFF`
5. Currently only run cpu-relected test
galeselee added a commit that referenced this issue Dec 30, 2022
Issue: #6434

### Brief Summary
1. This is a special part of the Tacihi runtime module for the `AMDGPU`
backend. Tacihi's runtime module uses `clang++` to generate `LLVM IR` is
different in memory allocation differs from the cpu-generated `LLVM IR`.
The following is an example.
```
C/C++ code
void func(int *a, int *b) {
    *a = *b;
}
x86_64 backend LLVM IR
define dso_local void @cpu_func(i32* %0, i32* %1) #2 {
  %3 = alloca i32*, align 8
  %4 = alloca i32*, align 8
  store i32* %0, i32** %3, align 8
  store i32* %1, i32** %4, align 8
  %5 = load i32*, i32** %4, align 8
  %6 = load i32, i32* %5, align 4
  %7 = load i32*, i32** %3, align 8
  store i32 %6, i32* %7, align 4
  ret void
}
__global__ function on AMDGPU
define protected amdgpu_kernel void @global_func(i32 addrspace(1)* %0, i32 addrspace(1)* %1) #4 {
  %3 = alloca i32*, align 8, addrspace(5)
  %4 = alloca i32*, align 8, addrspace(5)
  %5 = alloca i32*, align 8, addrspace(5)
  %6 = alloca i32*, align 8, addrspace(5)
  %7 = addrspacecast i32* addrspace(5)* %3 to i32**
  %8 = addrspacecast i32* addrspace(5)* %4 to i32**
  %9 = addrspacecast i32* addrspace(5)* %5 to i32**
  %10 = addrspacecast i32* addrspace(5)* %6 to i32**
  %11 = addrspacecast i32 addrspace(1)* %0 to i32*
  store i32* %11, i32** %7, align 8
  %12 = load i32*, i32** %7, align 8
  %13 = addrspacecast i32 addrspace(1)* %1 to i32*
  store i32* %13, i32** %8, align 8
  %14 = load i32*, i32** %8, align 8
  store i32* %12, i32** %9, align 8
  store i32* %14, i32** %10, align 8
  %15 = load i32*, i32** %10, align 8
  %16 = load i32, i32* %15, align 4
  %17 = load i32*, i32** %9, align 8
  store i32 %16, i32* %17, align 4
  ret void
}
__device__ function on AMDGPU
define hidden void @device_func(i32* %0, i32* %1) #2 {
  %3 = alloca i32*, align 8, addrspace(5)
  %4 = alloca i32*, align 8, addrspace(5)
  %5 = addrspacecast i32* addrspace(5)* %3 to i32**
  %6 = addrspacecast i32* addrspace(5)* %4 to i32**
  store i32* %0, i32** %5, align 8
  store i32* %1, i32** %6, align 8
  %7 = load i32*, i32** %6, align 8
  %8 = load i32, i32* %7, align 4
  %9 = load i32*, i32** %5, align 8
  store i32 %8, i32* %9, align 4
  ret void
}
```
2. There are some differences in the place about `allocainst`,
specifically about addrspace (for `AMDGPU`,
[this](https://llvm.org/docs/AMDGPUUsage.html#address-spaces) will be
helpful). I have not found documentation describing how to write the
correct `LLVM IR` on `AMDGPU`, through my observation of the `LLVM IR`
generated by `clang++/hipcc`. We need to deal with the arguments of the
`__global__` function and the `allocainst` (including specifying the
addrspace of `allocainst` and performing addrspace-cast) while for the
`__device__` function we do not need to deal with the arguments of the
function.

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
galeselee added a commit that referenced this issue Dec 30, 2022
Issue: ##6434

Add some unit tests on the AMDGPU RHI level

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
galeselee added a commit that referenced this issue Dec 30, 2022
Issue: ##6434

### Brief Summary
It contains four parts(`driver`, `context`, `device` and
`caching_allocator`). The code is similar to `cuda/rhi`. However, there
are still some differences between `amdgpu/rhi` and `cuda/rhi`
#### context
1. The method of obtaining the hardware version
2. Context::launch
#### driver
1. ROCm/hip internal functions
#### cmake

The current cmake compilation system is sufficient to support the Unit
test in #6597

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
feisuzhu pushed a commit to feisuzhu/taichi that referenced this issue Jan 5, 2023
Issue: taichi-dev#6434

### Brief Summary
1. This is a special part of the Tacihi runtime module for the `AMDGPU`
backend. Tacihi's runtime module uses `clang++` to generate `LLVM IR` is
different in memory allocation differs from the cpu-generated `LLVM IR`.
The following is an example.
```
C/C++ code
void func(int *a, int *b) {
    *a = *b;
}
x86_64 backend LLVM IR
define dso_local void @cpu_func(i32* %0, i32* %1) taichi-dev#2 {
  %3 = alloca i32*, align 8
  %4 = alloca i32*, align 8
  store i32* %0, i32** %3, align 8
  store i32* %1, i32** %4, align 8
  %5 = load i32*, i32** %4, align 8
  %6 = load i32, i32* %5, align 4
  %7 = load i32*, i32** %3, align 8
  store i32 %6, i32* %7, align 4
  ret void
}
__global__ function on AMDGPU
define protected amdgpu_kernel void @global_func(i32 addrspace(1)* %0, i32 addrspace(1)* %1) taichi-dev#4 {
  %3 = alloca i32*, align 8, addrspace(5)
  %4 = alloca i32*, align 8, addrspace(5)
  %5 = alloca i32*, align 8, addrspace(5)
  %6 = alloca i32*, align 8, addrspace(5)
  %7 = addrspacecast i32* addrspace(5)* %3 to i32**
  %8 = addrspacecast i32* addrspace(5)* %4 to i32**
  %9 = addrspacecast i32* addrspace(5)* %5 to i32**
  %10 = addrspacecast i32* addrspace(5)* %6 to i32**
  %11 = addrspacecast i32 addrspace(1)* %0 to i32*
  store i32* %11, i32** %7, align 8
  %12 = load i32*, i32** %7, align 8
  %13 = addrspacecast i32 addrspace(1)* %1 to i32*
  store i32* %13, i32** %8, align 8
  %14 = load i32*, i32** %8, align 8
  store i32* %12, i32** %9, align 8
  store i32* %14, i32** %10, align 8
  %15 = load i32*, i32** %10, align 8
  %16 = load i32, i32* %15, align 4
  %17 = load i32*, i32** %9, align 8
  store i32 %16, i32* %17, align 4
  ret void
}
__device__ function on AMDGPU
define hidden void @device_func(i32* %0, i32* %1) taichi-dev#2 {
  %3 = alloca i32*, align 8, addrspace(5)
  %4 = alloca i32*, align 8, addrspace(5)
  %5 = addrspacecast i32* addrspace(5)* %3 to i32**
  %6 = addrspacecast i32* addrspace(5)* %4 to i32**
  store i32* %0, i32** %5, align 8
  store i32* %1, i32** %6, align 8
  %7 = load i32*, i32** %6, align 8
  %8 = load i32, i32* %7, align 4
  %9 = load i32*, i32** %5, align 8
  store i32 %8, i32* %9, align 4
  ret void
}
```
2. There are some differences in the place about `allocainst`,
specifically about addrspace (for `AMDGPU`,
[this](https://llvm.org/docs/AMDGPUUsage.html#address-spaces) will be
helpful). I have not found documentation describing how to write the
correct `LLVM IR` on `AMDGPU`, through my observation of the `LLVM IR`
generated by `clang++/hipcc`. We need to deal with the arguments of the
`__global__` function and the `allocainst` (including specifying the
addrspace of `allocainst` and performing addrspace-cast) while for the
`__device__` function we do not need to deal with the arguments of the
function.

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
feisuzhu pushed a commit to feisuzhu/taichi that referenced this issue Jan 5, 2023
Issue: #taichi-dev#6434

Add some unit tests on the AMDGPU RHI level

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
feisuzhu pushed a commit to feisuzhu/taichi that referenced this issue Jan 5, 2023
Issue: #taichi-dev#6434

### Brief Summary
It contains four parts(`driver`, `context`, `device` and
`caching_allocator`). The code is similar to `cuda/rhi`. However, there
are still some differences between `amdgpu/rhi` and `cuda/rhi`
#### context
1. The method of obtaining the hardware version
2. Context::launch
#### driver
1. ROCm/hip internal functions
#### cmake

The current cmake compilation system is sufficient to support the Unit
test in taichi-dev#6597

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
@turbo0628 turbo0628 modified the milestones: v1.5.0, v1.6.0 Mar 9, 2023
@feisuzhu feisuzhu modified the milestones: v1.6.0, v1.7.0 May 11, 2023
quadpixels pushed a commit to quadpixels/taichi that referenced this issue May 13, 2023
Issue: taichi-dev#6434

### Brief Summary
This dockerfile refs to Dockerfile.ubuntu.20.04. The differences are
following
1. based docker image
2. del python3-pip, mesa-common-dev which are already downloaded in
based image.
3. remove vulkan module(amdgpu is not yet adapted to vulkan)
4. add user into video usergroup. (Some Ubuntu-20.04 versions also
require render group)
quadpixels pushed a commit to quadpixels/taichi that referenced this issue May 13, 2023
Issue: taichi-dev#6434

### Brief Summary
Add the logic of docker for AMDGPU ci
1. The llvm used which only contains AMDGPU and X86 targets is from
docker image
2. Using AMDGPU in docker requires that `/dev/kfd` and the directory
`/dev/dri` be mounted on. For `dev` user, there is no permission to
access these character devices by default.
3. `TI_WITH_CUDA=OFF`
4. `TI_RUN_RELEASE_TESTS=OFF`
5. Currently only run cpu-relected test
quadpixels pushed a commit to quadpixels/taichi that referenced this issue May 13, 2023
Issue: taichi-dev#6434

### Brief Summary
1. This is a special part of the Tacihi runtime module for the `AMDGPU`
backend. Tacihi's runtime module uses `clang++` to generate `LLVM IR` is
different in memory allocation differs from the cpu-generated `LLVM IR`.
The following is an example.
```
C/C++ code
void func(int *a, int *b) {
    *a = *b;
}
x86_64 backend LLVM IR
define dso_local void @cpu_func(i32* %0, i32* %1) taichi-dev#2 {
  %3 = alloca i32*, align 8
  %4 = alloca i32*, align 8
  store i32* %0, i32** %3, align 8
  store i32* %1, i32** %4, align 8
  %5 = load i32*, i32** %4, align 8
  %6 = load i32, i32* %5, align 4
  %7 = load i32*, i32** %3, align 8
  store i32 %6, i32* %7, align 4
  ret void
}
__global__ function on AMDGPU
define protected amdgpu_kernel void @global_func(i32 addrspace(1)* %0, i32 addrspace(1)* %1) taichi-dev#4 {
  %3 = alloca i32*, align 8, addrspace(5)
  %4 = alloca i32*, align 8, addrspace(5)
  %5 = alloca i32*, align 8, addrspace(5)
  %6 = alloca i32*, align 8, addrspace(5)
  %7 = addrspacecast i32* addrspace(5)* %3 to i32**
  %8 = addrspacecast i32* addrspace(5)* %4 to i32**
  %9 = addrspacecast i32* addrspace(5)* %5 to i32**
  %10 = addrspacecast i32* addrspace(5)* %6 to i32**
  %11 = addrspacecast i32 addrspace(1)* %0 to i32*
  store i32* %11, i32** %7, align 8
  %12 = load i32*, i32** %7, align 8
  %13 = addrspacecast i32 addrspace(1)* %1 to i32*
  store i32* %13, i32** %8, align 8
  %14 = load i32*, i32** %8, align 8
  store i32* %12, i32** %9, align 8
  store i32* %14, i32** %10, align 8
  %15 = load i32*, i32** %10, align 8
  %16 = load i32, i32* %15, align 4
  %17 = load i32*, i32** %9, align 8
  store i32 %16, i32* %17, align 4
  ret void
}
__device__ function on AMDGPU
define hidden void @device_func(i32* %0, i32* %1) taichi-dev#2 {
  %3 = alloca i32*, align 8, addrspace(5)
  %4 = alloca i32*, align 8, addrspace(5)
  %5 = addrspacecast i32* addrspace(5)* %3 to i32**
  %6 = addrspacecast i32* addrspace(5)* %4 to i32**
  store i32* %0, i32** %5, align 8
  store i32* %1, i32** %6, align 8
  %7 = load i32*, i32** %6, align 8
  %8 = load i32, i32* %7, align 4
  %9 = load i32*, i32** %5, align 8
  store i32 %8, i32* %9, align 4
  ret void
}
```
2. There are some differences in the place about `allocainst`,
specifically about addrspace (for `AMDGPU`,
[this](https://llvm.org/docs/AMDGPUUsage.html#address-spaces) will be
helpful). I have not found documentation describing how to write the
correct `LLVM IR` on `AMDGPU`, through my observation of the `LLVM IR`
generated by `clang++/hipcc`. We need to deal with the arguments of the
`__global__` function and the `allocainst` (including specifying the
addrspace of `allocainst` and performing addrspace-cast) while for the
`__device__` function we do not need to deal with the arguments of the
function.

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
quadpixels pushed a commit to quadpixels/taichi that referenced this issue May 13, 2023
Issue: #taichi-dev#6434

Add some unit tests on the AMDGPU RHI level

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
quadpixels pushed a commit to quadpixels/taichi that referenced this issue May 13, 2023
Issue: #taichi-dev#6434

### Brief Summary
It contains four parts(`driver`, `context`, `device` and
`caching_allocator`). The code is similar to `cuda/rhi`. However, there
are still some differences between `amdgpu/rhi` and `cuda/rhi`
#### context
1. The method of obtaining the hardware version
2. Context::launch
#### driver
1. ROCm/hip internal functions
#### cmake

The current cmake compilation system is sufficient to support the Unit
test in taichi-dev#6597

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
quadpixels pushed a commit to quadpixels/taichi that referenced this issue May 13, 2023
…7023)

Issue: #taichi-dev#6434

### Brief Summary
These unit tests are for taichi-dev#6486

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
quadpixels pushed a commit to quadpixels/taichi that referenced this issue May 13, 2023
Issue: #taichi-dev#6434

### Brief Summary
1. Add exclusive way to process kernel argument.
2. Upate `launch` api in `AMDGPUContext` .

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
quadpixels pushed a commit to quadpixels/taichi that referenced this issue May 13, 2023
Issue: taichi-dev#6434

### Brief Summary
This part contains `CHI IR->LLVM IR` part. Similar to part0, the
skeleton of this part is similar to `cuda`. Some important notes are
following.
1. `ocml` is the math library like 'libdevice' in `nvidia`. It provides
most of the functions we need. Remains such as `sgn`, and `abs` have
been completed by hand codes
2. The kernel parameter passing part is different from the `CUDA`.
`extra_args` is the only way that could be accepted by `device kernel.`
3. still uses jargon from `nvidia`.(e.g. `sm`, `block` and `grid`)

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
quadpixels pushed a commit to quadpixels/taichi that referenced this issue May 13, 2023
Issue: taichi-dev#6434

### Brief Summary

1. This is the third part of adding the backend of amdgpu: adding the
runtime part of the implementation. The main code for runtime is llvm ir
generating gcn-isa/object and hsaco (which is a file format that can be
accepted by the module launch api provided by hip)

2. After calling the relevant api to generate the gcn isa/obj, the
linker of llvm (ld.lld) needs to be called to generate the hsaco file
format, so there is a command line call `ld.lld -shared xxx.o -o
xxx.hsaco` in the code, and the temporarily generated file is stored in
the `/tmp/taichi_hsaco/` folder

3. To deal with the problem of multiple `hsaco` files being generated at
the same time, a random number is used to name the related generated
files, as follows: in `JITSessionAMDGPU` there is a `random_num_` and
`tmp_dir_` which are assigned when the `JITSessionAMDGPU` instance is
created. Each `ti.kernel` will be devided into offload-tasks which is
compiled into a separate `hsaco` file. A random number bound to the
`hsaco` file is obtained when the `hsaco` file is generated. Here is an
example of the file after running the `ti example mpm128`:
```
taichi_hsaco/
└── 4858208420434830779
    ├── taichi_amdgcn_10476395765980093855.hsaco
    ├── taichi_amdgcn_10476395765980093855.o
    ├── taichi_amdgcn_11369096326162657620.hsaco
    ├── taichi_amdgcn_11369096326162657620.o
    ├── taichi_amdgcn_11700031850871498261.hsaco
    ├── taichi_amdgcn_11700031850871498261.o
    ├── taichi_amdgcn_14803499569653867868.hsaco
    ├── taichi_amdgcn_14803499569653867868.o
    ├── taichi_amdgcn_14949458395707884954.hsaco
    ├── taichi_amdgcn_14949458395707884954.o
    ├── taichi_amdgcn_15955762247261446379.hsaco
    ├── taichi_amdgcn_15955762247261446379.o
    ├── taichi_amdgcn_16891452471041191610.hsaco
    ├── taichi_amdgcn_16891452471041191610.o
    ├── taichi_amdgcn_17615766226135707772.hsaco
    ├── taichi_amdgcn_17615766226135707772.o
    ├── taichi_amdgcn_18033844193337069056.hsaco
    ├── taichi_amdgcn_18033844193337069056.o
    ├── taichi_amdgcn_5951151729973841331.hsaco
    ├── taichi_amdgcn_5951151729973841331.o
    ├── taichi_amdgcn_6012043323411824926.hsaco
    ├── taichi_amdgcn_6012043323411824926.o
    ├── taichi_amdgcn_6796840558965541322.hsaco
    ├── taichi_amdgcn_6796840558965541322.o
    ├── taichi_amdgcn_6835984424286808860.hsaco
    ├── taichi_amdgcn_6835984424286808860.o
    ├── taichi_amdgcn_7872622170129629907.hsaco
    ├── taichi_amdgcn_7872622170129629907.o
    ├── taichi_amdgcn_8760441738982760858.hsaco
    ├── taichi_amdgcn_8760441738982760858.o
    ├── taichi_amdgcn_9006625347419529255.hsaco
    └── taichi_amdgcn_9006625347419529255.o
```

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
quadpixels pushed a commit to quadpixels/taichi that referenced this issue May 13, 2023
Issue: taichi-dev#6434

### Brief Summary
1. put amdgpu-related bitcode files in external/amdgpu_libdevice(thus
taichi just need libamdhip64.so)
2. link amdgpu-related bc files and taichi-module together(Precisely,
clone bc files into taichi-module)

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
quadpixels pushed a commit to quadpixels/taichi that referenced this issue May 13, 2023
Issue: taichi-dev#6434

### Brief Summary
1. update `runtime/llvm/runtime_module/runtime.cpp" to enable amd gpu
backend
quadpixels pushed a commit to quadpixels/taichi that referenced this issue May 13, 2023
Issue: #taichi-dev#6434

### Brief Summary
1. refactor `add_struct_for_func` into llvm_pass
2. replace `bool spmd` with `string spmd` to recognize `amdgpu` and
`nvgpu`.

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
quadpixels pushed a commit to quadpixels/taichi that referenced this issue May 13, 2023
Issue: taichi-dev#6434

### Brief Summary
1. enable amdgpu api in taichi(except struct for)

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
quadpixels pushed a commit to quadpixels/taichi that referenced this issue May 13, 2023
Issue: #taichi-dev#6434

### Brief Summary
1. fix amdgpu backend bugs:
    a. codegen typos
    b. add more types support for `sgn`
    c. use temporary method to handle `runtime.cpp` error
2. enable amdgpu backend python unit test
a. because of the lack of `print` support, temporarily disable all
`print` related tests on amdgpu backend.
    b. there is still something wrong in `gdar_mpm` and `ad_if` tests.

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
quadpixels pushed a commit to quadpixels/taichi that referenced this issue May 13, 2023
Issue: #taichi-dev#6434

### Brief Summary
1. Currently only default(event) profiler is available on AMDGPU
2. Here is the show
<img width="1198" alt="image"
src="https://user-images.githubusercontent.com/47965866/217734581-4c7f7fa7-4d17-4243-b4bd-0a70d1c88f4a.png">

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
quadpixels pushed a commit to quadpixels/taichi that referenced this issue May 13, 2023
Issue: #taichi-dev#6434

### Brief Summary

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
@liuyunpu
Copy link

Is there any plan to support cdna or cdna2 architecture? I hope to run my taichi code on MI series gpu.

@galeselee
Copy link
Contributor Author

galeselee commented Sep 28, 2023

Is there any plan to support cdna or cdna2 architecture? I hope to run my taichi code on MI series gpu.

Actually, you can run your taichi code on GPU with cdna or cdna2 architecture with a little modification. (I have test some cases on MI210)

@liuyunpu
Copy link

How? I thought the current taichi doesn't support amd gpu

@galeselee
Copy link
Contributor Author

How? I thought the current taichi doesn't support amd gpu

It looks like the amdgpu backend hasn't been released yet, you can either do the compilation yourself (I can sort out how to compile it if you need help) or try using the vulkan backend.

@galeselee
Copy link
Contributor Author

galeselee commented Oct 12, 2023

How? I thought the current taichi doesn't support amd gpu

  1. The current Taichi-AMDGPU backend is based on ROCm, so the first thing you need to do is to install ROCm. (The official doc may be helpful.)

  2. LLVM for AMDGPU backend
    2.1. Download llvm source code
    2.2. Compile

    cmake .. -DLLVM_ENABLE_RTTI:BOOL=ON -DBUILD_SHARED_LIBS:BOOL=OFF - DCMAKE_BUILD_TYPE=Release -DLLVM_TARGETS_TO_BUILD="X86;AMDGPU" - DLLVM_ENABLE_ASSERTIONS=ON -DLLVM_ENABLE_TERMINFO=OFF - DLLVM_INCLUDE_BENCHMARKS=OFF -DCMAKE_INSTALL_PREFIX=llvm_binary_path
    make -j `nproc`
    make install
    2.3. Check, there will be four folders(bin, include, lib and share) in the CMAKE_INSATLL_PREFIX.

  3. Download taichi source code
    git clone --recursive https://github.com/taichi-dev/taichi.git

  4. Prepare env to compile Taichi
    virtualenv -p python3 your_env_path
    pip install -r requirements_dev.txt

  5. Compile
    TAICHI_CMAKE_ARGS="-DTI_WITH_VULKAN=OFF -DTI_WITH_OPENGL=OFF - DTI_BUILD_TESTS=ON -DTI_BUILD_EXAMPLES=OFF - DCMAKE_PREFIX_PATH=llvm_binary_path/lib/cmake -DCMAKE_CXX_COMPILER=clang++ - DTI_WITH_AMDGPU=ON" python setup.py develop
    Please choose the backend that needs to be compiled according to your needs.(e.g. if you need VULKAN backend, thus -DTI_WITH_VULKAN=ON)

  6. Test
    pip install -r requirements_test.txt
    cd tests
    python run_test.py -a amdgpu -k mpm88
    If your output is similar to the one below, then you have finished the compilation.

    7911697133456_ pic
    Please enjoy!

@expenses
Copy link

I've managed to build Taichi for AMDGPU, but one problem I ran into is that recent (5.3+) versions of ROCm require PCIe atomics, which are not available on my setup. The most recent version, 5.7 includes a workaround for setups without PCIe atomics: https://rocm.docs.amd.com/en/latest/release.html#non-hostcall-hip-printf.

I had a go at building taichi against LLVM 17, which is the version of LLVM that ROCm 5.7 works with. I haven't been able to test said changes because they require the ROCM 5.7 runtime to be installed and this isn't out in the Arch Linux repositories yet. Here are the changes I've had to make: expenses@800c693.

I'm not sure if this information is super useful, but it might help some people.

@jim19930609 jim19930609 removed this from the v1.7.0 milestone Oct 25, 2023
@GZGavinZhao
Copy link

I'd just like to note that the AMDGPU backend (TI_ARCH=amdgpu) is already working with ROCm 6.0 on my gfx1032 GPU. It passes all (Python) tests and in many cases is more stable / has less crashes than the Vulkan and OpenGL (albeit at the cost of a slightly slower startup).

Currently the latest LLVM version Taichi officially supports is LLVM 15. I have LLVM 16 working locally, but LLVM 17+ needs some thought because the need (or not?) to migrate to LLVM's new pass manager infrastructure (e.g. PassManagerBuilder is entirely non-existent).

@galeselee
Copy link
Contributor Author

@GZGavinZhao Thanks for your sharing. And I'm very glad to know this. I have previous experience migrating from a lower version of LLVM to 14. Some of LLVM code need changed according to the LLVM docs due to the change of the LLVM API, like PassManagerBuilder you mentioned.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
amdgpu AMDGPU backend discussion Welcome discussion! feature request Suggest an idea on this project
Projects
Status: In Progress
Development

No branches or pull requests

8 participants