# ZLUDA

**ZLUDA** is a CUDA wrapper that allows CUDA-based applications to run on otherwise unsupported GPUs, such as AMD GPUs on Windows.  

> [!IMPORTANT]
> If you are looking for the official ROCm support through AMD's [TheRock](https://github.com/ROCm/TheRock) project, please refer to [ROCm on Windows](https://github.com/vladmandic/sdnext/wiki/AMD-ROCm#rocm-on-windows) section of [AMD ROCm](https://github.com/vladmandic/sdnext/wiki/AMD-ROCm) page.

## Warning

ZLUDA support is unofficial and currently limited.  

- For unofficial instructions on how to manually build ROCm libraries, see the ROCm Custom Build section below.  
- For unofficial instructions on how to install ROCm for older GPUs such as Polaris and Vega, see [ROCm for Polaris and Vega](https://github.com/vladmandic/sdnext/issues/3898) post  

## Installing ZLUDA for AMD GPUs in Windows

> [!NOTE]
> This guide assumes you have [Git and Python](Installation#install-python-and-git) installed,  
> and are comfortable using the command prompt, navigating Windows Explorer, renaming files and folders, and working with zip files.
>
> [!IMPORTANT]
> If you have an integrated AMD GPU (iGPU), you may need to disable it,  
> or use the `HIP_VISIBLE_DEVICES` environment variable.

### Install Visual C++ Runtime

> [!NOTE]
> Many systems already have this because it is bundled with many apps and games, but reinstalling is safe.

Download the latest Visual C++ Runtime from <https://aka.ms/vs/17/release/vc_redist.x64.exe> and run it.  
If you see Repair or Uninstall, it is already installed. Otherwise, install it.  

### Install ZLUDA

ZLUDA is auto-installed and added to `PATH` when you start `webui.bat` with `--use-zluda`.

### Install HIP SDK

Install HIP SDK 6.2 (or 6.4) from <https://www.amd.com/en/developer/resources/rocm-hub/hip-sdk.html>  
If your regular AMD GPU driver is up to date, you do not need the PRO driver suggested by HIP SDK.

> [!IMPORTANT]
> HIP SDK 7.x is NOT supported at the moment.

### Replace HIP SDK library files for unsupported GPU architectures

Go to <https://rocm.docs.amd.com/projects/install-on-windows/en/develop/reference/system-requirements.html> and find your GPU model.  
If your GPU model has a ✅ in both columns, skip to [Install SD.Next](#install-sdnext).  
If your GPU model has an ❌ in the HIP SDK column, or if your GPU isn't listed, follow the instructions below;  

1. Open Windows Explorer and copy and paste `C:\Program Files\AMD\ROCm\6.2\bin\rocblas` into the location bar.  
   *(Assuming you've installed the HIP SDK in the default location and Windows is located on C:)*
2. Make a copy of the `library` folder, for backup purposes.  
3. Download one of [the unofficial rocBLAS libraries](https://github.com/likelovewant/ROCmLibs-for-gfx1103-AMD780M-APU/releases/tag/v0.6.2.4), then unzip into the original `library` folder and overwrite files.  
gfx1010: RX 5700, RX 5700 XT  
gfx1012: RX 5500, RX 5500 XT  
gfx1031: RX 6700, RX 6700 XT, RX 6750 XT  
gfx1032: RX 6600, RX 6600 XT, RX 6650 XT  
gfx1103: Radeon 780M  
gfx803: RX 570, RX 580  
[AMD GPU processor list](https://llvm.org/docs/AMDGPUUsage.html#processors)
4. Open the zip file.
5. Drag and drop the `library` folder from zip file into `%HIP_PATH%bin\rocblas` (The folder you opened in step 1).
6. Reboot PC

If your GPU model is not in the HIP SDK column or not in the list above, follow [ROCm Support guide](AMD-ROCm#rocm-on-windows) to build your own RocblasLibs.  

> [!WARNING]
> Building your own libraries is not for the faint of heart

### Install SD.Next

Using Windows Explorer, navigate to a place you'd like to install SD.Next. This should be a folder which your user account has read/write/execute access to. Installing SD.Next in a directory which requires admin permissions may cause it to not launch properly.  

Note: Do not install SD.Next into Program Files, Users, Windows folders, OneDrive, Desktop, or a dot-prefixed folder (for example `.sdnext`).  

The best place would be on an SSD for model loading.  

In the Location Bar, type `cmd`, then hit [Enter]. This will open a Command Prompt window at that location.  

![image](https://github.com/vladmandic/sdnext/assets/1969381/8a24ff53-4fe9-4260-8674-badcdc3d5aa5)

Run these commands in Command Prompt, one at a time:  

> `git clone https://github.com/vladmandic/sdnext`  
> `cd sdnext`  
> `.\webui.bat --use-zluda --debug --autolaunch`

### Compilation and First Generation

Generate a test image. First-time compilation can take 10-15 minutes, and sometimes longer.  
The text `Compilation is in progress. Please wait...` may appear repeatedly. This is expected.  
Subsequent generations will be significantly quicker.  

### Upgrading ZLUDA

If ZLUDA stops working after an SD.Next update, reinstalling ZLUDA may help.

1. Remove `.zluda` folder.
2. Launch WebUI. The installer will download and install newer ZLUDA.

You may need to wait for recompilation on the first generation after reinstall.

## Experimental features

### cuDNN

Speed-up: ★★★☆☆  
VRAM: ★★★★☆  
Stability: ★★★☆☆  
Compatible with: Navi cards

MIOpen, the equivalent of cuDNN for AMD GPUs, has not been released on Windows yet.

However, you can enable it with a custom build of MIOpen.

This section describes how to enable cuDNN.

1. Install HIP SDK 6.2. If you already have older HIP SDK, uninstall it before installing 6.2.  
2. Download and install HIP SDK extension from the [ZLUDA releases page](https://github.com/lshqqytiger/ZLUDA/releases).  
(unzip and paste folders upon `path/to/AMD/ROCm/6.2`)  
3. Remove `.zluda` folder if exists.  
4. Launch WebUI with command line arguments `--use-zluda --use-nightly`.  

The first generation will take long time because MIOpen has to find the optimal solution and cache it.

If you get driver crashes, restart webui and try again.

### cuBLASLt

Speed-up: ★☆☆☆☆  
VRAM: ★☆☆☆☆  
Stability: ★★☆☆☆  
Compatible with: gfx1100, or CDNA accelerators

hipBLASLt, the equivalent of cuBLASLt for AMDGPUs, hasn't been released on Windows yet.

However, unofficial builds are available.

This section describes how to enable cuBLASLt.

1. Install HIP SDK 6.2. If you already have older HIP SDK, uninstall it before installing 6.2.  
2. Download and install HIP SDK extension from the [ZLUDA releases page](https://github.com/lshqqytiger/ZLUDA/releases).  
(unzip and paste folders upon `path/to/AMD/ROCm/6.2`)  
3. Remove `.zluda` folder if exists.  
4. Launch WebUI with command line arguments `--use-zluda --use-nightly`.  

### triton

Speed-up: ★★★★★  
VRAM: ★★★★☆  
Stability: ★★★★☆  
Compatible with: Navi cards

1. Prepare Python 3.11 (or 3.12) environment.  
2. Download a triton wheel that matches your Python version from the [Triton releases page](https://github.com/lshqqytiger/triton/releases).  
   (cp312 is Python 3.12, cp311 is Python 3.11 and cp310 is Python 3.10)  
3. Open a PowerShell Windows in the SDNext folder and install via pip.  

```shell
venv\scripts\python -m pip install --upgrade setuptools
venv\scripts\python -m pip install --upgrade path/to/downloaded/triton.whl
```

> [!IMPORTANT]
> Developer PowerShell for Visual Studio (or Prompt) will be needed to compile kernel using triton.

#### Flash Attention 2

Using triton, you can enable Flash Attention 2.

1. Go to Settings.
2. Set attention method to `Scaled Dot-product`.
3. Enable `Triton Flash attention`.
4. Restart WebUI.

#### torch.compile

Using triton, you can enable `torch.compile`.

1. Go to Settings.
2. Enable compilation.
3. Set compilation method to `inductor` or `cuda-graph`.

※ `torch.compile` is currently not compatible with flash attention 2 on ZLUDA.

---

## Comparison (DirectML)

| Feature | DirectML | ZLUDA |
| --- | --- | --- |
| Speed | Slower | Faster |
| VRAM Usage | More | Less |
| VRAM GC | ❌ | ✅ |
| Training | * | ✅ |
| Flash Attention | ❌ | ✅ |
| FFT | ✅ | ⚠️ |
| DNN | ❓ | ✅ |
| RTC | ❓ | ✅ |
| Source Code | Closed-source | Open-source |

❓: unknown  
⚠️: partially supported  
*: known as possible, but uses too much VRAM to train stable diffusion models/LoRAs/etc.

## Compatibility

| DTYPE | Support |
| --- | --- |
| FP64 | ✅ |
| FP32 | ✅ |
| FP16 | ✅ |
| BF16 | ✅ |
| LONG | ✅ |
| INT8 | ✅ |
| UINT8 | ✅* |
| INT4 | ❓ |
| FP8 | ⚠️ |
| BF8 | ⚠️ |

*: Not tested.

## Building rocBLAS for unsupported architectures

This section explains how to build rocBLAS based on official ROCm documentation.

You may have an AMD GPU without official support on ROCm [HIP SDK](https://www.amd.com/en/developer/resources/rocm-hub/hip-sdk.html)
If you use an integrated AMD GPU (iGPU) and want HIP SDK support on Windows, you can also use this process.

*If you do not need to build ROCmLibs or already have the library, please skip this.*

Make sure the following software is available on your PC. Otherwise, ROCmLibs build may fail:
1. Visual Studio 2022
2. Python
3. Strawberry Perl
4. CMake
5. Git
6. HIP SDK (Mentioned in the first step)
7. Download [rocBLAS](https://github.com/ROCm/rocBLAS) and [Tensile](https://github.com/ROCm/Tensile) (Download Tensile 4.38.0 for ROCm 5.7.0 (latest) on Windows)

Edit line 41 in `rdeps.py` for rocBLAS. The old repo has an outdated `vcpkg`, which can cause build failures. Update `vcpkg` with:

```shell
git clone -b 2024.02.14 https://github.com/microsoft/vcpkg
```

Download `Tensile 4.38.0` from the release page.

Download [Tensile-fix-fallback-arch-build.patch](https://github.com/likelovewant/ROCmLibs-for-gfx1103-AMD780M-APU-/blob/main/Tensile-fix-fallback-arch-build.patch), and place in the `Tensile` folder. In this example, the path is: `C:\ROCm\Tensile-rocm-5.7.0`.

Enter the following line in the terminal opened in `Tensile-rocm-5.7.0`:

```shell
git apply Tensile-fix-fallback-arch-build.patch
```text

If your `vcpkg` was built after April 2023, replace `Tensile/tree/develop/Tensile/Source/lib/CMakeLists.txt` with this [CMakeLists.txt](https://github.com/ROCm/Tensile/tree/develop/Tensile/Source/lib/CMakeLists.txt). For details, see the [ROCm Official Guide](https://rocmdocs.amd.com/projects/rocBLAS/en/latest/install/Windows_Install_Guide.html#windows-install).

In `C:\ROCm\rocBLAS-rocm-5.7.0`, run:

```shell
python rdeps.py
```

If you encounter any mistake, try to Google and fix it or try it again. Use `install.sh -d` in Linux.

Once done, run:

```shell
python rmake.py -a "gfx906;gfx1012" --lazy-library-loading --no-merge-architectures -t "C:\ROCm\Tensile-rocm-5.7.0"
```text

Change `gfx906;gfx1012` to your GPU LLVM Target. If you want to build multiple ones at a time, make sure to separate with `;`.

Upon successful compilation, rocblas.dll will be generated. In this example, the file path is `C:\ROCm\rocBLAS-rocm-5.7.0\build\release\staging\rocblas.dll`. In addition, some Tensile data files will also be produced in `C:\ROCm\rocBLAS-rocm-5.7.0\build\release\Tensile\library`.

To compile HIP SDK programs that use hipBLAS/rocBLAS, you need to replace the rocblas.dll file in the SDK with the one that you have just made yourself. Then, place `rocblas.dll`into `C:\Program Files\AMD\ROCm\5.7\bin` and the Tensile data files into `C:\Program Files\AMD\ROCm\5.7\bin\rocblas\library`.

Your programs should run smooth as silk on the designated graphics card now.

## ROCm Custom Build

This guide will walk you through building rocBLAS using the official ROCm documentation.

This guide is for users with AMD GPUs lacking official ROCm/[HIP SDK](https://www.amd.com/en/developer/resources/rocm-hub/hip-sdk.html) support, or those wanting to enable HIP SDK support for hip sdk 5.7 and 6.1.2  on Windows for integrated AMD GPUs(iGPUs)."

If you already have the libraries, you can skip this section!

**Prerequisites:** Ensure the following software is installed on your PC. `python`, `git`, and the `HIP SDK`are
essential.  The script `rdeps.py` will automatically download any missing dependencies when you run it.

* **Visual Studio 2022:** (Download from
[https://visualstudio.microsoft.com/](https://visualstudio.microsoft.com/))
* **Python:** (Download from [https://www.python.org/](https://www.python.org/))
* **Strawberry Perl:**  (Download from [https://strawberryperl.com/](https://strawberryperl.com/))
* **CMake:** (Download from [https://cmake.org/download/](https://cmake.org/download/))
* **Git:** (Download from [https://git-scm.com/](https://git-scm.com/))
* **HIP SDK:** (Download from [https://www.amd.com/en/developer/resources/rocm-hub/hip-sdk.html](https://www.amd.com/en/developer/resources/rocm-hub/hip-sdk.html))

### Downloading the Source Code

1. **rocBLAS:** Download the latest version ([https://github.com/ROCm/rocBLAS](https://github.com/ROCm/rocBLAS/releases)).
   * **ROCm 5.7.0:**  Download `rocBLAS 3.1.0`
[rocBLAS 3.1.0 for ROCm 5.7.0](https://github.com/ROCm/rocBLAS/releases/tag/rocm-5.7.0)
   * **ROCm 6.1.2:** Download `rocBLAS 4.1.2`
[rocBLAS 4.1.2 for ROCm 6.1.2](https://github.com/ROCm/rocBLAS/releases/tag/rocm-6.1.2)

2. **Tensile:** Download the appropriate version:([https://github.com/ROCm/Tensile](https://github.com/ROCm/Tensile/releases))
   * **ROCm 5.7.0:**  Download `Tensile 4.38.0`
[Tensile 4.38.0 for ROCm 5.7.0](https://github.com/ROCm/Tensile/releases/tag/rocm-5.7.0)

   * **ROCm 6.1.2:** Download `Tensile 4.40.0`
[Tensile 4.40.0 for ROCm 6.1.2](https://github.com/ROCm/Tensile/releases/tag/rocm-6.1.2)

### Patching Tensile for ROCm (For Advanced Users, Not-a-must-Do)

These steps are necessary for specific configurations of ROCm and may not be required in all cases.
If you already have optimized logic for your GPU architecture, you may skip these steps, especially when building libraries for `xnack-` features.

### Determine Your ROCm Version

* **ROCm 5.7.0:** Follow the instructions for "**For hip 5.7**" below.
* **ROCm 6.1.2:** Follow the instructions for "**For hip 6.1.2**" below.


### Patches for Tensile

### For hip 5.7.0

1. Download
[Tensile-fix-fallback-arch-build.patch](https://github.com/likelovewant/ROCmLibs-for-gfx1103-AMD780M-APU/blob/main/Tensile-fix-fallback-arch-build.patch).

2. Place the patch file in your `Tensile` folder (e.g., `C:\ROCM\Tensile-rocm-5.7.0`).

3. Open a terminal within the `Tensile` folder.

4. Apply the patch:

   ```bash
   git apply Tensile-fix-fallback-arch-build.patch
   ```

   If nothing appears after applying, the patch succeeded. Otherwise, you may need to manually add the patch content to `TensileCreateLibrary.py`. You can also skip this step if you already have optimized logic.

### For hip 6.1.2

1. Download
[Tensile-fix-fallback-arch-build-hip-6.1.2.patch](https://github.com/likelovewant/ROCmLibs-for-gfx1103-AMD780M-APU/blob/main/Tensile-fix-fallback-arch-build-hip-6.1.2.patch).

2. Place the patch file in your `Tensile` folder (e.g., `C:\ROCM\Tensile-rocm-6.1.2`).

3. Open a terminal within the `Tensile` folder.

4. Apply the patch:

   ```bash
   git apply Tensile-fix-fallback-arch-build-hip-6.1.2.patch
   ```

   If nothing appears after applying, the patch succeeded. Otherwise, you may need to manually add the patch content to `TensileCreateLibrary.py`.

### ( Skip this step for ROCm 6.1.2 )

Note: edit the line 41 in file rdeps.py for rocBLAS  ,The old repo has an outdated vckpg, which will lead to fail build.update the vcpkg ,by replace with the following line

```shell
git clone -b 2024.02.14 https://github.com/microsoft/vcpkg
```

to update the `vcpkg` version.

* **vcpkg Version:** If your vcpkg version was built after April 2023, replace `CMakeLists.txt` in
`Tensile/tree/develop/Tensile/Source/lib/CMakeLists.txt` with this
[Tensile CMakeLists.txt replacement](https://github.com/ROCm/Tensile/tree/develop/Tensile/Source/lib/CMakeLists.txt) and place it in the same
folder (e.g., `rocm`).
  * For more information, see the [official ROCm
guide](https://rocmdocs.amd.com/projects/rocBLAS/en/latest/install/Windows_Install_Guide.html#windows-install).

### Build with rdeps and rmake

1. Navigate to the `rocm/rocBLAS` directory in your terminal.
2. Run `python rdeps.py`. This script will configure your environment and download necessary packages.

```shell
python rdeps.py
```text

( using `install.sh -d` in linux , if you encounter any mistakes , try to google and fix with it or try it again  )
after done . try next step

1. After `rdeps.py` completes, run

```shell

python rmake.py -a "gfx1101;gfx1103" --lazy-library-loading--no-merge-architectures -t "C:\rocm\Tensile-rocm-5.7.0"

```

(adjust paths and architectures as needed).

**Important:**

* Replace `"gfx1101;gfx1103"` with the correct GPU or APU architecture names for your system.Make sure separate with ";"if you have more than one arches build .
* Make sure read the  Editing Tensile/Common.py and blow before to build .
* For ROCm 6.1.2, change the path to `C:\rocm\Tensile-rocm-6.1.2`.
* The specific commands and patch files may vary depending on your setup and ROCm version.


After successfully building rocBLAS from source, you need to replace the default `rocblas.dll` with your compiled
version for your HIP programs to utilize it. Here's how:

1. **Locate your Compiled Files:**
   * `rocblas.dll`: Located in `C:\ROCM\rocBLAS-rocm-5.7.0\build\release\staging\` (or a similar path based on
your build location).
   * Tensile data files: Found within `C:\ROCM\rocBLAS-rocm-5.7.0\build\release\Tensile\library\` (adjust the
path if needed).

2. **Replace the Default rocBLAS:**

   * Copy `rocblas.dll`  to `C:\Program Files\AMD\ROCm\5.7\bin`. This is where the HIP SDK looks for it by
default.( make sure to back up the original rocblas.dll )


3. **Place Tensile Data Files:**

   * Navigate to `C:\Program Files\AMD\ROCm\5.7\bin\rocblas\`
   * Replace the `library` with new build ( back up the original library by rename to different name ,eg ,bklibrary).  This is where you should place all the Tensile data files from your build directory.


4. **Test Your HIP Program:**

    * Now, when you run your HIP program, it should use your newly compiled `rocblas.dll` and its associated
Tensile data files.

**Important Notes:**
* For ROCm 6.1.2, change the path to `C:\Program Files\AMD\ROCm\6.1\bin\`.
* Always double-check the paths to ensure they match your installation configuration.
* Make sure the ROCm version in the `bin` directory matches the version of rocBLAS you built.

### Note: Editing Tensile/Common.py

This file contains general parameters used by the Tensile library. To ensure compatibility with your GPU, you need
to update two specific settings.Update the value of `" globalParameters["SupportedISA"]"`and `"CACHED_ASM_CAPS"` with your`gpu ISA and info` .and choose the similar gpu achetecture. eg `RND2 for gfx1031 ,RND2 for gfx1032`, then copy and put below with your gpu number and others available gpu data .For hip sdk 6.1.2 , `CACHED_ASM_CAPS` info move to tensile/AsmCaps.py, also edit architectureMap from line299 to 310 , add your arch information .map your arch information to correct logic file .however , some optimized logic don't exist in the offoicial release. then we need to creat it.otherwilse ,it will creat a fallback no optimized rocblas and library.

**Here's a step-by-step guide:**

1. **Choose Your Architecture:**
   * Select an existing architecture folder within `rocBLAS\library\src\blas3\Tensile\Logic\asm_full` (e.g.,
`navi21`). This will serve as a template for your new architecture.
   * Create a new folder with the name of your target architecture (e.g., `navi22`).

2. **Copy Files:**
    * Copy all the files from your chosen template folder into your new architecture folder.

3. **Modify Files:**
   * Open the copied files in a code editor (like VS Code or Visual Studio).
   * Search for instances of `navi21` and replace them with `navi22`.
   * Update any `gfx1030` references to `gfx1031`  (or your target GPU's identifier).
   * Find lines containing `ISA: [10, 3, 0]` and replace them with `ISA: [10, 3, 1]`. (Remember to adjust the ISA
code according to your GPU)
   * "Rename all files within the new folder to reflect your architecture name (e.g., change 'navi21' to
'navi22'). You can use a file renaming tool like 'File Rename APP', a free application available in the Windows Store, for this task."
   * if build failed ,that's because ROCm architectures have different capabilities. You need to ensure your `rocblas` is tailored to each
architecture you're targeting:
      * **gfx90c:** Doesn't support `4x8II`.  Delete any logic or files related to `4x8II` within the `asm_full`
folder under `rocBLAS\library\src\blas3\Tensile\Logic`.

      * **gfx1010:** Doesn't support `8II`. Do the same for files related to `8II` in the `asm_full` folder.
   * **Checking Logic Files:**  The "new named logic file" is likely a critical place where these operations are
defined. Carefully review it and remove any unsupported calculations.

4. **Use Your New Architecture:**
   * In `Tensile/Common.py`, update `"CACHED_ASM_CAPS"` or the relevant entries in  `architectureMap` to reference
your new `navi22` folder.

**Important Notes:**

* Carefully review the changes you make, as incorrect modifications can lead to errors.

**(Skip this for HIP 5.7, Necessary for HIP 6.1.2)**

**Key Changes:**

* **Search for `gfx1030`:** Begin by searching within both the Tensile and rocBLAS folders for instances of
`gfx1030`. This identifier represents a gfx1030 GPU architecture.
* **Replace with Your Target Architecture:** Replace all occurrences of `gfx1030` with the corresponding code for
your desired GPU architecture (e.g., `gfx1031`).

**Important Files to Modify:**

* **Tensile:** Within the Tensile folder, make changes to:
  * `CMakeLists.txt`: This file configures the build process and needs adjustments for new architectures.
  * `AMDGPU.hpp`: Defines the architecture-specific interface.
  * `PlaceholderLibrary.hpp`, `Predicaters.hpp`, `OclUtiles.cpp`: These files contain code related to specific
functionalities, which might require modifications for your target GPU.

* **rocBLAS:** In the rocBLAS folder:
  * `CMakeLists.txt`: Similar to Tensile, update this file for your new architecture.
  * `handle.cpp`, `tensile_host.cpp`, `handle.hpp`: These files are likely involved in communication and
interactions between rocBLAS and the GPU.

**Caution:**

* Modifying these core files can have unintended consequences.

**Advanced Usage:**

For maximum performance optimization, delve deeper into Tensile's logic files. Examples are provided in
`rocBLAS\library\src\blas3\Tensile\Logic\asm_full`.

For truly optimized libraries, you'll need to
fine-tune these logic files specifically for your target hardware.The [Tensile Tuning
Guide](https://github.com/likelovewant/ROCmLibs-for-gfx1103-AMD780M-APU/wiki/Tensile-tuning-Guide) provides practical guidance and techniques for start this process. Keep in mind that the tuning process requires patience, time, and a willingness to delve into Tensile's inner workings.

More detail can be found in [tuning](https://github.com/ROCm/Tensile/tree/develop/tuning) ,
and tensile [tuning .tex](https://github.com/ROCm/Tensile/blob/develop/tuning_docs/tensile_tuning.tex) ,
A PDF version is available in the [tensile tuning guide](https://github.com/likelovewant/ROCmLibs-for-gfx1103-AMD780M-APU/blob/main/tensile_tuning.pdf)

Please feel welcome to edit this post and contribute optimized logic links. Remember to carefully consider the
impact of any edits or additions.