-
Notifications
You must be signed in to change notification settings - Fork 2
Home
Welcome to the spine-FlagGems support status page for SpacemiT SoCs.
FlagGems is a high-performance generic operator library implemented in Triton language, part of FlagOS. It registers with the ATen backend of PyTorch to provide seamless operator replacement, enabling "develop once, run anywhere" across diverse AI accelerators.
The SpacemiT backend (_spacemit) is a CPU backend powered by Spine Triton (triton.backends.spine_triton.driver.CPUDriver). It leverages SpacemiT's SMT (SpacemiT Tensor) instruction set extensions — including smt.descriptor_load, smt.view, smt.dot, smt.alloc, and smt.mbarrier — to deliver high-performance GEMM and pointwise operations on K1/K3 SoCs.
- Repo: flagos-ai/FlagGems
- Triton dependency:
triton==3.6.0+spacemit.a5(SpacemiT Triton package) - Support SOCs:
| Aspect | Detail |
|---|---|
| Vendor |
SPACEMIT = 13 in vendors enum (runtime/common.py) |
| Device | CPU via Spine Triton; BF16 and INT64 marked unsupported in device.py
|
| Driver |
CPUDriver from triton.backends.spine_triton.driver — set once at backend init |
| GEMM strategy | 4 execution paths: EVEN_K (fast SMT dot), SPLIT_M, SPLIT_N (pipelined with smt.mbarrier), SPLIT_K (tiled reduction) |
| Autotuning |
TunedConfigLoader (migrated from ConfigLoader in PR #3793) with pre_hook validation |
| Arch configs |
LEGAL_CONFIGS keyed by arch_id (0x503C, 0xA03C, 0xA064, 0xF000) with per-op, per-dtype MICRO_M/K/N constraints |
| Pointwise codegen | Custom KernelGenerator / ModuleGenerator overriding block pointer stride order for SpacemiT memory layout |
| Heuristics |
HEURISTICS_CONFIGS for 20+ operators — block sizes adapted to tensor dimensions and SM count |
| Component | Submitted time | Status | Link | Owner | Comments |
|---|---|---|---|---|---|
| Add SpacemiT runtime backend | Apr 19, 2026 | Merged (May 14, 2026) | #2527 | alex-spacemit / zuoweixia497 | 20 operators, vendor detection, codegen config, tuning infra |
| Fix mm/bmm/gelu/argmax & bump Triton to a5 | Jun 5, 2026 | Open | #3793 | alex-spacemit | out-variant ops, in-place GELU, empty-tensor argmax, TunedConfigLoader migration |
| Operator | Status | SMT dot | EVEN_K fast path | Comments |
|---|---|---|---|---|
mm |
✔️ | ✔️ | ✔️ | 4-path GEMM (EVEN_K / SPLIT_M / SPLIT_N / SPLIT_K) |
bmm |
✔️ | ✔️ | ✔️ | K==1 fast path via outer/mul
|
addmm |
✔️ | ✔️ | ✔️ | Implemented but commented out in __init__.py
|
mv |
✔️ | — | — | Matrix-vector multiply |
| Operator | Forward | Backward | In-place | Comments |
|---|---|---|---|---|
gelu |
✔️ | ✔️ | ✔️ |
tanh/none approximate; uses geluTanh/geluNone SMT intrinsics |
silu |
✔️ | ✔️ | — | Uses _silu SMT intrinsic |
sigmoid |
✔️ | ✔️ | — | Pointwise dynamic codegen |
| Operator | Status | Comments |
|---|---|---|
softmax |
✔️ | Spacemit-specific kernel with 2-pass max+exp; backward delegates to common |
layer_norm |
✔️ | Implemented but commented out in __init__.py
|
batch_norm |
✔️ | Supports 2D–4D input via 3D view conversion |
| Operator | Status | Comments |
|---|---|---|
argmax |
✔️ | 2-stage kernel + multi-dim kernel; empty tensor early-return (PR #3793) |
argmin |
✔️ | Mirrors argmax structure |
mean_dim |
✔️ | 2-stage reduction |
global_avg_pool |
✔️ | Fixed block size config |
| Operator | Status | Variants |
|---|---|---|
where |
✔️ |
where_self, where_self_out, where_scalar_self, where_scalar_other
|
pow |
✔️ |
pow_tensor_tensor, pow_tensor_scalar, pow_scalar, + in-place variants |
rsqrt |
✔️ | Forward + in-place (rsqrt_) |
| Operator | Status | Comments |
|---|---|---|
conv2d |
✔️ (code) | Fused im2col + bmm kernel; not yet exported |
conv1d |
✔️ (code) | Reshapes to 2D → delegates to conv2d
|
conv_depthwise2d |
✔️ (code) | Delegates to conv2d with groups=C
|
thnn_conv2d |
✔️ (code) | THNN-compatible wrapper |
| Operator | Status | Comments |
|---|---|---|
flash_attention |
✔️ (code) | SMT-accelerated; supports causal masking |
scaled_dot_product_attention |
✔️ (code) | Wraps flash_attention |
The LEGAL_CONFIGS table maps CPU arch_id → operator → dtype → valid (MICRO_M, MICRO_K, MICRO_N) tuples. At kernel launch, a pre_hook validates the autotuner's chosen config and fixes illegal combinations to the arch-legal default.
| arch_id | SoC | float32 MICRO (M, K, N) | float16 MICRO (M, K, N) |
|---|---|---|---|
0x503C |
K1 / X60 | (8, 32, 32) | (16, 8, 32) |
0xA03C |
K1 / A60 | (8, 32, 32) | (8, 16, 16) |
0xA064 |
K3 / A100 | (8, 32, 32) | (16, 8, 32) |
-
BF16 / INT64 unsupported: Marked as unsupported dtypes in
device.py - Backward passes incomplete: Most backward tests skipped; only GELU, SiLU, sigmoid, and softmax have backward kernels
-
Convolution & attention not exported: Code exists but commented out in
ops/__init__.py - CPU-only: No GPU or other accelerator support; targets SpacemiT RISC-V CPUs
- Operator count gap: ~20 exported operators vs 200+ in upstream FlagGems; significant coverage gap remains
-
TunedConfigLoader migration in progress (PR #3793): Still uses fallback
ConfigLoaderimport
| Month | Summary | Updated by |
|---|---|---|
| 2026-06 | Initial wiki created | alex-spacemit yutingnie |