Skip to content
alex-spacemit edited this page Jun 5, 2026 · 1 revision

spine-FlagGems Support Status

Welcome to the spine-FlagGems support status page for SpacemiT SoCs.

Overview

FlagGems is a high-performance generic operator library implemented in Triton language, part of FlagOS. It registers with the ATen backend of PyTorch to provide seamless operator replacement, enabling "develop once, run anywhere" across diverse AI accelerators.

The SpacemiT backend (_spacemit) is a CPU backend powered by Spine Triton (triton.backends.spine_triton.driver.CPUDriver). It leverages SpacemiT's SMT (SpacemiT Tensor) instruction set extensions — including smt.descriptor_load, smt.view, smt.dot, smt.alloc, and smt.mbarrier — to deliver high-performance GEMM and pointwise operations on K1/K3 SoCs.

Architecture

Key Design Decisions

Aspect Detail
Vendor SPACEMIT = 13 in vendors enum (runtime/common.py)
Device CPU via Spine Triton; BF16 and INT64 marked unsupported in device.py
Driver CPUDriver from triton.backends.spine_triton.driver — set once at backend init
GEMM strategy 4 execution paths: EVEN_K (fast SMT dot), SPLIT_M, SPLIT_N (pipelined with smt.mbarrier), SPLIT_K (tiled reduction)
Autotuning TunedConfigLoader (migrated from ConfigLoader in PR #3793) with pre_hook validation
Arch configs LEGAL_CONFIGS keyed by arch_id (0x503C, 0xA03C, 0xA064, 0xF000) with per-op, per-dtype MICRO_M/K/N constraints
Pointwise codegen Custom KernelGenerator / ModuleGenerator overriding block pointer stride order for SpacemiT memory layout
Heuristics HEURISTICS_CONFIGS for 20+ operators — block sizes adapted to tensor dimensions and SM count

Upstream Component Status

Component Submitted time Status Link Owner Comments
Add SpacemiT runtime backend Apr 19, 2026 Merged (May 14, 2026) #2527 alex-spacemit / zuoweixia497 20 operators, vendor detection, codegen config, tuning infra
Fix mm/bmm/gelu/argmax & bump Triton to a5 Jun 5, 2026 Open #3793 alex-spacemit out-variant ops, in-place GELU, empty-tensor argmax, TunedConfigLoader migration

Operator Support Status

BLAS / GEMM Operators

Operator Status SMT dot EVEN_K fast path Comments
mm ✔️ ✔️ ✔️ 4-path GEMM (EVEN_K / SPLIT_M / SPLIT_N / SPLIT_K)
bmm ✔️ ✔️ ✔️ K==1 fast path via outer/mul
addmm ✔️ ✔️ ✔️ Implemented but commented out in __init__.py
mv ✔️ Matrix-vector multiply

Activation Operators

Operator Forward Backward In-place Comments
gelu ✔️ ✔️ ✔️ tanh/none approximate; uses geluTanh/geluNone SMT intrinsics
silu ✔️ ✔️ Uses _silu SMT intrinsic
sigmoid ✔️ ✔️ Pointwise dynamic codegen

Normalization Operators

Operator Status Comments
softmax ✔️ Spacemit-specific kernel with 2-pass max+exp; backward delegates to common
layer_norm ✔️ Implemented but commented out in __init__.py
batch_norm ✔️ Supports 2D–4D input via 3D view conversion

Reduction Operators

Operator Status Comments
argmax ✔️ 2-stage kernel + multi-dim kernel; empty tensor early-return (PR #3793)
argmin ✔️ Mirrors argmax structure
mean_dim ✔️ 2-stage reduction
global_avg_pool ✔️ Fixed block size config

Pointwise Operators

Operator Status Variants
where ✔️ where_self, where_self_out, where_scalar_self, where_scalar_other
pow ✔️ pow_tensor_tensor, pow_tensor_scalar, pow_scalar, + in-place variants
rsqrt ✔️ Forward + in-place (rsqrt_)

Convolution Operators (code complete, not exported)

Operator Status Comments
conv2d ✔️ (code) Fused im2col + bmm kernel; not yet exported
conv1d ✔️ (code) Reshapes to 2D → delegates to conv2d
conv_depthwise2d ✔️ (code) Delegates to conv2d with groups=C
thnn_conv2d ✔️ (code) THNN-compatible wrapper

Attention Operators (code complete, not exported)

Operator Status Comments
flash_attention ✔️ (code) SMT-accelerated; supports causal masking
scaled_dot_product_attention ✔️ (code) Wraps flash_attention

Tuning Infrastructure

Architecture-Specific GEMM Configs (config_pre_hook.py)

The LEGAL_CONFIGS table maps CPU arch_id → operator → dtype → valid (MICRO_M, MICRO_K, MICRO_N) tuples. At kernel launch, a pre_hook validates the autotuner's chosen config and fixes illegal combinations to the arch-legal default.

arch_id SoC float32 MICRO (M, K, N) float16 MICRO (M, K, N)
0x503C K1 / X60 (8, 32, 32) (16, 8, 32)
0xA03C K1 / A60 (8, 32, 32) (8, 16, 16)
0xA064 K3 / A100 (8, 32, 32) (16, 8, 32)

Known Limitations

  1. BF16 / INT64 unsupported: Marked as unsupported dtypes in device.py
  2. Backward passes incomplete: Most backward tests skipped; only GELU, SiLU, sigmoid, and softmax have backward kernels
  3. Convolution & attention not exported: Code exists but commented out in ops/__init__.py
  4. CPU-only: No GPU or other accelerator support; targets SpacemiT RISC-V CPUs
  5. Operator count gap: ~20 exported operators vs 200+ in upstream FlagGems; significant coverage gap remains
  6. TunedConfigLoader migration in progress (PR #3793): Still uses fallback ConfigLoader import

Monthly Update Log

Month Summary Updated by
2026-06 Initial wiki created alex-spacemit yutingnie