Home

spine-triton Support Status

Welcome to the spine-triton support status page for SpacemiT SOCs.

Overview

spine-triton is forked from microsoft/triton-shared, which provides a shared middle layer for Triton compilation. It implements a CPU backend for Triton targeting SpacemiT's RISC-V based AI processors (X60, X100, A60, A100), enabling Triton kernels to be compiled and executed on SpacemiT hardware without requiring a GPU.

The project bridges Triton's high-level kernel language to efficient RISC-V vector (RVV) 、SpacemiT IME、SpacemiT AME code through a multi-stage MLIR-based compilation pipeline, with custom dialects for SpacemiT-specific hardware features including packed tensor cores, descriptor-based memory access, and thread synchronization primitives.

Current version: 3.6.0+spacemit.a5

Architecture

┌─────────────────────────────────────────────────┐
│                Triton Kernel (Python)              │
├─────────────────────────────────────────────────┤
│  language/smt  │  language/tle  │  language/cpu   │
│  (XSMT builtins)│ (Tile Ops)    │  (CPU utils)    │
├─────────────────────────────────────────────────┤
│              Triton IR (TTIR)                     │
├─────────────────────────────────────────────────┤
│         spine-triton-opt (MLIR Passes)            │
│  TTIR → Structured → Unstructured → Memref        │
│  → Linalg (with XSMT/TLE dialect lowering)       │
├─────────────────────────────────────────────────┤
│           spine-opt (spine-mlir)                  │
│  Linalg MLIR → LLVM MLIR → LLVM IR               │
├─────────────────────────────────────────────────┤
│        LLVM opt/llc → .so (RISC-V RVV)           │
├─────────────────────────────────────────────────┤
│           CPUDriver / CPULauncher                 │
│         (Dynamic loading & execution)             │
└─────────────────────────────────────────────────┘

Compilation Pipeline

The compilation flows through four stages defined in backend/compiler.py:

Stage	Input	Output	Tool
`ttir`	Triton IR	Optimized TTIR	Triton pass manager
`linalgdir`	TTIR	Linalg MLIR	`spine-triton-opt --triton-to-linalg-experimental`
`llir`	Linalg MLIR	LLVM IR	`spine-opt --spine-triton-e2e-pipeline` + `mlir-translate`
`so`	LLVM IR	Shared object	`opt` + `llc` + `g++`

Custom MLIR Dialects

spine-triton defines five custom MLIR dialects for SpacemiT-specific operations:

Dialect	Namespace	Purpose
XSMT	`xsmt`	Core SpacemiT ops: pack/unpack/repack, mmt4d, alloc, barriers
XSMTAsync	`xsmt_async`	Async memory barrier lifecycle (alloc/arrive/wait/release)
TLE	`tle`	Triton Language Extension: extract_tile, insert_tile
TritonTilingExt	`ttx`	Extended tiling interface (cumsum) with TilingInterface
TritonStructured	`tts`	Structured operations dialect

Key XSMT Operations

Operation	Description
`xsmt.pack`	Pack 2D tensor → 4D packed layout (tile decomposition)
`xsmt.unpack`	Unpack 4D packed tensor → 2D
`xsmt.repack`	Change 4D packed tile size (unpack + repack)
`xsmt.subview`	Create pointer subview preserving packing
`xsmt.subview_pack`	Create subview with new packed tile layout
`xsmt.mmt4d`	4D matrix multiplication with packed tensors
`xsmt.alloc`	Allocate tensor in specified memory (l2/shared)
`xsmt.alloc_copies`	Allocate multi-copy buffer tensor
`xsmt.mbarrier_copies`	Allocate multiple memory barrier instances
`xsmt.descriptor_load_view`	Fused descriptor load + view operation

Python Language Layer

The language/ directory provides Python-level APIs for kernel authors:

Module	Key Functions
`smt` (SpacemiT Triton)	`descriptor_load`, `view` (pack/unpack/repack/subview), `alloc`, `alloc_copies`, `dot` (mmt4d), `mbarrier`, `barrier_arrive`/`barrier_wait`, `parallel`, `compile_hint`, `get_num_of_thread`
`tle` (Triton Lang Extension)	`extract_tile`, `insert_tile`
`cpu`	`utils`, `libdevice` (CPU-specific math functions)

Key Design Features

4D Packed Tensor Layout: 2D matrices are packed into 4D [M/m, N/n, m, n] layout for efficient tensor core operations. pack/unpack/repack operations handle layout transformations.
Destination-Passing Style (DPS): Operations support optional destination tensors to avoid intermediate allocations, enabling memory-efficient operation chaining.
Descriptor-Based Load: descriptor_load operation provides efficient block memory access with boundary checking.
Memory Barriers: Hardware memory barriers (mbarrier) for thread synchronization in multi-core execution, supporting double/triple buffering patterns.
Multi-Copy Buffers: alloc_copies and mbarrier_copies support software pipelining with multiple buffer copies.
Tile Operations: extract_tile/insert_tile support fine-grained tile manipulation with both static (compile-time) and dynamic (runtime) indexing.
Proton Profiling: RISC-V rdtime instruction-based kernel profiling with Chrome Trace and Hatchet format output.
RISC-V Vector Extension: Targets RVV 1.0 with v extension, including zfh (half-precision float), zvfh (vector half-precision), zicbop (cache block operations), xsmtvdotii(SpacemiT IME2).

Supported CPU Architectures

Arch ID	CPU Name	Target
`0x503C`	spacemit-x60	K1
`0x5064`	spacemit-x100	K3
`0xA03C`	spacemit-a60	K1
`0xA064`	spacemit-a100	K3

Module Support Status

Component	Submitted time	Status	Link	Owner	Comments
TLE dialect (extract/insert tile)	-	WIP	-	zuoweixia497	Triton Language Extension dialect
RISC-V target support (AME)	-	WIP	-	alex-spacemit
TLE language module	-	WIP	-	zuoweixia497	extract_tile, insert_tile
Proton CPU profiling	-	WIP	-	zuoweixia497	rdtime-based timing, Chrome Trace output

Monthly Update Log

Month	Summary	Updated by
2026-05	Initial wiki created from spine-triton internal source documentation	alex-spacemit

Provide feedback

Saved searches

Use saved searches to filter your results more quickly