Skip to content
alex-spacemit edited this page Jun 5, 2026 · 3 revisions

spine-triton Support Status

Welcome to the spine-triton support status page for SpacemiT SOCs.

Overview

spine-triton is forked from microsoft/triton-shared, which provides a shared middle layer for Triton compilation. It implements a CPU backend for Triton targeting SpacemiT's RISC-V based AI processors (X60, X100, A60, A100), enabling Triton kernels to be compiled and executed on SpacemiT hardware without requiring a GPU.

The project bridges Triton's high-level kernel language to efficient RISC-V vector (RVV) 、SpacemiT IME、SpacemiT AME code through a multi-stage MLIR-based compilation pipeline, with custom dialects for SpacemiT-specific hardware features including packed tensor cores, descriptor-based memory access, and thread synchronization primitives.

  • Current version: 3.6.0+spacemit.a5

Architecture

┌─────────────────────────────────────────────────┐
│                Triton Kernel (Python)              │
├─────────────────────────────────────────────────┤
│  language/smt  │  language/tle  │  language/cpu   │
│  (XSMT builtins)│ (Tile Ops)    │  (CPU utils)    │
├─────────────────────────────────────────────────┤
│              Triton IR (TTIR)                     │
├─────────────────────────────────────────────────┤
│         spine-triton-opt (MLIR Passes)            │
│  TTIR → Structured → Unstructured → Memref        │
│  → Linalg (with XSMT/TLE dialect lowering)       │
├─────────────────────────────────────────────────┤
│           spine-opt (spine-mlir)                  │
│  Linalg MLIR → LLVM MLIR → LLVM IR               │
├─────────────────────────────────────────────────┤
│        LLVM opt/llc → .so (RISC-V RVV)           │
├─────────────────────────────────────────────────┤
│           CPUDriver / CPULauncher                 │
│         (Dynamic loading & execution)             │
└─────────────────────────────────────────────────┘

Compilation Pipeline

The compilation flows through four stages defined in backend/compiler.py:

Stage Input Output Tool
ttir Triton IR Optimized TTIR Triton pass manager
linalgdir TTIR Linalg MLIR spine-triton-opt --triton-to-linalg-experimental
llir Linalg MLIR LLVM IR spine-opt --spine-triton-e2e-pipeline + mlir-translate
so LLVM IR Shared object opt + llc + g++

Custom MLIR Dialects

spine-triton defines five custom MLIR dialects for SpacemiT-specific operations:

Dialect Namespace Purpose
XSMT xsmt Core SpacemiT ops: pack/unpack/repack, mmt4d, alloc, barriers
XSMTAsync xsmt_async Async memory barrier lifecycle (alloc/arrive/wait/release)
TLE tle Triton Language Extension: extract_tile, insert_tile
TritonTilingExt ttx Extended tiling interface (cumsum) with TilingInterface
TritonStructured tts Structured operations dialect

Key XSMT Operations

Operation Description
xsmt.pack Pack 2D tensor → 4D packed layout (tile decomposition)
xsmt.unpack Unpack 4D packed tensor → 2D
xsmt.repack Change 4D packed tile size (unpack + repack)
xsmt.subview Create pointer subview preserving packing
xsmt.subview_pack Create subview with new packed tile layout
xsmt.mmt4d 4D matrix multiplication with packed tensors
xsmt.alloc Allocate tensor in specified memory (l2/shared)
xsmt.alloc_copies Allocate multi-copy buffer tensor
xsmt.mbarrier_copies Allocate multiple memory barrier instances
xsmt.descriptor_load_view Fused descriptor load + view operation

Python Language Layer

The language/ directory provides Python-level APIs for kernel authors:

Module Key Functions
smt (SpacemiT Triton) descriptor_load, view (pack/unpack/repack/subview), alloc, alloc_copies, dot (mmt4d), mbarrier, barrier_arrive/barrier_wait, parallel, compile_hint, get_num_of_thread
tle (Triton Lang Extension) extract_tile, insert_tile
cpu utils, libdevice (CPU-specific math functions)

Key Design Features

  1. 4D Packed Tensor Layout: 2D matrices are packed into 4D [M/m, N/n, m, n] layout for efficient tensor core operations. pack/unpack/repack operations handle layout transformations.

  2. Destination-Passing Style (DPS): Operations support optional destination tensors to avoid intermediate allocations, enabling memory-efficient operation chaining.

  3. Descriptor-Based Load: descriptor_load operation provides efficient block memory access with boundary checking.

  4. Memory Barriers: Hardware memory barriers (mbarrier) for thread synchronization in multi-core execution, supporting double/triple buffering patterns.

  5. Multi-Copy Buffers: alloc_copies and mbarrier_copies support software pipelining with multiple buffer copies.

  6. Tile Operations: extract_tile/insert_tile support fine-grained tile manipulation with both static (compile-time) and dynamic (runtime) indexing.

  7. Proton Profiling: RISC-V rdtime instruction-based kernel profiling with Chrome Trace and Hatchet format output.

  8. RISC-V Vector Extension: Targets RVV 1.0 with v extension, including zfh (half-precision float), zvfh (vector half-precision), zicbop (cache block operations), xsmtvdotii(SpacemiT IME2).

Supported CPU Architectures

Arch ID CPU Name Target
0x503C spacemit-x60 K1
0x5064 spacemit-x100 K3
0xA03C spacemit-a60 K1
0xA064 spacemit-a100 K3

Module Support Status

Component Submitted time Status Link Owner Comments
TLE dialect (extract/insert tile) - WIP - zuoweixia497 Triton Language Extension dialect
RISC-V target support (AME) - WIP - alex-spacemit
TLE language module - WIP - zuoweixia497 extract_tile, insert_tile
Proton CPU profiling - WIP - zuoweixia497 rdtime-based timing, Chrome Trace output

Monthly Update Log

Month Summary Updated by
2026-05 Initial wiki created from spine-triton internal source documentation alex-spacemit