Skip to content

Add load/store words for reduced-width memory types #52

@tetsuo-cpp

Description

@tetsuo-cpp

Summary

Add load/store words for 32-bit and narrower memory types (i32, f32, f16, bf16, i8, i16) for both global and shared memory. Values are widened to the native i64 stack cell on load and narrowed on store.

Motivation

GPU workloads use a variety of data widths:

  • f32/i32: The native compute precision for GPUs. f32 is the baseline for ML; i32 is the standard integer width.
  • f16/bf16: Half-precision types used for bandwidth-efficient storage. Most ML inference and training uses these for activations and weights.
  • i8/i16: Used in quantized models and integer indexing.

Currently @/! load/store i64 and F@/F! load/store f64. Real GPU kernels need to access 32-bit and narrower memory types. The stack remains memref<256xi64> — GPU pointers are 64-bit and must fit on the stack. Narrower values are widened to i64 when loaded and narrowed when stored.

Design

Load/store words

All words take an address (i64) from the stack and either load a value (widened to i64) or store a value (narrowed from i64).

Global Shared Memory type Load widening Store narrowing
@ / ! S@ / S! i64 (unchanged) (unchanged)
F@ / F! SF@ / SF! f64 (unchanged) (unchanged)
HF@ / HF! SHF@ / SHF! f16 extf f16→f64, bitcast f64→i64 bitcast i64→f64, truncf f64→f16
BF@ / BF! SBF@ / SBF! bf16 extf bf16→f64, bitcast f64→i64 bitcast i64→f64, truncf f64→bf16
I8@ / I8! SI8@ / SI8! i8 extsi i8→i64 trunci i64→i8
I16@ / I16! SI16@ / SI16! i16 extsi i16→i64 trunci i64→i16
I32@ / I32! SI32@ / SI32! i32 extsi i32→i64 trunci i64→i32
F32@ / F32! SF32@ / SF32! f32 extf f32→f64, bitcast f64→i64 bitcast i64→f64, truncf f64→f32

What does NOT change

  • Arithmetic: All operations remain i64/f64 as they are today.
  • Stack: Stays memref<256xi64> with i64 cells.
  • Existing words: @/! (i64), F@/F! (f64), S@/S!, SF@/SF! remain unchanged.
  • Kernel parameters: Still declared as i64/f64 in \! headers.
  • CELLS: Still 8 (sizeof i64).

Implementation notes

  • Each new word needs a dialect op in ForthOps.td and a conversion pattern in ForthToMemRef.cpp.
  • The parser (ForthToMLIR.cpp) maps each word name to the corresponding op.
  • Shared variants use the same shared memory infrastructure as existing S@/S!/SF@/SF!.
  • f16 and bf16 are MLIR builtin types (f16, bf16); no dialect extension needed.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions