Add load/store words for reduced-width memory types

## Summary

Add load/store words for 32-bit and narrower memory types (i32, f32, f16, bf16, i8, i16) for both global and shared memory. Values are widened to the native i64 stack cell on load and narrowed on store.

## Motivation

GPU workloads use a variety of data widths:

- **f32/i32**: The native compute precision for GPUs. f32 is the baseline for ML; i32 is the standard integer width.
- **f16/bf16**: Half-precision types used for bandwidth-efficient storage. Most ML inference and training uses these for activations and weights.
- **i8/i16**: Used in quantized models and integer indexing.

Currently `@`/`!` load/store i64 and `F@`/`F!` load/store f64. Real GPU kernels need to access 32-bit and narrower memory types. The stack remains `memref<256xi64>` — GPU pointers are 64-bit and must fit on the stack. Narrower values are widened to i64 when loaded and narrowed when stored.

## Design

### Load/store words

All words take an address (i64) from the stack and either load a value (widened to i64) or store a value (narrowed from i64).

| Global | Shared | Memory type | Load widening | Store narrowing |
|--------|--------|-------------|---------------|-----------------|
| `@` / `!` | `S@` / `S!` | i64 | (unchanged) | (unchanged) |
| `F@` / `F!` | `SF@` / `SF!` | f64 | (unchanged) | (unchanged) |
| `HF@` / `HF!` | `SHF@` / `SHF!` | f16 | extf f16→f64, bitcast f64→i64 | bitcast i64→f64, truncf f64→f16 |
| `BF@` / `BF!` | `SBF@` / `SBF!` | bf16 | extf bf16→f64, bitcast f64→i64 | bitcast i64→f64, truncf f64→bf16 |
| `I8@` / `I8!` | `SI8@` / `SI8!` | i8 | extsi i8→i64 | trunci i64→i8 |
| `I16@` / `I16!` | `SI16@` / `SI16!` | i16 | extsi i16→i64 | trunci i64→i16 |
| `I32@` / `I32!` | `SI32@` / `SI32!` | i32 | extsi i32→i64 | trunci i64→i32 |
| `F32@` / `F32!` | `SF32@` / `SF32!` | f32 | extf f32→f64, bitcast f64→i64 | bitcast i64→f64, truncf f64→f32 |

### What does NOT change

- **Arithmetic**: All operations remain i64/f64 as they are today.
- **Stack**: Stays `memref<256xi64>` with i64 cells.
- **Existing words**: `@`/`!` (i64), `F@`/`F!` (f64), `S@`/`S!`, `SF@`/`SF!` remain unchanged.
- **Kernel parameters**: Still declared as i64/f64 in `\!` headers.
- **CELLS**: Still 8 (sizeof i64).

## Implementation notes

- Each new word needs a dialect op in `ForthOps.td` and a conversion pattern in `ForthToMemRef.cpp`.
- The parser (`ForthToMLIR.cpp`) maps each word name to the corresponding op.
- Shared variants use the same shared memory infrastructure as existing `S@`/`S!`/`SF@`/`SF!`.
- f16 and bf16 are MLIR builtin types (`f16`, `bf16`); no dialect extension needed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add load/store words for reduced-width memory types #52

Summary

Motivation

Design

Load/store words

What does NOT change

Implementation notes

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Global	Shared	Memory type	Load widening	Store narrowing
`@` / `!`	`S@` / `S!`	i64	(unchanged)	(unchanged)
`F@` / `F!`	`SF@` / `SF!`	f64	(unchanged)	(unchanged)
`HF@` / `HF!`	`SHF@` / `SHF!`	f16	extf f16→f64, bitcast f64→i64	bitcast i64→f64, truncf f64→f16
`BF@` / `BF!`	`SBF@` / `SBF!`	bf16	extf bf16→f64, bitcast f64→i64	bitcast i64→f64, truncf f64→bf16
`I8@` / `I8!`	`SI8@` / `SI8!`	i8	extsi i8→i64	trunci i64→i8
`I16@` / `I16!`	`SI16@` / `SI16!`	i16	extsi i16→i64	trunci i64→i16
`I32@` / `I32!`	`SI32@` / `SI32!`	i32	extsi i32→i64	trunci i64→i32
`F32@` / `F32!`	`SF32@` / `SF32!`	f32	extf f32→f64, bitcast f64→i64	bitcast i64→f64, truncf f64→f32

Add load/store words for reduced-width memory types #52

Description

Summary

Motivation

Design

Load/store words

What does NOT change

Implementation notes

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions