### Uops annotated

Here is a comprehensive list of UOps, together with (AI-generated) annotations on purpose and use:

#### Non-rendered UOps
- **NAME**: Assigns a name, typically to a kernel function. Used for naming generated functions for readability and debugging.
  - arg: str (the name)
  - src: ()

- **SINK**: Marks the termination point of a computation graph or kernel. Defines the final outputs of a kernel and acts as the root for graph traversal and scheduling.
  - arg: Any (often None or Metadata)
  - src: tuple[UOp, ...] (The UOps whose results are the final outputs)

- **CONTIGUOUS**: Marker indicating that the input tensor data must be made contiguous in memory. Often optimized away if the input is already contiguous.
  - arg: None
  - src: (UOp,) (The tensor to make contiguous)

- **CONTIGUOUS_BACKWARD**: Similar to CONTIGUOUS, but specifically inserted during the backward pass.
  - arg: None
  - src: (UOp,)

- **DETACH**: Detaches the tensor from the computation graph for gradient calculation. Stops gradients from flowing back through this tensor.
  - arg: None
  - src: (UOp,) (The tensor to detach)

- **KERNEL**: Represents a computation kernel to be executed. Encapsulates a unit of work for the scheduler and code generator.
  - arg: KernelInfo (Contains metadata about the kernel)
  - src: tuple[UOp, ...] (Source buffers and values needed by the kernel)

- **UNIQUE**: Represents a unique identifier, typically associated with a buffer. Prevents unwanted fusion or aliasing.
  - arg: int (A unique integer ID)
  - src: ()

- **EMPTY**: Represents an uninitialized buffer placeholder. Used during Tensor creation before data is allocated.
  - arg: None
  - src: (UOp,) (Typically a VIEW wrapping a DEVICE UOp)

#### Meta Operations
- **COPY**: Copies data from one buffer/device to another. Handles data movement between devices or explicit cloning.
  - arg: bool (clone=True forces a new buffer allocation)
  - src: (UOp<DEVICE>, UOp) (Destination device, Source tensor/buffer UOp)

- **BUFFER_VIEW**: Creates a view into an existing buffer with a specified size and offset. For accessing sub-regions without copying.
  - arg: tuple[int, int] (size, offset in bytes)
  - src: (UOp<BUFFER>,) (The base buffer)

#### Block Operations
- **BLOCK**: Internal node representing a basic block of UOps during linearization. Groups UOps for code generation.
  - arg: BasicBlock (Contains context and list of UOps in the block)
  - src: tuple[UOp, ...] (Dependencies for the block)

- **BLOCKSTART**: Internal node marking the start of a block context. Used by the linearizer to manage block context.
  - arg: None
  - src: (UOp,) (The RANGE or IF UOp that starts the block)

- **BLOCKFORK**: Internal node representing a point where a block's result is used by multiple subsequent blocks.
  - arg: int (Number of children using this fork)
  - src: (UOp<BLOCK>,) (The block being forked)

- **BLOCKEND**: Internal node marking the end of a block context. Used to manage block context and potentially merge blocks.
  - arg: BasicBlock (Contains context, end instruction, and the RANGE/IF UOp being ended)
  - src: tuple[UOp, ...] (UOps feeding into the end of the block)

#### Movement Operations
- **RESHAPE, PERMUTE, EXPAND, PAD, SHRINK, FLIP**: Represent memory layout transformations without changing the underlying data.
  - arg: Varies by operation (shapes, dimensions, padding info, etc.)
  - src: (UOp,) (The source tensor/buffer UOp)

#### Miscellaneous Operations
- **UNROLL**: Represents an unrolled loop dimension during vectorization/expansion.
  - arg: tuple[tuple[int, int], ...] (Axis and size for each unrolled dimension)
  - src: (UOp,) (The UOp being unrolled)

- **CONTRACT**: Represents the contraction (summation) of dimensions during vectorization/expansion.
  - arg: tuple[tuple[int, int], ...] (Axis and size for each contracted dimension)
  - src: (UOp,) (The UOp being contracted)

- **VIEW**: Applies a ShapeTracker to a base UOp, representing memory layout without copying. Core mechanism for all movement ops.
  - arg: ShapeTracker
  - src: (UOp,) (The base UOp, typically a BUFFER or another VIEW)

- **DEFINE_GLOBAL**: Defines a global buffer used in a kernel. Represents kernel arguments (input/output buffers).
  - arg: int | str (Buffer index or name)
  - dtype: PtrDType | ImageDType
  - src: ()

- **BUFFER**: Represents a memory buffer allocated on a specific device. Base for most data operations.
  - arg: int (Size in elements)
  - dtype: DType | ImageDType
  - src: (UOp<DEVICE>, UOp<UNIQUE>)

- **DEFINE_VAR**: Defines a symbolic variable (e.g., for tensor shapes). Allows representing shapes or loop bounds symbolically.
  - arg: tuple[str, ConstType, ConstType] (name, min_val, max_val)
  - dtype: DType (Typically int)
  - src: ()

- **DEFINE_LOCAL**: Defines a buffer in local (shared) memory. Used for communication between threads in a workgroup.
  - arg: str (Name/ID)
  - dtype: PtrDType(local=True)
  - src: ()

- **DEFINE_ACC**: Defines an accumulator register, typically for reductions. Holds intermediate values during reduction operations.
  - arg: int (Accumulator index)
  - src: (UOp<CONST>, *UOp<RANGE>) (Initial value, followed by the loop ranges it depends on)

- **VALID**: Represents the validity mask derived from a ShapeTracker's padding or shrinking. Used in indexing to determine bounds.
  - arg: None
  - dtype: bool
  - src: (UOp<VIEW>,)

- **SPECIAL**: Represents special variables like thread or block indices. Provides access to hardware-specific indices.
  - arg: tuple[tuple[str, int], int] ((name, axis), limit)
  - dtype: int
  - src: ()

- **NOOP**: No operation, passes through its source. Used as a placeholder or to break graph structures.
  - arg: None
  - src: (UOp,)

#### Reduction Operations
- **REDUCE_AXIS**: Performs a reduction operation (sum, max, etc.) along specified axes.
  - arg: tuple[Ops, tuple[int, ...]] (The reduction ALU op, Tuple of axes to reduce)
  - src: (UOp,) (The tensor to reduce)

#### Helper Operations
- **GEP**: Get Element Pointer. Extracts scalar elements from a vector.
  - arg: tuple[int, ...] (Indices of elements to extract)
  - src: (UOp,) (The vector UOp)

- **VECTORIZE**: Combines multiple scalar UOps into a single vector UOp. Inverse of GEP.
  - arg: None
  - src: tuple[UOp, ...] (Scalar UOps to combine)

- **CAT**: Concatenates multiple vectors. (Often rewritten to VECTORIZE with GEPs).
  - arg: None
  - src: tuple[UOp, ...] (Vectors to concatenate)

#### Unary Operations
- **CAST, BITCAST, EXP2, LOG2, SIN, SQRT, RECIP, NEG**: Unary elementwise operations. CAST changes type and potentially value, BITCAST changes type but preserves bits.
  - arg: None
  - src: (UOp,)

#### Memory Operations
- **LOAD, STORE**: Memory access operations. LOAD reads, STORE writes.
  - arg: None
  - src: Varies depending on early/late phase (see original comments for details)

- **INDEX**: Represents an indexed memory address, potentially with a validity gate. Used as input to late LOAD/STORE ops.
  - arg: None
  - src: (UOp<DEFINE_GLOBAL/LOCAL>, UOp<index_calculation>, Optional<UOp<gate>>)

#### Math Operations
- **WMMA**: Warp Matrix Multiply Accumulate. Represents a hardware tensor core operation.
  - arg: tuple (Contains TC parameters)
  - src: (`UOp<A>`, `UOp<B>`, `UOp<C>`) (Input matrices A, B, Accumulator C)

#### Binary Operations
- **ADD, MUL, IDIV, MAX, MOD, CMPLT, CMPNE, XOR, SHL, SHR, OR, AND, THREEFRY, SUB, FDIV, POW**: Binary elementwise operations.
  - arg: None
  - src: (UOp, UOp)

#### Ternary Operations
- **WHERE, MULACC**: Ternary elementwise operations. WHERE is conditional select, MULACC is fused multiply-accumulate.
  - arg: None
  - src: (UOp, UOp, UOp)

#### Assignment Operations
- **ASSIGN**: Assigns a value to a destination (Accumulator or Buffer). Used heavily in scheduling/kernel construction.
  - arg: None
  - src: (UOp<DEFINE_ACC/BUFFER/VIEW>, UOp<value>)

- **BIND**: Binds a symbolic variable (DEFINE_VAR) to a specific constant value.
  - arg: None
  - src: (UOp<DEFINE_VAR>, UOp<CONST>)

#### Control Flow Operations
- **BARRIER**: Synchronization barrier, typically for local memory. Ensures memory operations are visible across threads.
  - arg: None
  - src: tuple[UOp<STORE>, ...] (Stores that must complete before the barrier)

- **RANGE**: Represents a loop range (iterator). Defines the bounds for loops during code generation.
  - arg: int (Loop variable ID)
  - dtype: int
  - src: (UOp<start>, UOp<end>)

- **IF**: Start of a conditional block. Defines conditional execution paths.
  - arg: None
  - src: (UOp<condition>, Optional<UOp<BARRIER>>)

- **ENDRANGE, ENDIF**: Marks the end of a RANGE loop or IF block. Defines the scope of loops and conditionals.
  - arg: None
  - src: (UOp<RANGE/IF>,)

#### Constants
- **VCONST, CONST**: Represents constant values. VCONST for vectors, CONST for scalars.
  - arg: ConstType | tuple[ConstType, ...] (The constant value(s))
  - src: () or (UOp<VIEW(DEVICE)>,) for tensor constants

#### Device Operations
- **DEVICE**: Represents a compute device. Specifies the target device for operations or buffer allocation.
  - arg: str (Device name, e.g., "CPU", "CUDA:0")
  - src: ()

- **MULTI**: Represents a tensor sharded across multiple devices. Manages distributed tensors and operations.
  - arg: tuple[Optional[int], tuple[bool, ...]] (Axis of sharding, tuple indicating which shards are 'real')
  - src: tuple[UOp, ...] (The UOps representing data on each device shard)

#### Custom Operations
- **CUSTOM, CUSTOMI**: Allows embedding backend-specific code or intrinsics. CUSTOMI suggests inlining.
  - arg: Any (Backend-specific data, often a format string or identifier)
  - src: tuple[UOp, ...] (Inputs to the custom operation)
