# 1 - UOps

In [None]:
#| default_exp uops

As we saw in the previous chapter, UOps are the intermediate device-independent representation on the computation tree that sits between the user-facing `Tensor` and device-specific code that is generated to perform the computations.

In [None]:
#| hide
from nbdev.showdoc import *
import nbdev; nbdev.nbdev_export()

In [None]:
import os

os.environ["CPU"] = "1"
# os.environ["TRACEMETA"] = "0"
os.environ["DEBUG"]="4"
# os.environ["NOOPT"]="1"


In [None]:
import tinygrad as tg
from tinygrad import Tensor, dtypes

### UOp is a singleton
As noted by  [mesozoic-egg@github][https://mesozoic-egg.github.io/tinygrad-notes/20250119_uop_singleton.html], UOp is a singleton.

It's implemented using a MetaClass:
[tinygrad/ops.py](https://github.com/tinygrad/tinygrad/blob/7b865ed03d314dc73debd6ffc2975218fbe6c4a4/tinygrad/ops.py#L226)

```python
class UOpMetaClass(type):
  ucache:dict[tuple, weakref.ReferenceType[UOp]] = {}
  def __call__(cls, op:Ops, dtype:DType=dtypes.void, src:tuple[UOp,...]=tuple(), arg:Any=None, _buffer:Buffer|None=None):
    if (wret:=UOpMetaClass.ucache.get(key:=(op, dtype, src, arg), None)) is not None and (ret:=wret()) is not None: return ret
    UOpMetaClass.ucache[key] = ref = weakref.ref(created:=super().__call__(*key))
    ...
    return created

@dataclass(eq=False, slots=True)
class UOp(MathTrait, metaclass=UOpMetaClass):
    def __del__(self):
        if (ref:=UOpMetaClass.ucache.get(k:=(self.op, self.dtype, self.src, self.arg))) is not None:
            ...
            del UOpMetaClass.ucache[k]
```

(TinyGrad really loves its `:=` operators)

The main idea is, if you have 2 UOp (sub-)trees, it's very easy to compare them, because the roots of both trees will be the same object if they are identical.


In [None]:
from tinygrad.ops import UOp, Ops

In [None]:
# Create two identical UOp trees (3 * 5 + 2)
x1 = UOp(Ops.CONST, dtype=dtypes.int, arg=5)
mul1 = UOp(Ops.MUL, dtype=dtypes.int, src=(UOp(Ops.CONST, dtype=dtypes.int, arg=3), x1))
add1 = UOp(Ops.ADD, dtype=dtypes.int, src=(mul1, UOp(Ops.CONST, dtype=dtypes.int, arg=2)))

# Second tree
x2 = UOp(Ops.CONST, dtype=dtypes.int, arg=5)
mul2 = UOp(Ops.MUL, dtype=dtypes.int, src=(UOp(Ops.CONST, dtype=dtypes.int, arg=3), x2))
add2 = UOp(Ops.ADD, dtype=dtypes.int, src=(mul2, UOp(Ops.CONST, dtype=dtypes.int, arg=2)))

id(add1) == id(add2)

True

In [None]:
# Third tree is different (3 * 5 + 1)
x3 = UOp(Ops.CONST, dtype=dtypes.int, arg=5)
mul3 = UOp(Ops.MUL, dtype=dtypes.int, src=(UOp(Ops.CONST, dtype=dtypes.int, arg=3), x3))
add3 = UOp(Ops.ADD, dtype=dtypes.int, src=(mul3, UOp(Ops.CONST, dtype=dtypes.int, arg=1)))

id(add1) == id(add3)

False

### Symbolic evaluation

Another cool feature of UOps - if all inputs are constants and the result is a scalar, it can be evaluated without generating any device code at all:

In [None]:
add1

UOp(Ops.ADD, dtypes.int, arg=None, src=(
  UOp(Ops.MUL, dtypes.int, arg=None, src=(
    UOp(Ops.CONST, dtypes.int, arg=3, src=()),
    UOp(Ops.CONST, dtypes.int, arg=5, src=()),)),
  UOp(Ops.CONST, dtypes.int, arg=2, src=()),))

In [None]:
add1.simplify()

UOp(Ops.CONST, dtypes.int, arg=17, src=())

Another way to do the same - cast the UOp to float or an int depending on dtype.

In [None]:
int(add1)

17

This does not seem to work on non-scalars though

### UOp reference

UOps are used throughout TinyGrad, some are specific to certain stages of processing (from Tensors to code), some are valid at any stage.

Here is the full list of all UOps, with (AI-generated) annotations and notes:

[UOp Reference](uops_annotates.html)


### UOp creation helpers

In many cases, the UOp class has methods for creating specific UOps. It's often more convenient and concise to use them

For example `UOp.const()` creates either a `CONST` or a `VCONST` (vector const, used internally for buffers), and also takes care of the arg type matching dtype:

In [None]:
UOp.const(dtypes.float16, 2)

UOp(Ops.CONST, dtypes.half, arg=2.0, src=())

Note the arg has been converted to a `float`, even though we gave it an `int`

There are a few that are very straight-forward:
```python

# The SINK is the end of a computation graph
def sink(self, *srcs:UOp): return UOp(Ops.SINK, dtypes.void, (self,)+srcs)

# Detach from the backprop
def detach(self): return UOp(Ops.DETACH, self.dtype, (self,))

def cast(self, dtype:DType): return UOp(Ops.CAST, dtype, (self,))
def bitcast(self, dtype:DType): return UOp(Ops.BITCAST, dtype, (self,))
def load(self, *src:UOp, **kwargs): return UOp(Ops.LOAD, src=(self,)+src, **kwargs)
def store(self, *src:UOp, **kwargs): return UOp(Ops.STORE, dtypes.void, (self,)+src, **kwargs)

# The RANGE UOp takes 2 UOps as start/end of the range.
def range(dtype:DType, start:sint, end:sint, idx:int): return UOp(Ops.RANGE, dtype=dtype, src=(sint_to_uop(start), sint_to_uop(end)), arg=idx)

def assign(self, x:UOp): return UOp(Ops.ASSIGN, self.dtype, (self,x))
def contiguous(self): return self.alu(Ops.CONTIGUOUS)
def contiguous_backward(self): return self.alu(Ops.CONTIGUOUS_BACKWARD)

```




### Toposort

Quite often we need to access a UOp tree in "topological order".

`UOp.toposort` is a property (a class method that looks like a class attribute) that returns a dictionary with UOps being the keys, and the values being None.

This emulates a sorted Set, which Python lacks:

In [None]:
print("===== 3 * 5 + 2 =====")
for o in add1.toposort.keys():
    print(o.op, o.arg)

===== 3 * 5 + 2 =====
Ops.CONST 3
Ops.CONST 5
Ops.MUL None
Ops.CONST 2
Ops.ADD None


You get the idea - the children always come before the parents

### Other UOp methods

When reading the Tiny Grad code, you will often see other UOp methods called. To make this task easier, let's go over some popular ones.

##### `.replace()`

Despite its name, this does not replace, but rather creates a new UOp that is a copy of the original UOp, except for the args (op, dtype, arg, src) you want to change:

In [None]:
add1.replace(op=Ops.SUB)

UOp(Ops.SUB, dtypes.int, arg=None, src=(
  UOp(Ops.MUL, dtypes.int, arg=None, src=(
    UOp(Ops.CONST, dtypes.int, arg=3, src=()),
    UOp(Ops.CONST, dtypes.int, arg=5, src=()),)),
  UOp(Ops.CONST, dtypes.int, arg=2, src=()),))

`add1` did not change:

In [None]:
add1

UOp(Ops.ADD, dtypes.int, arg=None, src=(
  UOp(Ops.MUL, dtypes.int, arg=None, src=(
    UOp(Ops.CONST, dtypes.int, arg=3, src=()),
    UOp(Ops.CONST, dtypes.int, arg=5, src=()),)),
  UOp(Ops.CONST, dtypes.int, arg=2, src=()),))

UOps are actually supposed to be immutable, but this is not enforced for performance reasons:
```python
# NOTE: this should be frozen, but frozen is slower
@dataclass(eq=False, slots=True)
class UOp(MathTrait, metaclass=UOpMetaClass):
    ...
```

### UOp to code

In [None]:
from tinygrad.engine.schedule import create_schedule_with_vars
from tinygrad.engine.realize import lower_schedule_item

You did a bunch of Tensor operations, constructed a chonky UOp tree, and now you want to actually compute it.

In [None]:
a = (Tensor.full((10, 10), 1) + Tensor.full((10, 10), 2)).contiguous()
a.lazydata

UOp(Ops.CONTIGUOUS, dtypes.int, arg=None, src=(
  UOp(Ops.ADD, dtypes.int, arg=None, src=(
    UOp(Ops.EXPAND, dtypes.int, arg=(10, 10), src=(
      UOp(Ops.RESHAPE, dtypes.int, arg=(1, 1), src=(
        UOp(Ops.CONST, dtypes.int, arg=1, src=(
          x4:=UOp(Ops.VIEW, dtypes.void, arg=ShapeTracker(views=(View(shape=(), strides=(), offset=0, mask=None, contiguous=True),)), src=(
            UOp(Ops.DEVICE, dtypes.void, arg='CPU', src=()),)),)),)),)),
    UOp(Ops.EXPAND, dtypes.int, arg=(10, 10), src=(
      UOp(Ops.RESHAPE, dtypes.int, arg=(1, 1), src=(
        UOp(Ops.CONST, dtypes.int, arg=2, src=(
           x4,)),)),)),)),))

The first step is to "schedule" the computation. This converts the UOp tree to a lover level one. You might also notice that it computed the `1+2=3`.
> Note: We will cover the `ShapeTracker` in a separate chapter soon

In [None]:
schedule, vars = a.schedule_with_vars()
schedule, vars

([ScheduleItem(ast=UOp(Ops.SINK, dtypes.void, arg=None, src=(
    UOp(Ops.STORE, dtypes.void, arg=None, src=(
      UOp(Ops.DEFINE_GLOBAL, dtypes.int.ptr(100), arg=0, src=()),
      UOp(Ops.VIEW, dtypes.void, arg=ShapeTracker(views=(View(shape=(10, 10), strides=(10, 1), offset=0, mask=None, contiguous=True),)), src=()),
      UOp(Ops.CONST, dtypes.int, arg=3, src=(
        UOp(Ops.VIEW, dtypes.void, arg=ShapeTracker(views=(View(shape=(10, 10), strides=(0, 0), offset=0, mask=None, contiguous=False),)), src=()),)),)),)), bufs=(<buf real:False device:CPU size:100 dtype:dtypes.int offset:0>,), metadata=(contiguous, __add__))],
 {})

The next step is to convert the `ScheduleItem` into executable code.

In [None]:
ei = lower_schedule_item(schedule[0])
ei

opened device CPU from pid:549064
E_[34m25[0m[90m_[0m[33m4[0m[90m[0m
 0: (25, 4)                   int.ptr(100)         (4, 1)                         ShapeTracker(views=(View(shape=(25, 4), strides=(4, 1), offset=0, mask=None, contiguous=True),))
[Opt(op=OptOps.UPCAST, axis=0, arg=4)]

void E_25_4(int* restrict data0) {
  for (int ridx0 = 0; ridx0 < 25; ridx0++) {
    int alu0 = (ridx0<<2);
    *(data0+alu0) = 3;
    *(data0+(alu0+1)) = 3;
    *(data0+(alu0+2)) = 3;
    *(data0+(alu0+3)) = 3;
  }
}



ExecItem(prg=<tinygrad.engine.realize.CompiledRunner object>, bufs=[<buf real:False device:CPU size:100 dtype:dtypes.int offset:0>], metadata=(contiguous, __add__))

This brings the UOp tree to the lowest level, that maps ~1:1 to the generated code:

In [None]:
for o in ei.prg.p.uops:
    print(o.op, o.arg, [s.arg for s in o.src if s.op == Ops.CONST] if o.src else "")

Ops.NAME E_25_4 
Ops.DEFINE_GLOBAL 0 
Ops.CONST 0 
Ops.CONST 1 
Ops.CONST 2 
Ops.CONST 3 
Ops.CONST 25 
Ops.RANGE 0 [0, 25]
Ops.SHL None [2]
Ops.INDEX None []
Ops.STORE None [3]
Ops.ADD None [1]
Ops.INDEX None []
Ops.STORE None [3]
Ops.ADD None [2]
Ops.INDEX None []
Ops.STORE None [3]
Ops.ADD None [3]
Ops.INDEX None []
Ops.STORE None [3]
Ops.ENDRANGE None []


In [None]:
print(ei.prg.p.src)


void E_25_4(int* restrict data0) {
  for (int ridx0 = 0; ridx0 < 25; ridx0++) {
    int alu0 = (ridx0<<2);
    *(data0+alu0) = 3;
    *(data0+(alu0+1)) = 3;
    *(data0+(alu0+2)) = 3;
    *(data0+(alu0+3)) = 3;
  }
}



This compiles and runs the code. We will go into much more details on individual steps later.

In [None]:
ei.run()

[32m*** CPU        1[0m E_[34m25[0m[90m_[0m[33m4[0m[90m[0m                                    arg  1 mem  0.00 GB tm      7.96us/     0.01ms (     0.00 GFLOPS    0.1|0.1     GB/s) ['contiguous', '__add__']


7.961993105709553e-06

The result has been saves to the buffer:

In [None]:
import numpy as np

view = memoryview(a.lazydata.base.realized._buf)
np.frombuffer(view, dtype=np.int32).reshape(a.shape)


array([[3, 3, 3, 3, 3, 3, 3, 3, 3, 3],
       [3, 3, 3, 3, 3, 3, 3, 3, 3, 3],
       [3, 3, 3, 3, 3, 3, 3, 3, 3, 3],
       [3, 3, 3, 3, 3, 3, 3, 3, 3, 3],
       [3, 3, 3, 3, 3, 3, 3, 3, 3, 3],
       [3, 3, 3, 3, 3, 3, 3, 3, 3, 3],
       [3, 3, 3, 3, 3, 3, 3, 3, 3, 3],
       [3, 3, 3, 3, 3, 3, 3, 3, 3, 3],
       [3, 3, 3, 3, 3, 3, 3, 3, 3, 3],
       [3, 3, 3, 3, 3, 3, 3, 3, 3, 3]], dtype=int32)

#### The Pattern Matcher

The next step is to progressively rewrite the UOp tree using the Pattern Matcher (PM).

The PM is used all over TinyGrad for different purposes, and I will cover it in greater detail later, but let's take a quick peek.

In [None]:
from tinygrad.ops import PatternMatcher, UPat, graph_rewrite

The PM operates on a list of rules.

Each rule consists of a `UPat`, and a function that is called when the pattern matches part of the tree.

The return value of the function is the result of the "match".

In [None]:
test_rules = PatternMatcher([
    (UPat(Ops.SINK), lambda: "U stink"),                                                # This rule matches any `SINK` UOp
    (UPat(Ops.CONST, name="x"), lambda x: f"Got a CONST dtype {x.dtype} arg {x.arg}"),  # Can pass the Op to the function
    (UPat(Ops.CONST), lambda x: f"Another rule for CONST"),                             # Oops, only one rule can match!
    (UPat((Ops.ADD, Ops.MUL)), lambda: "ADD or MUL"),                                   # Can match more than one UOp type
    (UPat(Ops.EXPAND, src=(UPat(Ops.RESHAPE, src=UPat(Ops.CONST, arg=2)))),
        lambda: "Expand with reshape from a const with arg=2")                          # Can match a specific sub-tree.
                                                                                        # Note: This one only matches the EXPAND for 2, not 1
    # No match - return Null
])

[test_rules.rewrite(op) for op in a_sink.toposort]

NameError: name 'a_sink' is not defined

A more interesting pattern is to replace the matched UOps with some other UOps. We can also use `graph_rewrite` to operate on a tree.

In [None]:
insanity = PatternMatcher([
    (UPat(Ops.ADD, name="x"), lambda x: UOp(Ops.SUB, dtype=x.dtype, arg=x.arg, src=x.src)),
    (UPat(Ops.MUL, dtype=dtypes.ints, name="x"), lambda x: UOp(Ops.IDIV, dtype=x.dtype, src=x.src))
])

rewritten = graph_rewrite(add1, insanity)
rewritten

UOp(Ops.SUB, dtypes.int, arg=None, src=(
  UOp(Ops.IDIV, dtypes.int, arg=None, src=(
    UOp(Ops.CONST, dtypes.int, arg=3, src=()),
    UOp(Ops.CONST, dtypes.int, arg=5, src=()),)),
  UOp(Ops.CONST, dtypes.int, arg=2, src=()),))

In [None]:
int(rewritten)

-2

In [None]:
schedule, var_vals, becomes_map = create_schedule_with_vars(a_sink)
assert len(schedule) == 1
assert len(var_vals) == 0
si = schedule[0]
type(si)

tinygrad.engine.schedule.ScheduleItem

`ScheduleItem` is a `dataclass` with:
- `ast` - the new UOp tree
- `bufs` - the buffers used in the calculation
- `metadata`

In [None]:
print("AST:", si.ast)
print("bufs:", si.bufs)
print("Metadata", si.metadata)

AST: UOp(Ops.SINK, dtypes.void, arg=None, src=(
  UOp(Ops.STORE, dtypes.void, arg=None, src=(
    UOp(Ops.DEFINE_GLOBAL, dtypes.int.ptr(100), arg=0, src=()),
    UOp(Ops.VIEW, dtypes.void, arg=ShapeTracker(views=(View(shape=(10, 10), strides=(10, 1), offset=0, mask=None, contiguous=True),)), src=()),
    UOp(Ops.CONST, dtypes.int, arg=3, src=(
      UOp(Ops.VIEW, dtypes.void, arg=ShapeTracker(views=(View(shape=(10, 10), strides=(0, 0), offset=0, mask=None, contiguous=False),)), src=()),)),)),))
bufs: (<buf real:False device:CPU size:100 dtype:dtypes.int offset:0>,)
Metadata (contiguous, __add__)


The next step is to "lower" the `ScheduleItem`, which generates the code.

In [None]:
compiled_runner, bufs = tg.engine.realize.si_lowerer.rewrite(si.ast, si.bufs)
compiled_runner, bufs

(<tinygrad.engine.realize.CompiledRunner>,
 [<buf real:False device:CPU size:100 dtype:dtypes.int offset:0>])

In [None]:
compiled_runner.lib

b'U\xc4\xe2}\x18\x05f\x00\x00\x00H\x89\xe5\xc5\xfc\x11G`\xc5\xfc\x11G@\xc5\xfc\x11G \xc5\xfc\x11\x07\xc5\xfc\x11\x87\xe0\x00\x00\x00\xc5\xfc\x11\x87\xc0\x00\x00\x00\xc5\xfc\x11\x87\xa0\x00\x00\x00\xc5\xfc\x11\x87\x80\x00\x00\x00\xc5\xfc\x11\x87`\x01\x00\x00\xc5\xfc\x11\x87@\x01\x00\x00\xc5\xfc\x11\x87 \x01\x00\x00\xc5\xfc\x11\x87\x00\x01\x00\x00\xc5\xf8\x11\x87\x80\x01\x00\x00]\xc5\xf8w\xc3\x00\x00\x00\x03\x00\x00\x00\x00Ubuntu clang version 18.1.3 (1ubuntu1)\x00'

In [None]:
ei = lower_schedule_item(si)

In [None]:
for k, v in becomes_map.items():
    print(f"  {k.op} -> {v.op} {v.arg if v.op is Ops.CONST else ''}")

  Ops.CONTIGUOUS -> Ops.VIEW 
  Ops.ADD -> Ops.CONST 3
  Ops.EXPAND -> Ops.CONST 2
  Ops.RESHAPE -> Ops.CONST 2
  Ops.EXPAND -> Ops.CONST 1
  Ops.RESHAPE -> Ops.CONST 1


In [None]:
import pickle
from os import getenv

In [None]:
PatternMatcher = tg.ops.TrackedPatternMatcher  # type: ignore
def print_match_stats():
    with open(fn:=tg.helpers.temp("rewrites.pkl", append_user=True), "wb") as f:
        print(f"rewrote {len(tg.ops.tracked_ctxs)} graphs and matched {sum(len(r.matches) for x in tg.ops.tracked_ctxs for r in x)} times, saved to {fn}")
        with tg.helpers.Context(PICKLE_BUFFERS=0): pickle.dump((tg.ops.tracked_keys, tg.ops.tracked_ctxs), f)
    # if getenv("VIZ"): tg.ops.launch_viz("VIZ", tg.helpers.temp("rewrites.pkl", append_user=True))
    # if getenv("PRINT_MATCH_STATS", 1):
    #     ret = [0,0,0.0,0.0]
    #     for k,v in sorted(list(tg.ops.match_stats.items()), key=lambda x: x[1][2]+x[1][3]):
    #         loc_str = f"{k.location[0].split('/')[-1]}:{k.location[1]}"
    #         if v[1] != 0: print(f"{v[0]:6d} / {v[1]:7d} -- {v[3]*1000.:9.2f} / {(v[2]+v[3])*1000.:9.2f} ms -- {loc_str:15s}", k.printable())
    #         ret = [x+y for x,y in zip(ret, v)]
    #         print(f"{ret[0]:6d} / {ret[1]:7d} -- {ret[3]*1000.:9.2f} / {(ret[2]+ret[3])*1000.:9.2f} ms -- TOTAL")

In [None]:
print_match_stats()

rewrote 0 graphs and matched 0 times, saved to /tmp/rewrites.pkl.xl0
