[Draft] One hierarchical view is all you need #3988

YichengDWu · 2024-03-29T06:41:17Z

It's still a draft, but it's a better abstraction.

Let's see what the View in this PR can do.

from tinygrad.shape.view2 import View

a = View.create((10,10)); print(a); print(a.render());
# View(shape=(10, 10), strides=(10, 1), continuous=True)
# ((idx0*10)+idx1)

a = a.permute((1,0)); print(a); print(a.render());
# View(shape=(10, 10), strides=(1, 10), continuous=False)
# ((idx1*10)+idx0)

a = a.reshape((5,2,5,2)); print(a); print(a.render());
# View(shape=(5, 2, 5, 2), strides=(2, 1, 20, 10), continuous=False)
# ((idx0*2)+(idx2*20)+(idx3*10)+idx1)

a = a.reshape((100,)); print(a); print(a.render());
# View(shape=((10, 10),), strides=((1, 10),), continuous=False)
# (((idx0%10)*10)+(idx0//10))

a = a.permute((1,0)); print(a); print(a.render());
# View(shape=(10, 10), strides=(10, 1), continuous=True)
# ((idx0*10)+idx1

assert a.continuous == True

This means that using symbolic to track indices in multiple views is redundant. A view records all information about the data layout.

I will add conceptual explanations later (need to go to sleep).

This PR currently has not modified any source code, it's just an addition, because the source code of ShapeTracker is spaghetti.

github-actions · 2024-03-29T06:43:41Z

Changes

Name                           Lines    Diff    Tokens/Line    Diff
---------------------------  -------  ------  -------------  ------
tinygrad/shape/view2.py          101    +101            9.9    +9.9
tinygrad/shape/int_tuple.py       86     +86           10.2   +10.2


total lines changes: +187

geohot · 2024-03-29T17:40:54Z

I'll believe this is real when you have all the tests passing with the new View class

YichengDWu · 2024-03-29T18:07:44Z

It is real. I will post the mathematics behind this with flawless logic.

As for passing all tests, it's a bit tricky. Because the source code may contain "In order to achieve A we implemented B and added tests for B". If I directly solve A, then there is no need for B, identifying B can be time-consuming.

geohot · 2024-03-29T18:15:28Z

Going to close this PR. No need to pass the view specific tests, but all models and training should work. Feel free to reopen when they do.

YichengDWu · 2024-03-29T23:20:51Z

I'll be moving the day after tomorrow and might not have access to a computer for a while. I'm posting the underlying principles here so that anyone interested can take a look or help accelerate the refactoring.

First glance

Hierarchical views are a strictly super set of normal views. For example, it is impossible for a normal view to represent the following data layout:

0 4 1 5
2 6 3 7

But can be represented with a hierarchical view:

shape:   (2, (2, 2))
strides: (2, (1, 4))

The second dimension has two dimensions folded into it, if we slice along the second dimension we get

shape:  (2, 2)
strides: (1, 4)

which is

0 4 
1 5

aka the first row of the original data layout.

Coordinate lists

In the following text, I will use the term "int tuple" to represent a nested tuple of integers.

We say two int tuples are compatible if they have the same hierarchical structure. For example, ((2,3), 4) is compatible with ((5,6), 8).

A coordinate list is a lexicographically ordered list of integers or compatible int tuples. For example,

{
 (0, 0), (0, 1), (0, 2), (0, 3)
 (1, 0), (1, 1), (1, 2), (1，3)
}

{
 (0, (0, 0)), (0, (0, 1)), (0, (1, 0)), (0, (1, 1))
 (1, (0, 0)), (1, (0, 1)), (1, (1, 0)), (1, (1, 1))
}

This kind of order reflects the row-major memory layout.

I will use the notation $C_{(2,4)}$ for the first coordinate list above and $C_{(2,(2,2))}$ for the second one. The meaning of this notation is self-evident.

We can then define a partial order of the coordinate lists themselves by prime factorization. For example, $C_{(2,4)}\leq C_{(2,(2,2))}$ since 4 can be factored into 2 by 2. And

$$ C_{36}\leq C_{(6,6)} \leq C_{((2,3), 6)}\leq C_{((2,3), (2,3))} $$

$$ C_{36}\leq C_{(4,9)} \leq C_{(4, (3,3))}\leq C_{(2,2),(3,3)} $$

Shape

A view contains a shape and a strides (I hate using the plural). They are both int tuples.

We can think of a view as a function that maps valid coordinates to a offset in memory.

The function is defined by taking the (nested) inner product of a coordinate and the strides.

For example,

shape:   (2, (2, 2))
strides: (2, (1, 4))
coordinate: (1, (1, 0))
offset: <(1, (1, 0)), (2, (1, 4))> -> 3

But we should also be able to use a linear coordinate to index into a view.

0 4 1 5
2 6 3 7

We know that 3 is the 7th element, so if we pass in a linear coordinate 7 into the view (as a function) we should also get 3 for the output.

More interestingly, if we treat the view as a 2D tensor, we know that the coordinate (1,3) will also lead us to 3.

Now we introduce the following definition:

A coordinate family generated by a shape $S$ is a set of coordinate lists defined by

$$G(S)={C_P|\ C_P\leq C_S}$$
For example,

$$ S=(2,(2,2)), \quad G(S)={C_{8},C_{(2,4)}, C_{(2,(2,2))}} $$

Observations: $C_S\in G(S)$ and $C_{|S|}\in G(S)$.

A coordinate $c$ of a shape $S$ is an element of an element $C_P \in G(S)$.

Therefore, (1, (1, 0)), (1, 3) and 7 are all valid coordinates of the shape (2,(2,2)).

It turns out the is a bijective map between any two coordinate lists in $G(S)$.

It suffices to show that there is a bijective map between $C_{|S|}$ and any $C_P\in G(S)$, then

$$C_{P_1}\leftrightarrow C_{|S|} \leftrightarrow C_{P_2}$$
for any $C_{P_1}\leftrightarrow C_{P_2}$.

The proof was provided by presenting a specific algorithm. crd2idx takes you from $C_{P_1}$ to $C_{|S|}$ and idx2crd takes you from $C_{|S|}$ to $C_{P_2}$.

For example,

crd2idx((1,(1,0)), (2, (2,2))) = 7
idx2crd(7, (2, 4)) = (1, 3)

Reshape

Given a view $v$ with shape $S_1$ and stride $D_1$, reshaping it to a new shape $S_2$ means for a coordinate $crd\in C_{S_2}$, we do

idx = crd2idx(crd, S2)
new_crd = idx2crd(idx, S1)
offset = v(new_crd)

If we have multiple reshape operations, we simply repeat this process.

From my understanding, this is the reason you need to maintain multiple views, and additionally, a symbolic coordinate is required to record the final expression.

The purpose of this PR is to completely discard this complexity.

We consider reshape as function composition.

$$v\circ v_2(\text{crd})=v(v_2(crd))=\text{offset}$$
Here $v_2$ is defined to be a continuous view with shape $S_2$. Then, the computation process of this composition is completely consistent with the above algorithm.

The trick is to find a new view, namely $v\circ v_2$, then we can simply maintain the new view and discard $v$ and $v_2$.

YichengDWu added 3 commits March 28, 2024 22:36

first try

2a6ce53

format

f1647c7

format

5b4ee11

YichengDWu marked this pull request as draft March 29, 2024 06:43

geohot closed this Mar 29, 2024

YichengDWu mentioned this pull request May 8, 2024

Better Symbolic Algebra Library jafioti/luminal#47

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Draft] One hierarchical view is all you need #3988

[Draft] One hierarchical view is all you need #3988

YichengDWu commented Mar 29, 2024 •

edited

github-actions bot commented Mar 29, 2024

geohot commented Mar 29, 2024

YichengDWu commented Mar 29, 2024

geohot commented Mar 29, 2024

YichengDWu commented Mar 29, 2024 •

edited

[Draft] One hierarchical view is all you need #3988

[Draft] One hierarchical view is all you need #3988

Conversation

YichengDWu commented Mar 29, 2024 • edited

github-actions bot commented Mar 29, 2024

Changes

geohot commented Mar 29, 2024

YichengDWu commented Mar 29, 2024

geohot commented Mar 29, 2024

YichengDWu commented Mar 29, 2024 • edited

First glance

Coordinate lists

Shape

Reshape

YichengDWu commented Mar 29, 2024 •

edited

YichengDWu commented Mar 29, 2024 •

edited