# TensorIR: 张量程序抽象案例研究

## 2.4.2 序言
1. 什么是表示张量函数可能的抽象？
2. 什么是张量函数之间可能的变换？

## 2.4.3 TensorIR

In [1]:
import numpy as np
import tvm
from tvm.ir.module import IRModule
from tvm.script import tir as T

张量程序抽象的主要目的是表示循环和相关的硬件加速选择，如多线程、特殊硬件指令的使用和内存访问

$$
\begin{aligned}
& Y_{i,j} = \sum_{k}{A_{i,k}\times{B_{k,j}}} \\
& C_{i,j} = Relu(Y_{i,j}) = max(Y_{i,j},0)
\end{aligned}
$$

In [3]:
# numpy version
dtype = 'float32'
a_np = np.random.rand(128, 128).astype(dtype)
b_np = np.random.rand(128, 128).astype(dtype)
# a @ b is equivalent to np.matmul(a,b)
c_mm_relu = np.maximum(a_np @ b_np, 0)

低级 NumPy
1. 使用循环而不是数组函数来展示可能的循环计算
2. 通过 numpy.empty 显式地分配数组并传递它们

In [4]:
# low level numpy version
def lnumpy_mm_relu(A: np.array, B: np.array, C: np.array):           # 多维数组（缓冲区）：input & output & intermediate results
    Y = np.empty((128, 128), dtype='float32')
    for i in range(128):                                             # 循环嵌套：loop nests -> drive iteration
        for j in range(128):
            for k in range(128):
                if k == 0:
                    Y[i, j] = 0
                Y[i, j] = Y[i, j] + A[i, k] * B[k, j]                # 计算定义：Computation
    
    for i in range(128):
        for j in range(128):
            C[i, j] = max(Y[i, j], 0)

In [5]:
c_np = np.empty((128, 128), dtype=dtype)
lnumpy_mm_relu(a_np, b_np, c_np)
np.testing.assert_allclose(c_np, c_mm_relu, rtol=1e-5)

TensorIR: TVMScript

In [23]:
@tvm.script.ir_module
class MyModel():
    @T.prim_func
    def mm_relu(A: T.Buffer[(128, 128), 'float32'],
                B: T.Buffer[(128, 128), 'float32'],
                C: T.Buffer[(128, 128), 'float32']):
        T.func_attr({"global_symbol": "mm_relu", "tir.noalias": True})
        Y = T.alloc_buffer((128, 128), dtype='float32')
        for i, j, k in T.grid(128, 128, 128):
            with T.block("Y"):
                vi = T.axis.spatial(128, i)
                vj = T.axis.spatial(128, j)
                vk = T.axis.reduce(128, k)
                with T.init():
                    Y[vi, vj] = T.float32(0)
                Y[vi, vj] = Y[vi, vj] + A[vi, vk] * B[vk, vj]
        
        for i, j in T.grid(128, 128):
            with T.block("C"):
                vi = T.axis.spatial(128, i)
                vj = T.axis.spatial(128, j)
                C[vi, vj] = T.max(Y[vi, vj], T.float32(0))

### 2.4.3.1 Multi-dimensional buffers

**多维数组**：函数参数 & 缓冲区 \
1.input & output
```python
# TensorIR
def mm_relu(A: T.Buffer[(128, 128), dtype='float32'],
            B: T.Buffer[(128, 128), dtype='float32'],
            C: T.Buffer[(128, 128), dtype='float32']):
    pass
```
```python
# numpy
def lnumpy_mm_relu(A: np.array, B: np.array, C: np.array):
    pass
```
2.intermediate results
```python
# TensorIR
Y = T.alloc_buffer((128, 128), dtype='float32')
```
```python
# numpy
Y = np.empty((128, 128), dtype='float32')
```

### 2.4.3.2 For: Loop iteration

```python
# TensorIR
for i, j, k in T.grid(128, 128, 128):    # TensorIR语法糖
    pass
```
```python
# numpy
for i in range(128):
    for j in range(128):
        for k in range(128):
            pass
```

### 2.4.3.3 Computation Block

```python
# TensorIR
with T.block("Y"):
    vi = T.axis.spatial(128, i)
    vj = T.axis.spatial(128, j)
    vk = T.axis.reduce(128, k)
    with T.init():
        Y[vi, vj] = T.float32(0)
    Y[vi, vj] = Y[vi, vj] + A[vi, vk] + B[vk, vj] 
```
```python
# corressponding numpy code
vi, vj, vk = i, j, k
if vk == 0:
    Y[vi, vj] = 0
Y[vi, vj] = Y[vi, vj] + A[vi, vk] * B[vk, vj]
```

**块**是TensorIR中的基本计算单位。值得注意的是，该块包含比普通NumPy代码更多的信息。一个块包含一组块轴(vi、vj、vk)和围绕它们定义的计算

```python
vi = T.axis.spatial(128, i)
vj = T.axis.spatial(128, j)
vk = T.axis.reduce(128, k)
# [block_axis] = T.axis.[axis_type]([axis_range], [mapped_value])
```
声明块轴的关键性质: 
1. 定义vi, vj, vk被绑定到的位置(本例中的i, j, k)
2. 声明vi, vj, vk的原始范围/预期范围(T.axis.spatial(128, i))
3. 声明块轴的属性(spatial, reduce)

### 2.4.3.4 块轴的属性
块Y通过读取来自A[vi, vk]和B[vk, vj]的值来计算结果Y[vi, vj]，并对所有可能的vk执行求和，对于一组固定的 vi 和 vj，计算块在Y的空间位置(Y[vi, vj])处生成一个点值，该点值独立于Y中的其他位置（具有不同的vi, vj值的位置）。 \
vi, vj -> 空间轴(spatial axis): 直接对应于块写入的缓冲区空间区域的开始\
vk -> 归约轴(reduce axis): 涉及归约op


### 2.4.3.5 为什么块需要额外的附加信息
使块轴独立于外部循环嵌套i, j, k, 同时帮助验证计算循环的正确性
```python
for i in range(127):
    with T.block('c'):
        vi = T.axis.spatial(128, i)
        # error here due to iterator size mismatch
```

### 2.4.3.6 块轴绑定语法糖

```python
# SSR means the properties of each axes are "spatial", "spatial", "reduce"
vi, vj, vk = T.axis.remap("SSR", [i, j, k])
```

In [18]:
@tvm.script.ir_module
class MyModelWithAxisRemapSuger():
    @T.prim_func
    def mm_relu(A: T.Buffer[(128, 128), 'float32'],
                B: T.Buffer[(128, 128), 'float32'],
                C: T.Buffer[(128, 128), 'float32']):
        T.func_attr({"global_symbol": "mm_relu", "tir.noalias": True})
        Y = T.alloc_buffer((128, 128), dtype='float32')
        for i, j, k in T.grid(128, 128, 128):
            with T.block("Y"):
                vi, vj, vk = T.axis.remap("SSR", [i, j, k])            # axis remap suger
                with T.init():
                    Y[vi, vj] = T.float32(0)
                Y[vi, vj] = Y[vi, vj] + A[vi, vk] * B[vk, vj]
        
        for i, j in T.grid(128, 128):
            with T.block("C"):
                vi, vj = T.axis.remap("SS", [i, j])                    # axis remap suger
                C[vi, vj] = max(Y[vi, vj], T.float32(0))

### 2.4.3.7 函数属性和装饰器

```python
T.func_attr({"global_symbol": "mm_relu", "tir.noalias": True})
```
global_symbol -> 函数名; tir.noalias -> 属性，指所有缓冲存储器不重叠

```python
# 用于表述对应部分的类型
@tvm.script.ir_module
@T.prim_func
```

```python
# 表示MyModule是一个IRModule -> IRModule是MLC中保存张量函数集合的容器对象
@tvm.script.ir_module
```

In [25]:
print(type(MyModel), type(MyModel['mm_relu']))
print(MyModel)
# 一个IRModule可以包含多个张量函数

<class 'tvm.ir.module.IRModule'> <class 'tvm.tir.function.PrimFunc'>
@mm_relu = primfn(A_handle: handle, B_handle: handle, C_handle: handle) -> ()
  attr = {"tir.noalias": True, "global_symbol": "mm_relu"}
  buffers = {A: Buffer(A_1: Pointer(global float32), float32, [128, 128], []),
             B: Buffer(B_1: Pointer(global float32), float32, [128, 128], []),
             C: Buffer(C_1: Pointer(global float32), float32, [128, 128], [])}
  buffer_map = {A_handle: A, B_handle: B, C_handle: C} {
  block([], "root") {
    tir.reads([])
    tir.writes([])
    Y = alloc_buffer(float32[128, 128])
     {
      for (i: int32, 0, 128) {
        for (j: int32, 0, 128) {
          for (k: int32, 0, 128) {
            block([128, 128, tir.reduce_axis(0, 128)], "Y") as [vi, vj, vk] {
              bind(vi, i)
              bind(vj, j)
              bind(vk, k)
              tir.reads([A[vi, vk], B[vk, vj]])
              tir.writes([Y[vi, vj]])
              with init() {
                Y[vi, vj]