<a href="https://colab.research.google.com/github/mlc-ai/notebooks/blob/main/2_tensor_program_abstraction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tensor Program Abstraction in Action





## Install packages 

For the purpose of this course, we will use some on-going development in tvm, which is an open source machine learning compilation framework. We provide the following command to install a packaged version for mlc course.

In [1]:
%pip install ml_dtypes
%pip install apache-tvm-ffi

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [2]:
# Install mlc-ai-nightly by directly downloading the wheel file
# This works around Python 3.13 compatibility issues with pip's wheel finder
import sys
import subprocess
import urllib.request
import os

# Use FULL wheel filename - pip requires proper wheel naming format
# For Linux x86_64 with CUDA 12.8
wheel_filename = "mlc_ai_nightly_cpu-0.20.dev748-py3-none-manylinux_2_28_x86_64.whl"
wheel_url = f"https://github.com/mlc-ai/package/releases/download/v0.9.dev0/{wheel_filename}"

# Download the wheel file
print(f"Downloading {wheel_url}...")
urllib.request.urlretrieve(wheel_url, wheel_filename)
print(f"Downloaded {wheel_filename}")

# Install the wheel file (use absolute path for reliability)
wheel_path = os.path.abspath(wheel_filename)
print(f"Installing {wheel_filename}...")
subprocess.check_call([sys.executable, "-m", "pip", "install", wheel_path, "--force-reinstall", "--no-deps"])

# Install dependencies
print("Installing dependencies...")
subprocess.check_call([sys.executable, "-m", "pip", "install", "attrs", "synr==0.6.0", "decorator", "numpy", "psutil", "scipy", "tornado", "cloudpickle"])

# Clean up
os.remove(wheel_filename)
print("Installation complete!")

Downloading https://github.com/mlc-ai/package/releases/download/v0.9.dev0/mlc_ai_nightly_cpu-0.20.dev748-py3-none-manylinux_2_28_x86_64.whl...


KeyboardInterrupt: 

## Constructing Tensor Program

Let us begin by constructing a tensor program that performs addition among two vectors.

In [3]:
import tvm
from tvm.ir.module import IRModule
from tvm.script import ir as I
from tvm.script import tir as T
import numpy as np

In [4]:
@I.ir_module
class MyModule:
    @T.prim_func
    def main(A: T.Buffer((128,), "float32"), 
             B: T.Buffer((128,), "float32"), 
             C: T.Buffer((128,), "float32")):
        # extra annotations for the function
        T.func_attr({"global_symbol": "main", "tir.noalias": True})
        for i in range(128):
            with T.sblock("C"):
                # declare a data parallel iterator on spatial domain
                vi = T.axis.spatial(128, i)
                C[vi] = A[vi] + B[vi]

TVMScript is a way for us to express tensor program in python ast. Note that this code do not actually correspond to a python program, but a tensor program  that can be used in MLC process. The language is designed to align with python syntax with additional structures to facilitate analysis and transformation. 
TVMScript는 텐서 프로그램을 파이썬 AST(추상 문법 트리) 형태로 표현하기 위한 방법이다.
주의할 점은, 여기서 작성한 코드는 실제로 실행되는 파이썬 프로그램에 해당하는 것이 아니라, MLC(머신러닝 컴파일) 과정에서 사용할 수 있는 텐서 프로그램이라는 것이다.

이 언어는 파이썬 문법과 최대한 비슷하게 설계되었지만, 분석과 변환을 더 쉽게 하기 위한 추가 구조들을 함께 제공한다.

In [5]:
type(MyModule)

tvm.ir.module.IRModule

MyModule is an instance of an **IRModule** data structure, which is used to hold a collection of tensor functions. 

We can use the `show()` function to get a highlighted string based representation of the IRModule. This function is quite useful for inspecting the module during each step of transformation.

In [6]:
print(MyModule.script())
# TVM의 IRModule 객체를 Python 스크립트 형태의 문자열로 변환


# from tvm.script import ir as I
# from tvm.script import tir as T

@I.ir_module
class Module:
    @T.prim_func
    def main(A: T.Buffer((128,), "float32"), B: T.Buffer((128,), "float32"), C: T.Buffer((128,), "float32")):
        T.func_attr({"tir.noalias": True})
        # with T.sblock("root"):
        for i in range(128):
            with T.sblock("C"):
                vi = T.axis.spatial(128, i)
                T.reads(A[vi], B[vi])
                T.writes(C[vi])
                C[vi] = A[vi] + B[vi]


IR은 컴파일러에서 소스 코드를 기계어로 변환하는 중간 단계의 표현

TVM에서의 IR:
통합된 표현: TensorFlow, PyTorch 등 다양한 프레임워크의 모델을 하나의 공통 IR 형태로 변환
IRModule: TVM에서는 이를 IRModule이라고 부르며, 텐서 함수들의 모음

TVM이 코드를 파싱해 내부 IR로 변환할 때, 블록의 메모리 접근 패턴을 분석해 T.reads()와 T.writes()를 자동으로 생성

In [7]:
MyModule.show()

In [9]:
from tvm import s_tir
sch = s_tir.Schedule(MyModule)

In [10]:
print(sch.mod.script())

# from tvm.script import ir as I
# from tvm.script import tir as T

@I.ir_module
class Module:
    @T.prim_func
    def main(A: T.Buffer((128,), "float32"), B: T.Buffer((128,), "float32"), C: T.Buffer((128,), "float32")):
        T.func_attr({"tir.noalias": True})
        # with T.sblock("root"):
        for i in range(128):
            with T.sblock("C"):
                vi = T.axis.spatial(128, i)
                T.reads(A[vi], B[vi])
                T.writes(C[vi])
                C[vi] = A[vi] + B[vi]


In [12]:
block = sch.get_sblock("C") #block C 반환
i, = sch.get_loops(block) #block C의 루프 반환 (for문을 slice할거임)


In [13]:
i0, i1, i2 = sch.split(i, factors = [None, 4, 4])#for loop를 분할할거임

print(sch.mod.script())

# from tvm.script import ir as I
# from tvm.script import tir as T

@I.ir_module
class Module:
    @T.prim_func
    def main(A: T.Buffer((128,), "float32"), B: T.Buffer((128,), "float32"), C: T.Buffer((128,), "float32")):
        T.func_attr({"tir.noalias": True})
        # with T.sblock("root"):
        for i_0, i_1, i_2 in T.grid(8, 4, 4):
            with T.sblock("C"):
                vi = T.axis.spatial(128, i_0 * 16 + i_1 * 4 + i_2)
                T.reads(A[vi], B[vi])
                T.writes(C[vi])
                C[vi] = A[vi] + B[vi]


원래 128번 돌 단일 루프 였는데, 8,4,4번도는 중첩 루프로 바뀜 
이렇게 중첩 루프로 루프를 분할하면 128번 반복을 8 x 4 x 4 반복이라 각 레벨에서 최적화 가능
예를들면 gpu 스레드에 병렬로 돌리고 싶을때 스레드의 수(32)에 맞게 병렬 설정 가능

In [15]:
sch.reorder(i2, i1) #루프 순서 변경

In [16]:
print(sch.mod.script())

# from tvm.script import ir as I
# from tvm.script import tir as T

@I.ir_module
class Module:
    @T.prim_func
    def main(A: T.Buffer((128,), "float32"), B: T.Buffer((128,), "float32"), C: T.Buffer((128,), "float32")):
        T.func_attr({"tir.noalias": True})
        # with T.sblock("root"):
        for i_0, i_2, i_1 in T.grid(8, 4, 4):
            with T.sblock("C"):
                vi = T.axis.spatial(128, i_0 * 16 + i_1 * 4 + i_2)
                T.reads(A[vi], B[vi])
                T.writes(C[vi])
                C[vi] = A[vi] + B[vi]


In [17]:
sch.parallel(i0) #가장 바깥 루프를 병렬화하고 싶으면 
print(sch.mod.script())


# from tvm.script import ir as I
# from tvm.script import tir as T

@I.ir_module
class Module:
    @T.prim_func
    def main(A: T.Buffer((128,), "float32"), B: T.Buffer((128,), "float32"), C: T.Buffer((128,), "float32")):
        T.func_attr({"tir.noalias": True})
        # with T.sblock("root"):
        for i_0 in T.parallel(8):
            for i_2, i_1 in T.grid(4, 4):
                with T.sblock("C"):
                    vi = T.axis.spatial(128, i_0 * 16 + i_1 * 4 + i_2)
                    T.reads(A[vi], B[vi])
                    T.writes(C[vi])
                    C[vi] = A[vi] + B[vi]


위 의미
스케쥴러를 도구로 사용해서 tensor program abstraction을 다르게 시도하므로써 program optimize를 시도해볼 수 있음

### Build and run

Any any time point, we can turn an IRModule to runnable functions by calling a build function.

In [19]:
rt_mod = tvm.build(sch.mod, target="llvm")  # The module for CPU backends.
print(type(rt_mod))

<class 'tvm.runtime.module.Module'>


tvm.build()함수는 IRModule을 실행 가능한 코드로 컴파일 함
입력: IRModule
출력: tvm.runtime.module.Module

target = "llvm" -> CPU용 LLVM 코드 생성 
target = "cuda", "opencl", "metal"등 가능 

After build, mod contains a collection of runnable functions. We can retrieve each function by its name.

In [20]:
func = rt_mod["main"]

In [21]:
func

ffi.Function(0x41ae360)

In [28]:
a = tvm.runtime.tensor(np.arange(128, dtype="float32"))
b = tvm.runtime.tensor(np.ones(128, dtype="float32"))
c = tvm.runtime.tensor(np.zeros(128, dtype="float32"))

In [29]:
print(a)
print(b)
print(c)


[  0.   1.   2.   3.   4.   5.   6.   7.   8.   9.  10.  11.  12.  13.
  14.  15.  16.  17.  18.  19.  20.  21.  22.  23.  24.  25.  26.  27.
  28.  29.  30.  31.  32.  33.  34.  35.  36.  37.  38.  39.  40.  41.
  42.  43.  44.  45.  46.  47.  48.  49.  50.  51.  52.  53.  54.  55.
  56.  57.  58.  59.  60.  61.  62.  63.  64.  65.  66.  67.  68.  69.
  70.  71.  72.  73.  74.  75.  76.  77.  78.  79.  80.  81.  82.  83.
  84.  85.  86.  87.  88.  89.  90.  91.  92.  93.  94.  95.  96.  97.
  98.  99. 100. 101. 102. 103. 104. 105. 106. 107. 108. 109. 110. 111.
 112. 113. 114. 115. 116. 117. 118. 119. 120. 121. 122. 123. 124. 125.
 126. 127.]
[1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.

To invoke the function, we can create three NDArrays in the tvm runtime, and then invoke the generated function.

In [30]:
func(a, b, c)


In [32]:
a

<tvm.runtime.Tensor shape=(128,), cpu:0>
array([  0.,   1.,   2.,   3.,   4.,   5.,   6.,   7.,   8.,   9.,  10.,
        11.,  12.,  13.,  14.,  15.,  16.,  17.,  18.,  19.,  20.,  21.,
        22.,  23.,  24.,  25.,  26.,  27.,  28.,  29.,  30.,  31.,  32.,
        33.,  34.,  35.,  36.,  37.,  38.,  39.,  40.,  41.,  42.,  43.,
        44.,  45.,  46.,  47.,  48.,  49.,  50.,  51.,  52.,  53.,  54.,
        55.,  56.,  57.,  58.,  59.,  60.,  61.,  62.,  63.,  64.,  65.,
        66.,  67.,  68.,  69.,  70.,  71.,  72.,  73.,  74.,  75.,  76.,
        77.,  78.,  79.,  80.,  81.,  82.,  83.,  84.,  85.,  86.,  87.,
        88.,  89.,  90.,  91.,  92.,  93.,  94.,  95.,  96.,  97.,  98.,
        99., 100., 101., 102., 103., 104., 105., 106., 107., 108., 109.,
       110., 111., 112., 113., 114., 115., 116., 117., 118., 119., 120.,
       121., 122., 123., 124., 125., 126., 127.], dtype=float32)

In [31]:
print(a)
print(b)
print(c)

[  0.   1.   2.   3.   4.   5.   6.   7.   8.   9.  10.  11.  12.  13.
  14.  15.  16.  17.  18.  19.  20.  21.  22.  23.  24.  25.  26.  27.
  28.  29.  30.  31.  32.  33.  34.  35.  36.  37.  38.  39.  40.  41.
  42.  43.  44.  45.  46.  47.  48.  49.  50.  51.  52.  53.  54.  55.
  56.  57.  58.  59.  60.  61.  62.  63.  64.  65.  66.  67.  68.  69.
  70.  71.  72.  73.  74.  75.  76.  77.  78.  79.  80.  81.  82.  83.
  84.  85.  86.  87.  88.  89.  90.  91.  92.  93.  94.  95.  96.  97.
  98.  99. 100. 101. 102. 103. 104. 105. 106. 107. 108. 109. 110. 111.
 112. 113. 114. 115. 116. 117. 118. 119. 120. 121. 122. 123. 124. 125.
 126. 127.]
[1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.

In [34]:
sch.mod.show()

## Transform the Tensor Program

Now let us start to transform the Tensor Program. A tensor prigram can be transformed using an auxiliary data structure called schedule.


In [None]:
sch = tvm.tir.Schedule(MyModule)
print(type(sch))

AttributeError: module 'tvm.tir' has no attribute 'Schedule'

Let us first try to split the loops

In [None]:
# Get block by its name
block_c = sch.get_block("C")
# Get loops surronding the block
(i,) = sch.get_loops(block_c)
# Tile the loop nesting.
i_0, i_1, i_2 = sch.split(i, factors=[None, 4, 4])
sch.mod.show()

NameError: name 'sch' is not defined

We can also reorder the loops. Now we move loop i_2 to outside of i_1.




In [None]:
sch.reorder(i_0, i_2, i_1)
sch.mod.show()

Finally, we can add hints to the program generator that we want to vectorize the inner most loop.

In [None]:
sch.mod.show()

In [None]:
sch.parallel(i_0)
sch.mod.show()

We can build and run the transformed program


In [None]:
transformed_mod = tvm.build(sch.mod, target="llvm")  # The module for CPU backends.
transformed_mod["main"](a, b, c)

## Constructing Tensor Program using Tensor Expression

In the previous example, we directly use TVMScript to construct the tensor program. In practice, it is usually helpful to construct these functions pragmatically from existing definitions. Tensor expression is an API that helps us to build some of the expression-like array computations.

In [37]:
# namespace for tensor expression utility
from tvm import te

# declare the computation using the expression API
A = te.placeholder((128, ), name="A")
B = te.placeholder((128, ), name="B")
C = te.compute((128,), lambda i: A[i] + B[i], name="C")

# create a function with the specified list of arguments. 
func = te.create_prim_func([A, B, C])
# mark that the function name is main
func = func.with_attr("global_symbol", "main")
ir_mod_from_te = IRModule({"main": func})

ir_mod_from_te.show()

## Transforming a matrix multiplication program

In the above example, we showed how to transform an vector add. Now let us try to apply that to a slightly more complicated program(matrix multiplication). Let us first try to build the initial code using the tensor expression API.


In [42]:
from tvm import te

M = 1024
K = 1024
N = 1024

# The default tensor type in tvm
dtype = "float32"

target = "llvm"
dev = tvm.device(target, 0)

# Algorithm
k = te.reduce_axis((0, K), "k")
A = te.placeholder((M, K), name="A")
B = te.placeholder((K, N), name="B")
C = te.compute((M, N), lambda m, n: te.sum(A[m, k] * B[k, n], axis=k), name="C")

# Default schedule
func = te.create_prim_func([A, B, C])
func = func.with_attr("global_symbol", "main")
ir_module = IRModule({"main": func})
ir_module.show()


func = tvm.build(ir_module, target="llvm")  # The module for CPU backends.

a = tvm.runtime.tensor(np.random.rand(M, K).astype(dtype), dev)
b = tvm.runtime.tensor(np.random.rand(K, N).astype(dtype), dev)
c = tvm.runtime.tensor(np.zeros((M, N), dtype=dtype), dev)
func(a, b, c)

evaluator = func.time_evaluator(func.entry_name, dev, number=1)
print("Baseline: %f" % evaluator(a, b, c).mean)

Baseline: 6.092609


We can transform the loop access pattern to make it more cache friendly. Let us use the following schedule.

In [41]:
sch = tvm.s_tir.Schedule(ir_module)
print(type(sch))
block_c = sch.get_sblock("C")
# Get loops surronding the block
(y, x, k) = sch.get_loops(block_c)
block_size = 32
yo, yi = sch.split(y, [None, block_size]) #yi = 32 -> 내부 루프, yo =외부부 루프
xo, xi = sch.split(x, [None, block_size]) #xi = 32 -> 내부 루프, xo =외부부 루프

sch.reorder(yo, xo, k, yi, xi)
sch.mod.show()

func = tvm.build(sch.mod, target="llvm")  # The module for CPU backends.

c = tvm.runtime.tensor(np.zeros((M, N), dtype=dtype), dev)
func(a, b, c)

evaluator = func.time_evaluator(func.entry_name, dev, number=1)
print("after transformation: %f" % evaluator(a, b, c).mean)

<class 'tvm.s_tir.schedule.schedule.Schedule'>


after transformation: 0.325078


Try to change the value of bn to see what performance you can get. In pratice, we will leverage an automated system to search over a set of possible transfromations to find an optimal one.

In [46]:
sch = tvm.s_tir.Schedule(ir_module)
print(type(sch))
block_c = sch.get_sblock("C")
# Get loops surronding the block
(y, x, k) = sch.get_loops(block_c)
block_size = 32
yo, yi = sch.split(y, [None, block_size]) #yi = 32 -> 내부 루프, yo =외부부 루프
xo, xi = sch.split(x, [None, block_size]) #xi = 32 -> 내부 루프, xo =외부부 루프

sch.reorder(yo, xo, k, yi, xi)
sch.parallel(yo)
sch.mod.show()

func = tvm.build(sch.mod, target="llvm")  # The module for CPU backends.

c = tvm.runtime.tensor(np.zeros((M, N), dtype=dtype), dev)
func(a, b, c)

evaluator = func.time_evaluator(func.entry_name, dev, number=1)
print("after transformation: %f" % evaluator(a, b, c).mean)

<class 'tvm.s_tir.schedule.schedule.Schedule'>


after transformation: 0.014044
