<a href="https://colab.research.google.com/github/vin136/llm-infer/blob/main/Geeking_out.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Goal : Build a good mental model of the `mlc/tvm`(compilation process)

`Note`: This is an attempt to understand the internals of `mlc` and `tvm` compiltion engines. It'll atleast help me confidently use these tools, if not ever write a custom kernel myself.

Also practically speaking,for fast inference we only want to take our llm => seq of functions(eg: `linear`=> `relu` ...) and map it to corresponding sequence of cuda kernels(~almost). Never have to write a custom cuda kernel.

## First things first - what's `tvm/mlc` doing to my model ?


1. Dependency minimization: Remove fluff from your development code.Keep only what's needed to run the model.

2. Leverage hardware native acceleration: change the model to a `form` that directly invokes native acceleration libraries.

3. Optimization in general: There are many equalent ways of doing an operation (say conv or attention), find the best one.

## Let's dig in.

llm = weights(`tensors`) + sequence of transformations on them.(`tensor-functions`)

As an engineering discipline, Software engineering/computer science is mostly a search for good abstractions. `tvm` takes a `tensor-function` or a sequence of them and represents them in an **equivalent way**, that can map better to the metal(eg: cuda architecture). More concretely:

high level code => intermediate representation(`mlc`) => map to low-level primitives.







In [1]:
!python3 -m  pip install mlc-ai-nightly -f https://mlc.ai/wheels

Looking in links: https://mlc.ai/wheels
Collecting mlc-ai-nightly
  Downloading https://github.com/mlc-ai/package/releases/download/v0.9.dev0/mlc_ai_nightly-0.12.dev2119-cp310-cp310-manylinux_2_28_x86_64.whl (90.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m90.5/90.5 MB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: mlc-ai-nightly
Successfully installed mlc-ai-nightly-0.12.dev2119


In [2]:
import tvm
from tvm.ir.module import IRModule
from tvm.script import tir as T
import numpy as np


@tvm.script.ir_module
class MyModule:
    @T.prim_func
    def main(A: T.Buffer[128, "float32"],
             B: T.Buffer[128, "float32"],
             C: T.Buffer[128, "float32"]):
        # extra annotations for the function
        T.func_attr({"global_symbol": "main", "tir.noalias": True})
        for i in range(128):
            with T.block("C"):
                # declare a data parallel iterator on spatial domain
                vi = T.axis.spatial(128, i)
                C[vi] = A[vi] + B[vi]


  def main(A: T.Buffer[128, "float32"],
  B: T.Buffer[128, "float32"],
  C: T.Buffer[128, "float32"]):


In [3]:
type(MyModule)

tvm.ir.module.IRModule

In [4]:
#let's inspect the module, just adds additional info
MyModule.show()

In [5]:
# with the annotated module, we can search for all equivalent representations(automatically) and find a good one,in this case it's same.


sch = tvm.tir.Schedule(MyModule)
print(type(sch))

<class 'tvm.tir.schedule.schedule.Schedule'>


In [7]:
print(sch.mod.script())

# from tvm.script import ir as I
# from tvm.script import tir as T

@I.ir_module
class Module:
    @T.prim_func
    def main(A: T.Buffer((128,), "float32"), B: T.Buffer((128,), "float32"), C: T.Buffer((128,), "float32")):
        T.func_attr({"tir.noalias": T.bool(True)})
        # with T.block("root"):
        for i in range(128):
            with T.block("C"):
                vi = T.axis.spatial(128, i)
                T.reads(A[vi], B[vi])
                T.writes(C[vi])
                C[vi] = A[vi] + B[vi]


In [8]:
#let's manually create a diff equavalent way - first try to split the loops

# Get block by its name
block_c = sch.get_block("C")
# Get loops surronding the block
(i,) = sch.get_loops(block_c)
# Tile the loop nesting.
i_0, i_1, i_2 = sch.split(i, factors=[None, 4, 4])
sch.mod.show()

In [9]:
# i can reorder loops

sch.reorder(i_0, i_2, i_1)
sch.mod.show()

In [10]:
#parallelize outer loops

sch.parallel(i_0)
sch.mod.show()

In [None]:
# turn this module/map to primitive functions

rt_mod = tvm.build(MyModule, target="llvm")  # The module for CPU backends.
print(type(rt_mod))

So given a backend `mlc` is taking the pytorch code => converting it into intermediate representation(IRModule) => that is conduction to program search to find the best one, given the backend(eg: nvidia gpu,apple gpu etc)