In [1]:
%matplotlib inline

import IPython



## Auto-Optimization with TVM

This is mostly: Auto-scheduling a Neural Network for NVIDIA GPU, Author: Lianmin Zheng
and adapted to a PyTorch model


In [2]:
import numpy as np

import tvm
from tvm import relay, auto_scheduler
import tvm.relay.testing
from tvm.contrib import graph_executor

# Define a Network

We export a model in TorchScript and import into TVM

In [3]:
import torchvision
import torch
model = torchvision.models.resnet18(pretrained=True).eval().cuda()
input_shape = 1, 3, 224, 224
inp = torch.randn(input_shape, device="cuda")
traced_model = torch.jit.trace(model, inp)
output_shape = traced_model(inp).shape


In [5]:
target = tvm.target.Target("cuda")
mod, params = tvm.relay.frontend.from_pytorch(traced_model, [('input', input_shape)])


Extract Search Tasks
--------------------
Next, we extract the search tasks and their weights from a network.
The weight of a task is the number of appearances of the task's subgraph
in the whole network.
By using the weight, we can approximate the end-to-end latency of the network
as :code:`sum(latency[t] * weight[t])`, where :code:`latency[t]` is the
latency of a task and :code:`weight[t]` is the weight of the task.
The task scheduler will just optimize this objective.



In [6]:
# Extract tasks from the network
print("Extract tasks...")
#mod, params, input_shape, output_shape = get_network(network, batch_size, layout, dtype=dtype)
tasks, task_weights = auto_scheduler.extract_tasks(mod["main"], params, target)

for idx, task in enumerate(tasks):
    print("========== Task %d  (workload key: %s) ==========" % (idx, task.workload_key))
    #print(task.compute_dag)


Extract tasks...


Begin Tuning
------------
Now, we set some options for tuning and launch the search tasks

* :code:`measure_ctx` launches a different process for measurement to
  provide isolation. It can protect the master process from GPU crashes
  during measurement and avoid other runtime conflicts.
* :code:`min_repeat_ms` defines the minimum duration of one "repeat" in every measurement.
  This can warmup the GPU, which is necessary to get accurate measurement results.
  Typically, we recommend a value >= 300 ms.
* :code:`num_measure_trials` is the number of measurement trials we can use during the tuning.
  You can set it to a small number (e.g., 200) for a fast demonstrative run.
  In practice, we recommend setting it around :code:`900 * len(tasks)`,
  which is typically enough for the search to converge.
  For example, there are 24 tasks in resnet-18, so we can set it as 20000.
  You can adjust this parameter according to your time budget.
* In addition, we use :code:`RecordToFile` to dump measurement records into a log file,
  The measurement records can be used to query the history best, resume the search,
  and do more analyses later.
* see :any:`auto_scheduler.TuningOptions`,
  :any:`auto_scheduler.LocalRPCMeasureContext` for more parameters.




In [7]:

class ClearOutput(auto_scheduler.task_scheduler.TaskSchedulerCallback):
    def pre_tune(self, task_scheduler, task_id):
        IPython.display.clear_output()

log_file = 'tune.log'
def run_tuning():
    print("Begin tuning...")
    measure_ctx = auto_scheduler.LocalRPCMeasureContext(repeat=1, min_repeat_ms=300, timeout=10)

    tuner = auto_scheduler.TaskScheduler(tasks, task_weights, callbacks=[
        ClearOutput(),
        tvm.auto_scheduler.task_scheduler.PrintTableInfo(),
        tvm.auto_scheduler.task_scheduler.LogEstimatedLatency('total_latency.tsv')
    ])
    tune_option = auto_scheduler.TuningOptions(
        num_measure_trials=200,  # change this to 20000 to achieve the best performance
        num_measures_per_round=8, # 64
        runner=measure_ctx.runner,
        measure_callbacks=[auto_scheduler.RecordToFile(log_file)],
    )

    tuner.tune(tune_option)


run_tuning()

|  ID  | Latency (ms) | Speed (GFLOPS) | Trials |
-------------------------------------------------
|    0 |        0.041 |        5839.50 |      8 |
|    1 |        0.112 |        2073.29 |      8 |
|    2 |        0.004 |         504.65 |      8 |
|    3 |        0.053 |        2177.95 |     16 |
|    4 |        0.004 |          -0.00 |      8 |
|    5 |        0.019 |         663.25 |      8 |
|    6 |        0.052 |        4481.00 |      8 |
|    7 |        0.052 |        4471.45 |     16 |
|    8 |        0.038 |        3053.44 |      8 |
|    9 |        0.016 |         798.09 |      8 |
|   10 |        0.041 |         316.21 |      8 |
|   11 |        0.119 |        1943.11 |      8 |
|   12 |        0.018 |          56.89 |      8 |
|   13 |        0.073 |        3162.12 |     16 |
|   14 |        0.052 |        2226.61 |      8 |
|   15 |        0.124 |        1859.29 |     24 |
|   16 |        0.154 |        1498.07 |     16 |
|   17 |        0.056 |        4156.56 |      8 |


Compile and Evaluate
--------------------
After auto-tuning, we can compile the network with the best schedules we found.
All measurement records are dumped into the log file during auto-tuning,
so we can read the log file and load the best schedules.



#### Compile with the history best
print("Compile...")
with auto_scheduler.ApplyHistoryBest(log_file):
    with tvm.transform.PassContext(opt_level=3, config={"relay.backend.use_auto_scheduler": True}):
        lib = relay.build(mod, target=target, params=params)
# Create graph executor
dev = tvm.device(str(target), 0)
module = graph_executor.GraphModule(lib["default"](dev))
#data_tvm = tvm.nd.array((np.random.uniform(size=input_shape)).astype(dtype))
import torch.utils.dlpack
data_tvm = tvm.nd.from_dlpack(torch.utils.dlpack.to_dlpack(inp))

module.set_input("input", data_tvm)


module.run(); dev.sync()
%timeit module.run(); dev.sync()

In [9]:
traced_model(inp); torch.cuda.synchronize()
%timeit traced_model(inp); torch.cuda.synchronize()

1.28 ms ± 161 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)


Other Tips
----------
1. During the tuning, the auto-scheduler needs to compile many programs and
   extract feature from them. This part is CPU-intensive,
   so a high-performance CPU with many cores is recommended for faster search.
2. You can use :code:`python3 -m tvm.auto_scheduler.measure_record --mode distill -i log.json`
   to distill the large log file and only save the best useful records.
3. You can resume a search from the previous log file. You just need to
   add a new argument :code:`load_log_file` when creating the task scheduler
   in function :code:`run_tuning`. Say,
   :code:`tuner = auto_scheduler.TaskScheduler(tasks, task_weights, load_log_file=log_file)`
4. If you have multiple target GPUs, you can use all of them for measurements to
   parallelize the measurements. Check this `section <tutorials-autotvm-scale-up-rpc-tracker>`
   to learn how to use the RPC Tracker and RPC Server.
   To use the RPC Tracker in auto-scheduler, replace the runner in :code:`TuningOptions`
   with :any:`auto_scheduler.RPCRunner`.

