# The PyTorch JIT
## PyCon 2019 - Berlin
### Tilman Krokotsch

# About Me

### Tilman Krokotsch
#### Deep Learning Engineer @ IAV Digital Lab
#### PhD Student @ TU Berlin under Prof. Clemens Gühmann

First of all, we have to explain what a JIT is. The Acronym stands for Just-In-Time compiler. It compiles source code to machine code during runtime. This way the code can be serialized for easy transfer and optimized for faster execution. In our case, the JIT can compile the forward pass of a neural network from Python to a computation graph representation. This graph can then be executed directly in the C++ runtine of PyTorch without intermediate callbacks to Python. The graph representations has several advantages for deploying neural network models, which we will investigate in this talk.

# What are we talking about?

### JIT == Just-in-time compiler

> \[..\] is a way to create serializable and optimizable models from PyTorch code. _**- PyTorch Docs**_


### Compiles NN forward pass from Python into computation graph

### Executes computation graph directly in the C++ runtime of PyTorch

# What do we need it for?

### Deployment!

These are the promises I made in the title of this talk. They are based on my understanding of the JIT at the time of writing the talk proposal. Since then a lot happened (e.g. a minor PyTorch release), so let's see how these promises hold up.

# The Promises

### 1. Minimize Dependencies
### 2. Hide Code
### 3. Boost Performance

# Imports and Stuff

First of all we need imports for PyTorch itself, torchvision for its pretrained models and the JIT module.

In [1]:
import torch
import torchvision
import torch.jit as jit

import pickle
import numpy as np
from scipy.stats import f_oneway

We will use a pretrained AlexNet from the torchvision model zoo for this example. Let's load it and have a look at its architecture. Printing the network lets us know pretty much everything: layer types, layer order, kernel sizes and so on.

In [2]:
net = torchvision.models.alexnet(pretrained=True)
print(net)

AlexNet(
  (features): Sequential(
    (0): Conv2d(3, 64, kernel_size=(11, 11), stride=(4, 4), padding=(2, 2))
    (1): ReLU(inplace=True)
    (2): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False)
    (3): Conv2d(64, 192, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2))
    (4): ReLU(inplace=True)
    (5): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False)
    (6): Conv2d(192, 384, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (7): ReLU(inplace=True)
    (8): Conv2d(384, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (9): ReLU(inplace=True)
    (10): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (11): ReLU(inplace=True)
    (12): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False)
  )
  (avgpool): AdaptiveAvgPool2d(output_size=(6, 6))
  (classifier): Sequential(
    (0): Dropout(p=0.5, inplace=False)
    (1): Linear(in_features=9216, out_features=4096, bias=True)
 

# What are we competing against?

There are two ways to conventionally save a PyTorch model. The first one just calls the save function on the model itself. The second one calls save on the state_dict of the model. The state_dict is a dictionary of all parameters and buffers of the model. Both methods use the pickle module internally and while the first pickles the entire model object, the second pickles the state_dict. The Docs recommend the second method. We will see later, why.

In [3]:
# Save the whole module object to a pickle file
torch.save(net, 'untraced_model.pth')

In [4]:
# Save only the weights and buffers to a pickle file
torch.save(net.state_dict(), 'state_dict_model.pth')

An odder way to save your model, is to export it to ONNX. The Open Neural Network eXchange format is supported by some frameworks (e.g. CNTK and Caffe2) and is aimed at making transfer of models between them more accessible. To export our model, we need to record the computational graph with an examplary input. The graph and the weights are then saved into a single file.

In [5]:
# Build an ONNX graph and save the module to an ONNX file
x = torch.randn(1, 3, 244, 244)
net.eval()
torch.onnx.export(net, x, "onnx_model.onnx")

# How to use the JIT

Now we can convert our network into TorchScript, the language used by the JIT. For most feed forward networks this is done by tracing. We set the network into evaluation mode, as we want to deploy it, and define a representative input. The trace() function feeds the input through the forward() function of our network and records all operations. Out comes our desired ScriptedModule.

In [6]:
# Define an exemplary input
x = torch.randn(1, 3, 244, 244)
# Set the module to evalutation mode
net.eval()
# Build the graph by tracing the forward function
traced_net = jit.trace(net, x)
print(traced_net.graph)

graph(%self : ClassType<AlexNet>,
      %input.1 : Float(1, 3, 244, 244)):
  %1 : ClassType<Sequential> = prim::GetAttr[name="features"](%self)
  %2 : ClassType<Conv2d> = prim::GetAttr[name="0"](%1)
  %weight.1 : Tensor = prim::GetAttr[name="weight"](%2)
  %4 : Tensor = prim::GetAttr[name="bias"](%2)
  %7 : ClassType<Conv2d> = prim::GetAttr[name="3"](%1)
  %weight.2 : Tensor = prim::GetAttr[name="weight"](%7)
  %9 : Tensor = prim::GetAttr[name="bias"](%7)
  %12 : ClassType<Conv2d> = prim::GetAttr[name="6"](%1)
  %weight.3 : Tensor = prim::GetAttr[name="weight"](%12)
  %14 : Tensor = prim::GetAttr[name="bias"](%12)
  %16 : ClassType<Conv2d> = prim::GetAttr[name="8"](%1)
  %weight.4 : Tensor = prim::GetAttr[name="weight"](%16)
  %18 : Tensor = prim::GetAttr[name="bias"](%16)
  %20 : ClassType<Conv2d> = prim::GetAttr[name="10"](%1)
  %weight.5 : Tensor = prim::GetAttr[name="weight"](%20)
  %22 : Tensor = prim::GetAttr[name="bias"](%20)
  %26 : ClassType<Sequential> = prim::GetAttr[name="c

Our traced network is now ready to be written to disk. For that we use the save() function of the jit module. It works the same as the conentional torch.save() function.

In [7]:
# Save the traced module to file
jit.save(traced_net, 'traced_model.pth')

# That's it folks!

We have saved our network. Now let's head over to a fresh notebook where we can load and test it out. (Open the notebook "Deployment")

# Boost Performance

Let us have a look at the runtime of our models. First we will run our untraced network and then the one traced by the JIT.

In [8]:
# Measure runtime for untraced network
untraced_runs = []
x = torch.randn(1, 3, 224, 224)
for _ in range(10):
    with torch.autograd.profiler.profile() as profile:
        for _ in range(10):
            net(x)
    untraced_runs.append(profile.self_cpu_time_total / 1000)
print('%.0fms mean with %.0fms std for 10 runs with 10 loops each' % (np.mean(untraced_runs), np.std(untraced_runs)))

997ms mean with 255ms std for 10 runs with 10 loops each


In [9]:
# Measure runtime for traced network
traced_runs = []
x = torch.randn(1, 3, 224, 224)
for _ in range(10):
    with torch.autograd.profiler.profile() as profile:
        for _ in range(10):
            traced_net(x)
    traced_runs.append(profile.self_cpu_time_total / 1000)
print('%.0fms mean with %.0fms std for 10 runs with 10 loops each' % (np.mean(traced_runs), np.std(traced_runs)))

796ms mean with 97ms std for 10 runs with 10 loops each


It seems that our traced network is slightly faster. An ANOVA on the runs can tell us if the difference is statistically significant. We get a p-value grater than 0.05, far away from any significance.

In [10]:
# Calculate if runtime difference is statistically significant
f_oneway(untraced_runs, traced_runs)

F_onewayResult(statistic=4.917216861649398, pvalue=0.039699343550967055)

But this was only on the CPU. What about performance on GPU? As we have no GPU available here, we will import the timeit runs from a machine equiped with a GTX 1080Ti. This time the ANOVA shows a p-value of 9e-9, which is by significant by all accounts. Unfortunatelly the traced network is only 1.5 ms faster on average, which is a quite small improvement. For better illustration: the untraced network can process 53 images per second, while the traced can do 58.

In [11]:
# Load and compare runtime of AlexNet on GPU
with open('torchvision_timings.pkl', mode='rb') as f:
    time_dict = pickle.load(f)
print(f_oneway(time_dict['alexnet']['untraced'], time_dict['alexnet']['traced']))
print('Untraced: %.0fms +- %.2fms' % (np.mean(time_dict['alexnet']['untraced']), np.std(time_dict['alexnet']['untraced'])))
print('Traced: %.0fms +- %.2fms' % (np.mean(time_dict['alexnet']['traced']), np.std(time_dict['alexnet']['traced'])))

F_onewayResult(statistic=99.06722724800497, pvalue=9.598289543000189e-09)
Untraced: 19ms +- 0.42ms
Traced: 17ms +- 0.03ms


The last promise is theoretically fulfilled for this network. No significant speedup is meassurable for our simple network on CPU and a small one on GPU. We will look at some advanced uses of the JIT next and see if our findings hold up there.

# Performance of TorchVision Networks

![Runtime comparisons of Torchvision networks](./torchvision_plot.png)

For the relatively simple AlexNet we now know what we can expect from tracing it. What about more complex networks? For this we will repeat the process above for each network in torchvision. The JIT will inform us, that the outputs of some traced networks do not correspond with the untraced ones. As we are using untrained versions of the networks, we will ignore this warning for now. Again we will have a look at the results from a machine with a GPU.

As we can see, there is next to no difference in mean execution time for all networks. While most networks report significant differences in execution time, the absolute difference is, again, magnitudes smaller than execution time itself. The googlenet architecture alone seems faster when traced. Looking at the source code I could not make out any differences in this network, compared to the others.

# Boost Performance 😐

# The Promises

### 1. Minimize Dependencies ✅
### 2. Hide Code 😐
### 3. Boost Performance 😐

# Tips for Migrating Your Model Code

* Avoid non-traceable code when possible
    * Tracing is easy, scripting is hard
    * Most networks are feed forward in inference anyway

* Convert module-by-module for debugging
    * Only script the needed modules
    * Tracing and scripting can be mixed

* Include your interface in the graph
    * Define an general interface for your deployment code
    * Include bridging code in the graph (e.g. input normalization)

## This Talk on GitHub: _github.com/tilman151/pytorch_jit_pycon19_