
Please follow the introduction of the TVM tutorial before running this. The below code assumes you have already setup TVM, and merely loads it from your Google Drive.

In [1]:
from google.colab import drive
drive.mount('/content/gdrive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=email%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdocs.test%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.photos.readonly%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/gdrive


In [2]:

try:
  import google.colab
  IN_COLAB = True
except:
  IN_COLAB = False

if IN_COLAB:
    ! gsutil cp "gs://tvm-fcrc-binariesd5fce43e-8373-11e9-bfb6-0242ac1c0002/tvm.tar.gz" /tmp/tvm.tar.gz
    ! mkdir -p /tvm
    ! tar -xf /tmp/tvm.tar.gz --strip-components=4 --directory /tvm
    ! ls -la /tvm
    ! bash /tvm/package.sh
    # Add TVM to the Python path.
    import sys
    sys.path.append('/tvm/python')
    sys.path.append('/tvm/topi/python')
else:
    print("Notebook executing locally, skipping Colab setup ...")

Copying gs://tvm-fcrc-binariesd5fce43e-8373-11e9-bfb6-0242ac1c0002/tvm.tar.gz...
| [1 files][119.5 MiB/119.5 MiB]                                                
Operation completed over 1 objects/119.5 MiB.                                    
total 164
drwxr-xr-x 21 root root  4096 Jun 21 04:43 .
drwxr-xr-x  1 root root  4096 Jun 21 04:43 ..
drwx------  8 root root  4096 May 31 08:14 3rdparty
drwx------ 12 root root  4096 May 31 08:14 apps
drwx------  3 root root  4096 Jun 19 07:58 build
drwx------  4 root root  4096 May 31 08:14 cmake
-rw-------  1 root root 11053 Jun 19 04:54 CMakeLists.txt
drwx------  6 root root  4096 May 31 08:14 conda
-rw-------  1 root root  5736 Jun 19 04:54 CONTRIBUTORS.md
drwx------  3 root root  4096 May 31 08:14 docker
drwx------ 11 root root  4096 May 31 08:14 docs
drwx------  4 root root  4096 May 31 08:14 golang
drwx------  3 root root  4096 May 31 08:14 include
-rw-------  1 root root 10607 Jun 19 04:54 Jenkinsfile
drwx------  6 root root  4096 May 31 

# **Ultra Low Precision Operators**

This tutorial will show how TVM can be used to define new operators, specifically ultra low precision operators used in networks like XNOR-Net and DoReFa-Net that perform computation on activation and weights quantized to a few bits.
The first half will show how to write a simple low precision dot product operator to demonstrate the basic concepts of bitserial computation. Then, show how to call tvm ultra low precision operators in topi.

In [0]:
import tvm
import numpy as np
from topi.transform import concatenate

# Ultra Low Precision Dot Product
**Step 1: Bitpacking**:
Efficient ultra low precision operators compute data *bit-serially*, processing each bit-position one at a time, on data that is *bitpacked*. The input vectors must be separated into *bitplanes* that  represent the binary value of each bit-position of the inputs. The bitplanes are packed into a single integer.

For this tutorial we assume that the data has already been quantized to the desired precision and is integer format. Different networks emply different quantization schemes that map floating point values to low-bit integers.

In [4]:
# Ultra low precision dot product between a 2-bit and 1-bit vector
K = 8
input_shape = (K,)
a_bits = 1
b_bits = 2
input_dtype="uint8"
assert(K%8 == 0)

# TVM parameters to generate code for CPU
target = 'llvm'
ctx = tvm.context(target, 0)


# Creating two ultra low precision vectors a and b.
a = tvm.nd.array(np.random.randint(0, 2**a_bits, input_shape).astype(input_dtype), ctx)
b = tvm.nd.array(np.random.randint(0, 2**b_bits, input_shape).astype(input_dtype), ctx)

a_dot_b = np.dot(a.asnumpy(), b.asnumpy())

print ("a:", a)
print("b:", b)
print ("dot product:", a_dot_b)

a: [1 0 0 1 0 0 0 1]
b: [2 2 3 1 3 1 0 0]
dot product: 3


Bitpacking a 1-bit vector is simply packing the single bit elements into an integer.

![Bitpacking 1-bit vector](https://docs.google.com/uc?export=download&id=1pYbyxNlvx-QYq1y8UxSRKimukwLPRXwe)



In [5]:
# Bitpacking efficiently stores low precision data into a pack data type
# Here we pick uint8's because our input vectors are only 8 elements
pack_type = 'uint8'
pack_size = 8

# Since vector a contains 1-bit data, it is simple to bitpack.
# The bits of vector b are just packed into a single uint8
print("a:", a)
print("a bitpacked:", np.packbits(a.asnumpy()))


a: [1 0 0 1 0 0 0 1]
a bitpacked: [145]


To bitpack a multi-bit vector, first separate the data into bitplanes and then pack the bitplanes.
![Bitpacking 2-bit vector](https://docs.google.com/uc?export=download&id=1hla9JcjnOL6ZgmatyeRW3Z1zVWnAhUHF)

In [7]:
# To bitpack a we want to first separate out a into bitplanes representing the values of each bit position
# For vector a, there are two bitplanes
# Then each of the bitplanes of vector a are packed into uint8s
b_bit0 = b.asnumpy() & 0x1
b_bit1 = (b.asnumpy() & 0x2) >> 1
print("b:", b)
print("b bitpacked", np.packbits(b_bit0), np.packbits(b_bit1))

b: [2 2 3 1 3 1 0 0]
b bitpacked [60] [232]


In [0]:
# Here is how to express a flexible bitpacking routing in TVM.
# Unlike most TVM operators that express one to one, or many to one functions
# bitpacking is a many to many function.

def bitpack(data, bits, name):
    bitplane_shape = (K//8, 1)
    masks = np.array([0x1, 0x2, 0x4, 0x8, 0x10, 0x20, 0x40, 0x80])

    def _bitpack(i, j):
        bitplane = [tvm.const(0, pack_type)] * bits
        
        # Iterate over elements that are being packed
        for k in range(pack_size):
            element = data[i*pack_size + k]
            
            # Extract each bit of the element and pack it into a separate element
            for b in range(bits):
                extracted_bit = ((element & tvm.const(masks[b], pack_type)) >> b).astype(pack_type)
                bitplane[b] = (bitplane[b] | extracted_bit)
                if k < pack_size - 1:
                    bitplane[b] = bitplane[b] << 1

            if k == pack_size - 1:
                return tuple(bitplane)

    output_tuple = tvm.compute(bitplane_shape, _bitpack, name=name)

    # If we have more than one bit, combine the bitplanes with concatentate
    if bits > 1:
        return concatenate(output_tuple, axis=1).astype(pack_type)
    else:
        return output_tuple


In [13]:
# Schedule and call packing vector b
# Declaring inputs and outputs
B = tvm.placeholder(b.shape, dtype=input_dtype, name='B')
BPacked = bitpack(B, b_bits, "PackedB")

s = tvm.create_schedule(BPacked.op)
f = tvm.build(s, [B, BPacked], target=target)
b_packed = tvm.nd.array(np.zeros((K//pack_size, b_bits), dtype = pack_type), ctx)
f(b, b_packed)

print("b:          ", b)
print("bits 0 of b:", b.asnumpy()&0x1)
print("bits 1 of b:", (b.asnumpy()&0x2) >> 1)
print("bitpacked b:", b_packed)

b:           [2 2 3 1 3 1 0 0]
bits 0 of b: [0 0 1 1 1 1 0 0]
bits 1 of b: [1 1 1 0 1 0 0 0]
bitpacked b: [[ 60 232]]


In [0]:
# Bitpack vector a with the same routine - just indicate that the number of bits is different
A = tvm.placeholder(a.shape, dtype=input_dtype, name='A')
APacked = bitpack(A, a_bits, "PackedA")

s = tvm.create_schedule(APacked.op)
f = tvm.build(s, [A, APacked], target=target)
a_packed = tvm.nd.array(np.zeros((K//pack_size, a_bits), dtype = pack_type), ctx)
f(a, a_packed)

**Step 2: Dot product:**

In [19]:
# Declare reduction axes
ab = tvm.reduce_axis((0, a_bits), name='ab')
bb = tvm.reduce_axis((0, b_bits), name='bb')
k = tvm.reduce_axis((0, K//pack_size), name='k')

out_dtype = 'int8'
out_shape = (1,)

C = tvm.compute(out_shape, lambda i: 
  tvm.sum(tvm.popcount(APacked[k, ab] & BPacked[k, bb]).astype(out_dtype) 
          << (ab+bb).astype(out_dtype), axis=[k, ab, bb]))


s = tvm.create_schedule(C.op)
f = tvm.build(s, [A, B, C], target=target)
c = tvm.nd.array(np.zeros(out_shape, dtype = out_dtype), ctx)
f(a, b, c)

print("Correct:", a_dot_b)
print("Calculated:", c)
np.testing.assert_allclose(c.asnumpy(), a_dot_b)

Correct: 3
Calculated: [3]


# TVM Low Precision Operators
The low precision operators support a variety of different configurations.
We also provide a flexible bitpacking operators that accepts generic shapes and allows the user to specify which axis to pack and the relative axis to place the new bitplane axis.
- Packing datat types. For example uint8 or uint32.
- Output data types. For example int16 or int32.
- Bitserial dotproduct style. 
- For 2D convolutions, NHWC and NCHW layouts.



In [20]:
import topi
import topi.testing
from topi.util import get_const_tuple


batch = 1
in_height = in_width = 56
in_channel = 64
num_filter = 64
kernel = 3
padding = 0
stride = (1, 1)
activation_bits = 2
weight_bits = 1
unipolar=False
input_dtype='uint32'
out_dtype='int32'

with tvm.target.create('llvm'):
    # Create input place holders
    A = tvm.placeholder((batch, in_height, in_width, in_channel), dtype=input_dtype, name='A')
    W = tvm.placeholder((kernel, kernel, in_channel, num_filter), dtype=input_dtype, name='W')
    # Declare computation
    B = topi.nn.bitserial_conv2d_nhwc(A, W, stride, padding, activation_bits, weight_bits,
                                      out_dtype=out_dtype, unipolar=unipolar)
    # Schedule computation
    s = topi.generic.schedule_bitserial_conv2d_nhwc([B])

    
# Declare some random inputs
a_shape = get_const_tuple(A.shape)
w_shape = get_const_tuple(W.shape)
a_np = np.random.randint(0, 2**activation_bits, get_const_tuple(a_shape)).astype(input_dtype)
w_np = np.random.randint(0, 2**weight_bits, get_const_tuple(w_shape)).astype(input_dtype)

# Call the function with inputs
ctx = tvm.cpu(0)
a = tvm.nd.array(a_np, ctx)
w = tvm.nd.array(w_np, ctx)
b = tvm.nd.array(np.zeros(get_const_tuple(B.shape), dtype=B.dtype), ctx)
func = tvm.build(s, [A, W, B], 'llvm')

func(a, w, b)

Cannot find config for target=llvm, workload=('bitserial_conv2d_nhwc', (1, 56, 56, 64, 'uint32'), (3, 3, 64, 64, 'uint32'), (1, 1), 0, 2, 1, 'uint32', 'int32', False). A fallback configuration is used, which may bring great performance regression.
