Skip to content

[quant] QuantizedCUDA #30813

@alnfedorov

Description

@alnfedorov

🚀 Feature

  1. Introduction of the QuantizedCUDA backend(similar to the QuantizedCPU backend).
  2. Implementation of the PER_TENSOR_AFFINE QScheme for QuantizedCUDA tensors. (others can be implemented later on)

End users should be able to quantize(qint8, quint8, qint32) and dequantize CUDA tensors with float[16|32|64] dtype.

Motivation

General necessity:
The last generation of the GPU produced by Nvidia (Turing) includes support of the integer Tensor Cores (TC). There are several ways that we can make use of them besides cheap and fast inference:

  1. In perspective, we could replace Fake-Quantization during training and work with a fully quantized forward pass to reduce possible training-inference biases and, perhaps, speed up things.
  2. This could stimulate researchers to investigate the limits of the cheap and dirty, fully quantized training (backward + forward).

Anyway, the first step to make use of the Tensor Cores is to introduce quantized CUDA tensors.

My user-case:
Up to date, I implemented TVM modules for common layers in the nn (conv, pooling) with use of the tensor cores, both forward and backward pass. Unfortunately, it is not simple to use them with PyTorch framework:

  1. Autograd makes a lot of assumptions(enforcing float dtype, etc) about gradients, so I had to remove several of them.
  2. I have to store quantization parameters for integer tensors somewhere.

The first problem should be considered elsewhere (out of the PR scope).
Any hacks that came to my mind to solve the second problem didn't work (see alternatives), and it seems that I need a new backend. With a simple functionality to pass tensors around and quantize/dequantize them. We could add more support later.

Pitch

Here is the code:

import torch

t = torch.rand(10)
print(t)
# tensor([0.6088, 0.3496, 0.3973, 0.0884, 0.5340, 0.9819, 0.5057, 0.2072, 0.6677,
#         0.6197], device='cuda:0')
t = torch.quantize_per_tensor(t, 0.01, 0, torch.qint8)
print(t)
# tensor([0.6100, 0.3500, 0.4000, 0.0900, 0.5300, 0.9800, 0.5100, 0.2100, 0.6700,
#         0.6200], size=(10,), dtype=torch.qint8, device='cuda:0',
#        quantization_scheme=torch.per_tensor_affine, scale=0.01, zero_point=0)
t = t.dequantize()
print(t)
# tensor([0.6100, 0.3500, 0.4000, 0.0900, 0.5300, 0.9800, 0.5100, 0.2100, 0.6700,
#         0.6200], device='cuda:0')

Should work without errors.

Alternatives

If end-users wish to simulate QuantizedGPU backend, they should store tensor's quantization parameters somewhere. And there is no reliable way to do so, especially given autograd backward strategy. Tensor's hash is not reliable because different views on the same memory have different hashes. One can use data_ptr to simulate hash, yet it leads to errors when PyTorch uses cached memory allocation. I didn't come up with any other simple hack, looks like adding QuantizedGPU backend is more fruitful.

Additional context

I have started to dig into QuantizedCPU PRs; it looks like I will need to go the same way. However, I would highly appreciate any comments and suggestions. Ideally just a list of files to update to get things to work :).

Edits: more clarity.

cc @jerryzh168 @jianyuh @dzhulgakov @raghuramank100 @jamesr66a

Metadata

Metadata

Assignees

Labels

oncall: quantizationQuantization support in PyTorchtriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions