[quant] QuantizedCUDA

## 🚀 Feature
1. Introduction of the **QuantizedCUDA** backend(similar to the **QuantizedCPU** backend). 
2. Implementation of the PER_TENSOR_AFFINE QScheme for QuantizedCUDA tensors. (others can be implemented later on)

End users should be able to quantize(qint8, quint8, qint32) and dequantize CUDA tensors with float[16|32|64] dtype.

## Motivation
*General necessity:*
The last generation of the GPU produced by Nvidia (Turing) includes support of the integer Tensor Cores (TC). There are several ways that we can make use of them besides cheap and fast inference:
1. In perspective, we could replace Fake-Quantization during training and work with a fully quantized forward pass to reduce possible training-inference biases and, perhaps, speed up things. 
2. This could stimulate researchers to investigate the limits of the cheap and dirty, fully quantized training (backward + forward).

Anyway, the first step to make use of the Tensor Cores is to introduce quantized CUDA tensors.

*My user-case:*
Up to date, I implemented TVM modules for common layers in the nn (conv, pooling) with use of the tensor cores, both forward and backward pass. Unfortunately, it is not simple to use them with PyTorch framework:
1. Autograd makes a lot of assumptions(enforcing float dtype, etc) about gradients, so I had to remove several of them. 
2. I have to store quantization parameters for integer tensors somewhere.

The first problem should be considered elsewhere (out of the PR scope).
Any hacks that came to my mind to solve the second problem didn't work (see alternatives), and it seems that I need a new backend. With a simple functionality to pass tensors around and quantize/dequantize them. We could add more support later.

## Pitch
Here is the code:
```python
import torch

t = torch.rand(10)
print(t)
# tensor([0.6088, 0.3496, 0.3973, 0.0884, 0.5340, 0.9819, 0.5057, 0.2072, 0.6677,
#         0.6197], device='cuda:0')
t = torch.quantize_per_tensor(t, 0.01, 0, torch.qint8)
print(t)
# tensor([0.6100, 0.3500, 0.4000, 0.0900, 0.5300, 0.9800, 0.5100, 0.2100, 0.6700,
#         0.6200], size=(10,), dtype=torch.qint8, device='cuda:0',
#        quantization_scheme=torch.per_tensor_affine, scale=0.01, zero_point=0)
t = t.dequantize()
print(t)
# tensor([0.6100, 0.3500, 0.4000, 0.0900, 0.5300, 0.9800, 0.5100, 0.2100, 0.6700,
#         0.6200], device='cuda:0')
```
Should work without errors.

## Alternatives
If end-users wish to simulate QuantizedGPU backend, they should store tensor's quantization parameters somewhere. And there is no reliable way to do so, especially given autograd backward strategy. Tensor's hash is not reliable because different views on the same memory have different hashes. One can use data_ptr to simulate hash, yet it leads to errors when PyTorch uses cached memory allocation. I didn't come up with any other simple hack, looks like adding QuantizedGPU backend is more fruitful.

## Additional context
I have started to dig into [QuantizedCPU PRs](https://github.com/pytorch/pytorch/pull/18546); it looks like I will need to go the same way. However, I would highly appreciate any comments and suggestions. Ideally just a list of files to update to get things to work :).

Edits: more clarity.

cc @jerryzh168 @jianyuh @dzhulgakov @raghuramank100 @jamesr66a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[quant] QuantizedCUDA #30813

🚀 Feature

Motivation

Pitch

Alternatives

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[quant] QuantizedCUDA #30813

Description

🚀 Feature

Motivation

Pitch

Alternatives

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions