## Quantisation in Transformer Models

![image.png](attachment:image.png)

## Integer Precision Table


| Format                    | Bits | Bytes | Typical Range/Notes                            |
|---------------------------|------|-------|-----------------------------------------------|
| **Integer (unsigned 8-bit)**   | 8    | 1     | 0 to 255 (256 discrete values)               |
| **Integer (signed 8-bit)**     | 8    | 1     | −128 to 127                                  |
| **Integer (unsigned 16-bit)**  | 16   | 2     | 0 to 65,535 (65,536 discrete values)         |
| **Integer (signed 16-bit)**    | 16   | 2     | −32,768 to 32,767                            |
| **Integer (unsigned 32-bit)**  | 32   | 4     | 0 to 4,294,967,295                           |
| **Integer (signed 32-bit)**    | 32   | 4     | −2,147,483,648 to 2,147,483,647              |
| **Integer (unsigned 64-bit)**  | 64   | 8     | 0 to 18,446,744,073,709,551,615              |
| **Integer (signed 64-bit)**    | 64   | 8     | −9,223,372,036,854,775,808 to 9,223,372,036,854,775,807 |
| **Floating-point (FP16)**      | 16   | 2     | ~5 bits for exponent, 10 bits for fraction (rounding)  |
| **Floating-point (FP32)**      | 32   | 4     | ~8 bits exponent, ~7 decimal digits precision          |
| **Floating-point (FP64)**      | 64   | 8     | ~11 bits exponent, ~15–16 decimal digits precision      |

**Notes**:
- For unsigned integers, the range always starts at 0.
- For signed integers, half of the values cover negative numbers.
- Floating-point formats follow IEEE 754 in most implementations, so the range/precision is approximate.
- FP16 is also known as half-precision, FP32 as single-precision, and FP64 as double-precision.



Quantisation can be explained in the following image
![image.png](attachment:image.png)

In [1]:
import torch

In [18]:
# Information of `8-bit unsigned integer`
torch.iinfo(torch.uint8)
torch.iinfo(torch.uint64)

iinfo(min=0, max=1.84467e+19, dtype=uint64)

In [21]:
#
torch.finfo(torch.float32)

finfo(resolution=1e-06, min=-3.40282e+38, max=3.40282e+38, eps=1.19209e-07, smallest_normal=1.17549e-38, tiny=1.17549e-38, dtype=float32)

![image.png](attachment:image.png)
![image-2.png](attachment:image-2.png)

In [24]:
tensor_fp32 = torch.rand(1000, dtype= torch.float32)
tensor_fp32[:10]

tensor([0.3869, 0.0937, 0.6786, 0.4006, 0.4614, 0.9926, 0.6237, 0.7446, 0.7571,
        0.9503])

In [27]:
tensor_bfloat16 = tensor_fp32.to(dtype=torch.bfloat16)
tensor_bfloat16[:10]

tensor([0.3867, 0.0938, 0.6797, 0.4004, 0.4609, 0.9922, 0.6250, 0.7461, 0.7578,
        0.9492], dtype=torch.bfloat16)

In [29]:
mfloat = torch.dot(tensor_fp32,tensor_fp32)
mfloat

tensor(332.9919)

In [30]:
bfloat_m = torch.dot(tensor_bfloat16,tensor_bfloat16)
bfloat_m

tensor(334., dtype=torch.bfloat16)

![image.png](attachment:image.png)