# Lesson 4: Quantization Theory

In this lab, you will perform Linear Quantization.

#### Libraries to install
- If you are running this notebook on your local machine, you can install the following:

```Python
!pip install transformers==4.35.0
!pip install quanto==0.0.11
!pip install torch==2.1.1
```

# linear quantization

# most common one

# This is the most popular quantization scheme so far. It is used in most state of the art quantization methods. Then, you will apply linear quantization to real models using Quanto, a Python quantization toolkit from hugging face.

# Quantization:
## quantization refers to the process of mapping a large set to a saller set of values.

### an example: how to perform 8-bit quantization on a simple tensor. We will go from float32 values to 8-int values.

### the intuition on how linear quantization works:

### we can start with an original tensor in FP32 with 9 floats and convert the FP32 weights to INT8 without losing too much information.  ---> quantized tensor in INT8 between -128 and 127. (map the extreme values first and then propotionally convert other values.) ---> linear mapping parameters, s = scale, z = zero point.  ----> and then we delete the original tensor that occupied a lot of our memory.

### but then, how can we go back from the quantized tensors t othe de-quantized tensor in FP32? (answer, you can't get the exact numbers back, but can get an approximation)

### procedures: quanized tensor in INT8 (between -128 and 127) to dequantized tensor in FP32 (following the linear mapping we defined previously). So quantization results in a loss of information. Let's compare the original and the de-quantized tensor. The result is that the de-quantized tensor in FP32 are pretty accurate.

### The quantization error has error. Even if linear uqantization looks very simple, it is used in many in SOTA quantization methods:

AWQ: activation-aware weight quantization \
GPTQ: GPT quantized \
BNB: BitsandBytes Quantization \

# What is a model parameter?

### one layer --> weights -- > usually stored in matrice.
### sign: one bit, exponent, 8 bits --> fraction (23 bits) --> 32 bits a number
### And it is possible to inspect each parameter's data type.
### inspect the weight's datatype

In [None]:
# !pip install transformers==4.35.0
# !pip install quanto==0.0.11
# !pip install torch==2.1.1

# if run into error, restart and do not install these.

## T5-FLAN
- Please note that due to hardware memory constraints, and in order to offer this course for free to everyone, the code you'll run here is for the T5-FLAN model instead of the EleutherAI AI Pythia model.  
- Thank you for your understanding! 🤗

For the T5-FLAN model, here is one more library to install if you are running locally:
```Python
!pip install sentencepiece==0.2.0
```


In [None]:
# !pip install sentencepiece==0.2.0

### Without Quantization

In [None]:
model_name = "google/flan-t5-small"

In [None]:
# !pip install --upgrade transformers huggingface_hub


In [None]:
import sentencepiece as spm
from transformers import T5Tokenizer, T5ForConditionalGeneration

tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-small")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


In [None]:
model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-small")

config.json:   0%|          | 0.00/1.40k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/308M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

In [None]:
input_text = "Hello, my name is "
input_ids = tokenizer(input_text, return_tensors="pt").input_ids

outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0]))

<pad> annie scott</s>




In [None]:
from helper_4 import compute_module_sizes
module_sizes = compute_module_sizes(model)
print(f"The model size is {module_sizes[''] * 1e-9} GB")

The model size is 0.307844608 GB


## Quantize the model (8-bit precision)

In [None]:
# huggingface model to quantize data
from quanto import quantize, freeze
import torch

In [None]:
quantize(model, weights=torch.int8, activations=None)

In [None]:
print(model)

T5ForConditionalGeneration(
  (shared): Embedding(32128, 512)
  (encoder): T5Stack(
    (embed_tokens): Embedding(32128, 512)
    (block): ModuleList(
      (0): T5Block(
        (layer): ModuleList(
          (0): T5LayerSelfAttention(
            (SelfAttention): T5Attention(
              (q): QLinear(in_features=512, out_features=384, bias=False)
              (k): QLinear(in_features=512, out_features=384, bias=False)
              (v): QLinear(in_features=512, out_features=384, bias=False)
              (o): QLinear(in_features=384, out_features=512, bias=False)
              (relative_attention_bias): Embedding(32, 6)
            )
            (layer_norm): T5LayerNorm()
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (1): T5LayerFF(
            (DenseReluDense): T5DenseGatedActDense(
              (wi_0): QLinear(in_features=512, out_features=1024, bias=False)
              (wi_1): QLinear(in_features=512, out_features=1024, bias=False)
              

### Freeze the model
- This step takes a bit of memory, and so for the Pythia model that is shown in the lecture video, it will not run in the classroom.
- This will work fine with the smaller T5-Flan model.

In [None]:
freeze(model)

In [None]:
module_sizes = compute_module_sizes(model)
print(f"The model size is {module_sizes[''] * 1e-9} GB")

The model size is 0.12682868 GB


### Try running inference on the quantized model

In [24]:
input_text = "Hello, my name is "
input_ids = tokenizer(input_text, return_tensors="pt").input_ids

outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0]))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


Hello, my name is 
<somsip>!ask | james_
<


## Note: Quantizing the model used in the lecture video will not work due to classroom hardware limitations.
- Here is the code that Marc, the instructor is walking through.  
- It will likely run on your local computer if you have 8GB of memory, which is usually the minimum for personal computers.
  - To run locally, you can download the notebook and the helper.py file by clicking on the "Jupyter icon" at the top of the notebook and navigating the file directory of this classroom.  Also download the requirements.txt to install all the required libraries.

### Without Quantization



- Load [EleutherAI/pythia-410m](https://huggingface.co/EleutherAI/pythia-410m) model and tokenizer.

```Python
from transformers import AutoModelForCausalLM
model_name = "EleutherAI/pythia-410m"

model = AutoModelForCausalLM.from_pretrained(model_name,
                                             low_cpu_mem_usage=True)
print(model.gpt_neox)


from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)

- Write a start of a (`text`) sentence which you'd like the model to complete.
```Python
text = "Hello my name is"
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=10)
outputs
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

- Compute the model's size using the helper function, `compute_module_sizes`.
```Python
from helper import compute_module_sizes
module_sizes = compute_module_sizes(model)
print(f"The model size is {module_sizes[''] * 1e-9} GB")
print(model.gpt_neox.layers[0].attention.dense.weight)
```
**Note:** The weights are in `fp32`.

### 8-bit Quantization

```Python
from quanto import quantize, freeze
import torch

quantize(model, weights=torch.int8, activations=None)
# after performing quantization
print(model.gpt_neox)
print(model.gpt_neox.layers[0].attention.dense.weight)
```

- The "freeze" function requires more memory than is available in this classroom.
- This code will run on a machine that has 8GB of memory, and so it will likely work if you run this code on your local machine.

```Python
# freeze the model
freeze(model)
print(model.gpt_neox.layers[0].attention.dense.weight)

# get model size after quantization
module_sizes = compute_module_sizes(model)
print(f"The model size is {module_sizes[''] * 1e-9} GB")

# run inference after quantizing the model
outputs = model.generate(**inputs, max_new_tokens=10)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

#### Comparing "linear quantization" to "downcasting"

To recap the difference between the "linear quantization" method in this lesson with the "downcasting" method in the previous lesson:

- When downcasting a model, you convert the model's parameters to a more compact data type (bfloat16).  During inference, the model performs its calculations in this data type, and its activations are in this data type.  Downcasting may work with the bfloat16 data type, but the model performance will likely degrade with any smaller data type, and won't work if you convert to an integer data type (like the int8 in this lesson).


- In this lesson, you used another quantization method, "linear quantization", which enables the quantized model to maintain performance much closer to the original model by converting from the compressed data type back to the original FP32 data type during inference. So when the model makes a prediction, it is performing the matrix multiplications in FP32, and the activations are in FP32.  This enables you to quantize the model in data types smaller than bfloat16, such as int8, in this example.

#### This is just the beginning...
- This course is intended to be a beginner-friendly introduction to the field of quantization. 🐣
- If you'd like to learn more about quantization, please stay tuned for another Hugging Face short course that goes into more depth on this topic (launching in a few weeks!) 🤗

## Did you like this course?

- If you liked this course, could you consider giving a rating and share what you liked? 💕
- If you did not like this course, could you also please share what you think could have made it better? 🙏

#### A note about the "Course Review" page.
The rating options are from 0 to 10.
- A score of 9 or 10 means you like the course.🤗
- A score of 7 or 8 means you feel neutral about the course (neither like nor dislike).🙄
- A score of 0,1,2,3,4,5 or 6 all mean that you do not like the course. 😭
  - Whether you give a 0 or a 6, these are all defined as "detractors" according to the standard measurement called "Net Promoter Score". 🧐

In [18]:
from transformers import AutoModelForCausalLM

In [19]:
model_name = "EleutherAI/pythia-410m"

In [22]:
model = AutoModelForCausalLM.from_pretrained(model_name, low_cpu_mem_usage = True)

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/911M [00:00<?, ?B/s]

In [23]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)

tokenizer_config.json:   0%|          | 0.00/396 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/99.0 [00:00<?, ?B/s]



In [33]:
text = "Hello my name is Alice"
inputs = tokenizer(text, return_tensors = "pt")

In [34]:
outputs = model.generate(**inputs, max_new_tokens = 10)

Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


In [35]:
outputs

tensor([[12092,   619,  1416,   310, 16922,   285,   309,   717,   247,   747,
         17782,   281,   436,  2670,    15]])

In [36]:
print(tokenizer.decode(outputs[0], skip_special_tokens = True))

Hello my name is Alice and I am a newbie to this site.


## Estimate Memory usage

### Pythia model in FP 32
#### How many parameters?
##### 400*10^6 parameters
##### 32 bits = 4 bytes
#### How much memory?
##### 400*10^6 params X 4 bytes
##### = 1600X 10^6 bytes
##### = 1600 megabytes
##### = 1.6 gigabytes

In [37]:
# just as before
from helper_4 import compute_module_sizes
module_sizes = compute_module_sizes(model)
print(f"The model size is {module_sizes[''] * 1e-9} GB")

# as expected

The model size is 1.6402112960000002 GB


In [38]:
print(model.gpt_neox.layers[0].attention.dense.weight)

Parameter containing:
tensor([[ 0.0061, -0.0016, -0.0068,  ..., -0.0062,  0.0138,  0.0222],
        [ 0.0077,  0.0157, -0.0090,  ...,  0.0013, -0.0132,  0.0109],
        [-0.0330,  0.0008,  0.0281,  ...,  0.0026,  0.0456, -0.0077],
        ...,
        [-0.0105,  0.0091, -0.0137,  ..., -0.0046,  0.0371, -0.0077],
        [-0.0063,  0.0035,  0.0147,  ...,  0.0220,  0.0158,  0.0224],
        [-0.0299,  0.0129,  0.0208,  ..., -0.0040, -0.0065,  0.0122]],
       requires_grad=True)


In [40]:
from quanto import quantize, freeze
import torch

In [44]:
quantize(model, weights = torch.int8, activations = None) # only quantize the weights.  quanti to int8
# we can also quanti the activations, but not applicable here.

In [45]:
print(model.gpt_neox)

GPTNeoXModel(
  (embed_in): Embedding(50304, 1024)
  (emb_dropout): Dropout(p=0.0, inplace=False)
  (layers): ModuleList(
    (0-23): 24 x GPTNeoXLayer(
      (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
      (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
      (post_attention_dropout): Dropout(p=0.0, inplace=False)
      (post_mlp_dropout): Dropout(p=0.0, inplace=False)
      (attention): GPTNeoXSdpaAttention(
        (rotary_emb): GPTNeoXRotaryEmbedding()
        (query_key_value): QLinear(in_features=1024, out_features=3072, bias=True)
        (dense): QLinear(in_features=1024, out_features=1024, bias=True)
        (attention_dropout): Dropout(p=0.0, inplace=False)
      )
      (mlp): GPTNeoXMLP(
        (dense_h_to_4h): QLinear(in_features=1024, out_features=4096, bias=True)
        (dense_4h_to_h): QLinear(in_features=4096, out_features=1024, bias=True)
        (act): GELUActivation()
      )
    )
  )
  (final_lay

In [46]:
print(model.gpt_neox.layers[0].attention.dense.weight)

# still in floats, not completely in ints

Parameter containing:
tensor([[ 0.0061, -0.0016, -0.0068,  ..., -0.0062,  0.0138,  0.0222],
        [ 0.0077,  0.0157, -0.0090,  ...,  0.0013, -0.0132,  0.0109],
        [-0.0330,  0.0008,  0.0281,  ...,  0.0026,  0.0456, -0.0077],
        ...,
        [-0.0105,  0.0091, -0.0137,  ..., -0.0046,  0.0371, -0.0077],
        [-0.0063,  0.0035,  0.0147,  ...,  0.0220,  0.0158,  0.0224],
        [-0.0299,  0.0129,  0.0208,  ..., -0.0040, -0.0065,  0.0122]],
       requires_grad=True)


In [47]:
# then, we freeze the model
freeze(model)

In [49]:
print(model.gpt_neox.layers[0].attention.dense.weight)
# now, it is integers now

QTensor(tensor([[ 12,  -3, -14,  ..., -12,  28,  45],
        [ 18,  37, -21,  ...,   3, -31,  26],
        [-75,   2,  64,  ...,   6, 104, -18],
        ...,
        [-25,  22, -33,  ..., -11,  89, -19],
        [-14,   8,  33,  ...,  49,  35,  50],
        [-56,  24,  39,  ...,  -8, -12,  23]], dtype=torch.int8), scale=tensor([[0.0005],
        [0.0004],
        [0.0004],
        ...,
        [0.0004],
        [0.0004],
        [0.0005]]), public_dtype=torch.float32, requires_grad=True)


In [53]:
module_sizes = compute_module_sizes(model)
print(f"The model size is {module_sizes['']*1e-9} GB")

The model size is 0.580794472 GB


# memory footprint
### pythia model:
- before quantization: 1.6 GB
- after quantization: ~0.58 GB
  - almost 1/4

# check for performance degradation
- run inference using the quantized model

In [54]:
outputs = model.generate(**inputs, max_new_tokens = 10)

Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


In [56]:
print(tokenizer.decode(outputs[0], skip_special_tokens = True))
# same outputs

Hello my name is Alice and I am a newbie to this site.


In [57]:
# this did not affec the performance.

# what is happening here is linear quantization

linear mapping, a longer line project to a shorter line

- formula: r = s(q-z)

[r_min, r_max] ===> [q_min, q_max]

original value: in FP32, quantized value: in INT8, zero point: int8, scale: e.g. in FP32 range [r_min, r_max] to the quantized range[q_min, q_max]


## scale and zero point:


If we look that the extreme values, we should gethe value since it is a n-bit integer:
z = int(round(q_min - r_min/s))

r_min = s(q_min-z)
r_max = s(q_max-z)

If we subtract the first euqation from the second one, we get the scale s:
s = (r_max-r_min)/(q_max - q_min)

for the zero point z, we need to round the value since it is a n-bit integer:
z = int(round(q_min - r_min/s))


### How do you get the scale and parameters:

linear quantization maps the floating point

## intermediate State:

In [61]:
# # standard
# from quanto import quantize
# quantize(model, weights=torch.int8)
# freeze(model)

# calibration:
calibrate model when quantizing the activations of the model:
- range of activate values depends on what input was give.
  - e.g a different input text wil generate diferent activations
- min/max of activation ranges are used to perform linear quantization.
- How to get min and max range of activations?
  - Gather sample input data
  - Run inference
  - Calculate min/max of activations.

Results: better quantized activations.

Quantization Aware Training:

training in a way that controls how the model performs once it is quantized.

- intermediate state holds both:
  - a quantized version of weights
  - original unquantized weights
- forward pass (inference)
  - Use quantized version of model weights to make predictions
    - e.ge. in BF16
- Back propagration (updating model weights)
  - update original, unquantized version of model weights
    - e.eg. in FP32