In [None]:
!BUILD_CUDA_EXT=0 pip install -q auto-gptq transformers

In [None]:
# Code for 4-bit Quantization with GPTQ
import random
import torch
from typing import List, Dict
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
from datasets import load_dataset
from transformers import AutoTokenizer

def prepare_data(model_id: str, n_samples: int) -> List[Dict[str, torch.Tensor]]:
    tokenizer = AutoTokenizer.from_pretrained(model_id)
    data = load_dataset("allenai/c4",
                        data_files="en/c4-train.00001-of-01024.json.gz",
                        split=f"train[:{n_samples*5}]")
    tokenized_data = tokenizer("\n\n".join(data['text']), return_tensors='pt')

    start_indices = []
    for _ in range(n_samples):
        max_index = tokenized_data.input_ids.shape[1] - tokenizer.model_max_length - 1
        random_index = random.randint(0, max_index)
        start_indices.append(random_index)
    examples_ids = []
    for i in start_indices:
        j = i + tokenizer.model_max_length
        input_ids = tokenized_data.input_ids[:, i:j]
        attention_mask = (input_ids != tokenizer.pad_token_id).type(input_ids.dtype)
        examples_ids.append({'input_ids': input_ids, 'attention_mask': attention_mask})

    return examples_ids

def quantize_llm_with_gptq(model_id: str,
                           out_dir: str,
                           n_samples: int = 512,
                           bits: int = 4,
                           group_size: int = 128,
                           damp_percent: float = 0.01,
                           desc_act: bool = False) -> str:
    quantize_config = BaseQuantizeConfig(bits=bits,
                                         group_size=group_size,
                                         damp_percent=damp_percent,
                                         desc_act=desc_act)
    model = AutoGPTQForCausalLM.from_pretrained(model_id, quantize_config)

    examples_ids = prepare_data(model_id, n_samples)
    model.quantize(examples_ids, batch_size=1, use_triton=True)
    model.save_quantized(out_dir, use_safetensors=True)
    AutoTokenizer.from_pretrained(model_id).save_pretrained(out_dir)

    return out_dir

# Example usage
model_id = "HuggingFaceH4/zephyr-7b-beta"
out_dir_quantized = model_id + "-GPTQ"
quantized_dir = quantize_llm_with_gptq(model_id, out_dir_quantized)

In [None]:
#######################################################################
# Now, load quantized model
device = "cuda:0" if torch.cuda.is_available() else "cpu"

# Reload model and tokenizer
model = AutoGPTQForCausalLM.from_quantized(
    out_dir_quantized,
    device=device,
    use_triton=True,
    use_safetensors=True,
)
tokenizer = AutoTokenizer.from_pretrained(out_dir_quantized)

from transformers import pipeline

generator = pipeline('text-generation', model=model, tokenizer=tokenizer)

result = generator("My favourite destination", do_sample=True, max_length=50)[0]['generated_text']

print(result)

📌 GPTQ is a post-training quantization (PTQ) method for 4-bit quantization that focuses primarily on GPU inference and performance.

The idea behind the method is that it will try to compress all weights to a 4-bit quantization by minimizing the mean squared error to that weight. During inference, it will dynamically dequantize its weights to float16 for improved performance whilst keeping memory low.

However, for CPU-friendly approach, GGML is currently your best option. 

The attached code is for 4-bit Quantization with GPTQ

------------------

## Explanation of the below block from the above code

```py
  start_indices = []
    for _ in range(n_samples):
        max_index = tokenized_data.input_ids.shape[1] - tokenizer.model_max_length - 1
        random_index = random.randint(0, max_index)
        start_indices.append(random_index)
```

📌 This block generates a list of `start_indices` that are used to select sequences from the tokenized data for model quantization. Each index in `start_indices` serves as a starting point for a sequence of tokens.

📌 `max_index = tokenized_data.input_ids.shape[1] - tokenizer.model_max_length - 1`
   
Here `tokenized_data.input_ids.shape[1]` gives the length of the tokenized sequence.
     
`tokenizer.model_max_length` is the maximum length of sequences that the model can handle.

`model_max_length (int, optional) `— The maximum length (in number of tokens) for the inputs to the transformer model. When the tokenizer is loaded with from_pretrained(), this will be set to the value stored for the associated model in max_model_input_sizes (see above). If no value is provided, will default to VERY_LARGE_INTEGER (int(1e30)).
     
Subtracting these values and an additional `1` ensures that any sequence starting at `random_index` and extending for `tokenizer.model_max_length` tokens will not exceed the length of `tokenized_data.input_ids`. This subtraction prevents index out-of-bounds errors when later slicing the `input_ids`.

`random_index = random.randint(0, max_index)`: Generates a random integer between `0` and `max_index`. This index is used as the start point for slicing a sequence from the tokenized data.

`start_indices.append(random_index)`: Appends the generated `random_index` to the `start_indices` list.

📌 After the loop, `start_indices` contains `n_samples` number of starting points, each chosen randomly within the valid range, ensuring the subsequences extracted will be within the model's maximum sequence length and within the bounds of `tokenized_data.input_ids`.

----------------------

### `input_ids = tokenized_data.input_ids[:, i:j]` - In this line why the we are slicing column-wise (indicated by `:`) ?

- In NLP models, especially those based on transformers, input data is often represented in a 2D tensor format. The first dimension typically represents different examples or sequences (batch dimension), and the second dimension represents the tokens within each sequence.

- The `input_ids` tensor from tokenized data usually has a shape of `[batch_size, seq_length]`, where `batch_size` is the number of sequences and `seq_length` is the length of each sequence. If you're processing a single sequence, `batch_size` would be 1. f you have 10 sequences of text data that you want to process at the same time, your batch size will be 10.

`seq_length`, the 2-nd dimension, is the number of tokens in each sequence, i.e. the length of each input text sequence after tokenization.

- In the given code, the slicing operation `[:, i:j]` is designed to select a specific range of tokens from each sequence in the batch. The `:` indicates that we are selecting all sequences (across the batch dimension), and `i:j` specifies the range of tokens to select from each sequence. This way, you get a continuous slice of tokens from every sequence in the batch. This operation is vital for creating uniform-length sequences for model input.

- **Why Not Row-wise**: Row-wise slicing would imply selecting specific sequences (entire sequences) rather than specific tokens within each sequence. This is not the desired operation here, as the goal is to select a particular section of tokens (from `i` to `j`) across all sequences for the model to process.

---------

### `attention_mask = (input_ids != tokenizer.pad_token_id).type(input_ids.dtype)`

📌 This to accurately represent the valid (non-padding) tokens in the `attention_mask`.

📌 In a typical NLP model setup, especially in large language models, input sequences may be padded to a fixed length for batching purposes. Padding tokens are not actual content and should not be attended to by the model.

📌 The expression `(input_ids != tokenizer.pad_token_id)` creates a boolean mask where each position is `True` if the corresponding token is not a padding token, and `False` otherwise. This effectively identifies the non-padding tokens.

📌 The method `.type(input_ids.dtype)` is then used to convert this boolean mask to the same data type as `input_ids`. This is required because attention masks need to be of the same type as the model's inputs (usually a tensor of integers or floats).

📌 By doing this, the attention mask accurately reflects the actual tokens that should be considered during the attention calculations in the model, leading to more effective and accurate model processing.

C4 (Colossal Clean Crawled Corpus) dataset is used to generate our samples. The quantization process relies heavily on samples to evaluate and enhance the quality of the quantization. They provide a means of comparison between the outputs produced by the origina and the newly quantized model. The larger the number of samples provided, the greater the potential for more accurate and effective comparisons, leading to improved quantization quality.

In example above we load 512 samples from the C4 dataset, tokenize them, and format them.

------

`desc_act` (also called act order) param allows you to process rows based on decreasing activation, meaning the most important or impactful rows (determined by sampled inputs and outputs) are processed first. This method aims to place most of the quantization error (inevitably introduced during quantization) on less significant weights. This approach improves the overall accuracy of the quantization process by ensuring the most significant weights are processed with greater precision. However, when used alongside group size, desc_act can lead to performance slowdowns due to the need to frequently reload quantization parameters. For this reason, we won't use it here (it will probably be fixed in the future, however).

https://huggingface.co/docs/transformers/main_classes/tokenizer#transformers.PreTrainedTokenizer.model_max_length