In [1]:
%autosave 300
%load_ext autoreload
%autoreload 2
%reload_ext autoreload
%config Completer.use_jedi = False

Autosaving every 300 seconds


#### transformers meets bitsandbytes for model quantization

In [2]:
# !pip install -q -U bitsandbytes
# !pip install -q -U git+https://github.com/huggingface/transformers.git
# !pip install -q -U git+https://github.com/huggingface/peft.git
# !pip install -q -U git+https://github.com/huggingface/accelerate.git

##### Regular model  from pretrained model

In [23]:
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "facebook/opt-350m"

model = AutoModelForCausalLM.from_pretrained(model_id, load_in_4bit=True, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_id)

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


In [24]:
print(model)

OPTForCausalLM(
  (model): OPTModel(
    (decoder): OPTDecoder(
      (embed_tokens): Embedding(50272, 512, padding_idx=1)
      (embed_positions): OPTLearnedPositionalEmbedding(2050, 1024)
      (project_out): Linear4bit(in_features=1024, out_features=512, bias=False)
      (project_in): Linear4bit(in_features=512, out_features=1024, bias=False)
      (layers): ModuleList(
        (0-23): 24 x OPTDecoderLayer(
          (self_attn): OPTAttention(
            (k_proj): Linear4bit(in_features=1024, out_features=1024, bias=True)
            (v_proj): Linear4bit(in_features=1024, out_features=1024, bias=True)
            (q_proj): Linear4bit(in_features=1024, out_features=1024, bias=True)
            (out_proj): Linear4bit(in_features=1024, out_features=1024, bias=True)
          )
          (activation_fn): ReLU()
          (self_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
          (fc1): Linear4bit(in_features=1024, out_features=4096, bias=True)
          (

In [25]:
print(model.state_dict()['model.decoder.embed_tokens.weight'])

tensor([[-0.0353,  0.0629, -0.0628,  ..., -0.0625,  0.0188,  0.0313],
        [ 0.0213,  0.0379, -0.0625,  ..., -0.0625, -0.0167,  0.0313],
        [-0.0484, -0.0648,  0.0690,  ...,  0.0656, -0.0626, -0.0485],
        ...,
        [ 0.0723,  0.0312, -0.0634,  ..., -0.0625, -0.0053, -0.0755],
        [ 0.0596, -0.0695, -0.0626,  ...,  0.0736, -0.0040,  0.0409],
        [-0.0237,  0.0327, -0.0636,  ..., -0.0625, -0.0248,  0.0315]],
       device='cuda:0', dtype=torch.float16)


In [26]:
text = "Hello my name is"
device = "cuda:0"

inputs = tokenizer(text, return_tensors="pt").to(device)
outputs = model.generate(**inputs, max_new_tokens=20)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Hello my name is jimmy and I am a new member of the reddit clan. I am a new member of


#### Quantize the model with 4-bit weights and 8-bit activations using BitsAndBytes

Let's review in this section advanced usage of the 4bit integration. First, you need to understand the different arguments that can be tweaked and used.

All these parameters can be changed by using the BitsandBytesConfig from transformers and pass it to quantization_config argument when calling from_pretrained.

Make sure to pass load_in_4bit=True when using the BitsAndBytesConfig!

The 4bit integration comes with 2 different quantization types: FP4 and NF4, Immediate below is the code for FP4 and later on, we will discuss NF4.

In [27]:
import torch
from transformers import BitsAndBytesConfig

```python
class BitsAndBytesConfig(
    load_in_8bit: bool = False,
    load_in_4bit: bool = False,
    llm_int8_threshold: float = 6,
    llm_int8_skip_modules: Any | None = None,
    llm_int8_enable_fp32_cpu_offload: bool = False,
    llm_int8_has_fp16_weight: bool = False,
    bnb_4bit_compute_dtype: Any | None = None,
    bnb_4bit_quant_type: str = "fp4",
    bnb_4bit_use_double_quant: bool = False,
    bnb_4bit_quant_storage: Any | None = None,
    **kwargs: Any
)
```

In [33]:
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16
)
print(quantization_config)
# The compute dtype is used to change the dtype that will be used during computation
# the memory dtype is used to change the dtype that will be used to store the weights

BitsAndBytesConfig {
  "_load_in_4bit": true,
  "_load_in_8bit": false,
  "bnb_4bit_compute_dtype": "bfloat16",
  "bnb_4bit_quant_storage": "uint8",
  "bnb_4bit_quant_type": "fp4",
  "bnb_4bit_use_double_quant": false,
  "llm_int8_enable_fp32_cpu_offload": false,
  "llm_int8_has_fp16_weight": false,
  "llm_int8_skip_modules": null,
  "llm_int8_threshold": 6.0,
  "load_in_4bit": true,
  "load_in_8bit": false,
  "quant_method": "bitsandbytes"
}



In [29]:
model_cd_bf16 = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=quantization_config)

`low_cpu_mem_usage` was None, now set to True since model is quantized.


In [30]:
outputs = model_cd_bf16.generate(**inputs, max_new_tokens=20)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Hello my name is jimmy and I am a new member of the reddit clan. I am a new member of


Qlora or NF4 quantization

In [34]:
nf4_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
)
print(nf4_config)

BitsAndBytesConfig {
  "_load_in_4bit": true,
  "_load_in_8bit": false,
  "bnb_4bit_compute_dtype": "float32",
  "bnb_4bit_quant_storage": "uint8",
  "bnb_4bit_quant_type": "nf4",
  "bnb_4bit_use_double_quant": false,
  "llm_int8_enable_fp32_cpu_offload": false,
  "llm_int8_has_fp16_weight": false,
  "llm_int8_skip_modules": null,
  "llm_int8_threshold": 6.0,
  "load_in_4bit": true,
  "load_in_8bit": false,
  "quant_method": "bitsandbytes"
}



In [36]:
model_nf4 = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=nf4_config)

`low_cpu_mem_usage` was None, now set to True since model is quantized.


In [37]:
outputs = model_nf4.generate(**inputs, max_new_tokens=20)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Hello my name is John and I am a very happy man. I am a very happy man. I am a very


#### Pushing the limits of the system
How far can we go using 4bit quantization? We'll see below that it is possible to load a 20B-scale model (40GB in half precision) entirely on the GPU using this quantization method! 🤯

Let's load the model with NF4 quantization type for better results, bfloat16 compute dtype as well as nested quantization for a more memory efficient model loading.

In [39]:
# model_id = "EleutherAI/gpt-neox-20b"
# bnb_config = BitsAndBytesConfig(
#     load_in_4bit=True,
#     bnb_4bit_use_double_quant=True,
#     bnb_4bit_quant_type="nf4",
#     bnb_4bit_compute_dtype=torch.bfloat16
# )

# tokenizer = AutoTokenizer.from_pretrained(model_id)
# model_4bit = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device_map="auto")

In [None]:
# text = "Hello my name is"
# device = "cuda:0"
# inputs = tokenizer(text, return_tensors="pt").to(device)

# outputs = model_4bit.generate(**inputs, max_new_tokens=20)
# print(tokenizer.decode(outputs[0], skip_special_tokens=True))

###################################################################### END ######################################################################