For models that we want to use in a quantized state (e.g. Llama 3 70B), compute and store quantized version of the models to reduce load times.

In [None]:
!pip install --quiet --upgrade transformers

In [None]:
!pip install --quiet --upgrade torch

In [None]:
!pip install --quiet --upgrade bitsandbytes accelerate

In [None]:
%%time

from transformers import AutoModelForCausalLM
from transformers import AutoTokenizer
from transformers import BitsAndBytesConfig

# model_id = "meta-llama/Meta-Llama-3-70B-Instruct"
model_id = "google/gemma-2-27b-it"

## 8-bit quantization
bnb_config_8bit = BitsAndBytesConfig(
    load_in_8bit=True
)

## 4-bit quantization
bnb_config_4bit = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    # bnb_4bit_compute_dtype=torch.bfloat16
    bnb_4bit_compute_dtype="bfloat16"
)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config_8bit,
    # torch_dtype=torch.bfloat16,
    torch_dtype="auto",
    trust_remote_code=True,
    low_cpu_mem_usage=True,
    device_map="sequential" ## using sequential instead of auto/balanced since otherwise lm_head gets put on CPU
)

tokenizer = AutoTokenizer.from_pretrained(
    model_id,
    trust_remote_code=True
)

In [None]:
quantized_model_dir = "models/" + model_id.replace("/", "__")

model.save_pretrained(quantized_model_dir)
_ = tokenizer.save_pretrained(quantized_model_dir)

If the model was reduced to a size that will fit into GPU memory, through quantization, when we print the model device map, the items listed should all map the GPU device (or one of the devices) listed by print_device_info. None of the layers should map to CPU. For example, if we have two GPU devices available and listed (0 and 1), each of the items listed by print_device_map, including all layers, lm_head, and norm, should have the values 0 or 1. 

In [None]:
import utils
utils.print_device_info()
utils.print_model_info(model)
utils.print_device_map(model)

We'll now clear the quantized model from memory so that we can re-load the locally stored version, for confirming that complete process was successful.

In [None]:
import gc
import torch

del model
gc.collect()
torch.cuda.empty_cache()

Ensure that the GPU device(s) memory have been freed.

In [None]:
import utils
utils.print_device_info()

Now we'll local the locally stored, quantized model into memory.

In [None]:
%%time

from transformers import AutoTokenizer
from transformers import AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained(
    quantized_model_dir,
    local_files_only=True
)

model = AutoModelForCausalLM.from_pretrained(
    quantized_model_dir,
    device_map="sequential",
    local_files_only=True
)

We should see that the locally stored model has the same quantization and memory usage as the model that was quantized as it was loaded into memory.

In [None]:
import utils
utils.print_device_info()
utils.print_model_info(model)
utils.print_device_map(model)