In [None]:
!pip install transformers torch bitsandbytes accelerate



## Quantization Resources

https://huggingface.co/docs/peft/en/developer_guides/quantization

https://huggingface.co/docs/transformers/quantization

### Load Tokenizer

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM
import transformers

tokenizer = AutoTokenizer.from_pretrained("tiiuae/falcon-7b")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


### Using 16bit Precision

This will crash on a 7B parameter model because we would need 14GB of GPU Ram.

In [None]:
from transformers import AutoModelForCausalLM
import transformers
import torch

model = AutoModelForCausalLM.from_pretrained("tiiuae/falcon-7b", torch_dtype=torch.bfloat16)

### Using 8bit Precision

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM
import transformers
import torch
from transformers import BitsAndBytesConfig

config = BitsAndBytesConfig(
    load_in_8bit=True
)

model = AutoModelForCausalLM.from_pretrained("tiiuae/falcon-7b", quantization_config=config)

gbs = model.get_memory_footprint() / 1e9

print(f"Number of parameters: {model.num_parameters()}")
print(f"Memory footprint if FP32: {(model.num_parameters()*4)/1e9} GB")
print(f"Memory footprint of the model with 8bit quantization: {gbs:.2f} GB")

`low_cpu_mem_usage` was None, now set to True since model is quantized.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Number of parameters: 6921720704
Memory footprint if FP32: 27.686882816 GB
Memory footprint of the model with 8bit quantization: 7.23 GB


In [None]:
prompt = "Girafatron is obsessed with giraffes, the most glorious animal on the face of this Earth. Giraftron believes all other animals are irrelevant when compared to the glorious majesty of the giraffe.\nDaniel: Hello, Girafatron!\nGirafatron:"
prompt

'Girafatron is obsessed with giraffes, the most glorious animal on the face of this Earth. Giraftron believes all other animals are irrelevant when compared to the glorious majesty of the giraffe.\nDaniel: Hello, Girafatron!\nGirafatron:'

In [None]:
def generate(prompt):
  tokenized_text = tokenizer(prompt, return_tensors="pt").to("cuda")
  output = model.generate(**tokenized_text, eos_token_id=tokenizer.eos_token_id, do_sample=True, max_new_tokens=100)
  result = tokenizer.batch_decode(output,  skip_special_tokens=True)[0]
  return result


result = generate(prompt)
print(result)

Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.


Girafatron is obsessed with giraffes, the most glorious animal on the face of this Earth. Giraftron believes all other animals are irrelevant when compared to the glorious majesty of the giraffe.
Daniel: Hello, Girafatron!
Girafatron: Hi!
Daniel: What brings you here? What's going on with you? How are you feeling?
Girafatron: Oh, not so good, Daniel.
Daniel: What's the matter?
Girafatron: My body is really hurting.
Daniel: I'm sorry, my dear. What sort of pain are you feeling?
Girafatron: Well, its my right leg, it just feels like... it


### Using 4bit Precision

Model Card: https://huggingface.co/tiiuae/falcon-7b

In [None]:
import torch
from transformers import BitsAndBytesConfig

config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True
)

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM
import transformers

four_bit_model = AutoModelForCausalLM.from_pretrained("tiiuae/falcon-7b", quantization_config=config)

`low_cpu_mem_usage` was None, now set to True since model is quantized.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [None]:
gbs = four_bit_model.get_memory_footprint() / 1e9

print(f"Number of parameters: {four_bit_model.num_parameters()}")
print(f"Memory footprint if FP32: {(four_bit_model.num_parameters()*4)/1e9} GB")
print(f"Memory footprint of the model with 4bit quantization: {gbs:.2f} GB")

Number of parameters: 6921720704
Memory footprint if FP32: 27.686882816 GB
Memory footprint of the model with 4bit quantization: 3.92 GB


In [None]:
def generate(prompt):
  tokenized_text = tokenizer(prompt, return_tensors="pt").to("cuda")
  output = four_bit_model.generate(**tokenized_text, eos_token_id=tokenizer.eos_token_id, do_sample=True, max_new_tokens=100)
  result = tokenizer.batch_decode(output,  skip_special_tokens=True)[0]
  return result


result = generate(prompt)

Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.


In [None]:
print(result)

Girafatron is obsessed with giraffes, the most glorious animal on the face of this Earth. Giraftron believes all other animals are irrelevant when compared to the glorious majesty of the giraffe.
Daniel: Hello, Girafatron!
Girafatron: Hi, Daniel! Girafatron is pleased to be in today's question time. Girafatron has been observing you and your questions, and I know I will answer your question with a great and glorious response!
Daniel: You know, I think it's about time. Everyone here is just dying to know what you will say this question time!
Girafatron: Girafatron answers by answering questions?
Daniel: Yes! The questions that have


## We were able to fit 2 models into memory!

In [None]:
x =