# What is Quantization?

 - Quantization is a technique that reduces the numerical precision of model parameters, optimizing memory usage and computational efficiency.
 - Instead of relying on 32-bit floating-point numbers (float32), models can be optimized using lower-precision formats like 16-bit floating-point (float16) or even 8-bit integers (int8).
 - This process can significantly shrink model size and accelerate inference while keeping accuracy at an acceptable level.

# Why Use Quantization for LLMs?

 - Enables the deployment of large-scale models on devices with constrained hardware resources.
 - Preserves computational performance without introducing major accuracy losses.
 - Enhances the ability to run LLMs on mobile devices and in real-time applications.
 - A practical solution for users running models on cloud-based platforms with resource limitations, such as Google Colab.

In [None]:
!pip install transformers einops accelerate bitsandbytes torch

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline, BitsAndBytesConfig
import torch
import getpass
import os

device = "cuda:0" if torch.cuda.is_available() else "cpu"
print(device)

In [None]:
os.environ["HF_TOKEN"] = getpass.getpass()

# Quantization

In [None]:
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16
)

### Loading Model

In [None]:
model_id = "meta-llama/Meta-Llama-3-8B-Instruct"

model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=quantization_config)
tokenizer = AutoTokenizer.from_pretrained(model_id)

In [None]:
prompt = ("Who was the first person in space?")
messages = [{"role": "user", "content": prompt}]

In [None]:
encodeds = tokenizer.apply_chat_template(messages, return_tensors="pt")
model_inputs = encodeds.to(device)
generated_ids = model.generate(model_inputs, max_new_tokens = 1000, do_sample = True,
                               pad_token_id=tokenizer.eos_token_id)
decoded = tokenizer.batch_decode(generated_ids)


In [None]:
res = decoded[0]
res