<a href="https://colab.research.google.com/github/rhiga2/AIHawaii/blob/main/workshops/transformers_and_gpt/QuantizationAndDistillation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Cheaper, Faster Machine Learning with Quantization and Distillation


In [1]:
!pip install autoawq;
!pip install mistralai;



In [2]:
# Make sure that the runtime has a GPU (I chose T4)
import torch

if torch.cuda.is_available():
    print("CUDA is available")
    print("Number of GPUs:", torch.cuda.device_count())
    for i in range(torch.cuda.device_count()):
        print("Device", i, ":", torch.cuda.get_device_name(i))
else:
    print("CUDA is not available")

CUDA is available
Number of GPUs: 1
Device 0 : Tesla T4


# Quantization
Quantization refers for techniques that reduce the precision of parameters and operations in a deep learning model.

## Why quantize?
1. Lower memory footprint. Each parameter in the network represented by less bits. GPU memory is expensive!
2. Faster inference. Lower precision operations are sometimes faster (highly hardware dependent!).
3. Can run on cheaper, less energy-demanding hardware.

## What are you trading off?
Model will perform a little worse. See [quantized LLM experiments](https://neuralmagic.com/blog/we-ran-over-half-a-million-evaluations-on-quantized-llms-heres-what-we-found/).

## Floating point and fixed point basics
* WXAY-(INT or FLOAT): weights quantized to X bits, activations quantized to Y bits. The represetation is either int or float.   
* INT = fixed point representation.

In [3]:
import torch

def float_to_fixed(tensor, fractional_bits):
    scale = 2 ** fractional_bits
    fixed_tensor = (tensor * scale).round().to(torch.int8)

    # Ensure the values are within the representable range
    min_val = -(2 ** (8 - 1))
    max_val = (2 ** (8 - 1)) - 1
    fixed_tensor = torch.clamp(fixed_tensor, min_val, max_val)

    return fixed_tensor

def fixed_to_float(tensor, fractional_bits):
     scale = 2 ** fractional_bits
     float_tensor = tensor.float() / scale
     return float_tensor

A = torch.randn(5, 5, dtype=torch.float32)
print(A)

quantized_A = float_to_fixed(A, 4)
print(quantized_A)
reconstructed_A = fixed_to_float(quantized_A, 4)
print(reconstructed_A)


tensor([[ 0.1435, -0.1650, -0.5434, -0.9724, -2.3872],
        [ 0.4304, -0.8603, -0.1000, -1.3351, -0.4415],
        [-0.6909, -0.8130, -1.6561,  1.8352, -0.1478],
        [ 1.0231,  2.2060,  0.5524,  1.0452, -0.0205],
        [ 0.9922,  0.3053, -0.1902, -0.6974, -0.9759]])
tensor([[  2,  -3,  -9, -16, -38],
        [  7, -14,  -2, -21,  -7],
        [-11, -13, -26,  29,  -2],
        [ 16,  35,   9,  17,   0],
        [ 16,   5,  -3, -11, -16]], dtype=torch.int8)
tensor([[ 0.1250, -0.1875, -0.5625, -1.0000, -2.3750],
        [ 0.4375, -0.8750, -0.1250, -1.3125, -0.4375],
        [-0.6875, -0.8125, -1.6250,  1.8125, -0.1250],
        [ 1.0000,  2.1875,  0.5625,  1.0625,  0.0000],
        [ 1.0000,  0.3125, -0.1875, -0.6875, -1.0000]])


## Quantization Techniques
* Quantize aware training: During training, we simulate quantization in the forward pass. This allow the network to adapt to quantization. Both weights and activations are quantized throughout the network before inference (usually results in higher accuracy).
* Static post-training quantization: Training occurs in full precision. Weights and activations are quantized before inference.
* Dynamic post-training quantization: Weights are quantized before inference, but activations are quantized at runtime.
* In this lab, we will focus on a type of post-training quantization: activation-aware weight quantization (AWQ). AWQ preserves 1% of weights at full quantization, while quantizing 99% weights to 4-bits. See autoawq library for quantization: https://docs.vllm.ai/en/latest/features/quantization/auto_awq.html.

In [4]:
# Estimate memory. These are lower bounds because we are only estimating weights
def estimate_memory_gb(num_bits_per_weight: int, num_weights_billions: int):
    num_bytes_per_weight = num_bits_per_weight / 8
    total_bytes = num_weights_billions * 1e9 * num_bytes_per_weight
    total_gb = total_bytes / 2**30
    return total_gb

def estimate_awq(full_precision_bits: int,
                 quantized_bits: int,
                 num_weights_billions: int,
                 percent_quantized: float):
    quantized_mem = estimate_memory_gb(
        quantized_bits,
        (1 - percent_quantized) * num_weights_billions
    )
    full_precision_mem = estimate_memory_gb(
        quantized_bits,
        percent_quantized * num_weights_billions
    )
    return quantized_mem + full_precision_mem


print("Estimate memory of mistral-8B: ",
      estimate_memory_gb(32, 8), "GB")
print("Estimate memory of mistral-8B_AWQ: ",
      estimate_awq(32, 4, 8, 0.01), "GB")

Estimate memory of mistral-8B:  29.802322387695312 GB
Estimate memory of mistral-8B_AWQ:  3.725290298461914 GB


In [5]:
messages = [
    {
        "role": "user",
        "content": "What is the best French cheese?",
    },
]

In [6]:
from google.colab import userdata
from mistralai import Mistral
api_key = userdata.get('MISTRAL_API_KEY')
model = "ministral-8b-latest"

client = Mistral(api_key=api_key)

chat_response = client.chat.complete(
    model = model,
    messages = messages,
    max_tokens = 500,
)

print(chat_response.choices[0].message.content)

Choosing the "best" French cheese can be quite subjective as it depends on personal preferences, but here are a few highly regarded French cheeses that are often considered among the best:

1. **Roquefort**: This is a sheep's milk cheese with a distinctive blue mold. It has a strong, pungent flavor and is often served with fruit or nuts.

2. **Brie de Meaux**: This is a soft, creamy cheese made from cow's milk. It has a mild, buttery flavor and is often served with crackers or bread.

3. **Camembert de Normandie**: Another soft cheese, Camembert has a creamy texture and a mild, slightly tangy flavor. It's often served with bread or crackers.

4. **Comté**: This is a hard cheese made from cow's milk. It has a nutty, slightly sweet flavor and is often used in cooking.

5. **Munster**: This is a soft cheese with a strong, pungent flavor. It's often served with bread or crackers.

6. **Reblochon**: This is a soft cheese with a creamy texture and a mild, slightly nutty flavor. It's often se

In [25]:
# Run a quantized model with AWQ
# Had a demo for quantizing comparable model, but free tier colab cannot run it.
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("TheBloke/Mistral-7B-v0.1-AWQ")
model = AutoModelForCausalLM.from_pretrained("TheBloke/Mistral-7B-v0.1-AWQ",
                                             device_map="cuda:0")

tokenized = tokenizer(messages[0]["content"], return_tensors="pt").to("cuda:0")
output = model.generate(**tokenized, max_new_tokens=200)
print(tokenizer.batch_decode(output)[0])


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


<s> What is the best French cheese?

- The 10 Best French Cheeses
- Brie.
- Camembert.
- Roquefort.
- Comté.
- Saint-Nectaire.
- Cantal.
- Brillat-Savarin.

## What is the most popular cheese in France?

Brie is the most popular cheese in France.

## What is the most famous cheese in the world?

The 10 Most Popular Cheeses in the World

- Brie.
- Camembert.
- Cheddar.
- Feta.
- Gouda.
- Mozzarella.
- Parmesan.
- Roquefort.

## What is the most famous French cheese?

Brie is the most popular cheese in France.

## What is the most popular cheese in France 2020?



# Knowledge Distillation
Knowledge distillation means having one model (student) learn based on the output of another model (teacher).

## Why distill?
* Most common usecase is model compression. Often a smaller model can achieve similar performance to a larger model with distillation.
* Another common usecase is model simiplification. This aims to make the student model simpler than the teacher model alowing us to run the student model on cheaper hardware.
* Mimicing closed source model with open source (may not be legal). [Deepseek is accused of using this technique with chatGPT](https://theconversation.com/openai-says-deepseek-inappropriately-copied-chatgpt-but-its-facing-copyright-claims-too-248863).




## Distillation Techniques
* LLMs output distribution of next token in sequence. Train off the distribution of teacher with increased tempurature. In this case, the loss is usually the KL divergence between the student and teacher distillation.
* We can also take a hidden layer of the transformer and train based on 1 - cossine similarity.
* Both techniques use a glass box method to distillation. Won't be talking too much about block box distillation.   