# Quantized inference example

Gazelle supports quantization via Huggingface natively. This notebook demonstrates quantized loading using 8bit and 4bit configs via bitsandbytes.

Half precision (ie bfloat16) requires 24GB vRAM. However, we can load and infer the 8-bit and 4-bit variants in Colab using a free T4 instance pretty easily!

Try yourself [on Colab](https://colab.research.google.com/drive/1wZN7LNebPf75s2jsJHhXtF0LnoNO7orx?usp=sharing)!

### 8-bit

This configuration uses ~15.3GB VRAM. For shorter sequences, you should be able to fit this into a 16GB VRAM card (eg T4, V100).

In [None]:
!pip install git+https://github.com/tincans-ai/gazelle.git --no-deps

Collecting git+https://github.com/tincans-ai/gazelle.git
  Cloning https://github.com/tincans-ai/gazelle.git to /tmp/pip-req-build-7m03ati_
  Running command git clone --filter=blob:none --quiet https://github.com/tincans-ai/gazelle.git /tmp/pip-req-build-7m03ati_
  Resolved https://github.com/tincans-ai/gazelle.git to commit 5fba88477868dc9a3438d1b19df67cc15c94c2aa
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Building wheels for collected packages: gazelle
  Building wheel for gazelle (pyproject.toml) ... [?25l[?25hdone
  Created wheel for gazelle: filename=gazelle-0.1.0-py2.py3-none-any.whl size=17297 sha256=f43df71547f707a51731be934178fbcd8c86af6c944c230dbf18502205f60858
  Stored in directory: /tmp/pip-ephem-wheel-cache-_9ln995n/wheels/84/79/80/5eac2b07d2c1e1d8234ca584bfcc87e79af9e6829fe6d882cc
Successfully built gazelle
Installing collected packages: gazel

In [None]:
!pip install accelerate bitsandbytes --no-deps

In [None]:
import torch
import transformers
from transformers import BitsAndBytesConfig

from gazelle import (
    GazelleConfig,
    GazelleForConditionalGeneration,
    GazelleProcessor,
)

model_id = "tincans-ai/gazelle-v0.2"
config = GazelleConfig.from_pretrained(model_id)
tokenizer = transformers.AutoTokenizer.from_pretrained(model_id)

audio_processor = transformers.Wav2Vec2Processor.from_pretrained(
    "facebook/wav2vec2-base-960h"
)

def inference_collator(audio_input, prompt="Transcribe the following \n<|audio|>", audio_dtype=torch.float16):
    audio_values = audio_processor(
        audio=audio_input, return_tensors="pt", sampling_rate=16000
    ).input_values
    msgs = [
        {"role": "user", "content": prompt},
    ]
    labels = tokenizer.apply_chat_template(
        msgs, return_tensors="pt", add_generation_prompt=True
    )
    return {
        "audio_values": audio_values.squeeze(0).to("cuda").to(audio_dtype),
        "input_ids": labels.to("cuda"),
    }

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers


In [None]:
import torchaudio
import requests
from IPython.display import Audio

remote_file_url = "https://r2proxy.tincans.ai/test6.wav"
local_file_path = "test6.wav"

response = requests.get(remote_file_url)

with open(local_file_path, "wb") as file:
    file.write(response.content)

test_audio, sr = torchaudio.load(local_file_path)

if sr != 16000:
    test_audio = torchaudio.transforms.Resample(sr, 16000)(test_audio)

In [None]:
quantization_config_8bit = BitsAndBytesConfig(
    load_in_8bit=True,
)

model = GazelleForConditionalGeneration.from_pretrained(
    model_id,
    device_map="cuda:0",
    quantization_config=quantization_config_8bit,
)

model.safetensors.index.json:   0%|          | 0.00/49.2k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/3 [00:00<?, ?it/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.96G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/4.86G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

In [None]:
inputs = inference_collator(test_audio, "<|audio|>")
tokenizer.decode(model.generate(**inputs, max_new_tokens=64)[0])

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


'<s> [INST] <|audio|>  [/INST]My greatest accomplishment is being able to help people.</s>'

In [None]:
print("torch.cuda.memory_allocated: %fGB"%(torch.cuda.memory_allocated(0)/1024/1024/1024))
print("torch.cuda.memory_reserved: %fGB"%(torch.cuda.memory_reserved(0)/1024/1024/1024))
print("torch.cuda.max_memory_reserved: %fGB"%(torch.cuda.max_memory_reserved(0)/1024/1024/1024))

torch.cuda.memory_allocated: 7.774353GB
torch.cuda.memory_reserved: 10.361328GB
torch.cuda.max_memory_reserved: 10.361328GB


In [None]:
del model

torch.cuda.empty_cache()

### 4-bit quantization

This setup uses 4bit weights and bfloat16 compute. The result uses just ~9.4GB vRAM.

In [None]:
quantization_config_4bit = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
)


model = GazelleForConditionalGeneration.from_pretrained(
    model_id,
    device_map="cuda:0",
    quantization_config=quantization_config_4bit,
)

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

In [None]:
inputs = inference_collator(test_audio, "<|audio|>")
tokenizer.decode(model.generate(**inputs, max_new_tokens=64)[0])

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


'<s> [INST] <|audio|>  [/INST]My greatest accomplishment is being able to raise my children to be good people.</s>'

In [None]:
print("torch.cuda.memory_allocated: %fGB"%(torch.cuda.memory_allocated(0)/1024/1024/1024))
print("torch.cuda.memory_reserved: %fGB"%(torch.cuda.memory_reserved(0)/1024/1024/1024))
print("torch.cuda.max_memory_reserved: %fGB"%(torch.cuda.max_memory_reserved(0)/1024/1024/1024))

torch.cuda.memory_allocated: 4.763987GB
torch.cuda.memory_reserved: 5.265625GB
torch.cuda.max_memory_reserved: 10.361328GB


In [None]:
del model

torch.cuda.empty_cache()

You can also enable double/nested quantization. https://huggingface.co/blog/4bit-transformers-bitsandbytes#nested-quantization

This further reduces vRAM to under 9GB!

In [None]:
quantization_config_4bit_dq = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)


model = GazelleForConditionalGeneration.from_pretrained(
    model_id,
    device_map="cuda:0",
    quantization_config=quantization_config_4bit_dq,
)

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

In [None]:
inputs = inference_collator(test_audio, "<|audio|>")
tokenizer.decode(model.generate(**inputs, max_new_tokens=64)[0])

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


'<s> [INST] <|audio|>  [/INST]My greatest accomplishment is being able to raise my children to be good people.</s>'

In [None]:
print("torch.cuda.memory_allocated: %fGB"%(torch.cuda.memory_allocated(0)/1024/1024/1024))
print("torch.cuda.memory_reserved: %fGB"%(torch.cuda.memory_reserved(0)/1024/1024/1024))
print("torch.cuda.max_memory_reserved: %fGB"%(torch.cuda.max_memory_reserved(0)/1024/1024/1024))

torch.cuda.memory_allocated: 4.437950GB
torch.cuda.memory_reserved: 4.937500GB
torch.cuda.max_memory_reserved: 10.361328GB
