## VPTQ inference example

<a target="_blank" href="https://colab.research.google.com/github/microsoft/VPTQ/blob/main/notebooks/vptq_example.ipynb">
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

## Install VPTQ package and requirements
The latest transformers and accelerate is essential.

In [1]:
%%capture
!pip install -U vptq
!pip install -U transformers accelerate


## Load model and tokenizer as usual
Note that T4-GPU does not support bf16,

Set `dtype = torch.half` for this model

Set `device_map='auto'` to load the model on GPU on priority.

In [2]:
import vptq
import transformers
import torch

tokenizer = transformers.AutoTokenizer.from_pretrained("VPTQ-community/Qwen2.5-7B-Instruct-v8-k65536-0-woft")
m = vptq.AutoModelForCausalLM.from_pretrained("VPTQ-community/Qwen2.5-7B-Instruct-v8-k65536-0-woft", device_map='auto', dtype=torch.half)



Successfully loaded VPTQ CUDA kernels.


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/7.33k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/605 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/613 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/106k [00:00<?, ?B/s]

Replacing linear layers...: 100%|██████████| 371/371 [00:00<00:00, 1109.53it/s]


Fetching 11 files:   0%|          | 0/11 [00:00<?, ?it/s]

.gitattributes:   0%|          | 0.00/1.57k [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/289 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/964 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/4.02G [00:00<?, ?B/s]

  return torch.load(checkpoint_file, map_location=torch.device("cpu"))


## Inference example with text generation

In [3]:
inputs = tokenizer("Explain: Do Not Go Gentle into That Good Night", return_tensors="pt").to("cuda")
out = m.generate(**inputs, max_new_tokens=100, pad_token_id=2)
print(tokenizer.decode(out[0], skip_special_tokens=True))

Explain: Do Not Go Gentle into That Good Night

The poem “Do Not Go Gentle into That Good Night” by Dylan Thomas is a poem about death. The poem is written in the form of a sonnet, and it is written in the form of a monologue. The poem is written in the form of a monologue, and it is written in the form of a sonnet. The poem is written in the form of a sonnet, and it is written in the form of a monologue. The poem is written in the


## Generate token in streaming mode

In [4]:
inputs = tokenizer("share me a story,", return_tensors="pt").to("cuda")

streamer = transformers.TextStreamer(tokenizer)
_ = m.generate(**inputs, streamer=streamer, max_new_tokens=60,pad_token_id=tokenizer.eos_token_id)

share me a story, please
Certainly! Here's a short story for you:

### The Forgotten Garden

In the heart of a bustling city, there was a small, forgotten garden hidden behind a row of tall buildings. It was a place where the city's residents often forgot to visit, but it was a sanctuary for


### VPTQ-community/Qwen2.5-32B-Instruct-v8-k256-256-woft

aشغال

In [1]:
import vptq
import transformers
import torch

tokenizer = transformers.AutoTokenizer.from_pretrained("VPTQ-community/Qwen2.5-32B-Instruct-v8-k256-256-woft")
m = vptq.AutoModelForCausalLM.from_pretrained("VPTQ-community/Qwen2.5-32B-Instruct-v8-k256-256-woft", device_map='auto', dtype=torch.half)



Successfully loaded VPTQ CUDA kernels.


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/7.33k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/7.03M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/605 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/613 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/241k [00:00<?, ?B/s]

Replacing linear layers...: 100%|██████████| 839/839 [00:00<00:00, 1273.97it/s]


Fetching 14 files:   0%|          | 0/14 [00:00<?, ?it/s]

README.md:   0%|          | 0.00/965 [00:00<?, ?B/s]

.gitattributes:   0%|          | 0.00/1.52k [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/261k [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/557 [00:00<?, ?B/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/1.56G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/4.39G [00:00<?, ?B/s]

In [2]:
inputs = tokenizer("Explain: Do Not Go Gentle into That Good Night", return_tensors="pt").to("cuda")
out = m.generate(**inputs, max_new_tokens=100, pad_token_id=2)
print(tokenizer.decode(out[0], skip_special_tokens=True))

Explain: Do Not Go Gentle into That Good Night
"Do Not Go Gentle into That Good Night" is a poem by Dylan Thomas, written in 1954. It is a poem about death and the struggle against it. The poem is a plea to resist death and to fight against it. The poem is a plea to resist death and to fight against it. The poem is a plea to resist death and to fight against it. The poem is a plea to resist death and to fight against it. The poem is a plea to resist


In [None]:
import xformers

m.enable_xformers_memory_efficient_attention()

شغاااااااااااال

In [1]:
import os
import vptq
import transformers
import torch
import xformers

# 1. Set environment variable for expandable memory segments:
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"

# 2. Define device dynamically:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# 3. Load tokenizer and model:
tokenizer = transformers.AutoTokenizer.from_pretrained("VPTQ-community/Llama-3.1-Nemotron-70B-Instruct-HF-v16-k65536-256-woft")
m = vptq.AutoModelForCausalLM.from_pretrained(
    "VPTQ-community/Llama-3.1-Nemotron-70B-Instruct-HF-v16-k65536-256-woft",
    device_map='auto',
    dtype=torch.bfloat16,  # Use bfloat16 instead of half for better memory efficiency
    # output_attentions=False,  # Disable attention outputs if not needed
    # output_hidden_states=False,
    offload_folder="offload"# Disable hidden states outputs if not needed
)

# Instead of directly calling `enable_xformers_memory_efficient_attention()` on 'm',
# attempt to apply it to the underlying Hugging Face model:
try:
    m.model.enable_xformers_memory_efficient_attention()  # Access the underlying model
except AttributeError:
    print("Warning: Could not enable xformers attention. The model might not be compatible or xformers may not be installed correctly.")

# 4. Disable gradient checkpointing during inference:
m.gradient_checkpointing_disable()

# 5. Move inputs to the selected device:
inputs = tokenizer("share me a story,", return_tensors="pt").to(device)

# 6. Reduce max_new_tokens and use TextStreamer for streaming output:
streamer = transformers.TextStreamer(tokenizer)
_ = m.generate(
    **inputs,
    streamer=streamer,
    max_new_tokens=1,  # Further reduce for less memory usage
    pad_token_id=tokenizer.eos_token_id
)

# 7. Clear the cache after generation:
del inputs
torch.cuda.empty_cache()

    PyTorch 2.6.0+cu124 with CUDA 1204 (you have 2.5.1+cu124)
    Python  3.11.11 (you have 3.11.11)
  Please reinstall xformers (see https://github.com/facebookresearch/xformers#installing-xformers)
  Memory-efficient attention, SwiGLU, sparse and more won't be available.
  Set XFORMERS_MORE_DETAILS=1 for more details


Successfully loaded VPTQ CUDA kernels.


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
Replacing linear layers...: 100%|██████████| 1047/1047 [00:01<00:00, 550.33it/s]


Fetching 12 files:   0%|          | 0/12 [00:00<?, ?it/s]

  0%|          | 0/738 [00:00<?, ?w/s]

  0%|          | 0/1238 [00:00<?, ?w/s]

  0%|          | 0/1259 [00:00<?, ?w/s]

  0%|          | 0/288 [00:00<?, ?w/s]



<|begin_of_text|>share me a story, and


In [None]:
attn_implementation="sdpa"

In [2]:
!pip install torchvision

Collecting torch==2.5.1 (from torchvision)
  Downloading torch-2.5.1-cp311-cp311-manylinux1_x86_64.whl.metadata (28 kB)
Collecting triton==3.1.0 (from torch==2.5.1->torchvision)
  Downloading triton-3.1.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (1.3 kB)
Downloading torch-2.5.1-cp311-cp311-manylinux1_x86_64.whl (906.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m906.5/906.5 MB[0m [31m1.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading triton-3.1.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (209.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m209.5/209.5 MB[0m [31m6.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: triton, torch
  Attempting uninstall: triton
    Found existing installation: triton 3.2.0
    Uninstalling triton-3.2.0:
      Successfully uninstalled triton-3.2.0
  Attempting uninstall: torch
    Found existing installation: torch 2.6.0
    Uninstalling torch-2.6.0:


In [3]:
!pip install sentence-transformers



In [2]:
!pip install xformers

Collecting xformers
  Downloading xformers-0.0.29.post3-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (1.0 kB)
Collecting torch==2.6.0 (from xformers)
  Downloading torch-2.6.0-cp311-cp311-manylinux1_x86_64.whl.metadata (28 kB)
Collecting nvidia-cusparselt-cu12==0.6.2 (from torch==2.6.0->xformers)
  Downloading nvidia_cusparselt_cu12-0.6.2-py3-none-manylinux2014_x86_64.whl.metadata (6.8 kB)
Collecting triton==3.2.0 (from torch==2.6.0->xformers)
  Downloading triton-3.2.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (1.4 kB)
Downloading xformers-0.0.29.post3-cp311-cp311-manylinux_2_28_x86_64.whl (43.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.4/43.4 MB[0m [31m11.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading torch-2.6.0-cp311-cp311-manylinux1_x86_64.whl (766.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m766.7/766.7 MB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading nvidia_cusparselt_cu12-0.6.2-py3

In [None]:
import xformers

m.enable_xformers_memory_efficient_attention()

aشغال

In [3]:

from transformers import pipeline
import torch
generator = pipeline('text-generation', model="VPTQ-community/Qwen2.5-7B-Instruct-v8-k65536-0-woft", device_map='auto', torch_dtype=torch.float16)
generator("What are we having for dinner?")

Device set to use cuda:0


[{'generated_text': 'What are we having for dinner? A chicken and rice dish, a vegetable stir fry, and a dessert. We will have a chicken'}]

ayhgشغال

In [3]:
# 5. Move inputs to the selected device:
inputs = tokenizer("share me a story,", return_tensors="pt").to(device)

# 6. Reduce max_new_tokens and use TextStreamer for streaming output:
streamer = transformers.TextStreamer(tokenizer)
_ = m.generate(
    **inputs,
    streamer=streamer,
    max_new_tokens=11,  # Further reduce for less memory usage
    pad_token_id=tokenizer.eos_token_id
)

# 7. Clear the cache after generation:
del inputs
torch.cuda.empty_cache()

<|begin_of_text|>share me a story, and I'll share you a story, and we'll


شغال

In [4]:
inputs = tokenizer("write python code for print welcom world", return_tensors="pt").to(device)

# 6. Reduce max_new_tokens and use TextStreamer for streaming output:
streamer = transformers.TextStreamer(tokenizer)
_ = m.generate(
    **inputs,
    streamer=streamer,
    max_new_tokens=11,  # Further reduce for less memory usage
    pad_token_id=tokenizer.eos_token_id
)

# 7. Clear the cache after generation:
del inputs
torch.cuda.empty_cache()

<|begin_of_text|>write python code for print welcom world
print("Welcome World")
print("Welcome World")



شغال

In [None]:
import os
import vptq
import transformers
import torch
import xformers

# 1. Set environment variable for expandable memory segments:
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"

# 2. Define device dynamically:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# 3. Load tokenizer and model:
tokenizer = transformers.AutoTokenizer.from_pretrained("VPTQ-community/Llama-3.1-Nemotron-70B-Instruct-HF-v16-k65536-256-woft")
m = vptq.AutoModelForCausalLM.from_pretrained(
    "VPTQ-community/Llama-3.1-Nemotron-70B-Instruct-HF-v16-k65536-256-woft",
    device_map='auto',
    dtype=torch.bfloat16,  # Use bfloat16 instead of half for better memory efficiency
    # output_attentions=False,  # Disable attention outputs if not needed
    # output_hidden_states=False,
    offload_folder="offload"# Disable hidden states outputs if not needed
)

# Instead of directly calling `enable_xformers_memory_efficient_attention()` on 'm',
# attempt to apply it to the underlying Hugging Face model:
try:
    m.model.enable_xformers_memory_efficient_attention()  # Access the underlying model
except AttributeError:
    print("Warning: Could not enable xformers attention. The model might not be compatible or xformers may not be installed correctly.")

# 4. Disable gradient checkpointing during inference:
m.gradient_checkpointing_disable()

# 5. Move inputs to the selected device:
inputs = tokenizer("share me a story,", return_tensors="pt").to(device)

# 6. Reduce max_new_tokens and use TextStreamer for streaming output:
streamer = transformers.TextStreamer(tokenizer)
_ = m.generate(
    **inputs,
    streamer=streamer,
    max_new_tokens=11,  # Further reduce for less memory usage
    pad_token_id=tokenizer.eos_token_id
)

# 7. Clear the cache after generation:
del inputs
torch.cuda.empty_cache()

شغال

In [None]:
inputs = tokenizer("write python code for print welcom world", return_tensors="pt").to(device)

# 6. Reduce max_new_tokens and use TextStreamer for streaming output:
streamer = transformers.TextStreamer(tokenizer)
_ = m.generate(
    **inputs,
    streamer=streamer,
    max_new_tokens=11,  # Further reduce for less memory usage
    pad_token_id=tokenizer.eos_token_id
)

# 7. Clear the cache after generation:
del inputs
torch.cuda.empty_cache()

In [None]:
import os
import vptq
import transformers
import torch
import xformers

# 1. Set environment variable for expandable memory segments:
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"

# 2. Define device dynamically:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# 3. Load tokenizer and model:
tokenizer = transformers.AutoTokenizer.from_pretrained("VPTQ-community/Llama-3.1-Nemotron-70B-Instruct-HF-v16-k65536-256-woft")
m = vptq.AutoModelForCausalLM.from_pretrained(
    "VPTQ-community/Llama-3.1-Nemotron-70B-Instruct-HF-v16-k65536-256-woft",
    device_map='auto',
    dtype=torch.bfloat16,  # Use bfloat16 instead of half for better memory efficiency
    # output_attentions=False,  # Disable attention outputs if not needed
    # output_hidden_states=False,  # Disable hidden states outputs if not needed
    offload_folder="offload"  # Specify the offload folder here
)

# Instead of directly calling `enable_xformers_memory_efficient_attention()` on 'm',
# attempt to apply it to the underlying Hugging Face model:
try:
    m.model.enable_xformers_memory_efficient_attention()  # Access the underlying model
except AttributeError:
    print("Warning: Could not enable xformers attention. The model might not be compatible or xformers may not be installed correctly.")

# 4. Disable gradient checkpointing during inference:
m.gradient_checkpointing_disable()

# 5. Move inputs to the selected device:
inputs = tokenizer("share me a story,", return_tensors="pt").to(device)

# 6. Reduce max_new_tokens and use TextStreamer for streaming output:
streamer = transformers.TextStreamer(tokenizer)
_ = m.generate(
    **inputs,
    streamer=streamer,
    max_new_tokens=1,  # Further reduce for less memory usage
    pad_token_id=tokenizer.eos_token_id
)

# 7. Clear the cache after generation:
del inputs
torch.cuda.empty_cache()

In [None]:
import os
import vptq
import transformers
import torch
import xformers

# 1. Set environment variable for expandable memory segments:
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"

# 2. Define device dynamically:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# 3. Load tokenizer and model:
tokenizer = transformers.AutoTokenizer.from_pretrained("VPTQ-community/Llama-3.1-Nemotron-70B-Instruct-HF-v16-k65536-256-woft")
m = vptq.AutoModelForCausalLM.from_pretrained(
    "VPTQ-community/Llama-3.1-Nemotron-70B-Instruct-HF-v16-k65536-256-woft",
    device_map='auto',
    dtype=torch.bfloat16,  # Use bfloat16 instead of half for better memory efficiency
    # output_attentions=False,  # Disable attention outputs if not needed
    # output_hidden_states=False,  # Disable hidden states outputs if not needed
    offload_folder="offload"  # Specify the offload folder here
)

# Instead of directly calling `enable_xformers_memory_efficient_attention()` on 'm',
# attempt to apply it to the underlying Hugging Face model:
try:
    m.model.enable_xformers_memory_efficient_attention()  # Access the underlying model
except AttributeError:
    print("Warning: Could not enable xformers attention. The model might not be compatible or xformers may not be installed correctly.")

# 4. Disable gradient checkpointing during inference:
m.gradient_checkpointing_disable()

# 5. Move inputs to the selected device:
inputs = tokenizer("share me a story,", return_tensors="pt").to(device)

# 6. Reduce max_new_tokens and use TextStreamer for streaming output:
streamer = transformers.TextStreamer(tokenizer)
_ = m.generate(
    **inputs,
    streamer=streamer,
    max_new_tokens=1,  # Further reduce for less memory usage
    pad_token_id=tokenizer.eos_token_id
)

# 7. Clear the cache after generation:
del inputs
torch.cuda.empty_cache()

### VPTQ-community/Llama-3.1-Nemotron-70B-Instruct-HF-v16-k65536-256-woft

In [1]:
import os
import vptq
import transformers
import torch
import xformers

# 1. Set environment variable for expandable memory segments:
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"

# 2. Define device dynamically:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# 3. Load tokenizer and model:
tokenizer = transformers.AutoTokenizer.from_pretrained("VPTQ-community/Llama-3.1-Nemotron-70B-Instruct-HF-v16-k65536-256-woft")
m = vptq.AutoModelForCausalLM.from_pretrained(
    "VPTQ-community/Llama-3.1-Nemotron-70B-Instruct-HF-v16-k65536-256-woft",
    device_map='auto',
    dtype=torch.bfloat16,  # Use bfloat16 instead of half for better memory efficiency
    # output_attentions=False,  # Disable attention outputs if not needed
    # output_hidden_states=False,
    offload_folder="offload"# Disable hidden states outputs if not needed
)

# Instead of directly calling `enable_xformers_memory_efficient_attention()` on 'm',
# attempt to apply it to the underlying Hugging Face model:
try:
    m.model.enable_xformers_memory_efficient_attention()  # Access the underlying model
except AttributeError:
    print("Warning: Could not enable xformers attention. The model might not be compatible or xformers may not be installed correctly.")

# 4. Disable gradient checkpointing during inference:
m.gradient_checkpointing_disable()

# 5. Move inputs to the selected device:
inputs = tokenizer("share me a story,", return_tensors="pt").to(device)

# 6. Reduce max_new_tokens and use TextStreamer for streaming output:
streamer = transformers.TextStreamer(tokenizer)
_ = m.generate(
    **inputs,
    streamer=streamer,
    max_new_tokens=11,  # Further reduce for less memory usage
    pad_token_id=tokenizer.eos_token_id
)

# 7. Clear the cache after generation:
del inputs
torch.cuda.empty_cache()

    PyTorch 2.6.0+cu124 with CUDA 1204 (you have 2.5.1+cu124)
    Python  3.11.11 (you have 3.11.11)
  Please reinstall xformers (see https://github.com/facebookresearch/xformers#installing-xformers)
  Memory-efficient attention, SwiGLU, sparse and more won't be available.
  Set XFORMERS_MORE_DETAILS=1 for more details


Successfully loaded VPTQ CUDA kernels.


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
Replacing linear layers...: 100%|██████████| 1047/1047 [00:02<00:00, 400.25it/s]


Fetching 12 files:   0%|          | 0/12 [00:00<?, ?it/s]

  0%|          | 0/738 [00:00<?, ?w/s]

  0%|          | 0/1238 [00:00<?, ?w/s]

  0%|          | 0/1259 [00:00<?, ?w/s]

  0%|          | 0/288 [00:00<?, ?w/s]



<|begin_of_text|>share me a story, and I'll share you a story, and we'll
