<a href="https://colab.research.google.com/github/tuhinmallick/AI-for-Fashion/blob/main/Smaller_LLMs_with_AutoRound_Low_bit_Quantization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


*More details in this article: [Intel AutoRound: Accurate Low-bit Quantization for LLMs](https://newsletter.kaitchup.com/p/intel-autoround-accurate-low-bit)*

This notebook shows how to use Intel AutoRound to quantize LLMs.

The first section runs AutoRound for Llama 3 8B with different hyperparameters.
The second section evaluates the quantized models with the Evaluation Harness.
The third section benchmarks the inference throughput with vLLM.

#Quantization with AutoRound

In [None]:
!pip install auto-round

Collecting auto-round
  Downloading auto_round-0.2-py3-none-any.whl (66 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m66.0/66.0 kB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting accelerate (from auto-round)
  Downloading accelerate-0.31.0-py3-none-any.whl (309 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m309.4/309.4 kB[0m [31m7.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting auto-gptq (from auto-round)
  Downloading auto_gptq-0.7.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (23.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.5/23.5 MB[0m [31m53.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets (from auto-round)
  Downloading datasets-2.20.0-py3-none-any.whl (547 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m547.8/547.8 kB[0m [31m56.0 MB/s[0m eta [36m0:00:00[0m
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch->auto-round)
  Using cached nvidia

The following code quantizes the model to 4-bit and saves it in the directory tmp_autoround

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_name = "meta-llama/Meta-Llama-3-8B"
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16)
tokenizer = AutoTokenizer.from_pretrained(model_name)

from auto_round import AutoRound

bits, group_size, sym = 4, 128, False
autoround = AutoRound(model, tokenizer, bits=bits, group_size=group_size, batch_size=2, seqlen=512, sym=sym, gradient_accumulate_steps=4, device='cuda')
autoround.quantize()
output_dir = "./tmp_autoround"
autoround.save_quantized(output_dir) ##save_quantized(output_dir,format=="auto_round")

config.json:   0%|          | 0.00/654 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/177 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/50.6k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/73.0 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
2024-06-20 06:06:13 INFO autoround.py L464: using torch.float16 for quantization tuning


Downloading readme:   0%|          | 0.00/373 [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/921 [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/33.3M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/10000 [00:00<?, ? examples/s]

Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

Filter:   0%|          | 0/10000 [00:00<?, ? examples/s]

2024-06-20 06:07:15 INFO autoround.py L1306: quantizing 1/32, model.layers.0
2024-06-20 06:10:04 INFO autoround.py L1237: quantized 7/7 layers in the block, loss iter 0: 0.000004 -> iter 194: 0.000001
2024-06-20 06:10:09 INFO autoround.py L1306: quantizing 2/32, model.layers.1
2024-06-20 06:12:56 INFO autoround.py L1237: quantized 7/7 layers in the block, loss iter 0: 0.000076 -> iter 43: 0.000042
2024-06-20 06:13:00 INFO autoround.py L1306: quantizing 3/32, model.layers.2
2024-06-20 06:15:47 INFO autoround.py L1237: quantized 7/7 layers in the block, loss iter 0: 0.000057 -> iter 198: 0.000036
2024-06-20 06:15:51 INFO autoround.py L1306: quantizing 4/32, model.layers.3
2024-06-20 06:18:38 INFO autoround.py L1237: quantized 7/7 layers in the block, loss iter 0: 0.000062 -> iter 152: 0.000038
2024-06-20 06:18:42 INFO autoround.py L1306: quantizing 5/32, model.layers.4
2024-06-20 06:21:29 INFO autoround.py L1237: quantized 7/7 layers in the block, loss iter 0: 0.000081 -> iter 196: 0.000

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_name = "meta-llama/Meta-Llama-3-8B"
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16)
tokenizer = AutoTokenizer.from_pretrained(model_name)

from auto_round import AutoRound

bits, group_size, sym = 2, 128, False
autoround = AutoRound(model, tokenizer, bits=bits, group_size=group_size, batch_size=2, seqlen=512, sym=sym, gradient_accumulate_steps=4, device='cuda')
autoround.quantize()
output_dir = "./tmp_autoround_2bit"
autoround.save_quantized(output_dir) ##save_quantized(output_dir,format=="auto_round")

config.json:   0%|          | 0.00/654 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/177 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/50.6k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/73.0 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
2024-06-24 10:32:28 INFO autoround.py L464: using torch.float16 for quantization tuning


Downloading readme:   0%|          | 0.00/373 [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/921 [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/33.3M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/10000 [00:00<?, ? examples/s]

Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

Filter:   0%|          | 0/10000 [00:00<?, ? examples/s]

2024-06-24 10:33:29 INFO autoround.py L1306: quantizing 1/32, model.layers.0
2024-06-24 10:36:18 INFO autoround.py L1237: quantized 7/7 layers in the block, loss iter 0: 0.000065 -> iter 183: 0.000015
2024-06-24 10:36:22 INFO autoround.py L1306: quantizing 2/32, model.layers.1
2024-06-24 10:39:10 INFO autoround.py L1237: quantized 7/7 layers in the block, loss iter 0: 0.000400 -> iter 51: 0.000151
2024-06-24 10:39:14 INFO autoround.py L1306: quantizing 3/32, model.layers.2
2024-06-24 10:42:01 INFO autoround.py L1237: quantized 7/7 layers in the block, loss iter 0: 0.000490 -> iter 193: 0.000146
2024-06-24 10:42:04 INFO autoround.py L1306: quantizing 4/32, model.layers.3
2024-06-24 10:44:52 INFO autoround.py L1237: quantized 7/7 layers in the block, loss iter 0: 0.000943 -> iter 190: 0.000297
2024-06-24 10:44:55 INFO autoround.py L1306: quantizing 5/32, model.layers.4
2024-06-24 10:47:43 INFO autoround.py L1237: quantized 7/7 layers in the block, loss iter 0: 0.001249 -> iter 192: 0.000

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_name = "meta-llama/Meta-Llama-3-8B"
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16)
tokenizer = AutoTokenizer.from_pretrained(model_name)

from auto_round import AutoRound

bits, group_size, sym = 2, 128, True
autoround = AutoRound(model, tokenizer, bits=bits, group_size=group_size, batch_size=2, seqlen=512, sym=sym, gradient_accumulate_steps=4, device='cuda')
autoround.quantize()
output_dir = "./Llama-3-8B-2bit-Sym-AutoRound-GPTQ/"
autoround.save_quantized(output_dir) ##save_quantized(output_dir,format=="auto_round")

config.json:   0%|          | 0.00/654 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/177 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/50.6k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/73.0 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
2024-06-25 10:43:47 INFO autoround.py L464: using torch.float16 for quantization tuning


Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

Filter:   0%|          | 0/10000 [00:00<?, ? examples/s]

2024-06-25 10:44:38 INFO autoround.py L1306: quantizing 1/32, model.layers.0
2024-06-25 10:47:25 INFO autoround.py L1237: quantized 7/7 layers in the block, loss iter 0: 0.000058 -> iter 199: 0.000015
2024-06-25 10:47:28 INFO autoround.py L1306: quantizing 2/32, model.layers.1
2024-06-25 10:50:15 INFO autoround.py L1237: quantized 7/7 layers in the block, loss iter 0: 0.000404 -> iter 51: 0.000153
2024-06-25 10:50:19 INFO autoround.py L1306: quantizing 3/32, model.layers.2
2024-06-25 10:53:05 INFO autoround.py L1237: quantized 7/7 layers in the block, loss iter 0: 0.000498 -> iter 192: 0.000152
2024-06-25 10:53:09 INFO autoround.py L1306: quantizing 4/32, model.layers.3
2024-06-25 10:55:56 INFO autoround.py L1237: quantized 7/7 layers in the block, loss iter 0: 0.000957 -> iter 186: 0.000284
2024-06-25 10:55:59 INFO autoround.py L1306: quantizing 5/32, model.layers.4
2024-06-25 10:58:46 INFO autoround.py L1237: quantized 7/7 layers in the block, loss iter 0: 0.001160 -> iter 195: 0.000

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_name = "meta-llama/Meta-Llama-3-8B"
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16)
tokenizer = AutoTokenizer.from_pretrained(model_name)

from auto_round import AutoRound

bits, group_size, sym = 4, 128, True
autoround = AutoRound(model, tokenizer, bits=bits, group_size=group_size, batch_size=2, seqlen=512, sym=sym, gradient_accumulate_steps=4, device='cuda')
autoround.quantize()
output_dir = "./Llama-3-8B-4bit-Symm-AutoRound-GPTQ/"
autoround.save_quantized(output_dir) ##save_quantized(output_dir,format=="auto_round")

config.json:   0%|          | 0.00/654 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/177 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/50.6k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/73.0 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
2024-06-26 12:54:58 INFO autoround.py L464: using torch.float16 for quantization tuning


Downloading readme:   0%|          | 0.00/373 [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/921 [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/33.3M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/10000 [00:00<?, ? examples/s]

Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

Filter:   0%|          | 0/10000 [00:00<?, ? examples/s]

2024-06-26 12:55:59 INFO autoround.py L1306: quantizing 1/32, model.layers.0
2024-06-26 12:58:45 INFO autoround.py L1237: quantized 7/7 layers in the block, loss iter 0: 0.000004 -> iter 195: 0.000002
2024-06-26 12:58:49 INFO autoround.py L1306: quantizing 2/32, model.layers.1
2024-06-26 13:01:34 INFO autoround.py L1237: quantized 7/7 layers in the block, loss iter 0: 0.000474 -> iter 40: 0.000116
2024-06-26 13:01:38 INFO autoround.py L1306: quantizing 3/32, model.layers.2
2024-06-26 13:04:23 INFO autoround.py L1237: quantized 7/7 layers in the block, loss iter 0: 0.000137 -> iter 194: 0.000104
2024-06-26 13:04:26 INFO autoround.py L1306: quantizing 4/32, model.layers.3
2024-06-26 13:07:11 INFO autoround.py L1237: quantized 7/7 layers in the block, loss iter 0: 0.000139 -> iter 185: 0.000101
2024-06-26 13:07:15 INFO autoround.py L1306: quantizing 5/32, model.layers.4
2024-06-26 13:09:59 INFO autoround.py L1237: quantized 7/7 layers in the block, loss iter 0: 0.000162 -> iter 176: 0.000

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_name = "meta-llama/Meta-Llama-3-8B"
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16)
tokenizer = AutoTokenizer.from_pretrained(model_name)

from auto_round import AutoRound

bits, group_size, sym = 2, 128, False
autoround = AutoRound(model, tokenizer, bits=bits, group_size=group_size, iters=1000, batch_size=2, seqlen=512, sym=sym, gradient_accumulate_steps=4, device='cuda')
autoround.quantize()
output_dir = "./Llama-3-8B-2bit-iter1000-AutoRound-GPTQ/"
autoround.save_quantized(output_dir) ##save_quantized(output_dir,format=="auto_round")

config.json:   0%|          | 0.00/654 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/177 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/50.6k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/73.0 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
2024-06-24 17:18:35 INFO autoround.py L464: using torch.float16 for quantization tuning


Downloading readme:   0%|          | 0.00/373 [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/921 [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/33.3M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/10000 [00:00<?, ? examples/s]

Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

Filter:   0%|          | 0/10000 [00:00<?, ? examples/s]

2024-06-24 17:19:37 INFO autoround.py L1306: quantizing 1/32, model.layers.0
2024-06-24 17:33:20 INFO autoround.py L1237: quantized 7/7 layers in the block, loss iter 0: 0.000060 -> iter 973: 0.000013
2024-06-24 17:33:24 INFO autoround.py L1306: quantizing 2/32, model.layers.1
2024-06-24 17:47:06 INFO autoround.py L1237: quantized 7/7 layers in the block, loss iter 0: 0.000383 -> iter 320: 0.000130
2024-06-24 17:47:10 INFO autoround.py L1306: quantizing 3/32, model.layers.2
2024-06-24 18:00:52 INFO autoround.py L1237: quantized 7/7 layers in the block, loss iter 0: 0.000484 -> iter 941: 0.000123
2024-06-24 18:00:56 INFO autoround.py L1306: quantizing 4/32, model.layers.3
2024-06-24 18:14:38 INFO autoround.py L1237: quantized 7/7 layers in the block, loss iter 0: 0.000844 -> iter 553: 0.000240
2024-06-24 18:14:42 INFO autoround.py L1306: quantizing 5/32, model.layers.4
2024-06-24 18:28:25 INFO autoround.py L1237: quantized 7/7 layers in the block, loss iter 0: 0.001133 -> iter 705: 0.00

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_name = "meta-llama/Meta-Llama-3-8B"
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16)
tokenizer = AutoTokenizer.from_pretrained(model_name)

from auto_round import AutoRound

bits, group_size, sym = 2, 128, False
autoround = AutoRound(model, tokenizer, bits=bits, group_size=group_size, batch_size=2, seqlen=1024, sym=sym, gradient_accumulate_steps=4, device='cuda')
autoround.quantize()
output_dir = "./Llama-3-8B-2bit-len1024-AutoRound-GPTQ/"
autoround.save_quantized(output_dir) ##save_quantized(output_dir,format=="auto_round")

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
2024-06-25 00:48:07 INFO autoround.py L464: using torch.float16 for quantization tuning


Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

Filter:   0%|          | 0/10000 [00:00<?, ? examples/s]

2024-06-25 00:48:59 INFO autoround.py L1306: quantizing 1/32, model.layers.0
2024-06-25 00:52:11 INFO autoround.py L1237: quantized 7/7 layers in the block, loss iter 0: 0.000066 -> iter 195: 0.000014
2024-06-25 00:52:20 INFO autoround.py L1306: quantizing 2/32, model.layers.1
2024-06-25 00:55:32 INFO autoround.py L1237: quantized 7/7 layers in the block, loss iter 0: 0.000257 -> iter 51: 0.000106
2024-06-25 00:55:39 INFO autoround.py L1306: quantizing 3/32, model.layers.2
2024-06-25 00:58:51 INFO autoround.py L1237: quantized 7/7 layers in the block, loss iter 0: 0.000470 -> iter 198: 0.000146
2024-06-25 00:58:58 INFO autoround.py L1306: quantizing 4/32, model.layers.3
2024-06-25 01:02:11 INFO autoround.py L1237: quantized 7/7 layers in the block, loss iter 0: 0.000855 -> iter 180: 0.000267
2024-06-25 01:02:18 INFO autoround.py L1306: quantizing 5/32, model.layers.4
2024-06-25 01:05:30 INFO autoround.py L1237: quantized 7/7 layers in the block, loss iter 0: 0.001239 -> iter 192: 0.000

# Evaluation with the Evaluation Harness

In [None]:
!pip install git+https://github.com/EleutherAI/lm-evaluation-harness.git
!pip install auto-gptq optimum

Collecting git+https://github.com/EleutherAI/lm-evaluation-harness.git
  Cloning https://github.com/EleutherAI/lm-evaluation-harness.git to /tmp/pip-req-build-g52qe58j
  Running command git clone --filter=blob:none --quiet https://github.com/EleutherAI/lm-evaluation-harness.git /tmp/pip-req-build-g52qe58j
  Resolved https://github.com/EleutherAI/lm-evaluation-harness.git to commit 6e49b1f6910931882a4b3b105794c6faf96b74e5
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting evaluate (from lm_eval==0.4.2)
  Downloading evaluate-0.4.2-py3-none-any.whl (84 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
Collecting jsonlines (from lm_eval==0.4.2)
  Downloading jsonlines-4.0.0-py3-none-any.whl (8.7 kB)
Collecting pybind11>=2.6.2 (from lm_eval==0.4.2)
  Downloading pybind11-2.13.0-py3-none-a

In [None]:
!lm_eval --model hf --model_args pretrained=./Llama-3-8B-4bit-AutoRound-GPTQ/,trust_remote_code=True --tasks winogrande,hellaswag,arc_challenge --device cuda:0 --num_fewshot 0 --batch_size 16 --output_path ./eval_gptq/4bit

2024-06-24 09:37:21.769582: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-06-24 09:37:21.819353: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-06-24 09:37:21.819406: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-06-24 09:37:21.820733: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-06-24 09:37:21.828023: I tensorflow/core/platform/cpu_feature_guar

In [None]:
!lm_eval --model hf --model_args pretrained=./Llama-3-8B-2bit-AutoRound-GPTQ/,trust_remote_code=True --tasks winogrande,hellaswag,arc_challenge --device cuda:0 --num_fewshot 0 --batch_size 16 --output_path ./eval_gptq/4bit

2024-06-24 12:46:30.793376: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-06-24 12:46:30.846039: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-06-24 12:46:30.846090: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-06-24 12:46:30.847848: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-06-24 12:46:30.855979: I tensorflow/core/platform/cpu_feature_guar

In [None]:
!lm_eval --model hf --model_args pretrained=./Llama-3-8B-2bit-iter1000-AutoRound-GPTQ/,trust_remote_code=True --tasks winogrande,hellaswag,arc_challenge --device cuda:0 --num_fewshot 0 --batch_size 16 --output_path ./eval_gptq/4bit

2024-06-25 06:46:09.311787: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-06-25 06:46:09.368281: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-06-25 06:46:09.368343: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-06-25 06:46:09.370370: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-06-25 06:46:09.379133: I tensorflow/core/platform/cpu_feature_guar

In [None]:
!lm_eval --model hf --model_args pretrained=./Llama-3-8B-2bit-len1024-AutoRound-GPTQ/,trust_remote_code=True --tasks winogrande,hellaswag,arc_challenge --device cuda:0 --num_fewshot 0 --batch_size 16 --output_path ./eval_gptq/4bit

2024-06-25 07:41:00.468812: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-06-25 07:41:00.521575: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-06-25 07:41:00.521625: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-06-25 07:41:00.523583: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-06-25 07:41:00.531925: I tensorflow/core/platform/cpu_feature_guar

In [None]:
!lm_eval --model hf --model_args pretrained=./Llama-3-8B-4bit-Symm-AutoRound-GPTQ/ --tasks winogrande,hellaswag,arc_challenge --device cuda:0 --num_fewshot 0 --batch_size 16 --output_path ./eval_gptq/4bit

2024-06-26 14:36:01.366739: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-06-26 14:36:01.422191: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-06-26 14:36:01.422242: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-06-26 14:36:01.424789: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-06-26 14:36:01.435155: I tensorflow/core/platform/cpu_feature_guar

In [None]:
!lm_eval --model hf --model_args pretrained=./Llama-3-8B-2bit-Sym-AutoRound-GPTQ/ --tasks winogrande,hellaswag,arc_challenge --device cuda:0 --num_fewshot 0 --batch_size 16 --output_path ./eval_gptq/4bit

2024-06-25 13:06:11.428493: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-06-25 13:06:12.302150: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-06-25 13:06:12.302214: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-06-25 13:06:12.420607: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-06-25 13:06:12.648372: I tensorflow/core/platform/cpu_feature_guar

#Benchmarking Inference Throughput with vLLM

In [None]:
!git clone https://github.com/vllm-project/vllm.git
!cd vllm && pip install -e .  # This may take 5-10 minutes.

fatal: destination path 'vllm' already exists and is not an empty directory.
Obtaining file:///content/vllm
  Installing build dependencies ... [?25l[?25hdone
  Checking if build backend supports build_editable ... [?25l[?25hdone
  Getting requirements to build editable ... [?25l[?25hdone
  Preparing editable metadata (pyproject.toml) ... [?25l[?25hdone
Collecting ninja (from vllm==0.5.0.post1+cu122)
  Using cached ninja-1.11.1.1-py2.py3-none-manylinux1_x86_64.manylinux_2_5_x86_64.whl (307 kB)
Collecting fastapi (from vllm==0.5.0.post1+cu122)
  Using cached fastapi-0.111.0-py3-none-any.whl (91 kB)
Collecting openai (from vllm==0.5.0.post1+cu122)
  Using cached openai-1.35.4-py3-none-any.whl (327 kB)
Collecting uvicorn[standard] (from vllm==0.5.0.post1+cu122)
  Using cached uvicorn-0.30.1-py3-none-any.whl (62 kB)
Collecting prometheus-fastapi-instrumentator>=7.0.0 (from vllm==0.5.0.post1+cu122)
  Using cached prometheus_fastapi_instrumentator-7.0.0-py3-none-any.whl (19 kB)
Colle

In [None]:
!wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json

--2024-06-26 16:00:43--  https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
Resolving huggingface.co (huggingface.co)... 18.164.174.23, 18.164.174.17, 18.164.174.118, ...
Connecting to huggingface.co (huggingface.co)|18.164.174.23|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://cdn-lfs.huggingface.co/repos/58/74/5874e8234cbcd37dd31ca486e8492d9f1370bdd04829001f53991a866851e83f/35f0e213ce091ed9b9af2a1f0755e9d39f9ccec34ab281cd4ca60d70f6479ba4?response-content-disposition=inline%3B+filename*%3DUTF-8%27%27ShareGPT_V3_unfiltered_cleaned_split.json%3B+filename%3D%22ShareGPT_V3_unfiltered_cleaned_split.json%22%3B&response-content-type=application%2Fjson&Expires=1719676844&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTcxOTY3Njg0NH19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy5odWdnaW5nZmFjZS5jby9yZXBvcy81OC83NC81ODc0ZTgyMzRjYmNkMzdkZDMxY2E0ODZ

In [None]:
!python vllm/benchmarks/benchmark_throughput.py --backend vllm --dataset ShareGPT_V3_unfiltered_cleaned_split.json --model ./Llama-3-8B-4bit-AutoRound-GPTQ/

Namespace(backend='vllm', dataset='ShareGPT_V3_unfiltered_cleaned_split.json', input_len=None, output_len=None, model='./drive/MyDrive/autoround/Llama-3-8B-4bit-AutoRound-GPTQ/', tokenizer='./drive/MyDrive/autoround/Llama-3-8B-4bit-AutoRound-GPTQ/', quantization=None, tensor_parallel_size=1, n=1, use_beam_search=False, num_prompts=1000, seed=0, hf_max_batch_size=None, trust_remote_code=False, max_model_len=None, dtype='auto', gpu_memory_utilization=0.9, enforce_eager=False, kv_cache_dtype='auto', quantization_param_path=None, device='cuda', enable_prefix_caching=False, enable_chunked_prefill=False, max_num_batched_tokens=None, download_dir=None, output_json=None, distributed_executor_backend=None, load_format='auto')
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 06-26 08:05:02 llm_engine.py:164] Initializing an LLM engine (v0.5.0.post1) with config: model='./drive/MyDrive/autoround/Llama-3-8B-4bit-AutoRound-GP

In [None]:
!python vllm/benchmarks/benchmark_throughput.py --backend vllm --dataset ShareGPT_V3_unfiltered_cleaned_split.json --model ./Llama-3-8B-4bit-Symm-AutoRound-GPTQ/

Namespace(backend='vllm', dataset='ShareGPT_V3_unfiltered_cleaned_split.json', input_len=None, output_len=None, model='./drive/MyDrive/autoround/Llama-3-8B-4bit-Symm-AutoRound-GPTQ/', tokenizer='./drive/MyDrive/autoround/Llama-3-8B-4bit-Symm-AutoRound-GPTQ/', quantization=None, tensor_parallel_size=1, n=1, use_beam_search=False, num_prompts=1000, seed=0, hf_max_batch_size=None, trust_remote_code=False, max_model_len=None, dtype='auto', gpu_memory_utilization=0.9, enforce_eager=False, kv_cache_dtype='auto', quantization_param_path=None, device='cuda', enable_prefix_caching=False, enable_chunked_prefill=False, max_num_batched_tokens=None, download_dir=None, output_json=None, distributed_executor_backend=None, load_format='auto')
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 06-26 16:00:58 gptq_marlin.py:134] The model is convertible to gptq_marlin during runtime. Using gptq_marlin kernel.
INFO 06-26 16:00:58 llm

In [None]:
!python vllm/benchmarks/benchmark_throughput.py --backend vllm --dataset ShareGPT_V3_unfiltered_cleaned_split.json --model kaitchup/Meta-Llama-3-8B-gptq-4bit

Namespace(backend='vllm', dataset='ShareGPT_V3_unfiltered_cleaned_split.json', input_len=None, output_len=None, model='kaitchup/Meta-Llama-3-8B-gptq-4bit', tokenizer='kaitchup/Meta-Llama-3-8B-gptq-4bit', quantization=None, tensor_parallel_size=1, n=1, use_beam_search=False, num_prompts=1000, seed=0, hf_max_batch_size=None, trust_remote_code=False, max_model_len=None, dtype='auto', gpu_memory_utilization=0.9, enforce_eager=False, kv_cache_dtype='auto', quantization_param_path=None, device='cuda', enable_prefix_caching=False, enable_chunked_prefill=False, max_num_batched_tokens=None, download_dir=None, output_json=None, distributed_executor_backend=None, load_format='auto')
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 06-26 08:13:44 gptq_marlin.py:134] The model is convertible to gptq_marlin during runtime. Using gptq_marlin kernel.
INFO 06-26 08:13:44 llm_engine.py:164] Initializing an LLM engine (v0.5.0.post1