<a href="https://colab.research.google.com/github/tuhinmallick/AI-for-Fashion/blob/main/vLLM_Inference_with_Marlin_for_GPTQ.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook shows how to convert a GPTQ model to the Marlin format.

It also benchmarks inference's throughput of Marlin GPTQ with vLLM, using Mistral 7B.

Marlin requires an Ampere GPU, or more recent. On Google Colab, only the A100 works.

First, install the following libraries:

In [None]:
!pip install --upgrade transformers auto-gptq accelerate nvidia-ml-py3 optimum

Collecting transformers
  Downloading transformers-4.39.0-py3-none-any.whl (8.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.8/8.8 MB[0m [31m69.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting auto-gptq
  Downloading auto_gptq-0.7.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (23.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.5/23.5 MB[0m [31m56.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting accelerate
  Downloading accelerate-0.28.0-py3-none-any.whl (290 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m290.1/290.1 kB[0m [31m36.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting nvidia-ml-py3
  Downloading nvidia-ml-py3-7.352.0.tar.gz (19 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting optimum
  Downloading optimum-1.17.1-py3-none-any.whl (407 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m407.1/407.1 kB[0m [31m39.6 MB/s[0m eta [36m0:00:00[0m
Collectin

Test whether Marlin is supported by your GPU. A function "print_gpu_utilization()" is also defined to measure the memory consumption in the following cells.

In [None]:
import torch

major_version, minor_version = torch.cuda.get_device_capability()
if major_version >= 8:
  print("Your GPU supports Marlin!")
else:
  print("Your GPU doesn't support Marlin... You need an Ampere GPU or more recent (RTX 30xx/40xx, A100, H100, ...)")


from transformers import AutoTokenizer
from auto_gptq import AutoGPTQForCausalLM
from pynvml import *
import time

def print_gpu_utilization():
    nvmlInit()
    handle = nvmlDeviceGetHandleByIndex(0)
    info = nvmlDeviceGetMemoryInfo(handle)
    print(f"GPU memory occupied: {info.used//1024**2} MB.")


GPTQ_MODEL = "kaitchup/Mistral-7B-v0.1-gptq-4bit"

Your GPU supports Marlin!


# Conversion to Marlin
We convert the GPTQ simply by loading it with AutoGPTQForCausalLM and passing "use_marlin=True".



In [None]:
tokenizer = AutoTokenizer.from_pretrained(GPTQ_MODEL, use_fast=True)
marlin_model = AutoGPTQForCausalLM.from_quantized(
      GPTQ_MODEL,
      use_marlin=True,
      device_map='auto')

save_dir = "Mistral-7B-v0.1-gptq-marlin-4bit"

marlin_model.save_pretrained(save_dir)
tokenizer.save_pretrained(save_dir)

INFO - The layer lm_head is not quantized.
INFO:auto_gptq.modeling._base:The layer lm_head is not quantized.
Overriding QuantLinear layers to use Marlin's QuantLinear...: 100%|██████████| 454/454 [00:37<00:00, 12.16it/s]


('Mistral-7B-v0.1-gptq-marlin-4bit/tokenizer_config.json',
 'Mistral-7B-v0.1-gptq-marlin-4bit/special_tokens_map.json',
 'Mistral-7B-v0.1-gptq-marlin-4bit/tokenizer.model',
 'Mistral-7B-v0.1-gptq-marlin-4bit/added_tokens.json',
 'Mistral-7B-v0.1-gptq-marlin-4bit/tokenizer.json')

# Usage and benchmarking with vLLM

In [None]:
!pip install vllm

Collecting vllm
  Downloading vllm-0.3.3-cp310-cp310-manylinux1_x86_64.whl (44.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.3/44.3 MB[0m [31m33.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting ninja (from vllm)
  Downloading ninja-1.11.1.1-py2.py3-none-manylinux1_x86_64.manylinux_2_5_x86_64.whl (307 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m307.2/307.2 kB[0m [31m29.2 MB/s[0m eta [36m0:00:00[0m
Collecting ray>=2.9 (from vllm)
  Downloading ray-2.10.0-cp310-cp310-manylinux2014_x86_64.whl (65.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m65.1/65.1 MB[0m [31m23.9 MB/s[0m eta [36m0:00:00[0m
Collecting torch==2.1.2 (from vllm)
  Downloading torch-2.1.2-cp310-cp310-manylinux1_x86_64.whl (670.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m670.2/670.2 MB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m
Collecting xformers==0.0.23.post1 (from vllm)
  Downloading xformers-0.0.23.post1-cp310-

Marlin's throughput is benchmarked forbatch sizes: 1, 2, 4, 8, 16, 32, 64, and 128.

In [None]:
import time
from vllm import LLM, SamplingParams

batch_sizes = [1, 2, 4, 8, 16, 32, 64, 128]
p = "You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.\n\nIf a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information. \n\n Tell me about gravity."

sampling_params = SamplingParams(max_tokens=1000)

loading_start = time.time()
llm = LLM(model="kaitchup/Mistral-7B-v0.1-gptq-marlin-4bit")
print("--- Loading time: %s seconds ---" % (time.time() - loading_start))


for b in batch_sizes:
  prompts = []
  for i in range(b):
    prompts.append(p)

  generation_time = time.time()
  outputs = llm.generate(prompts, sampling_params)
  duration = time.time() - generation_time
  total_tokens = 0
  for output in outputs:
    total_tokens += len(output.prompt_token_ids) + len(output.outputs[0].token_ids)
  print('\nBatch size: '+str(b))
  print("--- Speed: %s tokens/second ---" % (round(total_tokens/duration,2)))



config.json:   0%|          | 0.00/1.00k [00:00<?, ?B/s]

INFO 03-23 01:51:43 llm_engine.py:87] Initializing an LLM engine with config: model='kaitchup/Mistral-7B-v0.1-gptq-marlin-4bit', tokenizer='kaitchup/Mistral-7B-v0.1-gptq-marlin-4bit', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=marlin, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, seed=0)


tokenizer_config.json:   0%|          | 0.00/967 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

INFO 03-23 01:51:50 weight_utils.py:163] Using model weights format ['*.safetensors']


model.safetensors:   0%|          | 0.00/4.13G [00:00<?, ?B/s]

INFO 03-23 01:52:05 llm_engine.py:357] # GPU blocks: 12730, # CPU blocks: 2048
INFO 03-23 01:52:07 model_runner.py:684] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 03-23 01:52:07 model_runner.py:688] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 03-23 01:52:13 model_runner.py:756] Graph capturing finished in 5 secs.
--- Loading time: 31.898792028427124 seconds ---


Processed prompts: 100%|██████████| 1/1 [00:00<00:00,  2.64it/s]



Batch size: 1
--- Speed: 456.38 tokens/second ---


Processed prompts: 100%|██████████| 2/2 [00:06<00:00,  3.17s/it]



Batch size: 2
--- Speed: 208.43 tokens/second ---


Processed prompts: 100%|██████████| 4/4 [00:04<00:00,  1.05s/it]



Batch size: 4
--- Speed: 407.9 tokens/second ---


Processed prompts: 100%|██████████| 8/8 [00:01<00:00,  4.55it/s]



Batch size: 8
--- Speed: 988.37 tokens/second ---


Processed prompts: 100%|██████████| 16/16 [00:07<00:00,  2.26it/s]



Batch size: 16
--- Speed: 1087.77 tokens/second ---


Processed prompts: 100%|██████████| 32/32 [00:07<00:00,  4.12it/s]



Batch size: 32
--- Speed: 1623.01 tokens/second ---


Processed prompts: 100%|██████████| 64/64 [00:09<00:00,  6.93it/s]



Batch size: 64
--- Speed: 2431.62 tokens/second ---


Processed prompts: 100%|██████████| 128/128 [00:15<00:00,  8.41it/s]


Batch size: 128
--- Speed: 3431.21 tokens/second ---





Benchmarking the original GPTQ model for reference, using the same code:

In [None]:
import time
from vllm import LLM, SamplingParams

batch_sizes = [1, 2, 4, 8, 16, 32, 64, 128]
p = "You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.\n\nIf a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information. \n\n Tell me about gravity."

sampling_params = SamplingParams(max_tokens=1000)

loading_start = time.time()
llm = LLM(model="kaitchup/Mistral-7B-v0.1-gptq-4bit")
print("--- Loading time: %s seconds ---" % (time.time() - loading_start))


for b in batch_sizes:
  prompts = []
  for i in range(b):
    prompts.append(p)

  generation_time = time.time()
  outputs = llm.generate(prompts, sampling_params)
  duration = time.time() - generation_time
  total_tokens = 0
  for output in outputs:
    total_tokens += len(output.prompt_token_ids) + len(output.outputs[0].token_ids)
  print('\nBatch size: '+str(b))
  print("--- Speed: %s tokens/second ---" % (round(total_tokens/duration,2)))



INFO 03-23 01:59:30 llm_engine.py:87] Initializing an LLM engine with config: model='kaitchup/Mistral-7B-v0.1-gptq-4bit', tokenizer='kaitchup/Mistral-7B-v0.1-gptq-4bit', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=gptq, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, seed=0)
INFO 03-23 01:59:34 weight_utils.py:163] Using model weights format ['*.safetensors']
INFO 03-23 01:59:38 llm_engine.py:357] # GPU blocks: 14035, # CPU blocks: 2048
INFO 03-23 01:59:40 model_runner.py:684] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 03-23 01:59:40 model_runner.py:688] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running

Processed prompts: 100%|██████████| 1/1 [00:00<00:00,  2.69it/s]



Batch size: 1
--- Speed: 464.51 tokens/second ---


Processed prompts: 100%|██████████| 2/2 [00:06<00:00,  3.17s/it]



Batch size: 2
--- Speed: 208.45 tokens/second ---


Processed prompts: 100%|██████████| 4/4 [00:03<00:00,  1.25it/s]



Batch size: 4
--- Speed: 411.04 tokens/second ---


Processed prompts: 100%|██████████| 8/8 [00:07<00:00,  1.08it/s]



Batch size: 8
--- Speed: 542.88 tokens/second ---


Processed prompts: 100%|██████████| 16/16 [00:10<00:00,  1.49it/s]



Batch size: 16
--- Speed: 722.92 tokens/second ---


Processed prompts: 100%|██████████| 32/32 [00:13<00:00,  2.39it/s]



Batch size: 32
--- Speed: 924.51 tokens/second ---


Processed prompts: 100%|██████████| 64/64 [00:18<00:00,  3.44it/s]



Batch size: 64
--- Speed: 1266.42 tokens/second ---


Processed prompts: 100%|██████████| 128/128 [00:30<00:00,  4.25it/s]


Batch size: 128
--- Speed: 1676.06 tokens/second ---



