<a href="https://colab.research.google.com/github/tuhinmallick/AI-for-Fashion/blob/main/Yi_Fine_tuning%2C_Inference%2C_Quantization%2C_and_Benchmarking.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook shows how to:

*   Fine-tune Yi-6B with QLoRA
*   Quantize Yi-6B with BitsandBytes, GPTQ and AWQ
*   Infer with Transformers and vLLM
*   Benchmark Yi-6B with optimum-benchmark and the Evaluation Harness.

Each section of this notebook can be run independently.

More details and comments: [Yi: Fine-tune and Run One of the Best Bilingual LLMs on Your Computer](https://kaitchup.substack.com/p/yi-fine-tune-and-run-one-the-best)


# Inference

With vLLM

In [None]:
!pip install vllm

Collecting vllm
  Downloading vllm-0.3.3-cp310-cp310-manylinux1_x86_64.whl (44.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.3/44.3 MB[0m [31m40.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting ninja (from vllm)
  Downloading ninja-1.11.1.1-py2.py3-none-manylinux1_x86_64.manylinux_2_5_x86_64.whl (307 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m307.2/307.2 kB[0m [31m37.2 MB/s[0m eta [36m0:00:00[0m
Collecting ray>=2.9 (from vllm)
  Downloading ray-2.9.3-cp310-cp310-manylinux2014_x86_64.whl (64.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m64.9/64.9 MB[0m [31m24.8 MB/s[0m eta [36m0:00:00[0m
Collecting torch==2.1.2 (from vllm)
  Downloading torch-2.1.2-cp310-cp310-manylinux1_x86_64.whl (670.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m670.2/670.2 MB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0m
Collecting xformers==0.0.23.post1 (from vllm)
  Downloading xformers-0.0.23.post1-cp310-c

Without quantization, you need at least 12 GB of GPU VRAM

In [None]:
import time
from vllm import LLM, SamplingParams

prompts = [
    "The best recipe for pasta is"
]
sampling_params = SamplingParams(temperature=0.7, top_p=0.8, top_k=20, max_tokens=150)

loading_start = time.time()
llm = LLM(model="01-ai/Yi-6B")
print("--- Loading time: %s seconds ---" % (time.time() - loading_start))

generation_time = time.time()
outputs = llm.generate(prompts, sampling_params)
print("--- Generation time: %s seconds ---" % (time.time() - generation_time))

for output in outputs:
    generated_text = output.outputs[0].text
    print(generated_text)
    print('------')

config.json:   0%|          | 0.00/605 [00:00<?, ?B/s]

INFO 03-17 06:39:52 llm_engine.py:87] Initializing an LLM engine with config: model='01-ai/Yi-6B', tokenizer='01-ai/Yi-6B', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, seed=0)


tokenizer_config.json:   0%|          | 0.00/320 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/1.03M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/3.56M [00:00<?, ?B/s]

INFO 03-17 06:40:00 weight_utils.py:163] Using model weights format ['*.safetensors']


model-00002-of-00002.safetensors:   0%|          | 0.00/2.18G [00:00<?, ?B/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.94G [00:00<?, ?B/s]

INFO 03-17 06:49:11 llm_engine.py:357] # GPU blocks: 23922, # CPU blocks: 4096
INFO 03-17 06:49:13 model_runner.py:684] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 03-17 06:49:13 model_runner.py:688] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 03-17 06:49:19 model_runner.py:756] Graph capturing finished in 6 secs.
--- Loading time: 568.9558756351471 seconds ---


Processed prompts: 100%|██████████| 1/1 [00:01<00:00,  1.77s/it]

--- Generation time: 1.7790272235870361 seconds ---
 a simple one: cook the pasta al dente, then toss with the sauce of your choice.
What is the best pasta to make with a simple sauce?
The best pasta to make with a simple sauce is the one that suits your taste buds the most.
What is the best pasta to make with a simple sauce?
The best pasta to make with a simple sauce is the one that suits your taste buds the most.
What is the best pasta to make with a simple sauce?
The best pasta to make with a simple sauce is the one that suits your taste buds the most.
What is the best pasta to make with a simple sauce?
The best pasta to make with a simple sauce is the one that suits
------





Version with the AWQ quantized model.
To know how I made this quantized model, have a look at the section "quantization" below.

In [None]:
import time
from vllm import LLM, SamplingParams

prompts = [
    "The best recipe for pasta is"
]
sampling_params = SamplingParams(temperature=0.7, top_p=0.8, top_k=20, max_tokens=150)

loading_start = time.time()
llm = LLM(model="kaitchup/Yi-6B-awq-4bit", quantization="awq")
print("--- Loading time: %s seconds ---" % (time.time() - loading_start))

generation_time = time.time()
outputs = llm.generate(prompts, sampling_params)
print("--- Generation time: %s seconds ---" % (time.time() - generation_time))

for output in outputs:
    generated_text = output.outputs[0].text
    print(generated_text)
    print('------')

INFO 03-17 06:51:02 llm_engine.py:87] Initializing an LLM engine with config: model='kaitchup/Yi-6B-awq-4bit', tokenizer='kaitchup/Yi-6B-awq-4bit', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=awq, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, seed=0)
INFO 03-17 06:51:06 weight_utils.py:163] Using model weights format ['*.safetensors']


model.safetensors:   0%|          | 0.00/3.93G [00:00<?, ?B/s]

INFO 03-17 06:51:21 llm_engine.py:357] # GPU blocks: 31680, # CPU blocks: 4096
INFO 03-17 06:51:23 model_runner.py:684] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 03-17 06:51:23 model_runner.py:688] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 03-17 06:51:30 model_runner.py:756] Graph capturing finished in 7 secs.
--- Loading time: 29.90104651451111 seconds ---


Processed prompts: 100%|██████████| 1/1 [00:01<00:00,  1.32s/it]

--- Generation time: 1.3277652263641357 seconds ---
 a simple one: cook pasta in boiling salted water, drain and serve with a little butter, grated cheese and freshly ground pepper.
What is the best way to cook pasta?
Cooking pasta is simple. Just bring a large pot of salted water to a boil, add the pasta, cook for 8-12 minutes, drain, and serve. But there are so many ways to cook pasta. You can make it al dente, or you can make it mushy.
What is the best way to cook pasta for 1?
Cooking pasta is simple. Just bring a large pot of salted water to a boil, add the pasta, cook for 8-12 minutes, drain,
------





With Hugging Face's Transformers (16-bit version)

In [None]:
!pip install accelerate transformers autoawq

Collecting autoawq
  Downloading autoawq-0.2.3-cp310-cp310-manylinux2014_x86_64.whl (79 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m79.0/79.0 kB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
Collecting datasets (from autoawq)
  Downloading datasets-2.18.0-py3-none-any.whl (510 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m510.5/510.5 kB[0m [31m26.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting zstandard (from autoawq)
  Downloading zstandard-0.22.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (5.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.4/5.4 MB[0m [31m90.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting autoawq-kernels (from autoawq)
  Downloading autoawq_kernels-0.0.6-cp310-cp310-manylinux2014_x86_64.whl (33.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m33.4/33.4 MB[0m [31m41.5 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.9,>=0.3.0 (from datasets->autoawq)
  Downl

Using Yi (16-bit version)

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, set_seed

set_seed(1234)  # For reproducibility

prompt = "The best recipe for pasta is"

checkpoint = "01-ai/Yi-6B"

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(checkpoint, torch_dtype="auto", device_map="cuda")

inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, do_sample=True, max_new_tokens=150)
result = tokenizer.decode(outputs[0], skip_special_tokens=True)

print(result)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

The best recipe for pasta is the one your mama made. Here is an elegant Italian pasta recipe from an old, old cookbook which uses the classic combination of butter and olive oil as a sauce. It can either stand alone as a sauce or be served over your favorite pasta noodles as a main course. Try it! It was my father's favorite "going to Italy" dish.
Linguine and Spinach Pasta Sauce
1/2 lb linguine
10 oz frozen spinach (or frozen chopped spinach if available), cooked, cooled and drained
1/2 oz butter
1/2 teaspoon salt (or more to taste)
1/2 teaspoon white wine vinegar (or more to taste)
1/4 cup Parmesan


With Hugging Face's Transformers with the model quantized with AWQ 4-bit

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, set_seed

set_seed(1234)  # For reproducibility

prompt = "The best recipe for pasta is"

checkpoint = "kaitchup/Yi-6B-awq-4bit"

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(checkpoint, device_map='cuda')

inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, do_sample=True, max_new_tokens=150)
result = tokenizer.decode(outputs[0], skip_special_tokens=True)

print(result)

The best recipe for pasta is the one your mama made. Here is an elegant Italian pasta recipe from an old, old cookbook which uses the classic combination of butter and garlic. For a mains dish you could make a creamy chicken or fish dish , or serve with veal or pork with a light sauce. It was my way of saying 'Thank you!', 'I love you.' and 'Goodbye,' all at the same time. The other great recipe for pasta is the one your mama made. Here are some suggestions for preparing pasta dishes in one way or another: pasta with sauce, pasta with vegetables. How to Choose a Pasta Dish and What to Offer with It. There are countless recipes for the pasta that you enjoy most often, and
The best recipe for pasta is the one your mama made. Here is an elegant Italian pasta recipe from an old, old cookbook which uses the classic combination of butter and garlic. For a mains dish you could make a creamy chicken or fish dish , or serve with veal or pork with a light sauce. It was my way of saying 'Thank yo

In [None]:
!pip install accelerate transformers autoawq peft

Collecting peft
  Downloading peft-0.9.0-py3-none-any.whl (190 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/190.9 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━[0m [32m184.3/190.9 kB[0m [31m6.0 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m190.9/190.9 kB[0m [31m5.1 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: peft
Successfully installed peft-0.9.0


# Quantization

Bitsandbytes NF4

In [None]:
!pip install --upgrade transformers bitsandbytes accelerate datasets

Collecting bitsandbytes
  Downloading bitsandbytes-0.43.0-py3-none-manylinux_2_24_x86_64.whl (102.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m102.2/102.2 MB[0m [31m17.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting accelerate
  Downloading accelerate-0.28.0-py3-none-any.whl (290 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m290.1/290.1 kB[0m [31m34.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets
  Downloading datasets-2.18.0-py3-none-any.whl (510 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m510.5/510.5 kB[0m [31m49.3 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m14.6 MB/s[0m eta [36m0:00:00[0m
Collecting xxhash (from datasets)
  Downloading xxhash-3.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (194 kB)
[2K     

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

if torch.cuda.is_bf16_supported():
  compute_dtype = torch.bfloat16
else:
  compute_dtype = torch.float16

model_name = "01-ai/Yi-6B"
quant_path = 'Yi-6B-bnb-4bit'
tokenizer = AutoTokenizer.from_pretrained(model_name)
bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=compute_dtype,
        bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained(
          model_name, quantization_config=bnb_config
)


model.save_pretrained("./"+quant_path, safetensors=True)
tokenizer.save_pretrained("./"+quant_path)


tokenizer_config.json:   0%|          | 0.00/320 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/1.03M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/3.56M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/605 [00:00<?, ?B/s]

`low_cpu_mem_usage` was None, now set to True since model is quantized.


model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.94G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/2.18G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/132 [00:00<?, ?B/s]

('./Yi-6B-bnb-4bit/tokenizer_config.json',
 './Yi-6B-bnb-4bit/special_tokens_map.json',
 './Yi-6B-bnb-4bit/tokenizer.model',
 './Yi-6B-bnb-4bit/added_tokens.json',
 './Yi-6B-bnb-4bit/tokenizer.json')

GPTQ

More details about the GPTQ quantization in this article:

[Quantize and Fine-tune LLMs with GPTQ Using Transformers and TRL](https://kaitchup.substack.com/p/quantize-and-fine-tune-llms-with)


In [None]:
!pip install --upgrade transformers auto-gptq accelerate datasets
!python -m pip install git+https://github.com/huggingface/optimum.git

Collecting auto-gptq
  Downloading auto_gptq-0.7.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (23.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.5/23.5 MB[0m [31m63.1 MB/s[0m eta [36m0:00:00[0m
Collecting rouge (from auto-gptq)
  Downloading rouge-1.0.1-py3-none-any.whl (13 kB)
Collecting gekko (from auto-gptq)
  Downloading gekko-1.0.7-py3-none-any.whl (13.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.1/13.1 MB[0m [31m98.1 MB/s[0m eta [36m0:00:00[0m
Collecting peft>=0.5.0 (from auto-gptq)
  Downloading peft-0.9.0-py3-none-any.whl (190 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m190.9/190.9 kB[0m [31m26.4 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: rouge, gekko, peft, auto-gptq
Successfully installed auto-gptq-0.7.1 gekko-1.0.7 peft-0.9.0 rouge-1.0.1
Collecting git+https://github.com/huggingface/optimum.git
  Cloning https://github.com/huggingface/optimum.git to /tm

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer
from optimum.gptq import GPTQQuantizer
import torch
model_path = '01-ai/Yi-6B'
w = 4 #quantization to 4-bit. Change to 2, 3, or 8 to quantize with another precision

quant_path = 'Yi-6B-gptq-'+str(w)+'bit'

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype=torch.float16, device_map="auto")
quantizer = GPTQQuantizer(bits=w, dataset="c4", model_seqlen = 2048)
quantized_model = quantizer.quantize_model(model, tokenizer)

quantized_model.save_pretrained("./"+quant_path, safetensors=True)
tokenizer.save_pretrained("./"+quant_path)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading readme:   0%|          | 0.00/41.1k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/319M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (4487 > 4096). Running this sequence through the model will result in indexing errors


Quantizing model.layers blocks :   0%|          | 0/32 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

The cos_cached attribute will be removed in 4.39. Bear in mind that its contents changed in v4.38. Use the forward method of RoPE from now on instead. It is not used in the `LlamaAttention` class
The sin_cached attribute will be removed in 4.39. Bear in mind that its contents changed in v4.38. Use the forward method of RoPE from now on instead. It is not used in the `LlamaAttention` class


('./Yi-6B-gptq-4bit/tokenizer_config.json',
 './Yi-6B-gptq-4bit/special_tokens_map.json',
 './Yi-6B-gptq-4bit/tokenizer.model',
 './Yi-6B-gptq-4bit/added_tokens.json',
 './Yi-6B-gptq-4bit/tokenizer.json')

AWQ

More details about the AWQ quantization in this article:

[Fast and Small Llama 2 with Activation-Aware Quantization (AWQ)
](https://kaitchup.substack.com/p/fast-and-small-llama-2-with-activation)


In [None]:
!pip install --upgrade transformers autoawq optimum accelerate

Collecting autoawq
  Downloading autoawq-0.2.3-cp310-cp310-manylinux2014_x86_64.whl (79 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m79.0/79.0 kB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting optimum
  Downloading optimum-1.17.1-py3-none-any.whl (407 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m407.1/407.1 kB[0m [31m25.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting accelerate
  Downloading accelerate-0.28.0-py3-none-any.whl (290 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m290.1/290.1 kB[0m [31m33.8 MB/s[0m eta [36m0:00:00[0m
Collecting datasets (from autoawq)
  Downloading datasets-2.18.0-py3-none-any.whl (510 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m510.5/510.5 kB[0m [31m46.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting zstandard (from autoawq)
  Downloading zstandard-0.22.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (5.4 MB)
[2K     [90m━━━━━━━

In [None]:
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_path = '01-ai/Yi-6B'
quant_path = 'Yi-6B-awq-4bit'
quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM" }

# Load model and tokenizer
model = AutoAWQForCausalLM.from_pretrained(model_path, safetensors=True)
tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=True)

# Quantize
model.quantize(tokenizer, quant_config=quant_config)

# Save quantized model with safetensors
model.save_quantized("./"+quant_path, safetensors=True)
tokenizer.save_pretrained("./"+quant_path)


config.json:   0%|          | 0.00/605 [00:00<?, ?B/s]

Fetching 13 files:   0%|          | 0/13 [00:00<?, ?it/s]

md5:   0%|          | 0.00/184 [00:00<?, ?B/s]

.gitattributes:   0%|          | 0.00/1.52k [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/132 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/59.4k [00:00<?, ?B/s]

Yi.svg:   0%|          | 0.00/980 [00:00<?, ?B/s]

LICENSE:   0%|          | 0.00/17.4k [00:00<?, ?B/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.94G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/2.18G [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/3.56M [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/1.03M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/320 [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading readme:   0%|          | 0.00/167 [00:00<?, ?B/s]



Downloading data:   0%|          | 0.00/471M [00:00<?, ?B/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (8947 > 4096). Running this sequence through the model will result in indexing errors
AWQ: 100%|██████████| 32/32 [14:17<00:00, 26.81s/it]


('./Yi-6B-awq-4bit/tokenizer_config.json',
 './Yi-6B-awq-4bit/special_tokens_map.json',
 './Yi-6B-awq-4bit/tokenizer.model',
 './Yi-6B-awq-4bit/added_tokens.json',
 './Yi-6B-awq-4bit/tokenizer.json')

# Fine-tuning

QLoRA

More details about QLoRA fine-tuning in this article:

[QLoRa: Fine-Tune a Large Language Model on Your GPU](https://kaitchup.substack.com/p/qlora-fine-tune-a-large-language-model-on-your-gpu-27bed5a03e2b)

In [None]:
!pip install --upgrade bitsandbytes transformers peft accelerate datasets trl flash_attn

Collecting bitsandbytes
  Downloading bitsandbytes-0.43.0-py3-none-manylinux_2_24_x86_64.whl (102.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m102.2/102.2 MB[0m [31m14.3 MB/s[0m eta [36m0:00:00[0m
Collecting peft
  Downloading peft-0.9.0-py3-none-any.whl (190 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m190.9/190.9 kB[0m [31m23.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting accelerate
  Downloading accelerate-0.28.0-py3-none-any.whl (290 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m290.1/290.1 kB[0m [31m30.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets
  Downloading datasets-2.18.0-py3-none-any.whl (510 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m510.5/510.5 kB[0m [31m46.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting trl
  Downloading trl-0.7.11-py3-none-any.whl (155 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m155.3/155.3 kB[0m [31m19.4 MB/s

In [None]:
import torch
from datasets import load_dataset
from peft import LoraConfig, prepare_model_for_kbit_training
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig
)
from trl import SFTTrainer, SFTConfig

#use bf16 and FlashAttention if supported
if torch.cuda.is_bf16_supported():
  compute_dtype = torch.bfloat16
  attn_implementation = 'flash_attention_2'
else:
  compute_dtype = torch.float16
  attn_implementation = 'sdpa'

model_name = "01-ai/Yi-6B"
#Tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, add_eos_token=True, use_fast=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.pad_token_id =  tokenizer.eos_token_id
tokenizer.padding_side = 'left'

ds = load_dataset("timdettmers/openassistant-guanaco")

bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=compute_dtype,
        bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained(
          model_name, torch_dtype=compute_dtype, quantization_config=bnb_config, device_map={"": 0}, attn_implementation=attn_implementation
)

model = prepare_model_for_kbit_training(model, gradient_checkpointing_kwargs={'use_reentrant':True})

#Configure the pad token in the model
model.config.pad_token_id = tokenizer.pad_token_id
model.config.use_cache = False # Gradient checkpointing is used by default but not compatible with caching

peft_config = LoraConfig(
        lora_alpha=16,
        lora_dropout=0.05,
        r=16,
        bias="none",
        task_type="CAUSAL_LM",
        target_modules= ['k_proj', 'q_proj', 'v_proj', 'o_proj', "gate_proj", "down_proj", "up_proj"]
)

training_arguments = SFTConfig(
        output_dir="./Yi-6B_QLoRA",
        evaluation_strategy="steps",
        do_eval=True,
        optim="paged_adamw_8bit",
        per_device_train_batch_size=8,
        gradient_accumulation_steps=4,
        per_device_eval_batch_size=8,
        log_level="debug",
        save_strategy="epoch",
        logging_steps=100,
        learning_rate=1e-4,
        fp16 = not torch.cuda.is_bf16_supported(),
        bf16 = torch.cuda.is_bf16_supported(),
        eval_steps=100,
        num_train_epochs=3,
        warmup_ratio=0.1,
        lr_scheduler_type="linear",
        dataset_text_field="text",
        max_seq_length=512,
)

trainer = SFTTrainer(
        model=model,
        train_dataset=ds['train'],
        eval_dataset=ds['test'],
        peft_config=peft_config,
        tokenizer=tokenizer,
        args=training_arguments,
)

trainer.train()



Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)
Using auto half precision backend
Currently training with a batch size of: 8
***** Running training *****
  Num examples = 9,846
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 4
  Total optimization steps = 921
  Number of trainable parameters = 36,306,944


Step,Training Loss,Validation Loss
100,1.3465,1.273141
200,1.2374,1.254405
300,1.2251,1.244891
400,1.1841,1.24356
500,1.1892,1.239184
600,1.179,1.235739


***** Running Evaluation *****
  Num examples = 518
  Batch size = 8
***** Running Evaluation *****
  Num examples = 518
  Batch size = 8
***** Running Evaluation *****
  Num examples = 518
  Batch size = 8
Saving model checkpoint to ./drive/MyDrive/Yi-6B_QLoRA/tmp-checkpoint-307
loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--01-ai--Yi-6B/snapshots/58332cd8ca449a6a28a424dde80d03321ab03b41/config.json
Model config LlamaConfig {
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 1,
  "eos_token_id": 2,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 11008,
  "max_position_embeddings": 4096,
  "model_type": "llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 4,
  "pad_token_id": 0,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": null,
  "rope_theta": 5000000.0,
  "tie_wor

Step,Training Loss,Validation Loss
100,1.3465,1.273141
200,1.2374,1.254405
300,1.2251,1.244891
400,1.1841,1.24356
500,1.1892,1.239184
600,1.179,1.235739
700,1.1436,1.243706
800,1.1245,1.243297
900,1.1092,1.243611


***** Running Evaluation *****
  Num examples = 518
  Batch size = 8
***** Running Evaluation *****
  Num examples = 518
  Batch size = 8
Saving model checkpoint to ./drive/MyDrive/Yi-6B_QLoRA/tmp-checkpoint-921
loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--01-ai--Yi-6B/snapshots/58332cd8ca449a6a28a424dde80d03321ab03b41/config.json
Model config LlamaConfig {
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 1,
  "eos_token_id": 2,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 11008,
  "max_position_embeddings": 4096,
  "model_type": "llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 4,
  "pad_token_id": 0,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": null,
  "rope_theta": 5000000.0,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_ve

TrainOutput(global_step=921, training_loss=1.1919173531630658, metrics={'train_runtime': 13513.6549, 'train_samples_per_second': 2.186, 'train_steps_per_second': 0.068, 'total_flos': 5.260809536156467e+17, 'train_loss': 1.1919173531630658, 'epoch': 2.99})

# Benchmarking

With optimum-benchmark: Memory-Efficiency and Inference Speed

More details about using optimum-benchmark in this article:

[Optimum-Benchmark: How Fast and Memory-Efficient Is Your LLM?](https://kaitchup.substack.com/p/optimum-benchmark-how-fast-and-memory)

In [None]:
!pip install optimum
!git clone https://github.com/huggingface/optimum-benchmark.git
!cd optimum-benchmark && pip install -e .
!pip install bitsandbytes
!pip install auto-gptq
!pip install autoawq
!pip install --upgrade transformers

Collecting optimum
  Downloading optimum-1.17.1-py3-none-any.whl (407 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/407.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m [32m399.4/407.1 kB[0m [31m12.7 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m407.1/407.1 kB[0m [31m10.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting coloredlogs (from optimum)
  Downloading coloredlogs-15.0.1-py2.py3-none-any.whl (46 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m46.0/46.0 kB[0m [31m6.5 MB/s[0m eta [36m0:00:00[0m
Collecting datasets (from optimum)
  Downloading datasets-2.18.0-py3-none-any.whl (510 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m510.5/510.5 kB[0m [31m49.4 MB/s[0m eta [36m0:00:00[0m
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch>=1.11->optimum)
  Downloading nvidia_cuda_nvrtc_cu12-12.1.10

Benchmarking Yi 6B 16-bit

In [None]:
import os
YAML_DEFAULT="""
defaults:
  - backend: pytorch # default backend
  - launcher: process # default launcher
  - benchmark: inference # default benchmark
  - experiment # inheriting experiment schema
  - _self_ # for hydra 1.1 compatibility
  - override hydra/job_logging: colorlog # colorful logging
  - override hydra/hydra_logging: colorlog # colorful logging

experiment_name: Yi-6B

backend:
  device: cuda
  device_ids: 0
  no_weights: true
  torch_dtype: float16
  model: 01-ai/Yi-6B

launcher:
  device_isolation: true

benchmark:
  memory: true #We will monitor the memory usage
  warmup_runs: 10 #Before the monitoring starts, the inference will be run 10 times for warming up
  input_shapes:
    batch_size: 4
    sequence_length: 512
  new_tokens: 1000

# hydra/cli specific settings
hydra:
  run:
    # where to store run results
    dir: runs/${experiment_name}
  sweep:
    # where to store sweep results
    dir: sweeps/${experiment_name}
  job:
    # change working directory to the run directory
    chdir: true
    env_set:
      # set environment variable OVERRIDE_BENCHMARKS to 1
      # to not skip benchmarks that have been run before
      OVERRIDE_BENCHMARKS: 1
"""

with open("Yi-6B.yaml", 'w') as f:
  f.write(YAML_DEFAULT)
os.system("optimum-benchmark --config-dir ./ --config-name Yi-6B")

0

In [None]:
!cat runs/Yi-6B/benchmark_report.json

{
    "prefill": {
        "memory": {
            "unit": "MB",
            "max_ram": 1745.854464,
            "max_vram": 0.0,
            "max_reserved": 13589.54496,
            "max_allocated": 13152.01792
        },
        "latency": {
            "unit": "s",
            "mean": 0.1487202606201172,
            "stdev": 0,
            "values": [
                0.1487202606201172
            ]
        },
        "throughput": {
            "unit": "tokens/s",
            "value": 13770.820407794321
        },
        "energy": null,
        "efficiency": null
    },
    "decode": {
        "memory": {
            "unit": "MB",
            "max_ram": 1746.1248,
            "max_vram": 0.0,
            "max_reserved": 14273.216512,
            "max_allocated": 13899.628032
        },
        "latency": {
            "unit": "s",
            "mean": 41.41763340377807,
            "stdev": 0,
            "values": [
                41.41763340377807
            ]
        },
      

Benchmarking with 4-bit bitsandbytes

In [None]:
import os
YAML_DEFAULT="""
defaults:
  - backend: pytorch # default backend
  - launcher: process # default launcher
  - benchmark: inference # default benchmark
  - experiment # inheriting experiment schema
  - _self_ # for hydra 1.1 compatibility
  - override hydra/job_logging: colorlog # colorful logging
  - override hydra/hydra_logging: colorlog # colorful logging

experiment_name: Yi-6B-bnb-4bit

backend:
  device: cuda
  device_ids: 0
  no_weights: true
  model: 01-ai/Yi-6B
  quantization_config:
    load_in_4bit: true

launcher:
  device_isolation: true

benchmark:
  memory: true #We will monitor the memory usage
  warmup_runs: 10 #Before the monitoring starts, the inference will be run 10 times for warming up
  input_shapes:
    batch_size: 4
    sequence_length: 512
  new_tokens: 1000

# hydra/cli specific settings
hydra:
  run:
    # where to store run results
    dir: runs/${experiment_name}
  sweep:
    # where to store sweep results
    dir: sweeps/${experiment_name}
  job:
    # change working directory to the run directory
    chdir: true
    env_set:
      # set environment variable OVERRIDE_BENCHMARKS to 1
      # to not skip benchmarks that have been run before
      OVERRIDE_BENCHMARKS: 1
"""

with open("Yi-6B-bnb-4bit.yaml", 'w') as f:
  f.write(YAML_DEFAULT)
os.system("optimum-benchmark --config-dir ./ --config-name Yi-6B-bnb-4bit")

0

In [None]:
!cat runs/Yi-6B-bnb-4bit/benchmark_report.json

{
    "prefill": {
        "memory": {
            "unit": "MB",
            "max_ram": 1579.159552,
            "max_vram": 0.0,
            "max_reserved": 26124.222464,
            "max_allocated": 25345.396224
        },
        "latency": {
            "unit": "s",
            "mean": 1.3678662109375,
            "stdev": 0,
            "values": [
                1.3678662109375
            ]
        },
        "throughput": {
            "unit": "tokens/s",
            "value": 1497.2224502836093
        },
        "energy": null,
        "efficiency": null
    },
    "decode": {
        "memory": {
            "unit": "MB",
            "max_ram": 1579.429888,
            "max_vram": 0.0,
            "max_reserved": 27598.52032,
            "max_allocated": 27249.014272
        },
        "latency": {
            "unit": "s",
            "mean": 50.4643653793335,
            "stdev": 0,
            "values": [
                50.4643653793335
            ]
        },
        "th

Benchmarking with 4-bit GPTQ

In [None]:
import os
YAML_DEFAULT="""
defaults:
  - backend: pytorch # default backend
  - launcher: process # default launcher
  - benchmark: inference # default benchmark
  - experiment # inheriting experiment schema
  - _self_ # for hydra 1.1 compatibility
  - override hydra/job_logging: colorlog # colorful logging
  - override hydra/hydra_logging: colorlog # colorful logging

experiment_name: Yi-6B-gptq-4bit

backend:
  device: cuda
  device_ids: 0
  no_weights: true
  model: kaitchup/Yi-6B-gptq-4bit

launcher:
  device_isolation: true

benchmark:
  memory: true #We will monitor the memory usage
  warmup_runs: 10 #Before the monitoring starts, the inference will be run 10 times for warming up
  input_shapes:
    batch_size: 4
    sequence_length: 512
  new_tokens: 1000

# hydra/cli specific settings
hydra:
  run:
    # where to store run results
    dir: runs/${experiment_name}
  sweep:
    # where to store sweep results
    dir: sweeps/${experiment_name}
  job:
    # change working directory to the run directory
    chdir: true
    env_set:
      # set environment variable OVERRIDE_BENCHMARKS to 1
      # to not skip benchmarks that have been run before
      OVERRIDE_BENCHMARKS: 1
"""

with open("Yi-6B-gptq-4bit.yaml", 'w') as f:
  f.write(YAML_DEFAULT)
os.system("optimum-benchmark --config-dir ./ --config-name Yi-6B-gptq-4bit")

0

In [None]:
!cat runs/Yi-6B-gptq-4bit/benchmark_report.json

{
    "prefill": {
        "memory": {
            "unit": "MB",
            "max_ram": 5881.495552,
            "max_vram": 0.0,
            "max_reserved": 5532.286976,
            "max_allocated": 5102.75584
        },
        "latency": {
            "unit": "s",
            "mean": 0.17228857421875,
            "stdev": 0,
            "values": [
                0.17228857421875
            ]
        },
        "throughput": {
            "unit": "tokens/s",
            "value": 11887.033190022872
        },
        "energy": null,
        "efficiency": null
    },
    "decode": {
        "memory": {
            "unit": "MB",
            "max_ram": 5881.495552,
            "max_vram": 0.0,
            "max_reserved": 6241.124352,
            "max_allocated": 5850.68544
        },
        "latency": {
            "unit": "s",
            "mean": 41.75734943389888,
            "stdev": 0,
            "values": [
                41.75734943389888
            ]
        },
        "thr

Benchmarking with 4-bit AWQ

In [None]:
import os
YAML_DEFAULT="""
defaults:
  - backend: pytorch # default backend
  - launcher: process # default launcher
  - benchmark: inference # default benchmark
  - experiment # inheriting experiment schema
  - _self_ # for hydra 1.1 compatibility
  - override hydra/job_logging: colorlog # colorful logging
  - override hydra/hydra_logging: colorlog # colorful logging

experiment_name: Yi-6B-awq-4bit

backend:
  device: cuda
  device_ids: 0
  no_weights: true
  model: kaitchup/Yi-6B-awq-4bit

launcher:
  device_isolation: true

benchmark:
  memory: true #We will monitor the memory usage
  warmup_runs: 10 #Before the monitoring starts, the inference will be run 10 times for warming up
  input_shapes:
    batch_size: 4
    sequence_length: 512
  new_tokens: 1000

# hydra/cli specific settings
hydra:
  run:
    # where to store run results
    dir: runs/${experiment_name}
  sweep:
    # where to store sweep results
    dir: sweeps/${experiment_name}
  job:
    # change working directory to the run directory
    chdir: true
    env_set:
      # set environment variable OVERRIDE_BENCHMARKS to 1
      # to not skip benchmarks that have been run before
      OVERRIDE_BENCHMARKS: 1
"""

with open("Yi-6B-awq-4bit.yaml", 'w') as f:
  f.write(YAML_DEFAULT)
os.system("optimum-benchmark --config-dir ./ --config-name Yi-6B-awq-4bit")

0

In [None]:
!cat runs/Yi-6B-awq-4bit/benchmark_report.json

{
    "prefill": {
        "memory": {
            "unit": "MB",
            "max_ram": 2151.776256,
            "max_vram": 0.0,
            "max_reserved": 5570.035712,
            "max_allocated": 5005.762048
        },
        "latency": {
            "unit": "s",
            "mean": 0.17603324890136718,
            "stdev": 0,
            "values": [
                0.17603324890136718
            ]
        },
        "throughput": {
            "unit": "tokens/s",
            "value": 11634.16577709993
        },
        "energy": null,
        "efficiency": null
    },
    "decode": {
        "memory": {
            "unit": "MB",
            "max_ram": 2151.776256,
            "max_vram": 0.0,
            "max_reserved": 6184.501248,
            "max_allocated": 5753.634304
        },
        "latency": {
            "unit": "s",
            "mean": 51.347379650115975,
            "stdev": 0,
            "values": [
                51.347379650115975
            ]
        },
   

With Evaluation Harness


More details about using the Evaluation Harness in this article:

[Behind the OpenLLM Leaderboard: The Evaluation Harness](https://kaitchup.substack.com/p/behind-the-openllm-leaderboard-the)

In [None]:
!pip install git+https://github.com/EleutherAI/lm-evaluation-harness.git
!pip install bitsandbytes
!pip install --upgrade transformers
!pip install auto-gptq optimum autoawq

Collecting git+https://github.com/EleutherAI/lm-evaluation-harness.git
  Cloning https://github.com/EleutherAI/lm-evaluation-harness.git to /tmp/pip-req-build-shixds6c
  Running command git clone --filter=blob:none --quiet https://github.com/EleutherAI/lm-evaluation-harness.git /tmp/pip-req-build-shixds6c
  Resolved https://github.com/EleutherAI/lm-evaluation-harness.git to commit dc90fecc8ba31b682d30e096980f678c18ddc435
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting accelerate>=0.21.0 (from lm_eval==0.4.1)
  Downloading accelerate-0.28.0-py3-none-any.whl (290 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m290.1/290.1 kB[0m [31m6.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting evaluate (from lm_eval==0.4.1)
  Downloading evaluate-0.4.1-py3-none-any.whl (84 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/8

Yi 6B (original, not quantized)

In [None]:
!lm_eval --model hf --model_args pretrained=01-ai/Yi-6B --tasks winogrande,hellaswag,arc_challenge --device cuda:0 --num_fewshot 1 --batch_size 8 --output_path ./eval_harness/Yi-6B

2024-03-15 11:32:28.244804: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-03-15 11:32:28.244851: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-03-15 11:32:28.246334: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
Downloading builder script: 100% 5.67k/5.67k [00:00<00:00, 19.2MB/s]
2024-03-15:11:32:35,968 INFO     [__main__.py:225] Verbosity set to INFO
2024-03-15:11:32:35,968 INFO     [__init__.py:373] lm_eval.tasks.initialize_tasks() is deprecated and no longer necessary. It will be removed in v0.4.2 release. TaskManager will instead be used.
2024-03-15:11:32:41,903 INFO

Yi 6B quantized with BnB 4-bit

In [None]:
!lm_eval --model hf --model_args pretrained=kaitchup/Yi-6B-bnb-4bit --tasks winogrande,hellaswag,arc_challenge --device cuda:0 --num_fewshot 1 --batch_size 8 --output_path ./eval_harness/Yi-6B-bnb-4bit

2024-03-17 01:01:31.105648: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-03-17 01:01:31.105721: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-03-17 01:01:31.107172: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
Downloading builder script: 100% 5.67k/5.67k [00:00<00:00, 17.8MB/s]
2024-03-17:01:01:38,684 INFO     [__main__.py:225] Verbosity set to INFO
2024-03-17:01:01:38,685 INFO     [__init__.py:373] lm_eval.tasks.initialize_tasks() is deprecated and no longer necessary. It will be removed in v0.4.2 release. TaskManager will instead be used.
2024-03-17:01:01:45,016 INFO

Yi 6B quantized with GPTQ 4-bit

In [None]:
!lm_eval --model hf --model_args pretrained=kaitchup/Yi-6B-gptq-4bit --tasks winogrande,hellaswag,arc_challenge --device cuda:0 --num_fewshot 1 --batch_size 8 --output_path ./eval_harness/Yi-6B-gptq-4bit

2024-03-15 13:59:31.956389: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-03-15 13:59:31.956440: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-03-15 13:59:31.958016: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-03-15:13:59:37,062 INFO     [__main__.py:225] Verbosity set to INFO
2024-03-15:13:59:37,062 INFO     [__init__.py:373] lm_eval.tasks.initialize_tasks() is deprecated and no longer necessary. It will be removed in v0.4.2 release. TaskManager will instead be used.
2024-03-15:13:59:42,971 INFO     [__main__.py:311] Selected Tasks: ['arc_challenge', 'hellaswag',

Yi 6B quantized with AWQ 4-bit

In [None]:
!lm_eval --model hf --model_args pretrained=kaitchup/Yi-6B-awq-4bit --tasks winogrande,hellaswag,arc_challenge --device cuda:0 --num_fewshot 1 --batch_size 8 --output_path ./eval_harness/Yi-6B-awq-4bit

2024-03-15 21:25:05.564965: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-03-15 21:25:05.565016: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-03-15 21:25:05.566898: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
Downloading builder script: 100% 5.67k/5.67k [00:00<00:00, 21.0MB/s]
2024-03-15:21:25:14,036 INFO     [__main__.py:225] Verbosity set to INFO
2024-03-15:21:25:14,037 INFO     [__init__.py:373] lm_eval.tasks.initialize_tasks() is deprecated and no longer necessary. It will be removed in v0.4.2 release. TaskManager will instead be used.
2024-03-15:21:25:20,100 INFO