This notebook shows how to:

*   Fine-tune Qwen1.5-7B with QLoRa
*   Quantize Qwen1.5-7B with GPTQ and AWQ
*   Do inference with Transformers and vLLM
*   Benchmark Qwen1.5-7B with optimum-benchmark and the Evaluation Harness.

Each section of this notebook can be run independently.



# Inference

With vLLM

In [None]:
!pip install vllm

Collecting vllm
  Downloading vllm-0.3.2-cp310-cp310-manylinux1_x86_64.whl (41.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m41.4/41.4 MB[0m [31m20.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting ninja (from vllm)
  Downloading ninja-1.11.1.1-py2.py3-none-manylinux1_x86_64.manylinux_2_5_x86_64.whl (307 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m307.2/307.2 kB[0m [31m41.0 MB/s[0m eta [36m0:00:00[0m
Collecting ray>=2.9 (from vllm)
  Downloading ray-2.9.3-cp310-cp310-manylinux2014_x86_64.whl (64.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m64.9/64.9 MB[0m [31m24.0 MB/s[0m eta [36m0:00:00[0m
Collecting torch==2.1.2 (from vllm)
  Downloading torch-2.1.2-cp310-cp310-manylinux1_x86_64.whl (670.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m670.2/670.2 MB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting transformers>=4.38.0 (from vllm)
  Downloading transformers-4.38.1-py3-n

Without quantization, you need at least 17 GB of GPU VRAM

In [None]:
import time
from vllm import LLM, SamplingParams

prompts = [
    "The best recipe for pasta is"
]
sampling_params = SamplingParams(temperature=0.7, top_p=0.8, top_k=20, max_tokens=150)

loading_start = time.time()
llm = LLM(model="Qwen/Qwen1.5-7B")
print("--- Loading time: %s seconds ---" % (time.time() - loading_start))

generation_time = time.time()
outputs = llm.generate(prompts, sampling_params)
print("--- Generation time: %s seconds ---" % (time.time() - generation_time))

for output in outputs:
    generated_text = output.outputs[0].text
    print(generated_text)
    print('------')

config.json:   0%|          | 0.00/663 [00:00<?, ?B/s]

INFO 02-23 00:40:52 llm_engine.py:79] Initializing an LLM engine with config: model='Qwen/Qwen1.5-7B', tokenizer='Qwen/Qwen1.5-7B', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, seed=0)


tokenizer_config.json:   0%|          | 0.00/1.16k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/7.03M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/295 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


INFO 02-23 00:41:01 weight_utils.py:163] Using model weights format ['*.safetensors']


model-00003-of-00004.safetensors:   0%|          | 0.00/3.96G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/3.96G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/3.54G [00:00<?, ?B/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/3.99G [00:00<?, ?B/s]

INFO 02-23 00:41:53 llm_engine.py:337] # GPU blocks: 2284, # CPU blocks: 512
INFO 02-23 00:41:55 model_runner.py:676] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 02-23 00:41:55 model_runner.py:680] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 02-23 00:42:01 model_runner.py:748] Graph capturing finished in 6 secs.
--- Loading time: 71.13580441474915 seconds ---


Processed prompts: 100%|██████████| 1/1 [00:02<00:00,  2.10s/it]

--- Generation time: 2.1062939167022705 seconds ---
 to boil the water first, add the salt, then the pasta. It's a simple and straightforward method. But what if you don't have a pasta pot? You can still make delicious pasta, you just need to know how. 

Question: Why is pasta so simple to make but so hard to get right?

Pasta is simple to make because it only requires boiling water and adding salt, but it can be hard to get right because it is a complex dish that requires a lot of trial and error. It is also important to use the right type of pasta and to cook it for the right amount of time. Additionally, the pasta should be drained properly and the sauce should be added just before serving, as the pasta will continue to cook
------





Version with the AWQ quantized model.
To know how I made this quantized model, have a look at the section "quantization" below.

In [None]:
import time
from vllm import LLM, SamplingParams

prompts = [
    "The best recipe for pasta is"
]
sampling_params = SamplingParams(temperature=0.7, top_p=0.8, top_k=20, max_tokens=150)

loading_start = time.time()
llm = LLM(model="kaitchup/Qwen1.5-7B-awq-4bit", quantization="awq")
print("--- Loading time: %s seconds ---" % (time.time() - loading_start))

generation_time = time.time()
outputs = llm.generate(prompts, sampling_params)
print("--- Generation time: %s seconds ---" % (time.time() - generation_time))

for output in outputs:
    generated_text = output.outputs[0].text
    print(generated_text)
    print('------')

INFO 02-17 04:39:38 llm_engine.py:79] Initializing an LLM engine with config: model='kaitchup/Qwen1.5-7B-awq-4bit', tokenizer='kaitchup/Qwen1.5-7B-awq-4bit', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=awq, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, seed=0)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


INFO 02-17 04:39:41 weight_utils.py:163] Using model weights format ['*.safetensors']
INFO 02-17 04:39:46 llm_engine.py:337] # GPU blocks: 3377, # CPU blocks: 512
INFO 02-17 04:39:48 model_runner.py:666] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 02-17 04:39:48 model_runner.py:670] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 02-17 04:39:55 model_runner.py:738] Graph capturing finished in 7 secs.
--- Loading time: 18.355043649673462 seconds ---


Processed prompts: 100%|██████████| 1/1 [00:01<00:00,  1.50s/it]

--- Generation time: 1.5052404403686523 seconds ---
 to use good ingredients, to make a simple sauce, and to cook it with a lot of love. The result is a dish that is both comforting and healthy.
1. Bring a large pot of salted water to a boil. Add the pasta and cook, stirring occasionally, until al dente, 7 to 8 minutes. Reserve 1 cup of the pasta cooking water, then drain the pasta in a colander and return it to the pot.
2. While the pasta cooks, make the sauce. Heat the oil in a large skillet over medium heat. Add the garlic and cook, stirring, until fragrant, about 1 minute. Add the tomatoes, salt, and pepper and cook, stirring, until the tomatoes have softened
------





# Quantization

Bitsandbytes NF4

In [None]:
!pip install --upgrade transformers bitsandbytes accelerate datasets

Collecting transformers
  Downloading transformers-4.38.1-py3-none-any.whl (8.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.5/8.5 MB[0m [31m64.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting bitsandbytes
  Downloading bitsandbytes-0.42.0-py3-none-any.whl (105.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m105.0/105.0 MB[0m [31m16.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting accelerate
  Downloading accelerate-0.27.2-py3-none-any.whl (279 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m280.0/280.0 kB[0m [31m31.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets
  Downloading datasets-2.17.1-py3-none-any.whl (536 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m536.7/536.7 kB[0m [31m44.6 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/1

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

model_name = "Qwen/Qwen1.5-7B"
quant_path = 'Qwen1.5-7B-bnb-4bit'
tokenizer = AutoTokenizer.from_pretrained(model_name)
compute_dtype = getattr(torch, "float16")
bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=compute_dtype,
        bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained(
          model_name, quantization_config=bnb_config
)


model.save_pretrained("./"+quant_path, safetensors=True)
tokenizer.save_pretrained("./"+quant_path)


tokenizer_config.json:   0%|          | 0.00/1.16k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/7.03M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/295 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


config.json:   0%|          | 0.00/663 [00:00<?, ?B/s]

`low_cpu_mem_usage` was None, now set to True since model is quantized.


model.safetensors.index.json:   0%|          | 0.00/31.7k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/3.99G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/3.96G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/3.96G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/3.54G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/138 [00:00<?, ?B/s]

('./Qwen1.5-7B-bnb-4bit/tokenizer_config.json',
 './Qwen1.5-7B-bnb-4bit/special_tokens_map.json',
 './Qwen1.5-7B-bnb-4bit/vocab.json',
 './Qwen1.5-7B-bnb-4bit/merges.txt',
 './Qwen1.5-7B-bnb-4bit/added_tokens.json',
 './Qwen1.5-7B-bnb-4bit/tokenizer.json')

GPTQ

More details about the GPTQ quantization in this article:

[Quantize and Fine-tune LLMs with GPTQ Using Transformers and TRL](https://kaitchup.substack.com/p/quantize-and-fine-tune-llms-with)


In [None]:
!pip install --upgrade transformers auto-gptq accelerate datasets
!python -m pip install git+https://github.com/huggingface/optimum.git

Collecting auto-gptq
  Downloading auto_gptq-0.7.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (23.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.5/23.5 MB[0m [31m70.3 MB/s[0m eta [36m0:00:00[0m
Collecting rouge (from auto-gptq)
  Downloading rouge-1.0.1-py3-none-any.whl (13 kB)
Collecting gekko (from auto-gptq)
  Downloading gekko-1.0.6-py3-none-any.whl (12.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.2/12.2 MB[0m [31m89.3 MB/s[0m eta [36m0:00:00[0m
Collecting peft>=0.5.0 (from auto-gptq)
  Downloading peft-0.8.2-py3-none-any.whl (183 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m183.4/183.4 kB[0m [31m25.1 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: rouge, gekko, peft, auto-gptq
Successfully installed auto-gptq-0.7.0 gekko-1.0.6 peft-0.8.2 rouge-1.0.1
Collecting git+https://github.com/huggingface/optimum.git
  Cloning https://github.com/huggingface/optimum.git to /tm

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer
from optimum.gptq import GPTQQuantizer
import torch
model_path = 'Qwen/Qwen1.5-7B'
w = 4 #quantization to 4-bit. Change to 2, 3, or 8 to quantize with another precision

quant_path = 'Qwen1.5-7B-gptq-'+str(w)+'bit'

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype=torch.float16, device_map="auto")
quantizer = GPTQQuantizer(bits=w, dataset="c4", model_seqlen = 2048)
quantized_model = quantizer.quantize_model(model, tokenizer)

quantized_model.save_pretrained("./"+quant_path, safetensors=True)
tokenizer.save_pretrained("./"+quant_path)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/1.16k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/7.03M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/295 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


config.json:   0%|          | 0.00/663 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/31.7k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/3.99G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/3.96G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/3.96G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/3.54G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/138 [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/41.1k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/319M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Quantizing model.layers blocks :   0%|          | 0/32 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]



('./Qwen1.5-7B-gptq-4bit/tokenizer_config.json',
 './Qwen1.5-7B-gptq-4bit/special_tokens_map.json',
 './Qwen1.5-7B-gptq-4bit/vocab.json',
 './Qwen1.5-7B-gptq-4bit/merges.txt',
 './Qwen1.5-7B-gptq-4bit/added_tokens.json',
 './Qwen1.5-7B-gptq-4bit/tokenizer.json')

AWQ

More details about the AWQ quantization in this article:

[Fast and Small Llama 2 with Activation-Aware Quantization (AWQ)
](https://kaitchup.substack.com/p/fast-and-small-llama-2-with-activation)


In [None]:
!pip install --upgrade transformers autoawq optimum accelerate

Collecting autoawq
  Downloading autoawq-0.2.1-cp310-cp310-manylinux2014_x86_64.whl (79 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m79.2/79.2 kB[0m [31m1.3 MB/s[0m eta [36m0:00:00[0m
Collecting optimum
  Downloading optimum-1.17.0-py3-none-any.whl (407 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m407.8/407.8 kB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
Collecting zstandard (from autoawq)
  Downloading zstandard-0.22.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (5.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.4/5.4 MB[0m [31m34.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting autoawq-kernels (from autoawq)
  Downloading autoawq_kernels-0.0.5-cp310-cp310-manylinux2014_x86_64.whl (29.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m29.7/29.7 MB[0m [31m35.6 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: zstandard, autoawq-kernels, autoawq, optimum
  Attempti

In [None]:
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_path = 'Qwen/Qwen1.5-7B'
quant_path = 'Qwen1.5-7B-awq-4bit'
quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM" }

# Load model and tokenizer
model = AutoAWQForCausalLM.from_pretrained(model_path, safetensors=True)
tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=True)

# Quantize
model.quantize(tokenizer, quant_config=quant_config)

# Save quantized model with safetensors
model.save_quantized("./"+quant_path, safetensors=True)
tokenizer.save_pretrained("./"+quant_path)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Fetching 15 files:   0%|          | 0/15 [00:00<?, ?it/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Generating validation split: 0 examples [00:00, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (57053 > 32768). Running this sequence through the model will result in indexing errors
AWQ: 100%|██████████| 32/32 [15:10<00:00, 28.44s/it]


('./Qwen1.5-7B-awq-4bit/tokenizer_config.json',
 './Qwen1.5-7B-awq-4bit/special_tokens_map.json',
 './Qwen1.5-7B-awq-4bit/vocab.json',
 './Qwen1.5-7B-awq-4bit/merges.txt',
 './Qwen1.5-7B-awq-4bit/added_tokens.json',
 './Qwen1.5-7B-awq-4bit/tokenizer.json')

# Fine-tuning

QLoRA

More details about QLoRA fine-tuning in this article:

[QLoRa: Fine-Tune a Large Language Model on Your GPU](https://kaitchup.substack.com/p/qlora-fine-tune-a-large-language-model-on-your-gpu-27bed5a03e2b)

In [None]:
!pip install --upgrade bitsandbytes transformers peft accelerate datasets trl

Collecting bitsandbytes
  Downloading bitsandbytes-0.42.0-py3-none-any.whl (105.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m105.0/105.0 MB[0m [31m16.3 MB/s[0m eta [36m0:00:00[0m
Collecting transformers
  Downloading transformers-4.37.2-py3-none-any.whl (8.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.4/8.4 MB[0m [31m107.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting peft
  Downloading peft-0.8.2-py3-none-any.whl (183 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m183.4/183.4 kB[0m [31m25.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting accelerate
  Downloading accelerate-0.27.2-py3-none-any.whl (279 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m280.0/280.0 kB[0m [31m37.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets
  Downloading datasets-2.17.0-py3-none-any.whl (536 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m536.6/536.6 kB[0m [31m55.4 MB/s[0

In [None]:
import torch
from datasets import load_dataset
from peft import LoraConfig, PeftModel, prepare_model_for_kbit_training
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    AutoTokenizer,
    TrainingArguments,
)
from trl import SFTTrainer

model_name = "Qwen/Qwen1.5-7B"
#Tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, add_eos_token=True, use_fast=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.pad_token_id =  tokenizer.eos_token_id
tokenizer.padding_side = 'left'

ds = load_dataset("timdettmers/openassistant-guanaco")
compute_dtype = getattr(torch, "float16")
bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=compute_dtype,
        bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained(
          model_name, quantization_config=bnb_config, device_map={"": 0}
)
model = prepare_model_for_kbit_training(model)
#Configure the pad token in the model
model.config.pad_token_id = tokenizer.pad_token_id
model.config.use_cache = False # Gradient checkpointing is used by default but not compatible with caching

peft_config = LoraConfig(
        lora_alpha=16,
        lora_dropout=0.05,
        r=16,
        bias="none",
        task_type="CAUSAL_LM",
        target_modules= ['k_proj', 'q_proj', 'v_proj', 'o_proj', "gate_proj", "down_proj", "up_proj"]
)

training_arguments = TrainingArguments(
        output_dir="./drive/Mydrive/results_qlora",
        evaluation_strategy="steps",
        do_eval=True,
        optim="paged_adamw_8bit",
        per_device_train_batch_size=8,
        per_device_eval_batch_size=8,
        log_level="debug",
        save_steps=50,
        logging_steps=50,
        learning_rate=2e-5,
        eval_steps=50,
        max_steps=300,
        warmup_steps=30,
        lr_scheduler_type="linear",
)

trainer = SFTTrainer(
        model=model,
        train_dataset=ds['train'],
        eval_dataset=ds['test'],
        peft_config=peft_config,
        dataset_text_field="text",
        max_seq_length=512,
        tokenizer=tokenizer,
        args=training_arguments,
)

trainer.train()

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/1.16k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/7.03M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/295 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Downloading readme:   0%|          | 0.00/395 [00:00<?, ?B/s]



Downloading data:   0%|          | 0.00/20.9M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.11M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

config.json:   0%|          | 0.00/663 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/31.7k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/3.99G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/3.96G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/3.96G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/3.54G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/138 [00:00<?, ?B/s]

Map:   0%|          | 0/9846 [00:00<?, ? examples/s]

Map:   0%|          | 0/518 [00:00<?, ? examples/s]

max_steps is given, it will override any value given in num_train_epochs
Currently training with a batch size of: 8
***** Running training *****
  Num examples = 9,846
  Num Epochs = 1
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 300
  Number of trainable parameters = 39,976,960


Step,Training Loss,Validation Loss
50,1.3694,1.489497
100,1.27,1.446679
150,1.2432,1.435053
200,1.2326,1.431196
250,1.2537,1.429125
300,1.2348,1.42962


***** Running Evaluation *****
  Num examples = 518
  Batch size = 8
Saving model checkpoint to ./drive/Mydrive/results_qlora/tmp-checkpoint-50
loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--Qwen--Qwen1.5-7B/snapshots/e52fa2ef47411cc8bc9f752d1d8d9072b37742e7/config.json
Model config Qwen2Config {
  "architectures": [
    "Qwen2ForCausalLM"
  ],
  "attention_dropout": 0.0,
  "bos_token_id": 151643,
  "eos_token_id": 151643,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 11008,
  "max_position_embeddings": 32768,
  "max_window_layers": 28,
  "model_type": "qwen2",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 32,
  "rms_norm_eps": 1e-06,
  "rope_theta": 1000000.0,
  "sliding_window": 32768,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.37.2",
  "use_cache": true,
  "use_sliding_window": false,
  "vocab_size": 151936


TrainOutput(global_step=300, training_loss=1.2672652435302734, metrics={'train_runtime': 4462.5801, 'train_samples_per_second': 0.538, 'train_steps_per_second': 0.067, 'total_flos': 5.160105818947584e+16, 'train_loss': 1.2672652435302734, 'epoch': 0.24})

# Benchmarking

With optimum-benchmark: Memory-Efficiency and Inference Speed

More details about using optimum-benchmark in this article:

[Optimum-Benchmark: How Fast and Memory-Efficient Is Your LLM?](https://kaitchup.substack.com/p/optimum-benchmark-how-fast-and-memory)

In [None]:
!pip install optimum
!python -m pip install git+https://github.com/huggingface/optimum-benchmark.git
!pip install bitsandbytes
!pip install auto-gptq
!pip install autoawq
!pip install --upgrade transformers

Collecting optimum
  Downloading optimum-1.17.0-py3-none-any.whl (407 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m407.8/407.8 kB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting coloredlogs (from optimum)
  Downloading coloredlogs-15.0.1-py2.py3-none-any.whl (46 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m46.0/46.0 kB[0m [31m6.2 MB/s[0m eta [36m0:00:00[0m
Collecting datasets (from optimum)
  Downloading datasets-2.17.0-py3-none-any.whl (536 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m536.6/536.6 kB[0m [31m11.2 MB/s[0m eta [36m0:00:00[0m
Collecting humanfriendly>=9.1 (from coloredlogs->optimum)
  Downloading humanfriendly-10.0-py2.py3-none-any.whl (86 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.8/86.8 kB[0m [31m11.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pyarrow>=12.0.0 (from datasets->optimum)
  Downloading pyarrow-15.0.0-cp310-cp310-manylinux_2_28_x86_64.wh

Benchmarking Qwen1.5 7B fp16

In [None]:
import os
YAML_DEFAULT="""
defaults:
  - backend: pytorch # default backend
  - benchmark: inference # we will monitor the inference
  - launcher: process
  - experiment # inheriting from experiment config
  - _self_ # for hydra 1.1 compatibility
  - override hydra/job_logging: colorlog # colorful logging
  - override hydra/hydra_logging: colorlog # colorful logging

hydra:
  run:
    dir: experiments/${experiment_name} #The results will be reported in this directory. Note that "experiment_name" refers to the configuration field name "experiment_name" below
  sweep:
    dir: experiments/${experiment_name}
  job:
    chdir: true
    env_set: #These are environment variable that you may want to set before running the benchmark
      CUDA_VISIBLE_DEVICES: 0
      CUDA_DEVICE_ORDER: PCI_BUS_ID

experiment_name: Qwen1.5-7B-fp16
model: Qwen/Qwen1.5-7B #The model that we want to evaluate. It can be from the Hugging Face Hub or local directory
device: cuda #Which device to use for the benchmark. We will use CUDA, i.e., the GPU

backend:
  torch_dtype: float16 #The model will be loaded with fp16

benchmark:
  memory: true #We will monitor the memory usage
  warmup_runs: 10 #Before the monitoring starts, the inference will be run 10 times for warming up

  new_tokens: 1000 #Inference will generate 1000 tokens
  input_shapes:
    sequence_length: 512 #Prompt will have 512 tokens
    batch_size: 4
"""

with open("Qwen1.5-7B-fp16.yaml", 'w') as f:
  f.write(YAML_DEFAULT)
os.system("optimum-benchmark --config-dir ./ --config-name Qwen1.5-7B-fp16")

0

In [None]:
!cat experiments/Qwen1.5-7B-fp16/benchmark_report.json

{
    "batch_size": 4,
    "sequence_length": 512,
    "num_new_tokens": 1000,
    "num_return_sequences": 1,
    "prefill": {
        "memory": {
            "max_vram_used(MB)": 20511,
            "max_memory_reserved(MB)": 19501,
            "max_memory_allocated(MB)": 18949
        },
        "latency": {
            "list[s]": [
                0.192934482,
                0.194949987,
                0.195894077,
                0.19556984,
                0.195216664,
                0.195924599,
                0.196097198,
                0.194839468,
                0.195526616,
                0.196561958,
                0.195556153,
                0.196075542,
                0.196143761,
                0.196234932,
                0.195558996,
                0.19638102,
                0.196230546,
                0.195916014,
                0.19560153,
                0.197120475,
                0.195986556,
                0.19717193,
                0.19626747,
  

Benchmarking with 4-bit GPTQ

In [None]:
import os
YAML_DEFAULT="""
defaults:
  - backend: pytorch # default backend
  - benchmark: inference # we will monitor the inference
  - launcher: process
  - experiment # inheriting from experiment config
  - _self_ # for hydra 1.1 compatibility
  - override hydra/job_logging: colorlog # colorful logging
  - override hydra/hydra_logging: colorlog # colorful logging

hydra:
  run:
    dir: experiments/${experiment_name} #The results will be reported in this directory. Note that "experiment_name" refers to the configuration field name "experiment_name" below
  sweep:
    dir: experiments/${experiment_name}
  job:
    chdir: true
    env_set: #These are environment variable that you may want to set before running the benchmark
      CUDA_VISIBLE_DEVICES: 0
      CUDA_DEVICE_ORDER: PCI_BUS_ID

experiment_name: Qwen1.5-7B-gptq-4bit
model: kaitchup/Qwen1.5-7B-gptq-4bit #The model that we want to evaluate. It can be from the Hugging Face Hub or local directory
device: cuda #Which device to use for the benchmark. We will use CUDA, i.e., the GPU

backend:
  torch_dtype: float16 #The model will be loaded with fp16

benchmark:
  memory: true #We will monitor the memory usage
  warmup_runs: 10 #Before the monitoring starts, the inference will be run 10 times for warming up

  new_tokens: 1000 #Inference will generate 1000 tokens
  input_shapes:
    sequence_length: 512 #Prompt will have 512 tokens
    batch_size: 4
"""

with open("Qwen1.5-7B-gptq-4bit.yaml", 'w') as f:
  f.write(YAML_DEFAULT)
os.system("optimum-benchmark --config-dir ./ --config-name Qwen1.5-7B-gptq-4bit")

0

In [None]:
!cat experiments/Qwen1.5-7B-gptq-4bit/benchmark_report.json

{
    "batch_size": 4,
    "sequence_length": 512,
    "num_new_tokens": 1000,
    "num_return_sequences": 1,
    "prefill": {
        "memory": {
            "max_vram_used(MB)": 11434,
            "max_memory_reserved(MB)": 10424,
            "max_memory_allocated(MB)": 9418
        },
        "latency": {
            "list[s]": [
                0.340050463,
                0.340974996,
                0.340027008,
                0.340453168,
                0.340192991,
                0.339539135,
                0.339419286,
                0.339408833,
                0.340415142,
                0.340898065,
                0.341128107,
                0.340863328,
                0.340189869,
                0.339947456,
                0.339436328,
                0.339495412,
                0.339523525,
                0.340904062,
                0.340851158,
                0.341105999,
                0.340976774,
                0.341066036,
                0.340194348

Benchmarking with 4-bit AWQ

In [None]:
import os
YAML_DEFAULT="""
defaults:
  - backend: pytorch # default backend
  - benchmark: inference # we will monitor the inference
  - launcher: process
  - experiment # inheriting from experiment config
  - _self_ # for hydra 1.1 compatibility
  - override hydra/job_logging: colorlog # colorful logging
  - override hydra/hydra_logging: colorlog # colorful logging

hydra:
  run:
    dir: experiments/${experiment_name} #The results will be reported in this directory. Note that "experiment_name" refers to the configuration field name "experiment_name" below
  sweep:
    dir: experiments/${experiment_name}
  job:
    chdir: true
    env_set: #These are environment variable that you may want to set before running the benchmark
      CUDA_VISIBLE_DEVICES: 0
      CUDA_DEVICE_ORDER: PCI_BUS_ID

experiment_name: Qwen1.5-7B-awq-4bit
model: kaitchup/Qwen1.5-7B-awq-4bit #The model that we want to evaluate. It can be from the Hugging Face Hub or local directory
device: cuda #Which device to use for the benchmark. We will use CUDA, i.e., the GPU

backend:
  torch_dtype: float16 #The model will be loaded with fp16

benchmark:
  memory: true #We will monitor the memory usage
  warmup_runs: 10 #Before the monitoring starts, the inference will be run 10 times for warming up

  new_tokens: 1000 #Inference will generate 1000 tokens
  input_shapes:
    sequence_length: 512 #Prompt will have 512 tokens
    batch_size: 4
"""

with open("Qwen1.5-7B-awq-4bit.yaml", 'w') as f:
  f.write(YAML_DEFAULT)
os.system("optimum-benchmark --config-dir ./ --config-name Qwen1.5-7B-awq-4bit")

0

In [None]:
!cat experiments/Qwen1.5-7B-awq-4bit/benchmark_report.json

{
    "batch_size": 4,
    "sequence_length": 512,
    "num_new_tokens": 1000,
    "num_return_sequences": 1,
    "prefill": {
        "memory": {
            "max_vram_used(MB)": 11642,
            "max_memory_reserved(MB)": 10632,
            "max_memory_allocated(MB)": 9412
        },
        "latency": {
            "list[s]": [
                0.468850611,
                0.467203082,
                0.470352865,
                0.468097367,
                0.468764615,
                0.467559894,
                0.469575453,
                0.467329432,
                0.466830243,
                0.466693559,
                0.467303087,
                0.475011398,
                0.466497257,
                0.467188389,
                0.467027904,
                0.466514316,
                0.466937372,
                0.468115238,
                0.46681191,
                0.467071308,
                0.466831776,
                0.466269733
            ],
            "m

With Evaluation Harness


More details about using the Evaluation Harness in this article:

[Behind the OpenLLM Leaderboard: The Evaluation Harness](https://kaitchup.substack.com/p/behind-the-openllm-leaderboard-the)

In [None]:
!pip install git+https://github.com/EleutherAI/lm-evaluation-harness.git
!pip install bitsandbytes
!pip install --upgrade transformers
!pip install auto-gptq optimum autoawq

Collecting git+https://github.com/EleutherAI/lm-evaluation-harness.git
  Cloning https://github.com/EleutherAI/lm-evaluation-harness.git to /tmp/pip-req-build-y9jbprev
  Running command git clone --filter=blob:none --quiet https://github.com/EleutherAI/lm-evaluation-harness.git /tmp/pip-req-build-y9jbprev
  Resolved https://github.com/EleutherAI/lm-evaluation-harness.git to commit f3b7917091afba325af3980a35d8a6dcba03dc3f
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting evaluate (from lm_eval==0.4.1)
  Downloading evaluate-0.4.1-py3-none-any.whl (84 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m1.4 MB/s[0m eta [36m0:00:00[0m
Collecting jsonlines (from lm_eval==0.4.1)
  Downloading jsonlines-4.0.0-py3-none-any.whl (8.7 kB)
Collecting pybind11>=2.6.2 (from lm_eval==0.4.1)
  Downloading pybind11-2.11.1-py3-none-a

In [None]:
!lm_eval --model hf --model_args pretrained=Qwen/Qwen1.5-7B --tasks winogrande,hellaswag,arc_challenge --device cuda:0 --num_fewshot 1 --batch_size 8 --output_path ./eval_harness/Qwen1.5-7B

2024-02-17 13:08:18.355350: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-02-17 13:08:18.355400: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-02-17 13:08:18.356973: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
Downloading builder script: 100% 5.67k/5.67k [00:00<00:00, 21.1MB/s]
2024-02-17:13:08:22,147 INFO     [__main__.py:200] Verbosity set to INFO
2024-02-17:13:08:22,148 INFO     [__init__.py:358] lm_eval.tasks.initialize_tasks() is deprecated and no longer necessary. It will be removed in v0.4.2 release. TaskManager will instead be used.
2024-02-17:13:08:26,650 INFO

In [None]:
!lm_eval --model hf --model_args pretrained=kaitchup/Qwen1.5-7B-gptq-4bit --tasks winogrande,hellaswag,arc_challenge --device cuda:0 --num_fewshot 1 --batch_size 8 --output_path ./eval_harness/Qwen1.5-7B-gptq-4bit

2024-02-17 13:23:50.229429: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-02-17 13:23:50.229490: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-02-17 13:23:50.230842: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-02-17:13:23:53,584 INFO     [__main__.py:200] Verbosity set to INFO
2024-02-17:13:23:53,584 INFO     [__init__.py:358] lm_eval.tasks.initialize_tasks() is deprecated and no longer necessary. It will be removed in v0.4.2 release. TaskManager will instead be used.
2024-02-17:13:23:58,096 INFO     [__main__.py:276] Selected Tasks: ['arc_challenge', 'hellaswag',

In [None]:
!lm_eval --model hf --model_args pretrained=kaitchup/Qwen1.5-7B-awq-4bit --tasks winogrande,hellaswag,arc_challenge --device cuda:0 --num_fewshot 1 --batch_size 8 --output_path ./eval_harness/Qwen1.5-7B-awq-4bit

2024-02-17 13:52:57.981846: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-02-17 13:52:57.981996: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-02-17 13:52:57.983808: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-02-17:13:53:01,356 INFO     [__main__.py:200] Verbosity set to INFO
2024-02-17:13:53:01,356 INFO     [__init__.py:358] lm_eval.tasks.initialize_tasks() is deprecated and no longer necessary. It will be removed in v0.4.2 release. TaskManager will instead be used.
2024-02-17:13:53:05,848 INFO     [__main__.py:276] Selected Tasks: ['arc_challenge', 'hellaswag',