<a href="https://colab.research.google.com/github/tuhinmallick/AI-for-Fashion/blob/main/Quantize_and_Evaluate_Mistral_NeMo_Minitron_8B_Base_and_Llama_3_1_Minitron_4B.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

*All the details in this article: [Mistral-NeMo: 4.1x Smaller with Quantized Minitron](https://newsletter.kaitchup.com/p/mistral-nemo-41x-smaller-with-quantized)*


To quantize, run, and evaluate, the Minitron models with AutoRound and bitsandbytes, we need to install the following libraries:

*Note: As I’m writing this, the Minitron models are not supported by the latest stable version of Transformers, we need to install it from source:*

This notebook has only been tested on an the GPUs RTX 3090 and A40. It should work with any NVIDIA GPUs from the Ampere generation or more recent.

In [None]:
!pip install --upgrade transformers auto-round flash_attn optimum auto-gptq bitsandbytes
!pip install git+https://github.com/huggingface/transformers

Collecting transformers
  Downloading transformers-4.44.2-py3-none-any.whl.metadata (43 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.7/43.7 kB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting auto-round
  Downloading auto_round-0.3-py3-none-any.whl.metadata (18 kB)
Collecting flash_attn
  Downloading flash_attn-2.6.3.tar.gz (2.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.6/2.6 MB[0m [31m70.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting optimum
  Downloading optimum-1.21.4-py3-none-any.whl.metadata (19 kB)
Collecting auto-gptq
  Downloading auto_gptq-0.7.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (18 kB)
Collecting bitsandbytes
  Downloading bitsandbytes-0.43.3-py3-none-manylinux_2_24_x86_64.whl.metadata (3.5 kB)
Collecting datasets (from auto-round)
  Downloading datasets-2.21.0-py3-none-any.whl.metadata (21 kB)
Collecting intel-extension-f

#Quantization

##With bitsandbytes

Example for 8-bit quantization of nvidia/Mistral-NeMo-Minitron-8B-Base. Change "model_name" to quantize another model.

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

if torch.cuda.is_bf16_supported():
  compute_dtype = torch.bfloat16
else:
  compute_dtype = torch.float16

model_name = "nvidia/Mistral-NeMo-Minitron-8B-Base"
quant_path = 'Mistral-NeMo-Minitron-8B-Base-bnb-8bit'
tokenizer = AutoTokenizer.from_pretrained(model_name)
bnb_config = BitsAndBytesConfig(
        load_in_8bit=True,
)
model = AutoModelForCausalLM.from_pretrained(
          model_name, quantization_config=bnb_config
)

model.save_pretrained("./"+quant_path, safetensors=True)
tokenizer.save_pretrained("./"+quant_path)

`low_cpu_mem_usage` was None, now set to True since model is quantized.


Loading checkpoint shards:   0%|          | 0/5 [00:00<?, ?it/s]

('./Mistral-Nemo-Base-2407-bnb-8bit/tokenizer_config.json',
 './Mistral-Nemo-Base-2407-bnb-8bit/special_tokens_map.json',
 './Mistral-Nemo-Base-2407-bnb-8bit/tokenizer.json')

##With AutoRound

Symmetric quantization.

Change "model_name" to quantize another model.

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_name = "nvidia/Mistral-NeMo-Minitron-8B-Base"
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_name)

from auto_round import AutoRound

bits, group_size, sym = 4, 128, True
autoround = AutoRound(model, tokenizer, bits=bits, group_size=group_size, batch_size=2, seqlen=512, sym=sym, gradient_accumulate_steps=4, device='cuda')
autoround.quantize()
output_dir = "./AutoRound/Mistral-NeMo-Minitron-8B-Base-AutoRound-GPTQ-sym-4bit/"
autoround.save_quantized(output_dir)

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

2024-08-24 16:11:39 INFO autoround.py L209: using torch.float16 for quantization tuning
2024-08-24 16:11:44,360 INFO utils.py L145: Note: detected 96 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
2024-08-24 16:11:44,362 INFO utils.py L148: Note: NumExpr detected 96 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
2024-08-24 16:11:44,363 INFO utils.py L161: NumExpr defaulting to 16 threads.
2024-08-24 16:11:44,550 INFO config.py L59: PyTorch version 2.1.0+cu118 available.


Downloading readme:   0%|          | 0.00/373 [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/921 [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/33.3M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/10000 [00:00<?, ? examples/s]

Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

Filter:   0%|          | 0/10000 [00:00<?, ? examples/s]

We detected that you are passing `past_key_values` as a tuple and this is deprecated and will be removed in v4.43. Please use an appropriate `Cache` class (https://huggingface.co/docs/transformers/internal/generation_utils#transformers.Cache)
2024-08-24 16:12:23 INFO autoround.py L1039: quantizing 1/40, model.layers.0
2024-08-24 16:13:29 INFO autoround.py L966: quantized 7/7 layers in the block, loss iter 0: 0.005015 -> iter 101: 0.001826
2024-08-24 16:13:30 INFO autoround.py L1039: quantizing 2/40, model.layers.1
2024-08-24 16:14:35 INFO autoround.py L966: quantized 7/7 layers in the block, loss iter 0: 0.001793 -> iter 178: 0.000575
2024-08-24 16:14:36 INFO autoround.py L1039: quantizing 3/40, model.layers.2
2024-08-24 16:15:42 INFO autoround.py L966: quantized 7/7 layers in the block, loss iter 0: 0.014820 -> iter 48: 0.001951
2024-08-24 16:15:43 INFO autoround.py L1039: quantizing 4/40, model.layers.3
2024-08-24 16:16:50 INFO autoround.py L966: quantized 7/7 layers in the block, lo

The same as above but for an asymmetric quantization.

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_name = "nvidia/Mistral-NeMo-Minitron-8B-Base"
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_name)

from auto_round import AutoRound

bits, group_size, sym = 4, 128, False
autoround = AutoRound(model, tokenizer, bits=bits, group_size=group_size, batch_size=2, seqlen=512, sym=sym, gradient_accumulate_steps=4, device='cuda')
autoround.quantize()
output_dir = "./AutoRound/Mistral-NeMo-Minitron-8B-Base-AutoRound-GPTQ-asym-4bit/"
autoround.save_quantized(output_dir)

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

2024-08-24 17:02:32 INFO autoround.py L209: using torch.float16 for quantization tuning
2024-08-24 17:02:36,883 INFO utils.py L145: Note: detected 96 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
2024-08-24 17:02:36,885 INFO utils.py L148: Note: NumExpr detected 96 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
2024-08-24 17:02:36,886 INFO utils.py L161: NumExpr defaulting to 16 threads.
2024-08-24 17:02:37,047 INFO config.py L59: PyTorch version 2.1.0+cu118 available.
We detected that you are passing `past_key_values` as a tuple and this is deprecated and will be removed in v4.43. Please use an appropriate `Cache` class (https://huggingface.co/docs/transformers/internal/generation_utils#transformers.Cache)
2024-08-24 17:02:48 INFO autoround.py L1039: quantizing 1/40, model.layers.0
2024-08-24 17:03:54 INFO autoround.py L966: quantized 7/7 layers in the block, loss iter 0: 0.002202 -> iter 3: 0.001155
2024-08-

Example for a 2-bit quantization (not used in the article; if you want the model, contact me).

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_name = "nvidia/Mistral-NeMo-Minitron-8B-Base"
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_name)

from auto_round import AutoRound

bits, group_size, sym = 2, 128, True
autoround = AutoRound(model, tokenizer, bits=bits, group_size=group_size, batch_size=2, seqlen=512, sym=sym, gradient_accumulate_steps=4, device='cuda')
autoround.quantize()
output_dir = "./AutoRound/Mistral-NeMo-Minitron-8B-Base-AutoRound-GPTQ-sym-2bit/"
autoround.save_quantized(output_dir)

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

2024-08-22 20:02:05 INFO autoround.py L209: using torch.float16 for quantization tuning
2024-08-22 20:02:08,430 INFO config.py L59: PyTorch version 2.1.0+cu118 available.
We detected that you are passing `past_key_values` as a tuple and this is deprecated and will be removed in v4.43. Please use an appropriate `Cache` class (https://huggingface.co/docs/transformers/v4.41.3/en/internal/generation_utils#transformers.Cache)
2024-08-22 20:02:21 INFO autoround.py L1039: quantizing 1/40, model.layers.0
2024-08-22 20:03:16 INFO autoround.py L966: quantized 7/7 layers in the block, loss iter 0: 0.197240 -> iter 193: 0.001448
2024-08-22 20:03:17 INFO autoround.py L1039: quantizing 2/40, model.layers.1
2024-08-22 20:04:12 INFO autoround.py L966: quantized 7/7 layers in the block, loss iter 0: 0.003790 -> iter 152: 0.000544
2024-08-22 20:04:13 INFO autoround.py L1039: quantizing 3/40, model.layers.2
2024-08-22 20:05:08 INFO autoround.py L966: quantized 7/7 layers in the block, loss iter 0: 0.2227

Another example for nvidia/Llama-3.1-Minitron-4B-Width-Base

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name = "nvidia/Llama-3.1-Minitron-4B-Width-Base"
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_name)

from auto_round import AutoRound

bits, group_size, sym = 4, 128, False
autoround = AutoRound(model, tokenizer, bits=bits, group_size=group_size, batch_size=2, seqlen=512, sym=sym, gradient_accumulate_steps=4, device='cuda')
autoround.quantize()
output_dir = "./AutoRound/Llama-3.1-Minitron-4B-Width-Base-AutoRound-GPTQ-asym-4bit/"
autoround.save_quantized(output_dir)

config.json:   0%|          | 0.00/906 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/4.05G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/126 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/50.5k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/301 [00:00<?, ?B/s]

2024-08-25 06:54:59 INFO autoround.py L209: using torch.float16 for quantization tuning
2024-08-25 06:55:01,134 INFO utils.py L145: Note: detected 96 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
2024-08-25 06:55:01,135 INFO utils.py L148: Note: NumExpr detected 96 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
2024-08-25 06:55:01,137 INFO utils.py L161: NumExpr defaulting to 16 threads.
2024-08-25 06:55:01,286 INFO config.py L59: PyTorch version 2.1.0+cu118 available.


Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

Filter:   0%|          | 0/10000 [00:00<?, ? examples/s]

We detected that you are passing `past_key_values` as a tuple and this is deprecated and will be removed in v4.43. Please use an appropriate `Cache` class (https://huggingface.co/docs/transformers/internal/generation_utils#transformers.Cache)
2024-08-25 06:55:40 INFO autoround.py L1039: quantizing 1/32, model.layers.0
The attention layers in this model are transitioning from computing the RoPE embeddings internally through `position_ids` (2D tensor with the indexes of the tokens), to using externally computed `position_embeddings` (Tuple of tensors, containing cos and sin). In v4.45 `position_ids` will be removed and `position_embeddings` will be mandatory.
2024-08-25 06:56:23 INFO autoround.py L966: quantized 7/7 layers in the block, loss iter 0: 0.000005 -> iter 191: 0.000002
2024-08-25 06:56:23 INFO autoround.py L1039: quantizing 2/32, model.layers.1
2024-08-25 06:57:07 INFO autoround.py L966: quantized 7/7 layers in the block, loss iter 0: 0.000306 -> iter 63: 0.000150
2024-08-25 0

and nvidia/Llama-3.1-Minitron-4B-Depth-Base

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name = "nvidia/Llama-3.1-Minitron-4B-Depth-Base"
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_name)

from auto_round import AutoRound

bits, group_size, sym = 4, 128, False
autoround = AutoRound(model, tokenizer, bits=bits, group_size=group_size, batch_size=2, seqlen=512, sym=sym, gradient_accumulate_steps=4, device='cuda')
autoround.quantize()
output_dir = "./AutoRound/Llama-3.1-Minitron-4B-Depth-Base-AutoRound-GPTQ-asym-4bit/"
autoround.save_quantized(output_dir)

config.json:   0%|          | 0.00/883 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/12.1k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/4.10G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/121 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/50.5k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/301 [00:00<?, ?B/s]

2024-08-25 15:22:29 INFO autoround.py L209: using torch.float16 for quantization tuning
2024-08-25 15:22:32,899 INFO utils.py L145: Note: detected 96 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
2024-08-25 15:22:32,900 INFO utils.py L148: Note: NumExpr detected 96 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
2024-08-25 15:22:32,902 INFO utils.py L161: NumExpr defaulting to 16 threads.
2024-08-25 15:22:33,044 INFO config.py L59: PyTorch version 2.1.0+cu118 available.


Downloading readme:   0%|          | 0.00/373 [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/921 [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/33.3M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/10000 [00:00<?, ? examples/s]

Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

Filter:   0%|          | 0/10000 [00:00<?, ? examples/s]

We detected that you are passing `past_key_values` as a tuple and this is deprecated and will be removed in v4.43. Please use an appropriate `Cache` class (https://huggingface.co/docs/transformers/internal/generation_utils#transformers.Cache)
2024-08-25 15:23:08 INFO autoround.py L1039: quantizing 1/16, model.layers.0
The attention layers in this model are transitioning from computing the RoPE embeddings internally through `position_ids` (2D tensor with the indexes of the tokens), to using externally computed `position_embeddings` (Tuple of tensors, containing cos and sin). In v4.45 `position_ids` will be removed and `position_embeddings` will be mandatory.
2024-08-25 15:24:24 INFO autoround.py L966: quantized 7/7 layers in the block, loss iter 0: 0.000007 -> iter 191: 0.000002
2024-08-25 15:24:25 INFO autoround.py L1039: quantizing 2/16, model.layers.1
2024-08-25 15:25:40 INFO autoround.py L966: quantized 7/7 layers in the block, loss iter 0: 0.000103 -> iter 25: 0.000072
2024-08-25 1

#Evaluation

We need to install the Evaluation Harness:

In [None]:
!pip install git+https://github.com/EleutherAI/lm-evaluation-harness.git

Collecting git+https://github.com/EleutherAI/lm-evaluation-harness.git
  Cloning https://github.com/EleutherAI/lm-evaluation-harness.git to /tmp/pip-req-build-feqo38wu
  Running command git clone --filter=blob:none --quiet https://github.com/EleutherAI/lm-evaluation-harness.git /tmp/pip-req-build-feqo38wu
  Resolved https://github.com/EleutherAI/lm-evaluation-harness.git to commit aab42ba836b4af28cc1c5c1e697ea334c6ea7ced
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
Collecting evaluate (from lm_eval==0.4.3)
  Downloading evaluate-0.4.2-py3-none-any.whl.metadata (9.3 kB)
Collecting jsonlines (from lm_eval==0.4.3)
  Downloading jsonlines-4.0.0-py3-none-any.whl.metadata (1.6 kB)
Collecting numexpr (from lm_eval==0.4.3)
  Downloading numexpr-2.10.1-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (1.2 kB)
Collecting pybind11>=2.6.2 (from lm_eval==0.4.3)

###nvidia/Mistral-NeMo-Minitron-8B-Base

In [None]:
!lm_eval --model hf --model_args pretrained=nvidia/Mistral-NeMo-Minitron-8B-Base --tasks mmlu,arc_challenge,leaderboard_mmlu_pro --device cuda:0 --num_fewshot 0 --batch_size 4 --output_path ./eval/

Running loglikelihood requests:   0%|                | 0/174850 [00:00<?, ?it/s]We detected that you are passing `past_key_values` as a tuple and this is deprecated and will be removed in v4.43. Please use an appropriate `Cache` class (https://huggingface.co/docs/transformers/v4.41.3/en/internal/generation_utils#transformers.Cache)
Running loglikelihood requests: 100%|██| 174850/174850 [17:47<00:00, 163.80it/s]
fatal: not a git repository (or any parent up to mount point /)
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
2024-08-23:15:04:45,837 INFO     [evaluation_tracker.py:206] Saving results aggregated
hf (pretrained=nvidia/Mistral-NeMo-Minitron-8B-Base), gen_kwargs: (None), limit: None, num_fewshot: 0, batch_size: 4
|                 Tasks                 |Version|Filter|n-shot| Metric |   |Value |   |Stderr|
|---------------------------------------|------:|------|-----:|--------|---|-----:|---|-----:|
|arc_challenge                          |    1.0|non

In [None]:
!lm_eval --model hf --model_args pretrained=nvidia/Mistral-NeMo-Minitron-8B-Base,load_in_8bit=True --tasks mmlu,arc_challenge,leaderboard_mmlu_pro --device cuda:0 --num_fewshot 0 --batch_size 4 --output_path ./eval/

2024-08-23:19:02:59,002 INFO     [__main__.py:279] Verbosity set to INFO
2024-08-23:19:02:59,115 INFO     [__init__.py:491] `group` and `group_alias` keys in tasks' configs will no longer be used in the next release of lm-eval. `tag` will be used to allow to call a collection of tasks just like `group`. `group` will be removed in order to not cause confusion with the new ConfigurableGroup which will be the official way to create groups with addition of group-wide configurations.
2024-08-23:19:03:13,340 INFO     [__main__.py:383] Selected Tasks: ['arc_challenge', 'leaderboard_mmlu_pro', 'mmlu']
2024-08-23:19:03:13,342 INFO     [evaluator.py:161] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234
2024-08-23:19:03:13,342 INFO     [evaluator.py:198] Initializing hf model, with arguments: {'pretrained': 'nvidia/Mistral-NeMo-Minitron-8B-Base', 'load_in_8bit': True}
2024-08-23:19:03:13,572 INFO     [huggingface.py:130] Using device 'cuda:0'
2024-08-23:19

In [None]:
!lm_eval --model hf --model_args pretrained=nvidia/Mistral-NeMo-Minitron-8B-Base,load_in_4bit=True --tasks mmlu,arc_challenge,leaderboard_mmlu_pro --device cuda:0 --num_fewshot 0 --batch_size 4 --output_path ./eval/

2024-08-24:10:23:07,557 INFO     [__main__.py:279] Verbosity set to INFO
2024-08-24:10:23:07,667 INFO     [__init__.py:491] `group` and `group_alias` keys in tasks' configs will no longer be used in the next release of lm-eval. `tag` will be used to allow to call a collection of tasks just like `group`. `group` will be removed in order to not cause confusion with the new ConfigurableGroup which will be the official way to create groups with addition of group-wide configurations.
2024-08-24:10:23:19,848 INFO     [__main__.py:383] Selected Tasks: ['arc_challenge', 'leaderboard_mmlu_pro', 'mmlu']
2024-08-24:10:23:19,850 INFO     [evaluator.py:161] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234
2024-08-24:10:23:19,850 INFO     [evaluator.py:198] Initializing hf model, with arguments: {'pretrained': 'nvidia/Mistral-NeMo-Minitron-8B-Base', 'load_in_4bit': True}
2024-08-24:10:23:20,079 INFO     [huggingface.py:130] Using device 'cuda:0'
2024-08-24:10

In [None]:
!lm_eval --model hf --model_args pretrained=kaitchup/Mistral-NeMo-Minitron-8B-Base-AutoRound-GPTQ-sym-4bit --tasks mmlu,arc_challenge,leaderboard_mmlu_pro --device cuda:0 --num_fewshot 0 --batch_size 4 --output_path ./eval/

2024-08-23:18:02:20,361 INFO     [__main__.py:279] Verbosity set to INFO
2024-08-23:18:02:20,461 INFO     [__init__.py:491] `group` and `group_alias` keys in tasks' configs will no longer be used in the next release of lm-eval. `tag` will be used to allow to call a collection of tasks just like `group`. `group` will be removed in order to not cause confusion with the new ConfigurableGroup which will be the official way to create groups with addition of group-wide configurations.
2024-08-23:18:02:34,470 INFO     [__main__.py:383] Selected Tasks: ['arc_challenge', 'leaderboard_mmlu_pro', 'mmlu']
2024-08-23:18:02:34,472 INFO     [evaluator.py:161] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234
2024-08-23:18:02:34,472 INFO     [evaluator.py:198] Initializing hf model, with arguments: {'pretrained': 'kaitchup/Mistral-NeMo-Minitron-8B-Base-AutoRound-GPTQ-sym-4bit'}
2024-08-23:18:02:34,723 INFO     [huggingface.py:130] Using device 'cuda:0'
config.js

In [None]:
!lm_eval --model hf --model_args pretrained=./AutoRound/Mistral-NeMo-Minitron-8B-Base-AutoRound-GPTQ-asym-4bit/ --tasks mmlu,arc_challenge,leaderboard_mmlu_pro --device cuda:0 --num_fewshot 0 --batch_size 4 --output_path ./eval/

2024-08-24:18:47:15,965 INFO     [__main__.py:279] Verbosity set to INFO
2024-08-24:18:47:16,057 INFO     [__init__.py:491] `group` and `group_alias` keys in tasks' configs will no longer be used in the next release of lm-eval. `tag` will be used to allow to call a collection of tasks just like `group`. `group` will be removed in order to not cause confusion with the new ConfigurableGroup which will be the official way to create groups with addition of group-wide configurations.
2024-08-24:18:47:28,545 INFO     [__main__.py:383] Selected Tasks: ['arc_challenge', 'leaderboard_mmlu_pro', 'mmlu']
2024-08-24:18:47:28,547 INFO     [evaluator.py:161] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234
2024-08-24:18:47:28,547 INFO     [evaluator.py:198] Initializing hf model, with arguments: {'pretrained': './AutoRound/Mistral-NeMo-Minitron-8B-Base-AutoRound-GPTQ-asym-4bit/'}
2024-08-24:18:47:28,784 INFO     [huggingface.py:130] Using device 'cuda:0'
2024

###mistralai/Mistral-Nemo-Base-2407

In [None]:
!lm_eval --model hf --model_args pretrained=mistralai/Mistral-Nemo-Base-2407,dtype=float16 --tasks mmlu,arc_challenge,leaderboard_mmlu_pro --device cuda:0 --num_fewshot 0 --batch_size 4 --output_path ./eval/

100%|████████████████████████████████████████| 238/238 [00:00<00:00, 702.84it/s]
2024-08-23:17:36:57,910 INFO     [task.py:423] Building contexts for mmlu_high_school_psychology on rank 0...
100%|████████████████████████████████████████| 545/545 [00:00<00:00, 715.55it/s]
2024-08-23:17:36:58,693 INFO     [task.py:423] Building contexts for mmlu_human_sexuality on rank 0...
100%|████████████████████████████████████████| 131/131 [00:00<00:00, 717.06it/s]
2024-08-23:17:36:58,882 INFO     [task.py:423] Building contexts for mmlu_professional_psychology on rank 0...
100%|████████████████████████████████████████| 612/612 [00:00<00:00, 715.79it/s]
2024-08-23:17:36:59,761 INFO     [task.py:423] Building contexts for mmlu_public_relations on rank 0...
100%|████████████████████████████████████████| 110/110 [00:00<00:00, 715.22it/s]
2024-08-23:17:36:59,919 INFO     [task.py:423] Building contexts for mmlu_security_studies on rank 0...
100%|████████████████████████████████████████| 245/245 [00:00<0

In [None]:
!lm_eval --model hf --model_args pretrained=mistralai/Mistral-Nemo-Base-2407,load_in_8bit=True --tasks mmlu,arc_challenge,leaderboard_mmlu_pro --device cuda:0 --num_fewshot 0 --batch_size 4 --output_path ./eval/

2024-08-23:19:47:15,078 INFO     [__main__.py:279] Verbosity set to INFO
2024-08-23:19:47:15,164 INFO     [__init__.py:491] `group` and `group_alias` keys in tasks' configs will no longer be used in the next release of lm-eval. `tag` will be used to allow to call a collection of tasks just like `group`. `group` will be removed in order to not cause confusion with the new ConfigurableGroup which will be the official way to create groups with addition of group-wide configurations.
2024-08-23:19:47:28,019 INFO     [__main__.py:383] Selected Tasks: ['arc_challenge', 'leaderboard_mmlu_pro', 'mmlu']
2024-08-23:19:47:28,022 INFO     [evaluator.py:161] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234
2024-08-23:19:47:28,022 INFO     [evaluator.py:198] Initializing hf model, with arguments: {'pretrained': 'mistralai/Mistral-Nemo-Base-2407', 'load_in_8bit': True}
2024-08-23:19:47:28,264 INFO     [huggingface.py:130] Using device 'cuda:0'
2024-08-23:19:47:

In [None]:
!lm_eval --model hf --model_args pretrained=mistralai/Mistral-Nemo-Base-2407,load_in_4bit=True --tasks mmlu,arc_challenge,leaderboard_mmlu_pro --device cuda:0 --num_fewshot 0 --batch_size 4 --output_path ./eval/

2024-08-24:09:37:38,572 INFO     [__main__.py:279] Verbosity set to INFO
2024-08-24:09:37:38,676 INFO     [__init__.py:491] `group` and `group_alias` keys in tasks' configs will no longer be used in the next release of lm-eval. `tag` will be used to allow to call a collection of tasks just like `group`. `group` will be removed in order to not cause confusion with the new ConfigurableGroup which will be the official way to create groups with addition of group-wide configurations.
2024-08-24:09:37:51,983 INFO     [__main__.py:383] Selected Tasks: ['arc_challenge', 'leaderboard_mmlu_pro', 'mmlu']
2024-08-24:09:37:51,985 INFO     [evaluator.py:161] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234
2024-08-24:09:37:51,985 INFO     [evaluator.py:198] Initializing hf model, with arguments: {'pretrained': 'mistralai/Mistral-Nemo-Base-2407', 'load_in_4bit': True}
2024-08-24:09:37:52,212 INFO     [huggingface.py:130] Using device 'cuda:0'
config.json: 100%

In [None]:
!lm_eval --model hf --model_args pretrained=kaitchup/Mistral-Nemo-Base-2407-AutoRound-GPTQ-sym-4bit --tasks mmlu,arc_challenge,leaderboard_mmlu_pro --device cuda:0 --num_fewshot 0 --batch_size 4 --output_path ./eval/

100%|████████████████████████████████████████| 783/783 [00:01<00:00, 601.46it/s]
2024-08-23:15:09:39,978 INFO     [task.py:423] Building contexts for mmlu_nutrition on rank 0...
100%|████████████████████████████████████████| 306/306 [00:00<00:00, 708.83it/s]
2024-08-23:15:09:40,422 INFO     [task.py:423] Building contexts for mmlu_professional_accounting on rank 0...
100%|████████████████████████████████████████| 282/282 [00:00<00:00, 708.26it/s]
2024-08-23:15:09:40,831 INFO     [task.py:423] Building contexts for mmlu_professional_medicine on rank 0...
100%|████████████████████████████████████████| 272/272 [00:00<00:00, 706.95it/s]
2024-08-23:15:09:41,227 INFO     [task.py:423] Building contexts for mmlu_virology on rank 0...
100%|████████████████████████████████████████| 166/166 [00:00<00:00, 706.50it/s]
2024-08-23:15:09:41,468 INFO     [task.py:423] Building contexts for mmlu_econometrics on rank 0...
100%|████████████████████████████████████████| 114/114 [00:00<00:00, 709.42it/s]
2

In [None]:
!lm_eval --model hf --model_args pretrained=./AutoRound/Mistral-Nemo-Base-2407-AutoRound-GPTQ-asym-4bit --tasks mmlu,arc_challenge,leaderboard_mmlu_pro --device cuda:0 --num_fewshot 0 --batch_size 4 --output_path ./eval/

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


2024-08-25:16:34:46,861 INFO     [__main__.py:279] Verbosity set to INFO
2024-08-25:16:34:46,967 INFO     [__init__.py:491] `group` and `group_alias` keys in tasks' configs will no longer be used in the next release of lm-eval. `tag` will be used to allow to call a collection of tasks just like `group`. `group` will be removed in order to not cause confusion with the new ConfigurableGroup which will be the official way to create groups with addition of group-wide configurations.
2024-08-25:16:35:00,418 INFO     [__main__.py:383] Selected Tasks: ['arc_challenge', 'leaderboard_mmlu_pro', 'mmlu']
2024-08-25:16:35:00,420 INFO     [evaluator.py:161] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234
2024-08-25:16:35:00,420 INFO     [evaluator.py:198] Initializing hf model, with arguments: {'pretrained': './AutoRound/Mistral-Nemo-Base-2407-AutoRound-GPTQ-asym-4bit'}
2024-08-25:16:35:00,527 INFO     [huggingface.py:130] Using device 'cuda:0'
2024-08-25:1

###meta-llama/Meta-Llama-3.1-8B

In [None]:
!lm_eval --model hf --model_args pretrained=meta-llama/Meta-Llama-3.1-8B,dtype=float16 --tasks mmlu,arc_challenge,leaderboard_mmlu_pro --device cuda:0 --num_fewshot 0 --batch_size 4 --output_path ./eval/

2024-08-24:10:48:17,252 INFO     [__main__.py:279] Verbosity set to INFO
2024-08-24:10:48:17,347 INFO     [__init__.py:491] `group` and `group_alias` keys in tasks' configs will no longer be used in the next release of lm-eval. `tag` will be used to allow to call a collection of tasks just like `group`. `group` will be removed in order to not cause confusion with the new ConfigurableGroup which will be the official way to create groups with addition of group-wide configurations.
2024-08-24:10:48:29,710 INFO     [__main__.py:383] Selected Tasks: ['arc_challenge', 'leaderboard_mmlu_pro', 'mmlu']
2024-08-24:10:48:29,713 INFO     [evaluator.py:161] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234
2024-08-24:10:48:29,713 INFO     [evaluator.py:198] Initializing hf model, with arguments: {'pretrained': 'meta-llama/Meta-Llama-3.1-8B', 'dtype': 'float16'}
2024-08-24:10:48:29,974 INFO     [huggingface.py:130] Using device 'cuda:0'
config.json: 100%|█████

In [None]:
!lm_eval --model hf --model_args pretrained=meta-llama/Meta-Llama-3.1-8B,load_in_8bit=True --tasks mmlu,arc_challenge,leaderboard_mmlu_pro --device cuda:0 --num_fewshot 0 --batch_size 4 --output_path ./eval/

2024-08-24:11:08:01,372 INFO     [__main__.py:279] Verbosity set to INFO
2024-08-24:11:08:01,476 INFO     [__init__.py:491] `group` and `group_alias` keys in tasks' configs will no longer be used in the next release of lm-eval. `tag` will be used to allow to call a collection of tasks just like `group`. `group` will be removed in order to not cause confusion with the new ConfigurableGroup which will be the official way to create groups with addition of group-wide configurations.
2024-08-24:11:08:16,263 INFO     [__main__.py:383] Selected Tasks: ['arc_challenge', 'leaderboard_mmlu_pro', 'mmlu']
2024-08-24:11:08:16,265 INFO     [evaluator.py:161] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234
2024-08-24:11:08:16,265 INFO     [evaluator.py:198] Initializing hf model, with arguments: {'pretrained': 'meta-llama/Meta-Llama-3.1-8B', 'load_in_8bit': True}
2024-08-24:11:08:16,513 INFO     [huggingface.py:130] Using device 'cuda:0'
2024-08-24:11:08:17,0

In [None]:
!lm_eval --model hf --model_args pretrained=meta-llama/Meta-Llama-3.1-8B,load_in_4bit=True --tasks mmlu,arc_challenge,leaderboard_mmlu_pro --device cuda:0 --num_fewshot 0 --batch_size 4 --output_path ./eval/

2024-08-24:11:41:12,578 INFO     [__main__.py:279] Verbosity set to INFO
2024-08-24:11:41:12,691 INFO     [__init__.py:491] `group` and `group_alias` keys in tasks' configs will no longer be used in the next release of lm-eval. `tag` will be used to allow to call a collection of tasks just like `group`. `group` will be removed in order to not cause confusion with the new ConfigurableGroup which will be the official way to create groups with addition of group-wide configurations.
2024-08-24:11:41:26,137 INFO     [__main__.py:383] Selected Tasks: ['arc_challenge', 'leaderboard_mmlu_pro', 'mmlu']
2024-08-24:11:41:26,139 INFO     [evaluator.py:161] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234
2024-08-24:11:41:26,139 INFO     [evaluator.py:198] Initializing hf model, with arguments: {'pretrained': 'meta-llama/Meta-Llama-3.1-8B', 'load_in_4bit': True}
2024-08-24:11:41:26,372 INFO     [huggingface.py:130] Using device 'cuda:0'
2024-08-24:11:41:26,7

In [None]:
!lm_eval --model hf --model_args pretrained=kaitchup/Meta-Llama-3.1-8B-AutoRound-GPTQ-sym-4bit --tasks mmlu,arc_challenge,leaderboard_mmlu_pro --device cuda:0 --num_fewshot 0 --batch_size 4 --output_path ./eval/

2024-08-24:12:03:35,651 INFO     [__main__.py:279] Verbosity set to INFO
2024-08-24:12:03:35,780 INFO     [__init__.py:491] `group` and `group_alias` keys in tasks' configs will no longer be used in the next release of lm-eval. `tag` will be used to allow to call a collection of tasks just like `group`. `group` will be removed in order to not cause confusion with the new ConfigurableGroup which will be the official way to create groups with addition of group-wide configurations.
2024-08-24:12:03:50,124 INFO     [__main__.py:383] Selected Tasks: ['arc_challenge', 'leaderboard_mmlu_pro', 'mmlu']
2024-08-24:12:03:50,126 INFO     [evaluator.py:161] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234
2024-08-24:12:03:50,126 INFO     [evaluator.py:198] Initializing hf model, with arguments: {'pretrained': 'kaitchup/Meta-Llama-3.1-8B-AutoRound-GPTQ-sym-4bit'}
2024-08-24:12:03:50,360 INFO     [huggingface.py:130] Using device 'cuda:0'
config.json: 100%|███

In [None]:
!lm_eval --model hf --model_args pretrained=./AutoRound/Meta-Llama-3.1-8B-AutoRound-GPTQ-asym-4bit --tasks mmlu,arc_challenge,leaderboard_mmlu_pro --device cuda:0 --num_fewshot 0 --batch_size 4 --output_path ./eval/

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


2024-08-25:15:46:02,156 INFO     [__main__.py:279] Verbosity set to INFO
2024-08-25:15:46:02,250 INFO     [__init__.py:491] `group` and `group_alias` keys in tasks' configs will no longer be used in the next release of lm-eval. `tag` will be used to allow to call a collection of tasks just like `group`. `group` will be removed in order to not cause confusion with the new ConfigurableGroup which will be the official way to create groups with addition of group-wide configurations.
2024-08-25:15:46:14,968 INFO     [__main__.py:383] Selected Tasks: ['arc_challenge', 'leaderboard_mmlu_pro', 'mmlu']
2024-08-25:15:46:14,970 INFO     [evaluator.py:161] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234
2024-08-25:15:46:14,970 INFO     [evaluator.py:198] Initializing hf model, with arguments: {'pretrained': './AutoRound/Meta-Llama-3.1-8B-AutoRound-GPTQ-asym-4bit'}
2024-08-25:15:46:15,071 INFO     [huggingface.py:130] Using device 'cuda:0'
2024-08-25:15:46:

###nvidia/Llama-3.1-Minitron-4B-Width-Base

In [None]:
!lm_eval --model hf --model_args pretrained=nvidia/Llama-3.1-Minitron-4B-Width-Base,dtype=float16 --tasks mmlu,arc_challenge,leaderboard_mmlu_pro --device cuda:0 --num_fewshot 0 --batch_size 4 --output_path ./eval/

2024-08-24:13:29:10,893 INFO     [__main__.py:279] Verbosity set to INFO
2024-08-24:13:29:10,996 INFO     [__init__.py:491] `group` and `group_alias` keys in tasks' configs will no longer be used in the next release of lm-eval. `tag` will be used to allow to call a collection of tasks just like `group`. `group` will be removed in order to not cause confusion with the new ConfigurableGroup which will be the official way to create groups with addition of group-wide configurations.
2024-08-24:13:29:25,356 INFO     [__main__.py:383] Selected Tasks: ['arc_challenge', 'leaderboard_mmlu_pro', 'mmlu']
2024-08-24:13:29:25,359 INFO     [evaluator.py:161] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234
2024-08-24:13:29:25,359 INFO     [evaluator.py:198] Initializing hf model, with arguments: {'pretrained': 'nvidia/Llama-3.1-Minitron-4B-Width-Base', 'dtype': 'float16'}
2024-08-24:13:29:25,596 INFO     [huggingface.py:130] Using device 'cuda:0'
2024-08-24:1

In [None]:
!lm_eval --model hf --model_args pretrained=nvidia/Llama-3.1-Minitron-4B-Width-Base,load_in_8bit=True --tasks mmlu,arc_challenge,leaderboard_mmlu_pro --device cuda:0 --num_fewshot 0 --batch_size 4 --output_path ./eval/

2024-08-24:13:42:33,070 INFO     [__main__.py:279] Verbosity set to INFO
2024-08-24:13:42:33,182 INFO     [__init__.py:491] `group` and `group_alias` keys in tasks' configs will no longer be used in the next release of lm-eval. `tag` will be used to allow to call a collection of tasks just like `group`. `group` will be removed in order to not cause confusion with the new ConfigurableGroup which will be the official way to create groups with addition of group-wide configurations.
2024-08-24:13:42:46,698 INFO     [__main__.py:383] Selected Tasks: ['arc_challenge', 'leaderboard_mmlu_pro', 'mmlu']
2024-08-24:13:42:46,700 INFO     [evaluator.py:161] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234
2024-08-24:13:42:46,700 INFO     [evaluator.py:198] Initializing hf model, with arguments: {'pretrained': 'nvidia/Llama-3.1-Minitron-4B-Width-Base', 'load_in_8bit': True}
2024-08-24:13:42:46,934 INFO     [huggingface.py:130] Using device 'cuda:0'
2024-08-24

In [None]:
!lm_eval --model hf --model_args pretrained=nvidia/Llama-3.1-Minitron-4B-Width-Base,load_in_4bit=True --tasks mmlu,arc_challenge,leaderboard_mmlu_pro --device cuda:0 --num_fewshot 0 --batch_size 4 --output_path ./eval/

2024-08-24:14:42:04,039 INFO     [__main__.py:279] Verbosity set to INFO
2024-08-24:14:42:04,131 INFO     [__init__.py:491] `group` and `group_alias` keys in tasks' configs will no longer be used in the next release of lm-eval. `tag` will be used to allow to call a collection of tasks just like `group`. `group` will be removed in order to not cause confusion with the new ConfigurableGroup which will be the official way to create groups with addition of group-wide configurations.
2024-08-24:14:42:18,942 INFO     [__main__.py:383] Selected Tasks: ['arc_challenge', 'leaderboard_mmlu_pro', 'mmlu']
2024-08-24:14:42:18,944 INFO     [evaluator.py:161] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234
2024-08-24:14:42:18,944 INFO     [evaluator.py:198] Initializing hf model, with arguments: {'pretrained': 'nvidia/Llama-3.1-Minitron-4B-Width-Base', 'load_in_4bit': True}
2024-08-24:14:42:19,184 INFO     [huggingface.py:130] Using device 'cuda:0'
2024-08-24

In [None]:
!lm_eval --model hf --model_args pretrained=kaitchup/Llama-3.1-Minitron-4B-Width-Base-AutoRound-GPTQ-sym-4bit --tasks mmlu,arc_challenge,leaderboard_mmlu_pro --device cuda:0 --num_fewshot 0 --batch_size 4 --output_path ./eval/

2024-08-24:14:12:51,656 INFO     [__main__.py:279] Verbosity set to INFO
2024-08-24:14:12:51,749 INFO     [__init__.py:491] `group` and `group_alias` keys in tasks' configs will no longer be used in the next release of lm-eval. `tag` will be used to allow to call a collection of tasks just like `group`. `group` will be removed in order to not cause confusion with the new ConfigurableGroup which will be the official way to create groups with addition of group-wide configurations.
2024-08-24:14:13:04,680 INFO     [__main__.py:383] Selected Tasks: ['arc_challenge', 'leaderboard_mmlu_pro', 'mmlu']
2024-08-24:14:13:04,682 INFO     [evaluator.py:161] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234
2024-08-24:14:13:04,682 INFO     [evaluator.py:198] Initializing hf model, with arguments: {'pretrained': 'kaitchup/Llama-3.1-Minitron-4B-Width-Base-AutoRound-GPTQ-sym-4bit'}
2024-08-24:14:13:04,894 INFO     [huggingface.py:130] Using device 'cuda:0'
2024-0

###nvidia/Llama-3.1-Minitron-4B-Depth-Base

In [None]:
!lm_eval --model hf --model_args pretrained=nvidia/Llama-3.1-Minitron-4B-Depth-Base --tasks mmlu,arc_challenge,leaderboard_mmlu_pro --device cuda:0 --num_fewshot 0 --batch_size 4 --output_path ./eval/

2024-08-24:14:57:49,411 INFO     [__main__.py:279] Verbosity set to INFO
2024-08-24:14:57:49,493 INFO     [__init__.py:491] `group` and `group_alias` keys in tasks' configs will no longer be used in the next release of lm-eval. `tag` will be used to allow to call a collection of tasks just like `group`. `group` will be removed in order to not cause confusion with the new ConfigurableGroup which will be the official way to create groups with addition of group-wide configurations.
2024-08-24:14:58:02,316 INFO     [__main__.py:383] Selected Tasks: ['arc_challenge', 'leaderboard_mmlu_pro', 'mmlu']
2024-08-24:14:58:02,319 INFO     [evaluator.py:161] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234
2024-08-24:14:58:02,319 INFO     [evaluator.py:198] Initializing hf model, with arguments: {'pretrained': 'nvidia/Llama-3.1-Minitron-4B-Depth-Base'}
2024-08-24:14:58:02,547 INFO     [huggingface.py:130] Using device 'cuda:0'
config.json: 100%|██████████████

In [None]:
!lm_eval --model hf --model_args pretrained=nvidia/Llama-3.1-Minitron-4B-Depth-Base,load_in_8bit=True --tasks mmlu,arc_challenge,leaderboard_mmlu_pro --device cuda:0 --num_fewshot 0 --batch_size 4 --output_path ./eval/

2024-08-24:15:11:20,152 INFO     [__main__.py:279] Verbosity set to INFO
2024-08-24:15:11:20,264 INFO     [__init__.py:491] `group` and `group_alias` keys in tasks' configs will no longer be used in the next release of lm-eval. `tag` will be used to allow to call a collection of tasks just like `group`. `group` will be removed in order to not cause confusion with the new ConfigurableGroup which will be the official way to create groups with addition of group-wide configurations.
2024-08-24:15:11:33,454 INFO     [__main__.py:383] Selected Tasks: ['arc_challenge', 'leaderboard_mmlu_pro', 'mmlu']
2024-08-24:15:11:33,455 INFO     [evaluator.py:161] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234
2024-08-24:15:11:33,455 INFO     [evaluator.py:198] Initializing hf model, with arguments: {'pretrained': 'nvidia/Llama-3.1-Minitron-4B-Depth-Base', 'load_in_8bit': True}
2024-08-24:15:11:33,690 INFO     [huggingface.py:130] Using device 'cuda:0'
2024-08-24

In [None]:
!lm_eval --model hf --model_args pretrained=nvidia/Llama-3.1-Minitron-4B-Depth-Base,load_in_4bit=True --tasks mmlu,arc_challenge,leaderboard_mmlu_pro --device cuda:0 --num_fewshot 0 --batch_size 4 --output_path ./eval/

2024-08-24:15:30:23,758 INFO     [__main__.py:279] Verbosity set to INFO
2024-08-24:15:30:23,843 INFO     [__init__.py:491] `group` and `group_alias` keys in tasks' configs will no longer be used in the next release of lm-eval. `tag` will be used to allow to call a collection of tasks just like `group`. `group` will be removed in order to not cause confusion with the new ConfigurableGroup which will be the official way to create groups with addition of group-wide configurations.
2024-08-24:15:30:37,144 INFO     [__main__.py:383] Selected Tasks: ['arc_challenge', 'leaderboard_mmlu_pro', 'mmlu']
2024-08-24:15:30:37,146 INFO     [evaluator.py:161] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234
2024-08-24:15:30:37,147 INFO     [evaluator.py:198] Initializing hf model, with arguments: {'pretrained': 'nvidia/Llama-3.1-Minitron-4B-Depth-Base', 'load_in_4bit': True}
2024-08-24:15:30:37,383 INFO     [huggingface.py:130] Using device 'cuda:0'
2024-08-24

In [None]:
!lm_eval --model hf --model_args pretrained=kaitchup/Llama-3.1-Minitron-4B-Depth-Base-AutoRound-GPTQ-sym-4bit --tasks mmlu,arc_challenge,leaderboard_mmlu_pro --device cuda:0 --num_fewshot 0 --batch_size 4 --output_path ./eval/

2024-08-24:15:43:55,554 INFO     [__main__.py:279] Verbosity set to INFO
2024-08-24:15:43:55,639 INFO     [__init__.py:491] `group` and `group_alias` keys in tasks' configs will no longer be used in the next release of lm-eval. `tag` will be used to allow to call a collection of tasks just like `group`. `group` will be removed in order to not cause confusion with the new ConfigurableGroup which will be the official way to create groups with addition of group-wide configurations.
2024-08-24:15:44:07,383 INFO     [__main__.py:383] Selected Tasks: ['arc_challenge', 'leaderboard_mmlu_pro', 'mmlu']
2024-08-24:15:44:07,385 INFO     [evaluator.py:161] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234
2024-08-24:15:44:07,385 INFO     [evaluator.py:198] Initializing hf model, with arguments: {'pretrained': 'kaitchup/Llama-3.1-Minitron-4B-Depth-Base-AutoRound-GPTQ-sym-4bit'}
2024-08-24:15:44:07,665 INFO     [huggingface.py:130] Using device 'cuda:0'
config

In [None]:
!lm_eval --model hf --model_args pretrained=./AutoRound/Llama-3.1-Minitron-4B-Depth-Base-AutoRound-GPTQ-asym-4bit --tasks mmlu,arc_challenge,leaderboard_mmlu_pro --device cuda:0 --num_fewshot 0 --batch_size 4 --output_path ./eval/

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


2024-08-25:17:47:23,329 INFO     [__main__.py:279] Verbosity set to INFO
2024-08-25:17:47:23,627 INFO     [__init__.py:491] `group` and `group_alias` keys in tasks' configs will no longer be used in the next release of lm-eval. `tag` will be used to allow to call a collection of tasks just like `group`. `group` will be removed in order to not cause confusion with the new ConfigurableGroup which will be the official way to create groups with addition of group-wide configurations.
2024-08-25:17:47:35,634 INFO     [__main__.py:383] Selected Tasks: ['arc_challenge', 'leaderboard_mmlu_pro', 'mmlu']
2024-08-25:17:47:35,636 INFO     [evaluator.py:161] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234
2024-08-25:17:47:35,636 INFO     [evaluator.py:198] Initializing hf model, with arguments: {'pretrained': './AutoRound/Llama-3.1-Minitron-4B-Depth-Base-AutoRound-GPTQ-asym-4bit'}
2024-08-25:17:47:35,715 INFO     [huggingface.py:130] Using device 'cuda:0'
20

In [None]:
!lm_eval --model hf --model_args pretrained=./AutoRound/Llama-3.1-Minitron-4B-Width-Base-AutoRound-GPTQ-asym-4bit --tasks mmlu,arc_challenge,leaderboard_mmlu_pro --device cuda:0 --num_fewshot 0 --batch_size 4 --output_path ./eval/

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


2024-08-25:18:13:49,567 INFO     [__main__.py:279] Verbosity set to INFO
2024-08-25:18:13:49,669 INFO     [__init__.py:491] `group` and `group_alias` keys in tasks' configs will no longer be used in the next release of lm-eval. `tag` will be used to allow to call a collection of tasks just like `group`. `group` will be removed in order to not cause confusion with the new ConfigurableGroup which will be the official way to create groups with addition of group-wide configurations.
2024-08-25:18:14:03,823 INFO     [__main__.py:383] Selected Tasks: ['arc_challenge', 'leaderboard_mmlu_pro', 'mmlu']
2024-08-25:18:14:03,826 INFO     [evaluator.py:161] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234
2024-08-25:18:14:03,826 INFO     [evaluator.py:198] Initializing hf model, with arguments: {'pretrained': './AutoRound/Llama-3.1-Minitron-4B-Width-Base-AutoRound-GPTQ-asym-4bit'}
2024-08-25:18:14:03,928 INFO     [huggingface.py:130] Using device 'cuda:0'
20