Deprecate TransformerEvalWrapper, LMEvalInputRecorder#3617
Deprecate TransformerEvalWrapper, LMEvalInputRecorder#3617namgyu-youn wants to merge 4 commits intopytorch:mainfrom
TransformerEvalWrapper, LMEvalInputRecorder#3617Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/3617
Note: Links to docs will display an error until the docs builds have been completed. This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
@pytorchbot label "topic: deprecation" |
|
oh does this work with gemma3 as well? I remember I need to make some modifications before: 2a98f58 |
torchao/quantization/GPTQ/README.md
Outdated
| ``` | ||
| For using lm_eval as a calibration data source, you would: | ||
| 1. Run lm_eval on your model with calibration tasks | ||
| 2. Collect the inputs during that run using `MultiTensorInputRecorder` |
There was a problem hiding this comment.
I didn't consider example code because: 1) not familiar with GPTQ workflow, 2) GPTQ api migration is work in progress at #3517. How about focusing on migration instead?
There was a problem hiding this comment.
Updated the code example for temporary filling up, please take a look.
There was a problem hiding this comment.
since GPTQ refactor is WIP, I think you can also wait for that to be landed before proceeding with this refactor?
There was a problem hiding this comment.
I don't think we have to delay this PR for not-fixed due date progress. We can go with future PR for that one. Even more, #3517 is using custom implementation for calibration flow, not lm-eval.
Not yet, but isn't it |
|
@jerryzh168 could you please take a look into this again? This PR aims to drop wrapper class for direct |
|
thanks, can you include the results for running quantize_and_upload and the GPTQ example as well to make sure it works? also please make sure to test gemma-3 |
https://huggingface.co/namgyu-youn/Llama-3.2-1B-AWQ-INT4 for a Llama 1B checkpoint. Actually, this PR is inspired by #3602 (comment) and tested in that benchmark module already. GPTQ will be done tomorrow
Failed to quantize gemma-3 and filed at #3648, I will look into it after this PR. |
|
@jerryzh168 updated GPTQ example and filed gemma3 log inside #3648. Could you please help me to merge this? |
77a9471 to
9f92623
Compare
|
I think we should make gemma3 work otherwise this is making the support worse than before |
Understood, presuming our issue because gemm3 issue hasn't been reported to lm-eval, let me investigate it further. |
|
@jerryzh168 The root cause of the gemma-3 issue is that the vision tower isn't compatible with text-based calibration data. There are two potential solutions: (1) Update AWQ to pass when encountering incompatible layers. I successfully quantized gemma-3 using AWQ with this approach (checkpoint: https://huggingface.co/namgyu-youn/gemma-3-12b-it-AWQ-INT4/blob/main/config.json) (2) A more robust solution would be to add a "layers to ignore" option, similar to llm-compressor's approach (ref: https://github.com/vllm-project/llm-compressor/blob/f68ff8dc2355b4fef2a47fdd9129d41c285aac3b/examples/multimodal_vision/README_internvl3.md?plain=1#L61). This would prevent multimodal issues affecting the vision tower, sensitive layers ( What are your thoughts? My suggestion is to remove the wrappers ( |
|
we already have layers_to_ignore support through the FqnToConfig https://docs.pytorch.org/ao/main/api_reference/generated/torchao.quantization.FqnToConfig.html#torchao.quantization.FqnToConfig I think |
can you give more details here? |
Vision tower layers are unable to calibrate using text-only calibration ( Maybe we can add following block inside transformation block ( if observed_linear.act_obs.inputs is None or len(observed_linear.act_obs.inputs) == 0:
return observed_linear |
we only need to calibrate the text part I think, you can check https://huggingface.co/pytorch/gemma-3-27b-it-AWQ-INT4#quantization-recipe we did selectively quantize some part of the model |
Yeah, this is exactly what I wanted, thanks. So back to the start, we can drop the external wrapper, right? Those are related to config setup, not the AWQ issue. |
26ecc96 to
fe0a08d
Compare
|
@jerryzh168 rebased on main and updated a little bit. Wrappers are already deprecated by #3720, so this PR could be easy to land. |
please still check it works with the selective quantization, we can land if it works I think |
Is it possible to do it in separate PR? The gemma3 check is quite overwhelming since it requires high resources. I have limited GPU access through an affiliate grant, but it's allocated for a specific purpose. RunPod is an option, but I'd like to minimize my spending. |
|
@namgyu-youn OK, I can check this for you next week |
Repro: Gemma3 + AWQ (W4A16-INT)
import torch
from huggingface_hub import create_repo, get_token
from lm_eval import evaluator
from lm_eval.models.huggingface import HFLM
from transformers import AutoModelForImageTextToText, AutoTokenizer, TorchAoConfig
from torchao.prototype.awq import AWQConfig
from torchao.quantization import (
Int4WeightOnlyConfig,
ModuleFqnToConfig,
quantize_,
)
from torchao.quantization.quantize_.common.quantization_step import (
QuantizationStep,
)
from torchao.quantization.quantize_.workflows import (
Int4ChooseQParamsAlgorithm,
Int4PackingFormat,
)
MODEL_ID = "google/gemma-3-27b-it"
USER_ID = "namgyu-youn"
# NOTE: This config require H100+ device
BASE_CONFIG = Int4WeightOnlyConfig(
group_size=128,
int4_packing_format=Int4PackingFormat.TILE_PACKED_TO_4D,
int4_choose_qparams_algorithm=Int4ChooseQParamsAlgorithm.TINYGEMM,
)
LAYER_PATTERNS = [
r"re:language_model\.model\.layers\..+\.mlp\..+_proj",
r"re:language_model\.model\.layers\..+\.self_attn\..+_proj",
]
def get_quant_config(linear_config):
return ModuleFqnToConfig({pat: linear_config for pat in LAYER_PATTERNS})
# --- Load model ---
model = AutoModelForImageTextToText.from_pretrained(
MODEL_ID, device_map="auto", dtype=torch.bfloat16
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
# --- AWQ: prepare + calibrate ---
quantize_(
model,
get_quant_config(AWQConfig(BASE_CONFIG, step=QuantizationStep.PREPARE)),
filter_fn=None,
)
evaluator.simple_evaluate(
HFLM(pretrained=model, tokenizer=tokenizer),
tasks=["mmlu_philosophy"],
limit=30,
batch_size=1,
)
# --- AWQ: convert ---
quantize_(
model,
get_quant_config(AWQConfig(BASE_CONFIG, step=QuantizationStep.CONVERT)),
filter_fn=None,
)
quant_config = AWQConfig(BASE_CONFIG, step=QuantizationStep.PREPARE_FOR_LOADING)
model.config.quantization_config = TorchAoConfig(quant_config)
# --- Push to Hub ---
save_to = f"{USER_ID}/{MODEL_ID.split('/')[-1]}-AWQ-INT4"
token = get_token()
create_repo(save_to, token=token, exist_ok=True)
model.push_to_hub(save_to, token=token, safe_serialization=False)
tokenizer.push_to_hub(save_to, token=token) |
great, could you do a quick accuracy eval for the checkpoint as well |
|
@jerryzh168 I ran into the AWQ + gemma3 issue again (checkpoint: https://huggingface.co/namgyu-youn/gemma-3-27b-it-AWQ-INT4-v2). The quantization recipe (including fqn) worked well, and evaluation inside lm-eval succeeded. However, I wasn't able to benchmark it inside vLLM. Since my experience here is still limited, I'm not sure of the root cause, as I mentioned before. Would it be okay to hold off on the vLLM issue until a user-side issue report comes in? |
Summary:
Follow-up #3720, AWQ have been used
TransformerEvalWrapper, which wrap an ao model to runlm-eval. Instead, update to runlm-evaldirectly by usingHFLM.Change log:
Test plan: