Deprecate `TransformerEvalWrapper`, `LMEvalInputRecorder` by namgyu-youn · Pull Request #3617 · pytorch/ao

namgyu-youn · 2026-01-10T05:50:17Z

Summary:
Follow-up #3720, AWQ have been used TransformerEvalWrapper, which wrap an ao model to run lm-eval. Instead, update to run lm-eval directly by using HFLM.

Change log:

# Old
from torchao._models._eval import TransformerEvalWrapper

TransformerEvalWrapper(
    model=model,
    tokenizer=tokenizer,
    max_seq_length=max_seq_length,
).run_eval(
    tasks=tasks,
    limit=calibration_limit,
)

# New (direct lm-eval)
from lm_eval import evaluator
from lm_eval.models.huggingface import HFLM

evaluator.simple_evaluate(
    HFLM(pretrained=model, tokenizer=tokenizer),
    tasks=tasks,
    limit=calibration_limit,
    batch_size=1,
)

Test plan:

python quantize_and_upload.py --model_id meta-llama/Llama-3.2-1B --quant AWQ-INT4

pytorch-bot · 2026-01-10T05:50:21Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/3617

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

namgyu-youn · 2026-01-10T05:50:52Z

@pytorchbot label "topic: deprecation"

jerryzh168 · 2026-01-10T20:38:28Z

oh does this work with gemma3 as well? I remember I need to make some modifications before: 2a98f58

jerryzh168 · 2026-01-10T20:40:08Z

torchao/quantization/GPTQ/README.md

-```
+For using lm_eval as a calibration data source, you would:
+1. Run lm_eval on your model with calibration tasks
+2. Collect the inputs during that run using `MultiTensorInputRecorder`


code examples?

I didn't consider example code because: 1) not familiar with GPTQ workflow, 2) GPTQ api migration is work in progress at #3517. How about focusing on migration instead?

Updated the code example for temporary filling up, please take a look.

since GPTQ refactor is WIP, I think you can also wait for that to be landed before proceeding with this refactor?

I don't think we have to delay this PR for not-fixed due date progress. We can go with future PR for that one. Even more, #3517 is using custom implementation for calibration flow, not lm-eval.

namgyu-youn · 2026-01-10T21:10:30Z

oh does this work with gemma3 as well? I remember I need to make some modifications before: 2a98f58

Not yet, but isn't it lm-eval issue? Should we update this opaque wrapper whenever architecture issue happen?

namgyu-youn · 2026-01-15T13:24:13Z

@jerryzh168 could you please take a look into this again? This PR aims to drop wrapper class for direct lm-eval use

jerryzh168 · 2026-01-16T03:04:25Z

thanks, can you include the results for running quantize_and_upload and the GPTQ example as well to make sure it works?

also please make sure to test gemma-3

namgyu-youn · 2026-01-16T07:41:28Z

thanks, can you include the results for running quantize_and_upload and the GPTQ example as well to make sure it works?

https://huggingface.co/namgyu-youn/Llama-3.2-1B-AWQ-INT4 for a Llama 1B checkpoint. Actually, this PR is inspired by #3602 (comment) and tested in that benchmark module already. GPTQ will be done tomorrow

also please make sure to test gemma-3

Failed to quantize gemma-3 and filed at #3648, I will look into it after this PR.

namgyu-youn · 2026-01-22T10:20:33Z

@jerryzh168 updated GPTQ example and filed gemma3 log inside #3648. Could you please help me to merge this?

jerryzh168 · 2026-01-28T19:11:14Z

I think we should make gemma3 work otherwise this is making the support worse than before

namgyu-youn · 2026-01-29T01:31:43Z

I think we should make gemma3 work otherwise this is making the support worse than before

Understood, presuming our issue because gemm3 issue hasn't been reported to lm-eval, let me investigate it further.

namgyu-youn · 2026-02-04T17:36:56Z

@jerryzh168 The root cause of the gemma-3 issue is that the vision tower isn't compatible with text-based calibration data.

There are two potential solutions:

(1) Update AWQ to pass when encountering incompatible layers. I successfully quantized gemma-3 using AWQ with this approach (checkpoint: https://huggingface.co/namgyu-youn/gemma-3-12b-it-AWQ-INT4/blob/main/config.json)

(2) A more robust solution would be to add a "layers to ignore" option, similar to llm-compressor's approach (ref: https://github.com/vllm-project/llm-compressor/blob/f68ff8dc2355b4fef2a47fdd9129d41c285aac3b/examples/multimodal_vision/README_internvl3.md?plain=1#L61). This would prevent multimodal issues affecting the vision tower, sensitive layers (lm_head), and other components.

What are your thoughts? My suggestion is to remove the wrappers (TransformerEvalWrapper, LMEvalInputRecorder) and implement solution (1) in this PR, then add the ignore option to AWQ in a subsequent PR.

jerryzh168 · 2026-02-04T22:44:22Z

we already have layers_to_ignore support through the FqnToConfig https://docs.pytorch.org/ao/main/api_reference/generated/torchao.quantization.FqnToConfig.html#torchao.quantization.FqnToConfig I think

jerryzh168 · 2026-02-04T22:45:28Z

Update AWQ to pass when encountering incompatible layers. I successfully quantized gemma-3 using AWQ with this approach

can you give more details here?

namgyu-youn · 2026-02-05T02:59:56Z

Update AWQ to pass when encountering incompatible layers. I successfully quantized gemma-3 using AWQ with this approach

can you give more details here?

Vision tower layers are unable to calibrate using text-only calibration (lm-eval). Since AWQ requires inputs to calculate scales, we can skip uncalibrated layers.

Maybe we can add following block inside transformation block (_awq_transform)?

if observed_linear.act_obs.inputs is None or len(observed_linear.act_obs.inputs) == 0:
    return observed_linear

jerryzh168 · 2026-02-05T04:33:12Z

Update AWQ to pass when encountering incompatible layers. I successfully quantized gemma-3 using AWQ with this approach

can you give more details here?

Vision tower layers are unable to calibrate using text-only calibration (lm-eval). Since AWQ requires inputs to calculate scales, we can skip uncalibrated layers.

Maybe we can add following block inside transformation block (_awq_transform)?
if observed_linear.act_obs.inputs is None or len(observed_linear.act_obs.inputs) == 0:
    return observed_linear

we only need to calibrate the text part I think, you can check https://huggingface.co/pytorch/gemma-3-27b-it-AWQ-INT4#quantization-recipe we did selectively quantize some part of the model

namgyu-youn · 2026-02-05T07:11:11Z

we only need to calibrate the text part I think, you can check https://huggingface.co/pytorch/gemma-3-27b-it-AWQ-INT4#quantization-recipe we did selectively quantize some part of the model

Yeah, this is exactly what I wanted, thanks. So back to the start, we can drop the external wrapper, right? Those are related to config setup, not the AWQ issue.

namgyu-youn · 2026-02-05T07:30:48Z

@jerryzh168 rebased on main and updated a little bit. Wrappers are already deprecated by #3720, so this PR could be easy to land.

jerryzh168 · 2026-02-06T03:54:06Z

we only need to calibrate the text part I think, you can check huggingface.co/pytorch/gemma-3-27b-it-AWQ-INT4#quantization-recipe we did selectively quantize some part of the model

Yeah, this is exactly what I wanted, thanks. So back to the start, we can drop the external wrapper, right? Those are related to config setup, not the AWQ issue.

please still check it works with the selective quantization, we can land if it works I think

namgyu-youn · 2026-02-07T21:41:00Z

please still check it works with the selective quantization, we can land if it works I think

Is it possible to do it in separate PR? The gemma3 check is quite overwhelming since it requires high resources.

I have limited GPU access through an affiliate grant, but it's allocated for a specific purpose. RunPod is an option, but I'd like to minimize my spending.

jerryzh168 · 2026-02-07T23:12:12Z

@namgyu-youn OK, I can check this for you next week

namgyu-youn · 2026-02-20T08:11:17Z

Repro: Gemma3 + AWQ (W4A16-INT)

code:

import torch
from huggingface_hub import create_repo, get_token
from lm_eval import evaluator
from lm_eval.models.huggingface import HFLM
from transformers import AutoModelForImageTextToText, AutoTokenizer, TorchAoConfig

from torchao.prototype.awq import AWQConfig
from torchao.quantization import (
    Int4WeightOnlyConfig,
    ModuleFqnToConfig,
    quantize_,
)
from torchao.quantization.quantize_.common.quantization_step import (
    QuantizationStep,
)
from torchao.quantization.quantize_.workflows import (
    Int4ChooseQParamsAlgorithm,
    Int4PackingFormat,
)

MODEL_ID = "google/gemma-3-27b-it"
USER_ID = "namgyu-youn"

# NOTE: This config require H100+ device
BASE_CONFIG = Int4WeightOnlyConfig(
    group_size=128,
    int4_packing_format=Int4PackingFormat.TILE_PACKED_TO_4D,
    int4_choose_qparams_algorithm=Int4ChooseQParamsAlgorithm.TINYGEMM,
)

LAYER_PATTERNS = [
    r"re:language_model\.model\.layers\..+\.mlp\..+_proj",
    r"re:language_model\.model\.layers\..+\.self_attn\..+_proj",
]


def get_quant_config(linear_config):
    return ModuleFqnToConfig({pat: linear_config for pat in LAYER_PATTERNS})


# --- Load model ---
model = AutoModelForImageTextToText.from_pretrained(
    MODEL_ID, device_map="auto", dtype=torch.bfloat16
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

# --- AWQ: prepare + calibrate ---
quantize_(
    model,
    get_quant_config(AWQConfig(BASE_CONFIG, step=QuantizationStep.PREPARE)),
    filter_fn=None,
)

evaluator.simple_evaluate(
    HFLM(pretrained=model, tokenizer=tokenizer),
    tasks=["mmlu_philosophy"],
    limit=30,
    batch_size=1,
)

# --- AWQ: convert ---
quantize_(
    model,
    get_quant_config(AWQConfig(BASE_CONFIG, step=QuantizationStep.CONVERT)),
    filter_fn=None,
)

quant_config = AWQConfig(BASE_CONFIG, step=QuantizationStep.PREPARE_FOR_LOADING)
model.config.quantization_config = TorchAoConfig(quant_config)

# --- Push to Hub ---
save_to = f"{USER_ID}/{MODEL_ID.split('/')[-1]}-AWQ-INT4"
token = get_token()
create_repo(save_to, token=token, exist_ok=True)
model.push_to_hub(save_to, token=token, safe_serialization=False)
tokenizer.push_to_hub(save_to, token=token)

checkpoint: https://huggingface.co/namgyu-youn/gemma-3-27b-it-AWQ-INT4

jerryzh168 · 2026-03-14T05:54:43Z

Repro: Gemma3 + AWQ (W4A16-INT)

code:

import torch
from huggingface_hub import create_repo, get_token
from lm_eval import evaluator
from lm_eval.models.huggingface import HFLM
from transformers import AutoModelForImageTextToText, AutoTokenizer, TorchAoConfig

from torchao.prototype.awq import AWQConfig
from torchao.quantization import (
    Int4WeightOnlyConfig,
    ModuleFqnToConfig,
    quantize_,
)
from torchao.quantization.quantize_.common.quantization_step import (
    QuantizationStep,
)
from torchao.quantization.quantize_.workflows import (
    Int4ChooseQParamsAlgorithm,
    Int4PackingFormat,
)

MODEL_ID = "google/gemma-3-27b-it"
USER_ID = "namgyu-youn"

# NOTE: This config require H100+ device
BASE_CONFIG = Int4WeightOnlyConfig(
    group_size=128,
    int4_packing_format=Int4PackingFormat.TILE_PACKED_TO_4D,
    int4_choose_qparams_algorithm=Int4ChooseQParamsAlgorithm.TINYGEMM,
)

LAYER_PATTERNS = [
    r"re:language_model\.model\.layers\..+\.mlp\..+_proj",
    r"re:language_model\.model\.layers\..+\.self_attn\..+_proj",
]


def get_quant_config(linear_config):
    return ModuleFqnToConfig({pat: linear_config for pat in LAYER_PATTERNS})


# --- Load model ---
model = AutoModelForImageTextToText.from_pretrained(
    MODEL_ID, device_map="auto", dtype=torch.bfloat16
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

# --- AWQ: prepare + calibrate ---
quantize_(
    model,
    get_quant_config(AWQConfig(BASE_CONFIG, step=QuantizationStep.PREPARE)),
    filter_fn=None,
)

evaluator.simple_evaluate(
    HFLM(pretrained=model, tokenizer=tokenizer),
    tasks=["mmlu_philosophy"],
    limit=30,
    batch_size=1,
)

# --- AWQ: convert ---
quantize_(
    model,
    get_quant_config(AWQConfig(BASE_CONFIG, step=QuantizationStep.CONVERT)),
    filter_fn=None,
)

quant_config = AWQConfig(BASE_CONFIG, step=QuantizationStep.PREPARE_FOR_LOADING)
model.config.quantization_config = TorchAoConfig(quant_config)

# --- Push to Hub ---
save_to = f"{USER_ID}/{MODEL_ID.split('/')[-1]}-AWQ-INT4"
token = get_token()
create_repo(save_to, token=token, exist_ok=True)
model.push_to_hub(save_to, token=token, safe_serialization=False)
tokenizer.push_to_hub(save_to, token=token)

checkpoint: huggingface.co/namgyu-youn/gemma-3-27b-it-AWQ-INT4

great, could you do a quick accuracy eval for the checkpoint as well

namgyu-youn · 2026-04-08T09:57:07Z

@jerryzh168 I ran into the AWQ + gemma3 issue again (checkpoint: https://huggingface.co/namgyu-youn/gemma-3-27b-it-AWQ-INT4-v2). The quantization recipe (including fqn) worked well, and evaluation inside lm-eval succeeded.

However, I wasn't able to benchmark it inside vLLM. Since my experience here is still limited, I'm not sure of the root cause, as I mentioned before. Would it be okay to hold off on the vLLM issue until a user-side issue report comes in?

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jan 10, 2026

pytorch-bot bot added the module: deprecation Use this tag if this PR deprecates a feature label Jan 10, 2026

namgyu-youn mentioned this pull request Jan 10, 2026

Add AWQ/SmoothQuant to benchmark module #3602

Merged

jerryzh168 reviewed Jan 10, 2026

View reviewed changes

namgyu-youn requested a review from jerryzh168 January 11, 2026 17:40

namgyu-youn mentioned this pull request Jan 16, 2026

AWQ failed to quantize google/gemma-3-12b HF model #3648

Closed

namgyu-youn mentioned this pull request Jan 23, 2026

Add static quant flow support for Float8StaticActivationFloat8WeightConfig #3655

Merged

namgyu-youn force-pushed the drop-wrapper branch from 77a9471 to 9f92623 Compare January 28, 2026 05:02

namgyu-youn marked this pull request as draft January 29, 2026 07:59

namgyu-youn added 2 commits February 5, 2026 16:25

drop wrapper

f925ce9

fix

fe0a08d

namgyu-youn force-pushed the drop-wrapper branch from 26ecc96 to fe0a08d Compare February 5, 2026 07:26

namgyu-youn marked this pull request as ready for review February 5, 2026 07:28

Merge branch 'main' into drop-wrapper

dd3cca2

Merge branch 'main' into drop-wrapper

3b3dbb7

Conversation

namgyu-youn commented Jan 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Jan 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/3617

Uh oh!

namgyu-youn commented Jan 10, 2026

Uh oh!

jerryzh168 commented Jan 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jerryzh168 Jan 10, 2026

Choose a reason for hiding this comment

Uh oh!

namgyu-youn Jan 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

namgyu-youn Jan 11, 2026

Choose a reason for hiding this comment

Uh oh!

jerryzh168 Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

namgyu-youn Jan 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

namgyu-youn commented Jan 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

namgyu-youn commented Jan 15, 2026

Uh oh!

jerryzh168 commented Jan 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

namgyu-youn commented Jan 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

namgyu-youn commented Jan 22, 2026

Uh oh!

jerryzh168 commented Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

namgyu-youn commented Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

namgyu-youn commented Feb 4, 2026

Uh oh!

jerryzh168 commented Feb 4, 2026

Uh oh!

jerryzh168 commented Feb 4, 2026

Uh oh!

namgyu-youn commented Feb 5, 2026

Uh oh!

jerryzh168 commented Feb 5, 2026

Uh oh!

namgyu-youn commented Feb 5, 2026

Uh oh!

namgyu-youn commented Feb 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jerryzh168 commented Feb 6, 2026

Uh oh!

namgyu-youn commented Feb 7, 2026

Uh oh!

jerryzh168 commented Feb 7, 2026

Uh oh!

namgyu-youn commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Repro: Gemma3 + AWQ (W4A16-INT)

Uh oh!

jerryzh168 commented Mar 14, 2026

Repro: Gemma3 + AWQ (W4A16-INT)

Uh oh!

namgyu-youn commented Apr 8, 2026

Uh oh!

namgyu-youn commented Jan 10, 2026 •

edited

Loading

pytorch-bot bot commented Jan 10, 2026 •

edited

Loading

jerryzh168 commented Jan 10, 2026 •

edited

Loading

namgyu-youn Jan 10, 2026 •

edited

Loading

namgyu-youn Jan 16, 2026 •

edited

Loading

namgyu-youn commented Jan 10, 2026 •

edited

Loading

jerryzh168 commented Jan 16, 2026 •

edited

Loading

namgyu-youn commented Jan 16, 2026 •

edited

Loading

jerryzh168 commented Jan 28, 2026 •

edited

Loading

namgyu-youn commented Jan 29, 2026 •

edited

Loading

namgyu-youn commented Feb 5, 2026 •

edited

Loading

namgyu-youn commented Feb 20, 2026 •

edited

Loading