Skip to content

Deprecate TransformerEvalWrapper, LMEvalInputRecorder#3617

Open
namgyu-youn wants to merge 4 commits intopytorch:mainfrom
namgyu-youn:drop-wrapper
Open

Deprecate TransformerEvalWrapper, LMEvalInputRecorder#3617
namgyu-youn wants to merge 4 commits intopytorch:mainfrom
namgyu-youn:drop-wrapper

Conversation

@namgyu-youn
Copy link
Copy Markdown
Contributor

@namgyu-youn namgyu-youn commented Jan 10, 2026

Summary:
Follow-up #3720, AWQ have been used TransformerEvalWrapper, which wrap an ao model to run lm-eval. Instead, update to run lm-eval directly by using HFLM.

Change log:

# Old
from torchao._models._eval import TransformerEvalWrapper

TransformerEvalWrapper(
    model=model,
    tokenizer=tokenizer,
    max_seq_length=max_seq_length,
).run_eval(
    tasks=tasks,
    limit=calibration_limit,
)
# New (direct lm-eval)
from lm_eval import evaluator
from lm_eval.models.huggingface import HFLM

evaluator.simple_evaluate(
    HFLM(pretrained=model, tokenizer=tokenizer),
    tasks=tasks,
    limit=calibration_limit,
    batch_size=1,
)

Test plan:

python quantize_and_upload.py --model_id meta-llama/Llama-3.2-1B --quant AWQ-INT4

@pytorch-bot
Copy link
Copy Markdown

pytorch-bot bot commented Jan 10, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/3617

Note: Links to docs will display an error until the docs builds have been completed.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jan 10, 2026
@namgyu-youn
Copy link
Copy Markdown
Contributor Author

@pytorchbot label "topic: deprecation"

@pytorch-bot pytorch-bot bot added the module: deprecation Use this tag if this PR deprecates a feature label Jan 10, 2026
@jerryzh168
Copy link
Copy Markdown
Contributor

jerryzh168 commented Jan 10, 2026

oh does this work with gemma3 as well? I remember I need to make some modifications before: 2a98f58

```
For using lm_eval as a calibration data source, you would:
1. Run lm_eval on your model with calibration tasks
2. Collect the inputs during that run using `MultiTensorInputRecorder`
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code examples?

Copy link
Copy Markdown
Contributor Author

@namgyu-youn namgyu-youn Jan 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't consider example code because: 1) not familiar with GPTQ workflow, 2) GPTQ api migration is work in progress at #3517. How about focusing on migration instead?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the code example for temporary filling up, please take a look.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since GPTQ refactor is WIP, I think you can also wait for that to be landed before proceeding with this refactor?

Copy link
Copy Markdown
Contributor Author

@namgyu-youn namgyu-youn Jan 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we have to delay this PR for not-fixed due date progress. We can go with future PR for that one. Even more, #3517 is using custom implementation for calibration flow, not lm-eval.

@namgyu-youn
Copy link
Copy Markdown
Contributor Author

namgyu-youn commented Jan 10, 2026

oh does this work with gemma3 as well? I remember I need to make some modifications before: 2a98f58

Not yet, but isn't it lm-eval issue? Should we update this opaque wrapper whenever architecture issue happen?

@namgyu-youn
Copy link
Copy Markdown
Contributor Author

@jerryzh168 could you please take a look into this again? This PR aims to drop wrapper class for direct lm-eval use

@jerryzh168
Copy link
Copy Markdown
Contributor

jerryzh168 commented Jan 16, 2026

thanks, can you include the results for running quantize_and_upload and the GPTQ example as well to make sure it works?

also please make sure to test gemma-3

@namgyu-youn
Copy link
Copy Markdown
Contributor Author

namgyu-youn commented Jan 16, 2026

thanks, can you include the results for running quantize_and_upload and the GPTQ example as well to make sure it works?

https://huggingface.co/namgyu-youn/Llama-3.2-1B-AWQ-INT4 for a Llama 1B checkpoint. Actually, this PR is inspired by #3602 (comment) and tested in that benchmark module already. GPTQ will be done tomorrow

also please make sure to test gemma-3

Failed to quantize gemma-3 and filed at #3648, I will look into it after this PR.

@namgyu-youn
Copy link
Copy Markdown
Contributor Author

@jerryzh168 updated GPTQ example and filed gemma3 log inside #3648. Could you please help me to merge this?

@jerryzh168
Copy link
Copy Markdown
Contributor

jerryzh168 commented Jan 28, 2026

I think we should make gemma3 work otherwise this is making the support worse than before

@namgyu-youn
Copy link
Copy Markdown
Contributor Author

namgyu-youn commented Jan 29, 2026

I think we should make gemma3 work otherwise this is making the support worse than before

Understood, presuming our issue because gemm3 issue hasn't been reported to lm-eval, let me investigate it further.

@namgyu-youn namgyu-youn marked this pull request as draft January 29, 2026 07:59
@namgyu-youn
Copy link
Copy Markdown
Contributor Author

@jerryzh168 The root cause of the gemma-3 issue is that the vision tower isn't compatible with text-based calibration data.

There are two potential solutions:

(1) Update AWQ to pass when encountering incompatible layers. I successfully quantized gemma-3 using AWQ with this approach (checkpoint: https://huggingface.co/namgyu-youn/gemma-3-12b-it-AWQ-INT4/blob/main/config.json)

(2) A more robust solution would be to add a "layers to ignore" option, similar to llm-compressor's approach (ref: https://github.com/vllm-project/llm-compressor/blob/f68ff8dc2355b4fef2a47fdd9129d41c285aac3b/examples/multimodal_vision/README_internvl3.md?plain=1#L61). This would prevent multimodal issues affecting the vision tower, sensitive layers (lm_head), and other components.

What are your thoughts? My suggestion is to remove the wrappers (TransformerEvalWrapper, LMEvalInputRecorder) and implement solution (1) in this PR, then add the ignore option to AWQ in a subsequent PR.

@jerryzh168
Copy link
Copy Markdown
Contributor

we already have layers_to_ignore support through the FqnToConfig https://docs.pytorch.org/ao/main/api_reference/generated/torchao.quantization.FqnToConfig.html#torchao.quantization.FqnToConfig I think

@jerryzh168
Copy link
Copy Markdown
Contributor

Update AWQ to pass when encountering incompatible layers. I successfully quantized gemma-3 using AWQ with this approach

can you give more details here?

@namgyu-youn
Copy link
Copy Markdown
Contributor Author

Update AWQ to pass when encountering incompatible layers. I successfully quantized gemma-3 using AWQ with this approach

can you give more details here?

Vision tower layers are unable to calibrate using text-only calibration (lm-eval). Since AWQ requires inputs to calculate scales, we can skip uncalibrated layers.

Maybe we can add following block inside transformation block (_awq_transform)?

if observed_linear.act_obs.inputs is None or len(observed_linear.act_obs.inputs) == 0:
    return observed_linear

@jerryzh168
Copy link
Copy Markdown
Contributor

Update AWQ to pass when encountering incompatible layers. I successfully quantized gemma-3 using AWQ with this approach

can you give more details here?

Vision tower layers are unable to calibrate using text-only calibration (lm-eval). Since AWQ requires inputs to calculate scales, we can skip uncalibrated layers.

Maybe we can add following block inside transformation block (_awq_transform)?

if observed_linear.act_obs.inputs is None or len(observed_linear.act_obs.inputs) == 0:
    return observed_linear

we only need to calibrate the text part I think, you can check https://huggingface.co/pytorch/gemma-3-27b-it-AWQ-INT4#quantization-recipe we did selectively quantize some part of the model

@namgyu-youn
Copy link
Copy Markdown
Contributor Author

we only need to calibrate the text part I think, you can check https://huggingface.co/pytorch/gemma-3-27b-it-AWQ-INT4#quantization-recipe we did selectively quantize some part of the model

Yeah, this is exactly what I wanted, thanks. So back to the start, we can drop the external wrapper, right? Those are related to config setup, not the AWQ issue.

@namgyu-youn namgyu-youn marked this pull request as ready for review February 5, 2026 07:28
@namgyu-youn
Copy link
Copy Markdown
Contributor Author

namgyu-youn commented Feb 5, 2026

@jerryzh168 rebased on main and updated a little bit. Wrappers are already deprecated by #3720, so this PR could be easy to land.

@jerryzh168
Copy link
Copy Markdown
Contributor

we only need to calibrate the text part I think, you can check huggingface.co/pytorch/gemma-3-27b-it-AWQ-INT4#quantization-recipe we did selectively quantize some part of the model

Yeah, this is exactly what I wanted, thanks. So back to the start, we can drop the external wrapper, right? Those are related to config setup, not the AWQ issue.

please still check it works with the selective quantization, we can land if it works I think

@namgyu-youn
Copy link
Copy Markdown
Contributor Author

please still check it works with the selective quantization, we can land if it works I think

Is it possible to do it in separate PR? The gemma3 check is quite overwhelming since it requires high resources.

I have limited GPU access through an affiliate grant, but it's allocated for a specific purpose. RunPod is an option, but I'd like to minimize my spending.

@jerryzh168
Copy link
Copy Markdown
Contributor

@namgyu-youn OK, I can check this for you next week

@namgyu-youn
Copy link
Copy Markdown
Contributor Author

namgyu-youn commented Feb 20, 2026

Repro: Gemma3 + AWQ (W4A16-INT)

  • code:
import torch
from huggingface_hub import create_repo, get_token
from lm_eval import evaluator
from lm_eval.models.huggingface import HFLM
from transformers import AutoModelForImageTextToText, AutoTokenizer, TorchAoConfig

from torchao.prototype.awq import AWQConfig
from torchao.quantization import (
    Int4WeightOnlyConfig,
    ModuleFqnToConfig,
    quantize_,
)
from torchao.quantization.quantize_.common.quantization_step import (
    QuantizationStep,
)
from torchao.quantization.quantize_.workflows import (
    Int4ChooseQParamsAlgorithm,
    Int4PackingFormat,
)

MODEL_ID = "google/gemma-3-27b-it"
USER_ID = "namgyu-youn"

# NOTE: This config require H100+ device
BASE_CONFIG = Int4WeightOnlyConfig(
    group_size=128,
    int4_packing_format=Int4PackingFormat.TILE_PACKED_TO_4D,
    int4_choose_qparams_algorithm=Int4ChooseQParamsAlgorithm.TINYGEMM,
)

LAYER_PATTERNS = [
    r"re:language_model\.model\.layers\..+\.mlp\..+_proj",
    r"re:language_model\.model\.layers\..+\.self_attn\..+_proj",
]


def get_quant_config(linear_config):
    return ModuleFqnToConfig({pat: linear_config for pat in LAYER_PATTERNS})


# --- Load model ---
model = AutoModelForImageTextToText.from_pretrained(
    MODEL_ID, device_map="auto", dtype=torch.bfloat16
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

# --- AWQ: prepare + calibrate ---
quantize_(
    model,
    get_quant_config(AWQConfig(BASE_CONFIG, step=QuantizationStep.PREPARE)),
    filter_fn=None,
)

evaluator.simple_evaluate(
    HFLM(pretrained=model, tokenizer=tokenizer),
    tasks=["mmlu_philosophy"],
    limit=30,
    batch_size=1,
)

# --- AWQ: convert ---
quantize_(
    model,
    get_quant_config(AWQConfig(BASE_CONFIG, step=QuantizationStep.CONVERT)),
    filter_fn=None,
)

quant_config = AWQConfig(BASE_CONFIG, step=QuantizationStep.PREPARE_FOR_LOADING)
model.config.quantization_config = TorchAoConfig(quant_config)

# --- Push to Hub ---
save_to = f"{USER_ID}/{MODEL_ID.split('/')[-1]}-AWQ-INT4"
token = get_token()
create_repo(save_to, token=token, exist_ok=True)
model.push_to_hub(save_to, token=token, safe_serialization=False)
tokenizer.push_to_hub(save_to, token=token)

@jerryzh168
Copy link
Copy Markdown
Contributor

Repro: Gemma3 + AWQ (W4A16-INT)

  • code:
import torch
from huggingface_hub import create_repo, get_token
from lm_eval import evaluator
from lm_eval.models.huggingface import HFLM
from transformers import AutoModelForImageTextToText, AutoTokenizer, TorchAoConfig

from torchao.prototype.awq import AWQConfig
from torchao.quantization import (
    Int4WeightOnlyConfig,
    ModuleFqnToConfig,
    quantize_,
)
from torchao.quantization.quantize_.common.quantization_step import (
    QuantizationStep,
)
from torchao.quantization.quantize_.workflows import (
    Int4ChooseQParamsAlgorithm,
    Int4PackingFormat,
)

MODEL_ID = "google/gemma-3-27b-it"
USER_ID = "namgyu-youn"

# NOTE: This config require H100+ device
BASE_CONFIG = Int4WeightOnlyConfig(
    group_size=128,
    int4_packing_format=Int4PackingFormat.TILE_PACKED_TO_4D,
    int4_choose_qparams_algorithm=Int4ChooseQParamsAlgorithm.TINYGEMM,
)

LAYER_PATTERNS = [
    r"re:language_model\.model\.layers\..+\.mlp\..+_proj",
    r"re:language_model\.model\.layers\..+\.self_attn\..+_proj",
]


def get_quant_config(linear_config):
    return ModuleFqnToConfig({pat: linear_config for pat in LAYER_PATTERNS})


# --- Load model ---
model = AutoModelForImageTextToText.from_pretrained(
    MODEL_ID, device_map="auto", dtype=torch.bfloat16
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

# --- AWQ: prepare + calibrate ---
quantize_(
    model,
    get_quant_config(AWQConfig(BASE_CONFIG, step=QuantizationStep.PREPARE)),
    filter_fn=None,
)

evaluator.simple_evaluate(
    HFLM(pretrained=model, tokenizer=tokenizer),
    tasks=["mmlu_philosophy"],
    limit=30,
    batch_size=1,
)

# --- AWQ: convert ---
quantize_(
    model,
    get_quant_config(AWQConfig(BASE_CONFIG, step=QuantizationStep.CONVERT)),
    filter_fn=None,
)

quant_config = AWQConfig(BASE_CONFIG, step=QuantizationStep.PREPARE_FOR_LOADING)
model.config.quantization_config = TorchAoConfig(quant_config)

# --- Push to Hub ---
save_to = f"{USER_ID}/{MODEL_ID.split('/')[-1]}-AWQ-INT4"
token = get_token()
create_repo(save_to, token=token, exist_ok=True)
model.push_to_hub(save_to, token=token, safe_serialization=False)
tokenizer.push_to_hub(save_to, token=token)

great, could you do a quick accuracy eval for the checkpoint as well

@namgyu-youn
Copy link
Copy Markdown
Contributor Author

@jerryzh168 I ran into the AWQ + gemma3 issue again (checkpoint: https://huggingface.co/namgyu-youn/gemma-3-27b-it-AWQ-INT4-v2). The quantization recipe (including fqn) worked well, and evaluation inside lm-eval succeeded.

However, I wasn't able to benchmark it inside vLLM. Since my experience here is still limited, I'm not sure of the root cause, as I mentioned before. Would it be okay to hold off on the vLLM issue until a user-side issue report comes in?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. module: deprecation Use this tag if this PR deprecates a feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants