Allow compile with bnb #38886

SunMarc · 2025-06-18T14:25:16Z

What does this PR do?

This PR enable compilation when generating with bnb models. This is supported with the latest bnb.

SunMarc · 2025-06-18T14:28:21Z

When generating by default , is fullgraph is set to False ? From the codebase it looks like that but I wanted to be sure cc @gante

SunMarc · 2025-06-18T14:42:46Z

any idea why the test is not passing @matthewdouglas ?

RUN_SLOW=True pytest tests/quantization/bnb/test_4bit.py::Bnb4bitCompile -s -vvvvv. I'm using torch 2.7.1

Getting the following traceback:

FAILED tests/quantization/bnb/test_4bit.py::Bnb4bitCompile::test_generate_compile - torch._dynamo.exc.Unsupported: Unsupported method call
  Explanation: Dynamo does not know how to trace method `t` of class `Params4bit`
  Hint: Avoid calling `Params4bit.t` in your code.
  Hint: Please report an issue to PyTorch.

  Developer debug context: call_method UserDefinedObjectVariable(Params4bit) t [] {}


from user code:
   File "/admin/home/marc/transformers/src/transformers/utils/generic.py", line 943, in wrapper
    output = func(self, *args, **kwargs)
  File "/admin/home/marc/transformers/src/transformers/models/llama/modeling_llama.py", line 555, in forward
    outputs: BaseModelOutputWithPast = self.model(
  File "/admin/home/marc/transformers/src/transformers/utils/generic.py", line 943, in wrapper
    output = func(self, *args, **kwargs)
  File "/admin/home/marc/transformers/src/transformers/models/llama/modeling_llama.py", line 443, in forward
    layer_outputs = decoder_layer(
  File "/admin/home/marc/transformers/src/transformers/modeling_layers.py", line 48, in __call__
    return super().__call__(*args, **kwargs)
  File "/admin/home/marc/miniconda3/envs/hf/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/admin/home/marc/miniconda3/envs/hf/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
    return forward_call(*args, **kwargs)
  File "/admin/home/marc/transformers/src/transformers/models/llama/modeling_llama.py", line 294, in forward
    hidden_states, self_attn_weights = self.self_attn(
  File "/admin/home/marc/transformers/src/transformers/models/llama/modeling_llama.py", line 235, in forward
    query_states = self.q_proj(hidden_states).view(hidden_shape).transpose(1, 2)
  File "/admin/home/marc/miniconda3/envs/hf/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 496, in forward
    return bnb.matmul_4bit(x, self.weight.t(), bias=bias, quant_state=self.weight.quant_state).to(inp_dtype)

Set TORCHDYNAMO_VERBOSE=1 for the internal stack trace (please do this especially if you're reporting a bug to PyTorch). For even more developer context, set TORCH_LOGS="+dynamo"

Same for 8bit


FAILED tests/quantization/bnb/test_mixed_int8.py::Bnb8bitCompile::test_generate_compile - torch._dynamo.exc.Unsupported: Dynamic shape operator
  Explanation: Operator `bitsandbytes.int8_vectorwise_quant.default`'s output shape depends on input Tensor data.
  Hint: Enable tracing of dynamic shape operators with `torch._dynamo.config.capture_dynamic_output_shape_ops = True`

  Developer debug context: bitsandbytes.int8_vectorwise_quant.default


from user code:
   File "/admin/home/marc/transformers/src/transformers/utils/generic.py", line 943, in wrapper
    output = func(self, *args, **kwargs)
  File "/admin/home/marc/transformers/src/transformers/models/llama/modeling_llama.py", line 555, in forward
    outputs: BaseModelOutputWithPast = self.model(
  File "/admin/home/marc/transformers/src/transformers/utils/generic.py", line 943, in wrapper
    output = func(self, *args, **kwargs)
  File "/admin/home/marc/transformers/src/transformers/models/llama/modeling_llama.py", line 443, in forward
    layer_outputs = decoder_layer(
  File "/admin/home/marc/transformers/src/transformers/modeling_layers.py", line 48, in __call__
    return super().__call__(*args, **kwargs)
  File "/admin/home/marc/miniconda3/envs/hf/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/admin/home/marc/miniconda3/envs/hf/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
    return forward_call(*args, **kwargs)
  File "/admin/home/marc/transformers/src/transformers/models/llama/modeling_llama.py", line 294, in forward
    hidden_states, self_attn_weights = self.self_attn(
  File "/admin/home/marc/transformers/src/transformers/models/llama/modeling_llama.py", line 235, in forward
    query_states = self.q_proj(hidden_states).view(hidden_shape).transpose(1, 2)
  File "/admin/home/marc/miniconda3/envs/hf/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 1010, in forward
    out = bnb.matmul(x, self.weight, bias=self.bias, state=self.state)
  File "/admin/home/marc/miniconda3/envs/hf/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 369, in matmul
    return MatMul8bitLt.apply(A, B, out, bias, state)
  File "/admin/home/marc/miniconda3/envs/hf/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 196, in forward
    CA, SCA, outlier_cols = F.int8_vectorwise_quant(A.to(torch.float16), threshold=state.threshold)
  File "/admin/home/marc/miniconda3/envs/hf/lib/python3.10/site-packages/bitsandbytes/functional.py", line 2245, in int8_vectorwise_quant
    return torch.ops.bitsandbytes.int8_vectorwise_quant.default(A, threshold)

Set TORCHDYNAMO_VERBOSE=1 for the internal stack trace (please do this especially if you're reporting a bug to PyTorch). For even more developer context, set TORCH_LOGS="+dynamo"

HuggingFaceDocBuilderDev · 2025-06-18T14:49:39Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

matthewdouglas · 2025-06-18T17:24:02Z

Thanks for adding this!

For 4bit this is just a graph break so it does seem to be using fullgraph=True. That test would pass on torch >= 2.8.

I think this default comes from CompileConfig here?

transformers/src/transformers/generation/configuration_utils.py

Lines 1564 to 1579 in c27f628

    
           class CompileConfig: 
        
               """ 
        
               Class that holds arguments relative to `torch.compile` behavior, when using automatic compilation in `generate`. 
        
               See [`torch.compile`](https://pytorch.org/docs/stable/generated/torch.compile.html) for more details on the arguments. 
        
               Args: 
        
                   fullgraph (`bool`, *optional*, defaults to `True`): 
        
                       If `True`, requires that the whole forward be capturable in a single graph. 
        
                   dynamic (`bool` or `None`, *optional*): 
        
                       Whether to try to use dynamic shape graphs. 
        
                   backend (`str` or `Callable`, *optional*, defaults to `"inductor"`): 
        
                       Backend to be used. 
        
                   mode (`str`, *optional*, defaults to `"reduce-overhead"`): 
        
                       Controls balance between performance and overhead. 
        
                   options (`dict`, *optional*): 
        
                       A dictionary of options to pass to the backend.

If that's what we expect by default it might be simplest to guard on for torch >= 2.8 for 4bit. Otherwise we might want to find a way to improve UX, like catching this and providing a better trace/error message, or disabling the fullgraph like here:

transformers/src/transformers/generation/utils.py

Lines 3581 to 3586 in 9cd7570

    
           elif generation_config.compile_config.fullgraph: 
        
               logger.warning_once( 
        
                   "When using Flash Attention 2 and a static cache, you cannot use the option `CompileConfig(fullgraph=True)` as " 
        
                   "FA2 introduces graph breaks. We overrode the option with `fullgraph=False`." 
        
               ) 
        
               generation_config.compile_config.fullgraph = False

Regarding 8bit, it's expected that you need torch._dynamo.config.capture_dynamic_output_shape_ops = True unless threshold=0.0. Fortunately the error message here isn't too bad and has the right advice, but I think this is also something we should just check for in is_compileable.

gante

A question about torch versions, but in general LGTM!

gante · 2025-06-24T12:08:28Z

src/transformers/quantizers/quantizer_bnb_4bit.py

+    def is_compileable(self) -> bool:
+        # Compatible with PyTorch 2.4+ for fullgraph=False.
+        # Requires PyTorch 2.8 nightly for fullgraph=True.
+        return version.parse(importlib.metadata.version("bitsandbytes")) >= version.parse("0.46.0")


Should we also do torch version checks here? To simplify logic, we should require torch>=2.8.0

I agree that to keep this simple we can go ahead and just require torch>=2.8.0 here.

Longer term in a separate PR maybe we can do some refactoring to this and let is_compileable consider additional context from a CompileConfig too.

gante · 2025-06-24T12:08:50Z

src/transformers/quantizers/quantizer_bnb_8bit.py

@@ -314,3 +314,7 @@ def _dequantize(self, model):
            model, self.modules_to_not_convert, quantization_config=self.quantization_config
        )
        return model
+
+    @property
+    def is_compileable(self) -> bool:


Does this one also have minimum torch requirements, or does it work with all torch versions?

Good point. The requirement here is torch>=2.4.0.

SunMarc · 2025-07-03T09:15:36Z

Thanks for everyone advice ! I will check that later and probably merge it after torch 2.8 it out

SunMarc added 2 commits June 18, 2025 14:24

compile bnb

e40e7c7

style

5186e33

SunMarc requested review from gante and matthewdouglas June 18, 2025 14:27

SunMarc added 2 commits June 18, 2025 14:32

prop

4dbcc74

update tests

d2a0df9

gante reviewed Jun 24, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Allow compile with bnb #38886

Allow compile with bnb #38886

SunMarc commented Jun 18, 2025

Uh oh!

SunMarc commented Jun 18, 2025

Uh oh!

SunMarc commented Jun 18, 2025 •

edited

Loading

Uh oh!

HuggingFaceDocBuilderDev commented Jun 18, 2025

Uh oh!

matthewdouglas commented Jun 18, 2025

Uh oh!

gante left a comment

Uh oh!

gante Jun 24, 2025

Uh oh!

matthewdouglas Jun 24, 2025

Uh oh!

gante Jun 24, 2025

Uh oh!

matthewdouglas Jun 24, 2025

Uh oh!

SunMarc commented Jul 3, 2025

Uh oh!

Uh oh!

Allow compile with bnb #38886

Are you sure you want to change the base?

Allow compile with bnb #38886

Conversation

SunMarc commented Jun 18, 2025

What does this PR do?

Uh oh!

SunMarc commented Jun 18, 2025

Uh oh!

SunMarc commented Jun 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Jun 18, 2025

Uh oh!

matthewdouglas commented Jun 18, 2025

Uh oh!

gante left a comment

Choose a reason for hiding this comment

Uh oh!

gante Jun 24, 2025

Choose a reason for hiding this comment

Uh oh!

matthewdouglas Jun 24, 2025

Choose a reason for hiding this comment

Uh oh!

gante Jun 24, 2025

Choose a reason for hiding this comment

Uh oh!

matthewdouglas Jun 24, 2025

Choose a reason for hiding this comment

Uh oh!

SunMarc commented Jul 3, 2025

Uh oh!

Uh oh!

SunMarc commented Jun 18, 2025 •

edited

Loading