Will this support 4bit quantisation? #2

distbit0 · 2023-03-14T01:45:22Z

Awesome project. Thanks so much for creating a foss alpaca codebase.

Semisol · 2023-03-16T13:52:53Z

Most probably yes, you can merge the LoRA weights into the model and quantize that

gururise · 2023-03-16T15:32:39Z

Most probably yes, you can merge the LoRA weights into the model and quantize that

Training in 4-bit or just 4-bit inference?

Training in 4-bit would be a game-changer.

johnsmith0031 · 2023-03-16T16:27:32Z

I think technically we can do both, with origin model weight in int4+fp16/fp32 and lora weight of fp32. The dtype of output from QuantLinear layer can be exactly the dtype of its input, so inserting LoRA layer after every QuantLinear layer won't be very difficult.

Made an adapter for peft to support QuantLinear,

from quant import QuantLinear

class Linear4bitLt(QuantLinear, LoraLayer):
    # Lora implemented in a dense layer
    def __init__(
            self,
            in_features,
            out_features,
            r: int = 0,
            lora_alpha: int = 1,
            lora_dropout: float = 0.0,
            **kwargs,
    ):
        QuantLinear.__init__(
            self,
            4,
            in_features,
            out_features
        )
        LoraLayer.__init__(self, r=r, lora_alpha=lora_alpha, lora_dropout=lora_dropout, merge_weights=False)
        # Actual trainable parameters
        if r > 0:
            self.lora_A = nn.Linear(in_features, r, bias=False)
            self.lora_B = nn.Linear(r, out_features, bias=False)
            self.scaling = self.lora_alpha / self.r
            # Freezing the pre-trained weight matrix
            self.qweight.requires_grad = False
            self.scales.requires_grad = False
            self.zeros.requires_grad = False
            self.bias.requires_grad = False
        self.reset_parameters()

    def reset_parameters(self):
        if hasattr(self, "lora_A"):
            # initialize A the same way as the default for nn.Linear and B to zero
            nn.init.kaiming_uniform_(self.lora_A.weight, a=math.sqrt(5))
            nn.init.zeros_(self.lora_B.weight)

    def forward(self, x: torch.Tensor):
        result = super().forward(x)

        if self.disable_adapters:
            return result
        elif self.r > 0:
            if not torch.is_autocast_enabled():
                expected_dtype = result.dtype

                if x.dtype != torch.float32:
                    x = x.float()
                output = self.lora_B(self.lora_A(self.lora_dropout(x))).to(expected_dtype) * self.scaling
                result += output
            else:
                output = self.lora_B(self.lora_A(self.lora_dropout(x))) * self.scaling
                result += output
        return result

The only thing left to do is to add support for gradient backward on 4bit matmul.

FNsi · 2023-03-22T06:58:13Z

Most probably yes, you can merge the LoRA weights into the model and quantize that
May I ask how to merge the Lora?

0xbitches mentioned this issue Mar 15, 2023

Benchmarking and optimization tips #11

Open

johnsmith0031 mentioned this issue Mar 17, 2023

How to fine-tune the 4-bit model? qwopqwop200/GPTQ-for-LLaMa#57

Closed

distbit0 closed this as completed Jan 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Will this support 4bit quantisation? #2

Will this support 4bit quantisation? #2

distbit0 commented Mar 14, 2023

Semisol commented Mar 16, 2023

gururise commented Mar 16, 2023

johnsmith0031 commented Mar 16, 2023 •

edited

Loading

FNsi commented Mar 22, 2023

Will this support 4bit quantisation? #2

Will this support 4bit quantisation? #2

Comments

distbit0 commented Mar 14, 2023

Semisol commented Mar 16, 2023

gururise commented Mar 16, 2023

johnsmith0031 commented Mar 16, 2023 • edited Loading

FNsi commented Mar 22, 2023

johnsmith0031 commented Mar 16, 2023 •

edited

Loading