I am trying to quantize Qwen3-VL (a multimodal LLM) to INT4 using AIMET. Due to the hardware (Snapdragon 8295), I can only use per‑channel quantization. I first tried basic PTQ, but the accuracy was insufficient. Then I used SpinQuant, which improved accuracy, but it is still not enough. I read the SpinQuant paper, which states that after applying the rotation, GPTQ is used for quantization. However, AIMET's implementation does not seem to follow this approach.
I am trying to quantize Qwen3-VL (a multimodal LLM) to INT4 using AIMET. Due to the hardware (Snapdragon 8295), I can only use per‑channel quantization. I first tried basic PTQ, but the accuracy was insufficient. Then I used SpinQuant, which improved accuracy, but it is still not enough. I read the SpinQuant paper, which states that after applying the rotation, GPTQ is used for quantization. However, AIMET's implementation does not seem to follow this approach.