Tags: pytorch/ao
Tags
Refactor `is_ROCm_mx_supported` function for improved readability - Reformatted the return statement to enhance clarity and maintainability of the code.
Uses torch.version.cuda to compile CUDA extensions (#2193) * Uses torch.version.cuda to compile CUDA extensions * lint
Move moe quant to better prototype dir (#2192) * Move moe quant to better prototype dir Summary: The old quantization/prototype dir is being deprecated so moving moe_quant out into the correct one. Test Plan: see CI Reviewers: Subscribers: Tasks: Tags: * actually adding new folder Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * ruff format Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:
Enabling MOE Quantization using linear decomposition (#2043) * Enabling MOE Quantization using linear decomposition Summary: This PR is a first step at optimizing moe inference using torchAO. The goal for this step is to enable existing quantization kernels and workflows to work for moe quantization by decomposing the group gemm into a sequence of unbalanced linear ops that can use the existing quantized kernels. To enable this we had to add support for quantizing these 3D tensors as well as slicing and indexing. 2 methods of achieving this were implemented. for int8wo, int8dq, int4wo, fp8wo, fp8dq, the underlying quantized tensor subclass was adapted to both support 3D tensors, indexing and slicing, as well as an updated transformation function that can handle the ConditionalFeedForwardAOQuantizable modules if the filter funciton in quantize_ is used to target the aforementioned module. For some complex kernels which use packed data that couldn't be made to easily work in 3D, we also added FakeExtraDimTensor which can transform any quantized tensor subclass into supporting the necessary slice and index operations for moe quantization. This option is enabled by using MoeQuantConfig. This can be applied to huggingface llama4 for instance as shown int he llama4_quant.py example. Since the hf moe module is implemented in a way that's not condusive to quantization, it first requires a module swap to the MOEFeedForwardAOQuantizable. TODO final benchmark numbers from run.sh, consolidate 3x implementation of MOEFeedForwardAOQuantizable and ConditionalFeedForwardAOQuantizable. verify hqq Test Plan: python test/quantization/test_moe_quant.py python test/torchao/experimental/tests/test_int8_dynamic_activation_intx_weight.py -k "test_moe_quant_intx" sh torchao/_models/mixtral-moe/run.sh Reviewers: Subscribers: Tasks: Tags: * fixing CI Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * fixing CI Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * fixing CI Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * lint Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * remove test code Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * fixing exp test Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * fixing experimental test Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * fixing experimental CI Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * fixing generate.py device stuff Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * fixing tests that aren't skipping Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * ruff format Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * removing test code Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * fixing CI Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * update API and remove branching on quant_api.py transform functions Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * ruff format Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * fix weird ci error Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * remove change to test_integration.py Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:
Add quantized q @ k test for intented used in quantized attention Differential Revision: D71370604 Pull Request resolved: #2006
PreviousNext