-
-
Notifications
You must be signed in to change notification settings - Fork 10.5k
Llamas 3.1 405B fp4 changes upstreaming from 355_wip #25135
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Llamas 3.1 405B fp4 changes upstreaming from 355_wip #25135
Conversation
Signed-off-by: Aleksandr Malyshev <maleksan@amd.com>
Signed-off-by: Aleksandr Malyshev <maleksan@amd.com>
Signed-off-by: Aleksandr Malyshev <maleksan@amd.com>
Signed-off-by: Aleksandr Malyshev <maleksan@amd.com>
Signed-off-by: Aleksandr Malyshev <maleksan@amd.com>
Signed-off-by: Aleksandr Malyshev <maleksan@amd.com>
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Aleksandr Malyshev <maleksan@amd.com>
798e475
to
f9626ee
Compare
Signed-off-by: Aleksandr Malyshev <maleksan@amd.com>
…es_upstr_from_355_wip
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Aleksandr Malyshev <maleksan@amd.com>
Signed-off-by: Aleksandr Malyshev <maleksan@amd.com>
vllm/model_executor/layers/quantization/quark/schemes/quark_w4a4_mxfp4.py
Outdated
Show resolved
Hide resolved
Signed-off-by: Aleksandr Malyshev <maleksan@amd.com>
vllm/model_executor/layers/quantization/quark/schemes/quark_w4a4_mxfp4.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we get some unit tests for the batched_rotary_embedding
kernel?
vllm/model_executor/layers/quantization/quark/schemes/quark_w4a4_mxfp4.py
Outdated
Show resolved
Hide resolved
Signed-off-by: Aleksandr Malyshev <maleksan@amd.com>
removed batched rope for now to speed up this PR landing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great that cdna4 mxfp4 gemm gets upstreamed!
vllm/model_executor/layers/quantization/quark/schemes/quark_w4a4_mxfp4.py
Show resolved
Hide resolved
vllm/model_executor/layers/quantization/quark/schemes/quark_w4a4_mxfp4.py
Show resolved
Hide resolved
Signed-off-by: Aleksandr Malyshev <maleksan@amd.com>
CC @mgoin |
Signed-off-by: Aleksandr Malyshev <maleksan@amd.com>
CI test model-executor-test fails on main:f552d5e578077574276aa9d83139b91e1d5ae163 as well which this branch is based on. Please force merge this PR. Thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's remove the x_quant_scales change to the linear layer
Signed-off-by: Aleksandr Malyshev <maleksan@amd.com>
removed, please look again. |
CI test that failed, passes locally: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM now, thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
if self.emulate: | ||
layer.weight_scale = torch.nn.Parameter(layer.weight_scale.data, | ||
requires_grad=False) | ||
try: | ||
from quark.torch.export.nn.modules import realquantizer | ||
from quark.torch.quantization.config.config import ( | ||
QuantizationSpec) | ||
except ImportError as err: | ||
raise ImportError( | ||
"The package `amd-quark` is required to use AMD Quark " | ||
"MX-FP4 models. Please install it with `pip install " | ||
"amd-quark`.") from err | ||
|
||
weight_quant_spec = QuantizationSpec.from_dict( | ||
self.weight_quant_spec) | ||
|
||
weight_quantizer = realquantizer.get_real_quantizer( | ||
qspec=weight_quant_spec, | ||
quantizer=None, | ||
real_quantized=True, | ||
reorder=False, | ||
float_dtype=self.out_dtype, | ||
scale_shape=layer.weight_scale.shape, | ||
zero_point_shape=None, | ||
) | ||
weight_quantizer.scale.data = layer.weight_scale.data | ||
|
||
layer.weight = torch.nn.Parameter( | ||
weight_quantizer(layer.weight.data).to(self.out_dtype), | ||
requires_grad=False, | ||
) | ||
layer.weight_scale = None | ||
|
||
# This call is necessary to release the scales memory. | ||
torch.cuda.empty_cache() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I insist that this is unnecessary https://github.com/vllm-project/vllm/pull/25135/files#r2378191214 - was not able to reopen the thread that was closed unfortunately.
Signed-off-by: Aleksandr Malyshev <maleksan@amd.com> Co-authored-by: Aleksandr Malyshev <maleksan@amd.com> Co-authored-by: Doug Lehr <douglehr@amd.com> Signed-off-by: yewentao256 <zhyanwentao@126.com>
Perf is the same:
upstream tp1
355_wip tp1
Command:
Run the client benchmark
Correctness - shows reasonable answers for command: