Skip to content

[Feature Request] Multi-Head Latent Attention(DeepSeek) support on CPU/NPU #23925

Open
@bkaruman

Description

@bkaruman

Describe the feature request

DeepSeek's models use Multi-head Latent Attention, the current ONNX model https://huggingface.co/onnxruntime/DeepSeek-R1-Distill-ONNX release leverages GroupQueryAttention.

Is MLA on roadmap for ONNXRT?

Describe scenario use case

Lower KV cache footprint with Multi-Head Latent Attention improving mobile and edge inference

Metadata

Metadata

Assignees

No one assigned

    Labels

    feature requestrequest for unsupported feature or enhancementplatform:mobileissues related to ONNX Runtime mobile; typically submitted using template

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions