Open
Description
Describe the feature request
DeepSeek's models use Multi-head Latent Attention, the current ONNX model https://huggingface.co/onnxruntime/DeepSeek-R1-Distill-ONNX release leverages GroupQueryAttention.
Is MLA on roadmap for ONNXRT?
Describe scenario use case
Lower KV cache footprint with Multi-Head Latent Attention improving mobile and edge inference