com.microsoft.Attention do_rotary flag doesn't work on apple silicon

### Describe the issue

The `com.microsoft.Attention` contrib operator defines a `do_rotary` attribute (per its schema) but on Apple Silicon (ARM64), with onnxruntime-silicon, setting `do_rotary=1` does not change the output 

### To reproduce

```
import numpy as np
import onnx
from onnx import helper, TensorProto, numpy_helper
import onnxruntime as ort

def attention_output(x, W, b, num_heads, qkv_hidden_size, do_rotary):
    # Create a minimal graph: one Attention node
    input_vi = helper.make_tensor_value_info("input", TensorProto.FLOAT, x.shape)
    output_vi = helper.make_tensor_value_info("output", TensorProto.FLOAT, x.shape)
    W_init = numpy_helper.from_array(W, name="weights")
    B_init = numpy_helper.from_array(b, name="bias")
    attn_node = helper.make_node(
        "Attention",
        inputs=["input", "weights", "bias"],
        outputs=["output"],
        domain="com.microsoft",
        num_heads=num_heads,
        unidirectional=0,
        qkv_hidden_sizes=[qkv_hidden_size]*3,
        do_rotary=do_rotary
    )
    graph = helper.make_graph(
        [attn_node],
        "MinimalAttnGraph",
        [input_vi],
        [output_vi],
        initializer=[W_init, B_init]
    )
    model = helper.make_model(
        graph,
        opset_imports=[helper.make_operatorsetid("com.microsoft", 1)]
    )

    sess = ort.InferenceSession(model.SerializeToString(),
                                providers=['CPUExecutionProvider'])
    return sess.run(None, {"input": x})[0]

if __name__ == "__main__":
    # Dummy data
    batch, seq_len, in_hid = 1, 5, 8
    num_heads, head_size = 4, 2
    hidden_size = num_heads * head_size

    x = np.random.rand(batch, seq_len, in_hid).astype(np.float32)
    W = np.random.rand(in_hid, hidden_size * 3).astype(np.float32)
    b = np.random.rand(hidden_size * 3).astype(np.float32)

    out0 = attention_output(x, W, b, num_heads, hidden_size, do_rotary=0)
    out1 = attention_output(x, W, b, num_heads, hidden_size, do_rotary=1)

    diff = np.linalg.norm(out0 - out1)
    print(f"Norm difference (do_rotary=0 vs 1): {diff:.6f}")

    # If outputs are identical, error out
    if diff == 0.0:
        raise RuntimeError(
            "do_rotary flag was ignored by the CPU Execution Provider: "
            "outputs are identical (norm difference == 0)"
        )

    print("Success: do_rotary had an effect on the outputs.")
```

### Urgency

_No response_

### Platform

Mac

### OS Version

macOS 15.4 (24E248)

### ONNX Runtime Installation

Released Package

### ONNX Runtime Version or Commit ID

1.16.3

### ONNX Runtime API

Python

### Architecture

ARM64

### Execution Provider

Default CPU

### Execution Provider Library Version

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

com.microsoft.Attention do_rotary flag doesn't work on apple silicon #24528

Describe the issue

To reproduce

Urgency

Platform

OS Version

ONNX Runtime Installation

ONNX Runtime Version or Commit ID

ONNX Runtime API

Architecture

Execution Provider

Execution Provider Library Version

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

com.microsoft.Attention do_rotary flag doesn't work on apple silicon #24528

Description

Describe the issue

To reproduce

Urgency

Platform

OS Version

ONNX Runtime Installation

ONNX Runtime Version or Commit ID

ONNX Runtime API

Architecture

Execution Provider

Execution Provider Library Version

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions