graph: doc, interface, backend: support SDPA training #3396

ElaineBao · 2025-06-06T03:27:29Z

Description

This PR implements rfcs: graph api: support SDPA training #3233
- extend MatMul and SoftMaxBackward operations to support new data type combinations that used in SDPA training. Related doc, library implementation, benchdnn cases are added.
- extend SoftMax operation to support optional stats output. Related doc, library implementation are added.
- extend benchdnn graph to support the validation of SoftMax stats.
- support SDPA training backward pattern (the forward pattern is the same as inference). Related library reference implementation is added.
- document of the training pattern
- validate correctness of SDPA training forward and backward pattern with benchdnn graph.
- add an example to demonstrate the entire flow of training forward and backward.

TaoLv · 2025-06-18T08:59:56Z

doc/graph/operations/Softmax.md

+|:-----|:----------------|:------|
+| f32  | f32, bf16, f16  | f32   |
+| bf16 | bf16            | f32   |
+| f16  | f16             | f32   |


@LuFinch Could you please confirm that the stats tensor is always f32 in PyTorch?

@ElaineBao It would be better to clarify this in the RFC document as well. Thanks.

Confirmed that all the CUDA FlashAttention/EfficientAttention/CudnnAttention backends use fp32 for logsumexp.

TaoLv · 2025-06-18T09:02:35Z

src/graph/backend/dnnl/kernels/sdp_decomp_config.cpp

    // The order of input logical tensors in inputs is not certain, we need
    // to record the input offset in a certain order of ops.
    CHECK_BOOL(record_input_offset(sg, inputs));
    dims src1_user_dims = ltw(inputs[graph_inport[mm1_src]]).vdims();
    ndims = src1_user_dims.size();
    VCHECK_SDP_DECOMP(ndims == 4 || ndims == 5, false,
            "Input dims should be 4 or 5, but got %zu", src1_user_dims.size());
+    VCHECK_SDP_DECOMP(
+            outputs.size() == 1, false, "Doesn't support SDPA training yet");


Better to make the error message consistent with the checks.

Suggested change

outputs.size() == 1, false, "Doesn't support SDPA training yet");

outputs.size() == 1, false, "does not support multiple outputs");

TaoLv · 2025-06-18T09:03:08Z

src/graph/backend/dnnl/kernels/sdp_primitive_config.cpp

    // At least 3 inputs: Q, K, V
    VCHECK_SDP_PRIMITIVE(inputs.size() >= 3, status::invalid_arguments,
            "At least 3 inputs are required");
+    VCHECK_SDP_PRIMITIVE(outputs.size() == 1, status::unimplemented,
+            "Doesn't support SDPA training yet");


TaoLv · 2025-06-18T09:04:43Z

src/graph/backend/dnnl/passes/lower.cpp

+    }
+
+    auto f32_dst = dst;
+    if (f32_dst->get_logical_tensor().data_type == dnnl::impl::data_type::f32) {


I guess below should work:

Suggested change

if (f32_dst->get_logical_tensor().data_type == dnnl::impl::data_type::f32) {

if (f32_dst->get_logical_tensor().data_type == data_type::f32) {

TaoLv · 2025-06-18T09:05:40Z

src/graph/backend/dnnl/passes/lower.cpp

+                = empty_logical_tensor_with_default_id();
+        f32_dst = std::make_shared<value_t>(
+                *new_softmax_op, 0, softmax_op_out_lt, true);
+        f32_dst->set_data_type(dnnl::impl::data_type::f32);


Same here. I think it's not needed to spell out all the nested namespaces.

TaoLv · 2025-06-18T09:06:31Z

src/graph/backend/dnnl/passes/lower.cpp

@@ -698,6 +698,168 @@ static status_t select_handler(
    return status::success;
 }

+static status_t softmax_handler(
+        const std::shared_ptr<op_t> &op, subgraph_rewriter_t &rewriter) {


what's the purpose of the lowering function?

the main purpose is to compute stats with multiple primitives. If stats output does not exist, it will use a single softmax primitive to compute.

TaoLv · 2025-06-18T09:08:02Z

src/graph/backend/dnnl/patterns/sdp.cpp

@@ -177,6 +177,48 @@ DNNL_BACKEND_REGISTER_PATTERN_MATCHER_PASS(dnnl, float_sdp_fusion_gpu)
            return std::make_shared<sdp_base_t<>>();
        });

+DNNL_BACKEND_REGISTER_PATTERN_MATCHER_PASS(dnnl, float_sdp_backward_fusion)


Does that mean the inference fusion pattern will be reused for training forward?

Yes, correct

wzt1997 · 2025-06-19T01:55:42Z

tests/benchdnn/graph/custom_driver.cpp

+
+        switch (exec_arg) {
+            case DNNL_ARG_SRC:
+                SAFE(::custom::fill_mem(mem, ref_mem, -8, 7), WARN);


How do we decide this data range?

wzt1997 · 2025-06-19T02:34:17Z

tests/benchdnn/graph/ref_partition.cpp

@@ -203,6 +210,13 @@ int ref_partition_t::init_graph_mem(
 }

 void ref_partition_t::exec_ops(res_t *res) {
+    // check if there's softmax backward op in the partition,
+    // which will be a candidate for sdpa training backward pattern
+    bool has_softmax_backward = std::any_of(partition_ops_ref_.begin(),


May use static?

wzt1997 · 2025-06-19T02:35:36Z

tests/benchdnn/graph/ref_partition.cpp

@@ -263,7 +288,7 @@ void ref_partition_t::exec_ops(res_t *res) {
                    || (parent0 == "Multiply" && parent1 == "MatMul");
        }

-        if (is_sdpa_pattern || is_gated_mlp_pattern) {
+        if (is_sdpa_pattern || is_sdpa_bwd_pattern || is_gated_mlp_pattern) {


The names are a little bit confusing: actually they are used to indicate whether the current op needs precision downcasting, but not the status of the whole pattern. Could you help improve it?

ElaineBao self-assigned this Jun 6, 2025

ElaineBao added the component:graph-api label Jun 6, 2025

github-actions bot added documentation component:tests component:examples labels Jun 6, 2025

ElaineBao force-pushed the yixin/sdpa-training-impl branch 2 times, most recently from 2828827 to 9e6b119 Compare June 18, 2025 06:06

ElaineBao marked this pull request as ready for review June 18, 2025 06:48

ElaineBao requested review from a team as code owners June 18, 2025 06:48

ElaineBao changed the title ~~[DO NOT REVIEW] graph: doc, interface, backend: support SDPA training~~ graph: doc, interface, backend: support SDPA training Jun 18, 2025

TaoLv reviewed Jun 18, 2025

View reviewed changes

ElaineBao force-pushed the yixin/sdpa-training-impl branch from 9e6b119 to 6f12739 Compare June 18, 2025 09:15

ElaineBao added 15 commits June 19, 2025 01:50

doc: graph: support new dtype combination for SoftMaxBackward

ad8041a

graph: interface: support new dtype combination for SoftMaxBackward

f0c9d7d

graph: backend: dnnl: support new dtype combination for SoftMaxBackward

119e42a

benchdnn: inputs: graph: add cases for SoftMaxBackward

af246a3

graph: interface: add optional stats output to SoftMax op

2a9b5df

graph: backend: dnnl: add optional stats output to SoftMax op

20e7501

graph: backend: dnnl: support sdpa training backward pattern

315bb0e

examples: graph: add sdpa training example

b3e0649

doc: graph: add optional stats output to SoftMax op

631af6d

tests: benchdnn: graph: support custom driver of softmax with stats

f977ce4

graph: backend: dnnl: support sdpa training forward with large partition

ff899d8

tests: benchdnn: graph: recomputing softmax stats for input displacer

b30645b

tests: benchdnn: graph: adjust correctness check for sdpa training

714d3bd

examples: graph: update sdpa training example

a0b52f8

tests: benchdnn: graph: add cases for sdpa training

4efa167

ElaineBao force-pushed the yixin/sdpa-training-impl branch from 6f12739 to 4efa167 Compare June 19, 2025 02:00

wzt1997 reviewed Jun 19, 2025

View reviewed changes

doc: graph: support SDPA training

f946713

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

graph: doc, interface, backend: support SDPA training #3396

graph: doc, interface, backend: support SDPA training #3396

Uh oh!

ElaineBao commented Jun 6, 2025 •

edited

Loading

Uh oh!

TaoLv Jun 18, 2025

Uh oh!

LuFinch Jun 19, 2025

Uh oh!

TaoLv Jun 18, 2025

Uh oh!

TaoLv Jun 18, 2025

Uh oh!

TaoLv Jun 18, 2025

Uh oh!

TaoLv Jun 18, 2025

Uh oh!

TaoLv Jun 18, 2025

Uh oh!

ElaineBao Jun 18, 2025

Uh oh!

TaoLv Jun 18, 2025

Uh oh!

ElaineBao Jun 18, 2025

Uh oh!

wzt1997 Jun 19, 2025

Uh oh!

wzt1997 Jun 19, 2025

Uh oh!

wzt1997 Jun 19, 2025

Uh oh!

Uh oh!

	outputs.size() == 1, false, "Doesn't support SDPA training yet");
	outputs.size() == 1, false, "does not support multiple outputs");

	if (f32_dst->get_logical_tensor().data_type == dnnl::impl::data_type::f32) {
	if (f32_dst->get_logical_tensor().data_type == data_type::f32) {

graph: doc, interface, backend: support SDPA training #3396

Are you sure you want to change the base?

graph: doc, interface, backend: support SDPA training #3396

Uh oh!

Conversation

ElaineBao commented Jun 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ElaineBao commented Jun 6, 2025 •

edited

Loading