Restore mixed dtypes GEMM auto-tuning for Ampere #129058

alexsamardzic · 2024-06-19T12:54:41Z

Stack from ghstack (oldest at bottom):

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang

[ghstack-poisoned]

pytorch-bot · 2024-06-19T12:54:43Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/129058

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 3 New Failures

As of commit 44caacd with merge base f565d16 ():

NEW FAILURES - The following jobs have failed:

trunk / macos-py3-arm64 / test (default, 1, 3, macos-m1-stable) (gh)
export/test_export_training_ir_to_run_decomp.py::TrainingIRToRunDecompExportTestExport::test_disable_forced_specializations_ok_training_ir_to_decomp
trunk / macos-py3-arm64 / test (default, 2, 3, macos-m1-stable) (gh)
inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_adaptive_avg_pool2d_low_prec_dynamic_shapes_cpu
trunk / macos-py3-arm64 / test (default, 3, 3, macos-m1-stable) (gh)
inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_adaptive_avg_pool2d_low_prec_dynamic_shapes_cpu

This comment was automatically generated by Dr. CI and updates every 15 minutes.

alexsamardzic · 2024-06-19T12:56:58Z

@kadeng, @cpuhrsch, @HDCharles, @amjames

alexsamardzic · 2024-06-19T14:02:11Z

Recent changes in the CUTLASS-based auto-tuning code for Inductor (the stack of changes roughly listed here) pretty much disabled any kind of CUTLASS-based auto-tuning for Ampere architecture, including auto-tuning for mixed dtypes GEMM, and sparse semi-structured GEMM. PRs in this stack are intended to restore the functionality for the two mentioned special GEMM types. (The rationale for original changes is provided in the comments section of PR 124577, starting from this comment.)

The change mentioned is not detected by CI because tests for CUTLASS-based auto-tuning are checking only that there were no crashes during auto-tuning procedure, and not that no CUTLASS-based candidate kernel is generated. This kind of testing actually makes sense, because CUTLASS lists configurations that are know to work, for example, on A100 GPUs, but on less capable GPUs of Ampere architecture some of these configurations may crash corresponding candidate kernel during the execution (for lack of resources, etc.). However, apparently some kind of test is needed that would actually check the number of CUTLASS candidate kernels generated.

Besides this extended testing needed, it remains to be decided how changes from this stack are to be incorporated. @kadeng suggested that these should be kept separate, and this is how this particular PR is implemented - it adds new CUTLASS2xGemmTemplate class. However, most of the code between original CUTLASSGemmTemplate and new CUTLASS2xGemmTemplate classes is the same, so apparently this is to be refactored. The next PR in this stack, about auto-tuning for sparse semi-strucutred GEMM, is just adding some if-else statements to CUTLASSGemmTemplate (EDIT: now CUTLASS2xGemmTemplate) instead (personally, I'm inclined to this kind of approach, as it at least avoids going through the list of all CUTLASS-suggested configurations twice). So we should re-consider what would be the best thing to do here.

Another thing to discuss: is there any need for CUTLASS-based auto-tuning on Ampere for ordinary MM/ADDMM operators? PRs in this stack revert auto-tuning functionality on Ampere for mentioned two special GEMM types, but for ordinary dense arguments of the same dtype, auto-tuning is still not possible on Ampere (while it was there before mentioned changes).

(As seen from above comments: PRs in this stack are not yet to be considered for merging.)

amjames

gemm_template_2x.py is almost Identical to the existing gemm_template.py.

What is the justification for having two isolated implementations for this? It looks like it would be very easy to cleanly refactor this into a common base + two small derived classes.

amjames · 2024-06-21T19:24:59Z

torch/_inductor/kernel/mm.py

        CUTLASSGemmTemplate.add_cutlass_gemm_choices(
            choices, layout, [mat1, mat2], fuseable=True, non_fuseable=True
        )
+        CUTLASS2xGemmTemplate.add_cutlass_gemm_choices(


Can both of these be used unconditionally? Or does there need to be a branch on the cutlass version?

If we can use both, do we need both?

Edit, from @alexsamardzic comment I see this is adding back support for Ampere, so if the existing template does not work for ampere should there be a branch on SM verson to decide which to add?

Roughly speaking: add_cutlass_gemm_choices() will go through all configurations offered by CUTLASS, and will then filter_op() on each one. For CUTLASSGemmTemplate, filter_op() will eliminate operations that are not for Hopper, and for CUTLASS2xGemmTemplate, it will eliminate operations that are not for Ampere. So the only harm here, as is, is going through all the configurations twice - but that's just another reason to unify these two classes.

I would think that it makes sense to factor out the gen_ops() and filter_op() methods from CutlassGEMMTemplate and CUTLASS2xGemmTemplate to a shared module and make them functions, not methods. Then you can pass in a flag that controls that. @alexsamardzic is that the kind of planned refactoring you had in mind already?

(This comment applies to your suggestions below too.) The main point is not in commonalities, but exactly that differences between the two classes are actually minor. Thus, my questions is: are you still at the position that we need to have two separate classes, plus a base class? Frankly, I think single class with some if-else statements, like what was the case before the major refactoring, would be simpler. Plus, it would avoid going through all the ops offered by CUTLASS generator twice, in cases when we want to support both CUTLASS 2.x and CUTLASS 3.x architecture kernels for auto-tuning the same operator.

I can understand where you're coming from. It might appear simpler and more compact to write at the moment, but it definitely impacts readability and testability ( and thereby maintainability ) very badly and that's not to be underestimated. So, yes, I still think that it's better to have two classes. But that's a bit of Software Design philosophy. I personally prioritize readability, modularity, testability and orthogonality in software design over the DRY principle, but I know that opinions on that differ.

Going through all ops twice can be solved differently ( sort them into separate lists / sets just once and remember these ). It's inefficient anyway at the moment. Cutlass 3.x will certainly evolve a lot in the future, while Cutlass 2.x likely won't, so I would not just judge what's easy and what is not by how the code looks at the moment. It should be possible to evolve both variants independently.

You could factor out shared code into a shared parent class or an utility module if that helps.

OK. I'm on leave now for several weeks, but as soon as I'm back, I'll refactor the code and will ping you for a review then.

Same for me, I will be back in a bit more than 2 weeks. Sorry I had to delay this ..

[ghstack-poisoned]

alexsamardzic · 2024-06-24T14:37:04Z

Removed bits of CUTLASS2xGemmTemplate that are not needed, and extended mixed data types test with checking that at least one working CUTLASS candidate kernel is produced.

[ghstack-poisoned]

kadeng · 2024-06-24T17:11:48Z