[Performance]: Fuse padding onto GEMM by making the GEMM out-of-place

### Proposal to improve performance

Currently, at the start of fused_moe, we pad the hidden dim of the activations to comply with fused moe kernel requirements. This results in a copy following the router GEMM (GPT-OSS, Deepseek). In the captured fx.Graph, it looks something like:

```
        ...
        mul_22: "bf16[s72, 2880]" = torch.ops.aten.mul.Tensor(convert_element_type_6, arg4_1);  convert_element_type_6 = arg4_1 = None

        constant_pad_nd: "bf16[s72, 3072]" = torch.ops.aten.constant_pad_nd.default(mul_22, [0, 192], 0.0)
        ...
```

Instead, we should write the output of `mul_22` into a pre-padded tensor by replacing the sequence of these two operations with an out-of-place mm call (that takes a pre-allocated output tensor as an arg).

### Report of performance regression

_No response_

### Misc discussion on performance

_No response_

### Your current environment (if you think it is necessary)

```text
The output of `python collect_env.py`
```


### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Performance]: Fuse padding onto GEMM by making the GEMM out-of-place #24917

Proposal to improve performance

Report of performance regression

Misc discussion on performance

Your current environment (if you think it is necessary)

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Performance]: Fuse padding onto GEMM by making the GEMM out-of-place #24917

Description

Proposal to improve performance

Report of performance regression

Misc discussion on performance

Your current environment (if you think it is necessary)

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions