Skip to content

[Inductor] [CPU] Torchbench model soft_actor_critic performance regression > 10% on ww02.3 #93505

@yudongsi

Description

@yudongsi

🐛 Describe the bug

Compare with the TorchInductor CPU Performance Dashboard on ww02.2, there is a performance regression on Torchbench model soft_actor_critic on ww02.3 as bellow:

ww02.3 ww02.2
batch_size speedup inductor eager batch_size speedup inductor eager speedup ratio eager ratio inductor ratio
256 1.6536 0.0004336 0.000717001 256 1.8405 0.0003333 0.000613439 0.9 0.86 0.77

WW02.3 SW info:

SW Nightly commit Master/Main commit
Pytorch fac4361 73e5379
Torchbench / 354378b
torchaudio ecc2781 4a037b0
torchtext 112d757 c7cc5fc
torchvision ac06efe 35f68a0
torchdata 049fb62 c0934b9

WW02.2 SW info:

SW Nightly commit Master/Main commit
Pytorch fac4361 73e5379
Torchbench / ff361c6
torchaudio 1c98d76 0be8423
torchtext 6cbfd3e 7c7b640
torchvision b7637f6 0dceac0
torchdata 0d9aa37 0a0ae5d

Error logs

grapy.py of this model on ww02.3
GRAPH_INDEX:0
class <lambda>(torch.nn.Module):
    def forward(self, arg0_1: f32[1024, 3], arg1_1: f32[1024], arg2_1: f32[2638049, 1], arg3_1: f32[1024, 1024], arg4_1: f32[1024], arg5_1: f32[3490017, 1], arg6_1: f32[2, 1024], arg7_1: f32[2], arg8_1: f32[2310369, 1], arg9_1: f32[256, 3]):
        # File: /workspace/benchmark/torchbenchmark/models/soft_actor_critic/nets.py:114, code: x = F.relu(self.fc1(state))
        _mkl_linear: f32[256, 1024] = torch.ops.mkl._mkl_linear.default(arg9_1, arg2_1, arg0_1, arg1_1, 256);  arg9_1 = arg2_1 = arg0_1 = arg1_1 = None
        relu: f32[256, 1024] = torch.ops.aten.relu.default(_mkl_linear);  _mkl_linear = None
        
        # File: /workspace/benchmark/torchbenchmark/models/soft_actor_critic/nets.py:115, code: x = F.relu(self.fc2(x))
        _mkl_linear_1: f32[256, 1024] = torch.ops.mkl._mkl_linear.default(relu, arg5_1, arg3_1, arg4_1, 256);  relu = arg5_1 = arg3_1 = arg4_1 = None
        relu_1: f32[256, 1024] = torch.ops.aten.relu.default(_mkl_linear_1);  _mkl_linear_1 = None
        
        # File: /workspace/benchmark/torchbenchmark/models/soft_actor_critic/nets.py:116, code: out = self.fc3(x)
        _mkl_linear_2: f32[256, 2] = torch.ops.mkl._mkl_linear.default(relu_1, arg8_1, arg6_1, arg7_1, 256);  relu_1 = arg8_1 = arg6_1 = arg7_1 = None
        
        # File: /workspace/benchmark/torchbenchmark/models/soft_actor_critic/nets.py:117, code: mu, log_std = out.chunk(2, dim=1)
        split = torch.ops.aten.split.Tensor(_mkl_linear_2, 1, 1);  _mkl_linear_2 = None
        getitem: f32[256, 1] = split[0]
        getitem_1: f32[256, 1] = split[1];  split = None
        
        # File: /workspace/benchmark/torchbenchmark/models/soft_actor_critic/nets.py:119, code: log_std = torch.tanh(log_std)
        tanh: f32[256, 1] = torch.ops.aten.tanh.default(getitem_1);  getitem_1 = None
        
        # File: /workspace/benchmark/torchbenchmark/models/soft_actor_critic/nets.py:122, code: ) * (log_std + 1)
        add: f32[256, 1] = torch.ops.aten.add.Tensor(tanh, 1);  tanh = None
        
        # File: /workspace/benchmark/torchbenchmark/models/soft_actor_critic/nets.py:120, code: log_std = self.log_std_low + 0.5 * (
        mul: f32[256, 1] = torch.ops.aten.mul.Tensor(add, 6.0);  add = None
        add_1: f32[256, 1] = torch.ops.aten.add.Tensor(mul, -10.0);  mul = None
        
        # File: /workspace/benchmark/torchbenchmark/models/soft_actor_critic/nets.py:123, code: std = log_std.exp()
        exp: f32[256, 1] = torch.ops.aten.exp.default(add_1);  add_1 = None
        return (getitem, exp, getitem, exp)
        

grapy.py of this model on ww02.2
GRAPH_INDEX:0
class <lambda>(torch.nn.Module):
    def forward(self, arg0_1: f32[1024, 3], arg1_1: f32[1024], arg2_1: f32[2638049, 1], arg3_1: f32[1024, 1024], arg4_1: f32[1024], arg5_1: f32[3490017, 1], arg6_1: f32[2, 1024], arg7_1: f32[2], arg8_1: f32[2310369, 1], arg9_1: f32[256, 3]):
        # File: /workspace/benchmark/torchbenchmark/models/soft_actor_critic/nets.py:114, code: x = F.relu(self.fc1(state))
        _mkl_linear: f32[256, 1024] = torch.ops.mkl._mkl_linear.default(arg9_1, arg2_1, arg0_1, arg1_1, 256);  arg9_1 = arg2_1 = arg0_1 = arg1_1 = None
        relu: f32[256, 1024] = torch.ops.aten.relu.default(_mkl_linear);  _mkl_linear = None
        
        # File: /workspace/benchmark/torchbenchmark/models/soft_actor_critic/nets.py:115, code: x = F.relu(self.fc2(x))
        _mkl_linear_1: f32[256, 1024] = torch.ops.mkl._mkl_linear.default(relu, arg5_1, arg3_1, arg4_1, 256);  relu = arg5_1 = arg3_1 = arg4_1 = None
        relu_1: f32[256, 1024] = torch.ops.aten.relu.default(_mkl_linear_1);  _mkl_linear_1 = None
        
        # File: /workspace/benchmark/torchbenchmark/models/soft_actor_critic/nets.py:116, code: out = self.fc3(x)
        _mkl_linear_2: f32[256, 2] = torch.ops.mkl._mkl_linear.default(relu_1, arg8_1, arg6_1, arg7_1, 256);  relu_1 = arg8_1 = arg6_1 = arg7_1 = None
        
        # File: /workspace/benchmark/torchbenchmark/models/soft_actor_critic/nets.py:117, code: mu, log_std = out.chunk(2, dim=1)
        split = torch.ops.aten.split.Tensor(_mkl_linear_2, 1, 1);  _mkl_linear_2 = None
        getitem: f32[256, 1] = split[0]
        getitem_1: f32[256, 1] = split[1];  split = None
        
        # File: /workspace/benchmark/torchbenchmark/models/soft_actor_critic/nets.py:119, code: log_std = torch.tanh(log_std)
        tanh: f32[256, 1] = torch.ops.aten.tanh.default(getitem_1);  getitem_1 = None
        
        # File: /workspace/benchmark/torchbenchmark/models/soft_actor_critic/nets.py:122, code: ) * (log_std + 1)
        add: f32[256, 1] = torch.ops.aten.add.Tensor(tanh, 1);  tanh = None
        
        # File: /workspace/benchmark/torchbenchmark/models/soft_actor_critic/nets.py:120, code: log_std = self.log_std_low + 0.5 * (
        mul: f32[256, 1] = torch.ops.aten.mul.Tensor(add, 6.0);  add = None
        add_1: f32[256, 1] = torch.ops.aten.add.Tensor(mul, -10.0);  mul = None
        
        # File: /workspace/benchmark/torchbenchmark/models/soft_actor_critic/nets.py:123, code: std = log_std.exp()
        exp: f32[256, 1] = torch.ops.aten.exp.default(add_1);  add_1 = None
        return (getitem, exp, getitem, exp)
        

Minified repro

python -m torch.backends.xeon.run_cpu --node_id 0 benchmarks/dynamo/torchbench.py --performance --float32 -dcpu --output=inductor_log/ww022.csv -n50 --inductor  --no-skip --dashboard --only soft_actor_critic  --cold_start_latency

cc @ezyang @soumith @msaroufim @wconstab @ngimel @bdhirsh

Metadata

Metadata

Assignees

Type

No type

Projects

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions