Add a decomposition for torch.put #115306

asmeurer · 2023-12-06T23:08:48Z

Getting a test failure here that I don't understand

    raise RuntimeError(
RuntimeError: Difference from float64 is larger with decomposition put.default than original on output 0. Original max diff: 0.015625, Decomp max diff: 7.203125
atol = 1e-07
args = (tensor([[ 1.0547,  3.4375, -7.3750],
        [ 7.3750,  1.8281, -6.8750],
        [ 3.0312,  2.1094,  7.5938]], dtype=torch.bfloat16, requires_grad=True), tensor([6, 4, 2]), tensor([-0.2812, -7.1875, -0.7031], dtype=torch.bfloat16, requires_grad=True), True)
kwargs = {}

However, from pure Python (CPU) torch.put and my decomposition seem to give the same thing:

>>> def put(self, index, source, accumulate=False):
...     flattened = self.flatten().clone()
...     if accumulate:
...         flattened[index] += source
...     else:
...         flattened[index] = source
...     return flattened.reshape(self.shape)
>>> import torch
>>> x = torch.tensor([[ 1.0547,  3.4375, -7.3750],
...         [ 7.3750,  1.8281, -6.8750],
...         [ 3.0312,  2.1094,  7.5938]], dtype=torch.bfloat16)
>>> index = torch.tensor([6, 4, 2])
>>> source = torch.tensor([-0.2812, -7.1875, -0.7031], dtype=torch.bfloat16)
>>> put(x, index, source, True)
tensor([[ 1.0547,  3.4375, -8.0625],
        [ 7.3750, -5.3750, -6.8750],
        [ 2.7500,  2.1094,  7.5938]], dtype=torch.bfloat16)
>>> torch.put(x, index, source, True)
tensor([[ 1.0547,  3.4375, -8.0625],
        [ 7.3750, -5.3750, -6.8750],
        [ 2.7500,  2.1094,  7.5938]], dtype=torch.bfloat16)

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @aakhundov @ColinPeppler @amjames @desertfire @chauhang

pytorch-bot · 2023-12-06T23:08:54Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/115306

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 5 New Failures

As of commit 876e0fd with merge base e5f2ac1 ():

NEW FAILURES - The following jobs have failed:

pull / linux-focal-cuda12.1-py3.10-gcc9 / test (default, 5, 5, linux.4xlarge.nvidia.gpu) (gh)
test_meta.py::TestMetaCUDA::test_dispatch_symbolic_meta_outplace_all_strides_put_cuda_float32
pull / linux-focal-cuda12.1-py3.10-gcc9-sm86 / test (default, 1, 5, linux.g5.4xlarge.nvidia.gpu) (gh)
test_meta.py::TestMetaCUDA::test_dispatch_symbolic_meta_outplace_all_strides_put_cuda_float32
pull / linux-focal-py3.11-clang10 / test (default, 3, 3, linux.2xlarge) (gh)
inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCPU::test_comprehensive_put_cpu_float16
pull / linux-focal-py3.8-clang10 / test (default, 1, 3, linux.2xlarge) (gh)
inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCPU::test_comprehensive_put_cpu_float16
pull / linux-jammy-py3.8-gcc11 / test (default, 3, 3, linux.2xlarge) (gh)
inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCPU::test_comprehensive_put_cpu_float16

This comment was automatically generated by Dr. CI and updates every 15 minutes.

peterbell10 · 2023-12-07T00:17:11Z

torch/_decomp/decompositions.py

+    flattened = self.flatten().clone()
+    if accumulate:
+        flattened[index] += source
+    else:
+        flattened[index] = source


Mutation isn't legal inside decompositions of functional operators. If I'm not mistaken you could just use torch.index_put though.

Suggested change

flattened = self.flatten().clone()

if accumulate:

flattened[index] += source

else:

flattened[index] = source

flattened = self.flatten()

flattened = torch.index_put(flattened, [index], source, accumulate)

It seems this fails because put supports NumPy-style "reverse broadcasting":

>>> x = torch.tensor(1.0) >>> index = torch.tensor(0) >>> source = torch.tensor([0.0]) >>> torch.put(x, index, source) tensor(0.)

(which is IMO a bad idea, but it is tested)

I guess source.reshape(index.shape) should work, assuming the actual shape conditions are all tested separately.

Co-authored-by: peterbell10 <peterbell10@live.co.uk>

asmeurer · 2023-12-07T01:11:08Z

Benchmark script based on #114813 (comment)

import torch
from torch.testing import make_tensor
from torch.fx.experimental.proxy_tensor import make_fx
from torch.utils.benchmark import Timer, Compare
from torch._inductor.compile_fx import compile_fx_inner, cudagraphify_impl
from torch._inductor.decomposition import decompositions
from itertools import product
from functools import partial

torch._logging.set_logs(output_code=True)

benchmark_name = "put"
Ss = [512]


def gen_inputs():
    make_arg = partial(torch.randn, dtype=torch.float32, device="cuda")
    make_source = partial(torch.randn, dtype=torch.float32, device="cuda")

    def make_idx(n):
        return make_tensor((n,), device="cuda", dtype=torch.int64, low=0, high=n)

    for b, s, in product(Ss, Ss):
        yield make_arg((b * s)), make_idx(b), make_source(b)


def benchmark(label, f, x, idx, source):
    return Timer("f([x, idx, source])",
                 globals=locals(),
                 label=benchmark_name,
                 description=label,
                 sub_label=f"{tuple(x.shape)}",
                 num_threads=torch.get_num_threads()).blocked_autorange(min_run_time=2)


def compare(x, idx, source):
    def f(args):
        x, idx, source = args
        val = torch.ops.aten.put(x, idx, source)
        return (val,)

    print(f"{tuple(x.shape)}")

    args = [x, idx, source]

    decomposed = make_fx(f, decomposition_table=decompositions, tracing_mode="fake")(args)
    compiled_decomposed = compile_fx_inner(decomposed, args, cudagraphs=False)
    yield benchmark("Decomposed", compiled_decomposed, *args)

    non_decomposed = make_fx(f, tracing_mode="fake")(args)
    compiled_nondecomposed = compile_fx_inner(non_decomposed, args, cudagraphs=False)
    yield benchmark("Lowering", compiled_nondecomposed, *args)

    # Just show the first two generated kernels
    torch._logging.set_logs(output_code=False)

    cuda_f = cudagraphify_impl(f, args, static_input_idxs=tuple(range(len(args))))
    yield benchmark("Eager", cuda_f, *args)


results = []
for args in gen_inputs():
    for res in compare(*args):
        results.append(res)

compare = Compare(results)
compare.trim_significant_figures()
compare.print()

(262144,)
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] Output code: 
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] 
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] from ctypes import c_void_p, c_long
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] import torch
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] import math
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] import random
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] import os
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] import tempfile
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] from math import inf, nan
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] from torch._inductor.hooks import run_intermediate_hooks
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] from torch._inductor.utils import maybe_profile
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] from torch._inductor.codegen.memory_planning import _align as align
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] 
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] from torch import device, empty, empty_strided
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] from torch._inductor.codecache import AsyncCompile
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] from torch._inductor.select_algorithm import extern_kernels
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] 
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] aten = torch.ops.aten
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] inductor_ops = torch.ops.inductor
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] assert_size_stride = torch._C._dynamo.guards.assert_size_stride
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] alloc_from_pool = torch.ops.inductor._alloc_from_pool
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] reinterpret_tensor = torch.ops.inductor._reinterpret_tensor
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] async_compile = AsyncCompile()
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] 
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] 
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] # kernel path: /tmp/torchinductor_aaronmeurer/qo/cqo4ledzek5kxu67v3n5tprrtu74ty5mwywp7hf25zb5gnnxkwic.py
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] # Source Nodes: [], Original ATen: []
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] 
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] triton_poi_fused_0 = async_compile.triton('triton_', '''
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] import triton
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] import triton.language as tl
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] from torch._inductor.ir import ReductionHint
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] from torch._inductor.ir import TileHint
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] from torch._inductor.triton_heuristics import AutotuneHint, pointwise
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] from torch._inductor.utils import instance_descriptor
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] from torch._inductor import triton_helpers
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] 
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] @pointwise(
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG]     size_hints=[262144], 
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG]     filename=__file__,
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG]     triton_meta={'signature': {0: '*fp32', 1: '*fp32', 2: 'i32'}, 'device': 0, 'device_type': 'cuda', 'constants': {}, 'configs': [instance_descriptor(divisible_by_16=(0, 1, 2), equal_to_1=(), ids_of_folded_args=(), divisible_by_8=(2,))]},
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG]     inductor_meta={'autotune_hints': set(), 'kernel_name': 'triton_poi_fused_0', 'mutated_arg_names': []},
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG]     min_elem_per_thread=0
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] )
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] @triton.jit
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] def triton_(in_ptr0, out_ptr0, xnumel, XBLOCK : tl.constexpr):
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG]     xnumel = 262144
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG]     xoffset = tl.program_id(0) * XBLOCK
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG]     xindex = xoffset + tl.arange(0, XBLOCK)[:]
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG]     xmask = xindex < xnumel
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG]     x0 = xindex
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG]     tmp0 = tl.load(in_ptr0 + (x0), None)
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG]     tl.store(out_ptr0 + (x0), tmp0, None)
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] ''')
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] 
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] import triton
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] import triton.language as tl
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] from torch._inductor.triton_heuristics import grid, start_graph, end_graph
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] from torch._C import _cuda_getCurrentRawStream as get_cuda_stream
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] 
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] 
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] # kernel path: /tmp/torchinductor_aaronmeurer/ld/cld3iubcoadahloiptwcsjn574iblirjkejdnxx3beeubrudjlhj.py
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] # Source Nodes: [], Original ATen: []
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] 
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] triton_poi_fused_1 = async_compile.triton('triton_', '''
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] import triton
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] import triton.language as tl
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] from torch._inductor.ir import ReductionHint
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] from torch._inductor.ir import TileHint
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] from torch._inductor.triton_heuristics import AutotuneHint, pointwise
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] from torch._inductor.utils import instance_descriptor
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] from torch._inductor import triton_helpers
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] 
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] @pointwise(
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG]     size_hints=[512], 
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG]     filename=__file__,
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG]     triton_meta={'signature': {0: '*i64', 1: '*fp32', 2: '*fp32', 3: 'i32'}, 'device': 0, 'device_type': 'cuda', 'constants': {}, 'configs': [instance_descriptor(divisible_by_16=(0, 1, 2, 3), equal_to_1=(), ids_of_folded_args=(), divisible_by_8=(3,))]},
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG]     inductor_meta={'autotune_hints': set(), 'kernel_name': 'triton_poi_fused_1', 'mutated_arg_names': ['out_ptr0']},
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG]     min_elem_per_thread=0
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] )
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] @triton.jit
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] def triton_(in_ptr0, in_ptr1, out_ptr0, xnumel, XBLOCK : tl.constexpr):
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG]     xnumel = 512
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG]     xoffset = tl.program_id(0) * XBLOCK
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG]     xindex = xoffset + tl.arange(0, XBLOCK)[:]
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG]     xmask = xindex < xnumel
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG]     x0 = xindex
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG]     tmp0 = tl.load(in_ptr0 + (x0), xmask)
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG]     tmp4 = tl.load(in_ptr1 + (x0), xmask)
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG]     tmp1 = tmp0 + 262144
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG]     tmp2 = tmp0 < 0
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG]     tmp3 = tl.where(tmp2, tmp1, tmp0)
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG]     tl.device_assert(((0 <= tmp3) & (tmp3 < 262144)) | ~xmask, "index out of bounds: 0 <= tmp3 < 262144")
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG]     tl.store(out_ptr0 + (tl.broadcast_to(tmp3, [XBLOCK])), tmp4, xmask)
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] ''')
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] 
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] 
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] async_compile.wait(globals())
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] del async_compile
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] 
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] def call(args):
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG]     args_1, args_2, args_3 = args
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG]     args.clear()
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG]     assert_size_stride(args_1, (262144, ), (1, ))
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG]     assert_size_stride(args_2, (512, ), (1, ))
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG]     assert_size_stride(args_3, (512, ), (1, ))
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG]     with torch.cuda._DeviceGuard(0):
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG]         torch.cuda.set_device(0) # no-op to ensure context
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG]         buf0 = empty((262144, ), device='cuda', dtype=torch.float32)
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG]         # Source Nodes: [], Original ATen: []
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG]         stream0 = get_cuda_stream(0)
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG]         triton_poi_fused_0.run(args_1, buf0, 262144, grid=grid(262144), stream=stream0)
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG]         del args_1
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG]         # Source Nodes: [], Original ATen: []
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG]         triton_poi_fused_1.run(args_2, args_3, buf0, 512, grid=grid(512), stream=stream0)
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG]         del args_2
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG]         del args_3
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG]         return (buf0, )
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] 
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] 
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] def benchmark_compiled_module(times=10, repeat=10):
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG]     from torch._dynamo.testing import rand_strided
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG]     from torch._inductor.utils import print_performance
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG]     args_1 = rand_strided((262144, ), (1, ), device='cuda:0', dtype=torch.float32)
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG]     args_2 = rand_strided((512, ), (1, ), device='cuda:0', dtype=torch.int64)
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG]     args_3 = rand_strided((512, ), (1, ), device='cuda:0', dtype=torch.float32)
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG]     return print_performance(lambda: call([args_1, args_2, args_3]), times=times, repeat=repeat)
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] 
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] 
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] if __name__ == "__main__":
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG]     from torch._inductor.wrapper_benchmark import compiled_module_main
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG]     compiled_module_main('None', benchmark_compiled_module)
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] 
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [INFO] Output code written to: /tmp/torchinductor_aaronmeurer/ze/cze4wgypbgwqi6vtoecbjgjw6nmwsahnmbz5adspksgfq4hh7pyn.py
[2023-12-06 18:08:12,809] torch._inductor.graph.__output_code: [DEBUG] Output code: 
[2023-12-06 18:08:12,809] torch._inductor.graph.__output_code: [DEBUG] 
[2023-12-06 18:08:12,809] torch._inductor.graph.__output_code: [DEBUG] from ctypes import c_void_p, c_long
[2023-12-06 18:08:12,809] torch._inductor.graph.__output_code: [DEBUG] import torch
[2023-12-06 18:08:12,809] torch._inductor.graph.__output_code: [DEBUG] import math
[2023-12-06 18:08:12,809] torch._inductor.graph.__output_code: [DEBUG] import random
[2023-12-06 18:08:12,809] torch._inductor.graph.__output_code: [DEBUG] import os
[2023-12-06 18:08:12,809] torch._inductor.graph.__output_code: [DEBUG] import tempfile
[2023-12-06 18:08:12,809] torch._inductor.graph.__output_code: [DEBUG] from math import inf, nan
[2023-12-06 18:08:12,809] torch._inductor.graph.__output_code: [DEBUG] from torch._inductor.hooks import run_intermediate_hooks
[2023-12-06 18:08:12,809] torch._inductor.graph.__output_code: [DEBUG] from torch._inductor.utils import maybe_profile
[2023-12-06 18:08:12,809] torch._inductor.graph.__output_code: [DEBUG] from torch._inductor.codegen.memory_planning import _align as align
[2023-12-06 18:08:12,809] torch._inductor.graph.__output_code: [DEBUG]
[2023-12-06 18:08:12,809] torch._inductor.graph.__output_code: [DEBUG] from torch import device, empty, empty_strided
[2023-12-06 18:08:12,809] torch._inductor.graph.__output_code: [DEBUG] from torch._inductor.codecache import AsyncCompile
[2023-12-06 18:08:12,809] torch._inductor.graph.__output_code: [DEBUG] from torch._inductor.select_algorithm import extern_kernels
[2023-12-06 18:08:12,809] torch._inductor.graph.__output_code: [DEBUG]
[2023-12-06 18:08:12,809] torch._inductor.graph.__output_code: [DEBUG] aten = torch.ops.aten
[2023-12-06 18:08:12,809] torch._inductor.graph.__output_code: [DEBUG] inductor_ops = torch.ops.inductor
[2023-12-06 18:08:12,809] torch._inductor.graph.__output_code: [DEBUG] assert_size_stride = torch._C._dynamo.guards.assert_size_stride
[2023-12-06 18:08:12,809] torch._inductor.graph.__output_code: [DEBUG] alloc_from_pool = torch.ops.inductor._alloc_from_pool
[2023-12-06 18:08:12,809] torch._inductor.graph.__output_code: [DEBUG] reinterpret_tensor = torch.ops.inductor._reinterpret_tensor
[2023-12-06 18:08:12,809] torch._inductor.graph.__output_code: [DEBUG] async_compile = AsyncCompile()
[2023-12-06 18:08:12,809] torch._inductor.graph.__output_code: [DEBUG]
[2023-12-06 18:08:12,809] torch._inductor.graph.__output_code: [DEBUG]
[2023-12-06 18:08:12,809] torch._inductor.graph.__output_code: [DEBUG] async_compile.wait(globals())
[2023-12-06 18:08:12,809] torch._inductor.graph.__output_code: [DEBUG] del async_compile
[2023-12-06 18:08:12,809] torch._inductor.graph.__output_code: [DEBUG]
[2023-12-06 18:08:12,809] torch._inductor.graph.__output_code: [DEBUG] def call(args):
[2023-12-06 18:08:12,809] torch._inductor.graph.__output_code: [DEBUG]     args_1, args_2, args_3 = args
[2023-12-06 18:08:12,809] torch._inductor.graph.__output_code: [DEBUG]     args.clear()
[2023-12-06 18:08:12,809] torch._inductor.graph.__output_code: [DEBUG]     assert_size_stride(args_1, (262144, ), (1, ))
[2023-12-06 18:08:12,809] torch._inductor.graph.__output_code: [DEBUG]     assert_size_stride(args_2, (512, ), (1, ))
[2023-12-06 18:08:12,809] torch._inductor.graph.__output_code: [DEBUG]     assert_size_stride(args_3, (512, ), (1, ))
[2023-12-06 18:08:12,809] torch._inductor.graph.__output_code: [DEBUG]     with torch.cuda._DeviceGuard(0):
[2023-12-06 18:08:12,809] torch._inductor.graph.__output_code: [DEBUG]         torch.cuda.set_device(0) # no-op to ensure context
[2023-12-06 18:08:12,809] torch._inductor.graph.__output_code: [DEBUG]         # Source Nodes: [], Original ATen: []
[2023-12-06 18:08:12,809] torch._inductor.graph.__output_code: [DEBUG]         buf0 = aten.put(args_1, args_2, args_3)
[2023-12-06 18:08:12,809] torch._inductor.graph.__output_code: [DEBUG]         del args_1
[2023-12-06 18:08:12,809] torch._inductor.graph.__output_code: [DEBUG]         del args_2
[2023-12-06 18:08:12,809] torch._inductor.graph.__output_code: [DEBUG]         del args_3
[2023-12-06 18:08:12,809] torch._inductor.graph.__output_code: [DEBUG]         buf1 = buf0
[2023-12-06 18:08:12,809] torch._inductor.graph.__output_code: [DEBUG]         return (buf1, )
[2023-12-06 18:08:12,809] torch._inductor.graph.__output_code: [DEBUG]
[2023-12-06 18:08:12,809] torch._inductor.graph.__output_code: [DEBUG]
[2023-12-06 18:08:12,809] torch._inductor.graph.__output_code: [DEBUG] def benchmark_compiled_module(times=10, repeat=10):
[2023-12-06 18:08:12,809] torch._inductor.graph.__output_code: [DEBUG]     from torch._dynamo.testing import rand_strided
[2023-12-06 18:08:12,809] torch._inductor.graph.__output_code: [DEBUG]     from torch._inductor.utils import print_performance
[2023-12-06 18:08:12,809] torch._inductor.graph.__output_code: [DEBUG]     args_1 = rand_strided((262144, ), (1, ), device='cuda:0', dtype=torch.float32)
[2023-12-06 18:08:12,809] torch._inductor.graph.__output_code: [DEBUG]     args_2 = rand_strided((512, ), (1, ), device='cuda:0', dtype=torch.int64)
[2023-12-06 18:08:12,809] torch._inductor.graph.__output_code: [DEBUG]     args_3 = rand_strided((512, ), (1, ), device='cuda:0', dtype=torch.float32)
[2023-12-06 18:08:12,809] torch._inductor.graph.__output_code: [DEBUG]     return print_performance(lambda: call([args_1, args_2, args_3]), times=times, repeat=repeat)
[2023-12-06 18:08:12,809] torch._inductor.graph.__output_code: [DEBUG]
[2023-12-06 18:08:12,809] torch._inductor.graph.__output_code: [DEBUG]
[2023-12-06 18:08:12,809] torch._inductor.graph.__output_code: [DEBUG] if __name__ == "__main__":
[2023-12-06 18:08:12,809] torch._inductor.graph.__output_code: [DEBUG]     from torch._inductor.wrapper_benchmark import compiled_module_main
[2023-12-06 18:08:12,809] torch._inductor.graph.__output_code: [DEBUG]     compiled_module_main('None', benchmark_compiled_module)
[2023-12-06 18:08:12,809] torch._inductor.graph.__output_code: [DEBUG]
[2023-12-06 18:08:12,809] torch._inductor.graph.__output_code: [INFO] Output code written to: /tmp/torchinductor_aaronmeurer/bd/cbdiox4iocae7tvdx24g5jl6sgr55m3dkd65o2samwoqk3bday6m.py
[----------------------- put -----------------------]
                 |  Decomposed  |  Lowering  |  Eager
1 threads: ------------------------------------------
      (262144,)  |     41.8     |     23     |    12

Times are in microseconds (us).

torch/_decomp/decompositions.py

Co-authored-by: peterbell10 <peterbell10@live.co.uk>

peterbell10 · 2023-12-13T13:25:00Z

torch/_inductor/lowering.py

 make_fallback(aten.polygamma)
 make_fallback(aten.put)
+make_fallback(aten.reflection_pad1d)
+make_fallback(aten.replication_pad1d)


I think your rebase has gone awry.

You need to remove the aten.put fallback since we're adding a decomposition that replaces it.

peterbell10 · 2023-12-13T13:30:23Z

torch/_decomp/decompositions.py

    norm = x.norm(2, keep_dim, keepdim=True)
    return x * (y / norm), norm

+@register_decomposition(aten.put)


For the test_has_decomposition failure, you can run the test as

EXPECTTEST_ACCEPT=1 pytest test/test_decomp,py -k HasDecomp

and it will update the expected test output files

test/expect/HasDecompTest.test_has_decomposition.expect test/expect/HasDecompTest.test_aten_core_opeartors.expect

…put-decomposition

As in the title. It is an updated copy of #115306 . cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames [ghstack-poisoned]

As in the title. It is an updated copy of #115306 . Pull Request resolved: #120179 Approved by: https://github.com/lezcano, https://github.com/peterbell10, https://github.com/jgong5

github-actions · 2024-03-06T03:33:51Z

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

lezcano · 2024-03-06T15:07:43Z

landed in #120179

Add a decomposition for torch.put

4b4f618

github-actions bot added module: inductor ciflow/inductor labels Dec 6, 2023

pytorchbot added the open source label Dec 6, 2023

asmeurer added the release notes: inductor label Dec 7, 2023

peterbell10 reviewed Dec 7, 2023

View reviewed changes

asmeurer and others added 2 commits December 6, 2023 17:50

Use index_put() instead of in-place mutation in the put() decomposition

2a94639

Co-authored-by: peterbell10 <peterbell10@live.co.uk>

Fix the put() decomposition

d4990f5

peterbell10 reviewed Dec 7, 2023

View reviewed changes

torch/_decomp/decompositions.py Show resolved Hide resolved

colesbury added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Dec 8, 2023

asmeurer and others added 2 commits December 8, 2023 14:27

Add @out_wrapper to the put() decomposition

6869564

Co-authored-by: peterbell10 <peterbell10@live.co.uk>

Merge branch 'main' into put-decomposition

418b21f

peterbell10 reviewed Dec 13, 2023

View reviewed changes

asmeurer added 6 commits December 14, 2023 14:29

Update has_decomposition expects

8ef00c6

Merge branch 'put-decomposition' of github.com:asmeurer/pytorch into …

0d72792

…put-decomposition

Fix bad merge conflict resolution

a7d5c62

Remove make_fallback(aten.put)

39571e6

Autoformatter

88737be

Merge branch 'main' into put-decomposition

876e0fd

qingpeng9802 mentioned this pull request Jan 17, 2024

accelerate binary_cross_entropy_with_logits by using log_sigmoid operator #115539

Closed

pearu mentioned this pull request Feb 18, 2024

Add a decomposition for torch.put, 2. #120179

Closed

github-actions bot added the Stale label Mar 6, 2024

lezcano closed this Mar 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add a decomposition for torch.put #115306

Add a decomposition for torch.put #115306

Uh oh!

asmeurer commented Dec 6, 2023 •

edited by pytorch-bot bot

Loading

Uh oh!

pytorch-bot bot commented Dec 6, 2023 •

edited

Loading

Uh oh!

peterbell10 Dec 7, 2023

Uh oh!

asmeurer Dec 7, 2023 •

edited

Loading

Uh oh!

asmeurer Dec 7, 2023

Uh oh!

asmeurer commented Dec 7, 2023

Uh oh!

Uh oh!

peterbell10 Dec 13, 2023

Uh oh!

peterbell10 Dec 15, 2023

Uh oh!

peterbell10 Dec 13, 2023

Uh oh!

github-actions bot commented Mar 6, 2024

Uh oh!

lezcano commented Mar 6, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Add a decomposition for torch.put #115306

Add a decomposition for torch.put #115306

Uh oh!

Conversation

asmeurer commented Dec 6, 2023 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Dec 6, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/115306

❌ 5 New Failures

Uh oh!

peterbell10 Dec 7, 2023

Choose a reason for hiding this comment

Uh oh!

asmeurer Dec 7, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

asmeurer Dec 7, 2023

Choose a reason for hiding this comment

Uh oh!

asmeurer commented Dec 7, 2023

Uh oh!

Uh oh!

peterbell10 Dec 13, 2023

Choose a reason for hiding this comment

Uh oh!

peterbell10 Dec 15, 2023

Choose a reason for hiding this comment

Uh oh!

peterbell10 Dec 13, 2023

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Mar 6, 2024

Uh oh!

lezcano commented Mar 6, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

asmeurer commented Dec 6, 2023 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Dec 6, 2023 •

edited

Loading

asmeurer Dec 7, 2023 •

edited

Loading