Skip to content

Conversation

asmeurer
Copy link
Collaborator

@asmeurer asmeurer commented Dec 6, 2023

Getting a test failure here that I don't understand

    raise RuntimeError(
RuntimeError: Difference from float64 is larger with decomposition put.default than original on output 0. Original max diff: 0.015625, Decomp max diff: 7.203125
atol = 1e-07
args = (tensor([[ 1.0547,  3.4375, -7.3750],
        [ 7.3750,  1.8281, -6.8750],
        [ 3.0312,  2.1094,  7.5938]], dtype=torch.bfloat16, requires_grad=True), tensor([6, 4, 2]), tensor([-0.2812, -7.1875, -0.7031], dtype=torch.bfloat16, requires_grad=True), True)
kwargs = {}

However, from pure Python (CPU) torch.put and my decomposition seem to give the same thing:

>>> def put(self, index, source, accumulate=False):
...     flattened = self.flatten().clone()
...     if accumulate:
...         flattened[index] += source
...     else:
...         flattened[index] = source
...     return flattened.reshape(self.shape)
>>> import torch
>>> x = torch.tensor([[ 1.0547,  3.4375, -7.3750],
...         [ 7.3750,  1.8281, -6.8750],
...         [ 3.0312,  2.1094,  7.5938]], dtype=torch.bfloat16)
>>> index = torch.tensor([6, 4, 2])
>>> source = torch.tensor([-0.2812, -7.1875, -0.7031], dtype=torch.bfloat16)
>>> put(x, index, source, True)
tensor([[ 1.0547,  3.4375, -8.0625],
        [ 7.3750, -5.3750, -6.8750],
        [ 2.7500,  2.1094,  7.5938]], dtype=torch.bfloat16)
>>> torch.put(x, index, source, True)
tensor([[ 1.0547,  3.4375, -8.0625],
        [ 7.3750, -5.3750, -6.8750],
        [ 2.7500,  2.1094,  7.5938]], dtype=torch.bfloat16)

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @aakhundov @ColinPeppler @amjames @desertfire @chauhang

Copy link

pytorch-bot bot commented Dec 6, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/115306

Note: Links to docs will display an error until the docs builds have been completed.

❌ 5 New Failures

As of commit 876e0fd with merge base e5f2ac1 (image):

NEW FAILURES - The following jobs have failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Comment on lines 4348 to 4352
flattened = self.flatten().clone()
if accumulate:
flattened[index] += source
else:
flattened[index] = source
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mutation isn't legal inside decompositions of functional operators. If I'm not mistaken you could just use torch.index_put though.

Suggested change
flattened = self.flatten().clone()
if accumulate:
flattened[index] += source
else:
flattened[index] = source
flattened = self.flatten()
flattened = torch.index_put(flattened, [index], source, accumulate)

Copy link
Collaborator Author

@asmeurer asmeurer Dec 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems this fails because put supports NumPy-style "reverse broadcasting":

>>> x = torch.tensor(1.0)
>>> index = torch.tensor(0)
>>> source = torch.tensor([0.0])
>>> torch.put(x, index, source)
tensor(0.)

(which is IMO a bad idea, but it is tested)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess source.reshape(index.shape) should work, assuming the actual shape conditions are all tested separately.

@asmeurer
Copy link
Collaborator Author

asmeurer commented Dec 7, 2023

Benchmark script based on #114813 (comment)

import torch
from torch.testing import make_tensor
from torch.fx.experimental.proxy_tensor import make_fx
from torch.utils.benchmark import Timer, Compare
from torch._inductor.compile_fx import compile_fx_inner, cudagraphify_impl
from torch._inductor.decomposition import decompositions
from itertools import product
from functools import partial

torch._logging.set_logs(output_code=True)

benchmark_name = "put"
Ss = [512]


def gen_inputs():
    make_arg = partial(torch.randn, dtype=torch.float32, device="cuda")
    make_source = partial(torch.randn, dtype=torch.float32, device="cuda")

    def make_idx(n):
        return make_tensor((n,), device="cuda", dtype=torch.int64, low=0, high=n)

    for b, s, in product(Ss, Ss):
        yield make_arg((b * s)), make_idx(b), make_source(b)


def benchmark(label, f, x, idx, source):
    return Timer("f([x, idx, source])",
                 globals=locals(),
                 label=benchmark_name,
                 description=label,
                 sub_label=f"{tuple(x.shape)}",
                 num_threads=torch.get_num_threads()).blocked_autorange(min_run_time=2)


def compare(x, idx, source):
    def f(args):
        x, idx, source = args
        val = torch.ops.aten.put(x, idx, source)
        return (val,)

    print(f"{tuple(x.shape)}")

    args = [x, idx, source]

    decomposed = make_fx(f, decomposition_table=decompositions, tracing_mode="fake")(args)
    compiled_decomposed = compile_fx_inner(decomposed, args, cudagraphs=False)
    yield benchmark("Decomposed", compiled_decomposed, *args)

    non_decomposed = make_fx(f, tracing_mode="fake")(args)
    compiled_nondecomposed = compile_fx_inner(non_decomposed, args, cudagraphs=False)
    yield benchmark("Lowering", compiled_nondecomposed, *args)

    # Just show the first two generated kernels
    torch._logging.set_logs(output_code=False)

    cuda_f = cudagraphify_impl(f, args, static_input_idxs=tuple(range(len(args))))
    yield benchmark("Eager", cuda_f, *args)


results = []
for args in gen_inputs():
    for res in compare(*args):
        results.append(res)

compare = Compare(results)
compare.trim_significant_figures()
compare.print()
(262144,)
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] Output code: 
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] 
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] from ctypes import c_void_p, c_long
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] import torch
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] import math
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] import random
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] import os
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] import tempfile
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] from math import inf, nan
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] from torch._inductor.hooks import run_intermediate_hooks
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] from torch._inductor.utils import maybe_profile
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] from torch._inductor.codegen.memory_planning import _align as align
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] 
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] from torch import device, empty, empty_strided
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] from torch._inductor.codecache import AsyncCompile
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] from torch._inductor.select_algorithm import extern_kernels
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] 
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] aten = torch.ops.aten
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] inductor_ops = torch.ops.inductor
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] assert_size_stride = torch._C._dynamo.guards.assert_size_stride
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] alloc_from_pool = torch.ops.inductor._alloc_from_pool
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] reinterpret_tensor = torch.ops.inductor._reinterpret_tensor
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] async_compile = AsyncCompile()
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] 
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] 
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] # kernel path: /tmp/torchinductor_aaronmeurer/qo/cqo4ledzek5kxu67v3n5tprrtu74ty5mwywp7hf25zb5gnnxkwic.py
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] # Source Nodes: [], Original ATen: []
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] 
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] triton_poi_fused_0 = async_compile.triton('triton_', '''
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] import triton
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] import triton.language as tl
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] from torch._inductor.ir import ReductionHint
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] from torch._inductor.ir import TileHint
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] from torch._inductor.triton_heuristics import AutotuneHint, pointwise
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] from torch._inductor.utils import instance_descriptor
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] from torch._inductor import triton_helpers
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] 
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] @pointwise(
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG]     size_hints=[262144], 
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG]     filename=__file__,
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG]     triton_meta={'signature': {0: '*fp32', 1: '*fp32', 2: 'i32'}, 'device': 0, 'device_type': 'cuda', 'constants': {}, 'configs': [instance_descriptor(divisible_by_16=(0, 1, 2), equal_to_1=(), ids_of_folded_args=(), divisible_by_8=(2,))]},
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG]     inductor_meta={'autotune_hints': set(), 'kernel_name': 'triton_poi_fused_0', 'mutated_arg_names': []},
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG]     min_elem_per_thread=0
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] )
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] @triton.jit
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] def triton_(in_ptr0, out_ptr0, xnumel, XBLOCK : tl.constexpr):
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG]     xnumel = 262144
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG]     xoffset = tl.program_id(0) * XBLOCK
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG]     xindex = xoffset + tl.arange(0, XBLOCK)[:]
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG]     xmask = xindex < xnumel
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG]     x0 = xindex
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG]     tmp0 = tl.load(in_ptr0 + (x0), None)
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG]     tl.store(out_ptr0 + (x0), tmp0, None)
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] ''')
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] 
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] import triton
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] import triton.language as tl
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] from torch._inductor.triton_heuristics import grid, start_graph, end_graph
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] from torch._C import _cuda_getCurrentRawStream as get_cuda_stream
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] 
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] 
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] # kernel path: /tmp/torchinductor_aaronmeurer/ld/cld3iubcoadahloiptwcsjn574iblirjkejdnxx3beeubrudjlhj.py
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] # Source Nodes: [], Original ATen: []
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] 
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] triton_poi_fused_1 = async_compile.triton('triton_', '''
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] import triton
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] import triton.language as tl
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] from torch._inductor.ir import ReductionHint
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] from torch._inductor.ir import TileHint
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] from torch._inductor.triton_heuristics import AutotuneHint, pointwise
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] from torch._inductor.utils import instance_descriptor
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] from torch._inductor import triton_helpers
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] 
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] @pointwise(
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG]     size_hints=[512], 
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG]     filename=__file__,
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG]     triton_meta={'signature': {0: '*i64', 1: '*fp32', 2: '*fp32', 3: 'i32'}, 'device': 0, 'device_type': 'cuda', 'constants': {}, 'configs': [instance_descriptor(divisible_by_16=(0, 1, 2, 3), equal_to_1=(), ids_of_folded_args=(), divisible_by_8=(3,))]},
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG]     inductor_meta={'autotune_hints': set(), 'kernel_name': 'triton_poi_fused_1', 'mutated_arg_names': ['out_ptr0']},
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG]     min_elem_per_thread=0
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] )
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] @triton.jit
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] def triton_(in_ptr0, in_ptr1, out_ptr0, xnumel, XBLOCK : tl.constexpr):
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG]     xnumel = 512
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG]     xoffset = tl.program_id(0) * XBLOCK
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG]     xindex = xoffset + tl.arange(0, XBLOCK)[:]
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG]     xmask = xindex < xnumel
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG]     x0 = xindex
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG]     tmp0 = tl.load(in_ptr0 + (x0), xmask)
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG]     tmp4 = tl.load(in_ptr1 + (x0), xmask)
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG]     tmp1 = tmp0 + 262144
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG]     tmp2 = tmp0 < 0
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG]     tmp3 = tl.where(tmp2, tmp1, tmp0)
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG]     tl.device_assert(((0 <= tmp3) & (tmp3 < 262144)) | ~xmask, "index out of bounds: 0 <= tmp3 < 262144")
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG]     tl.store(out_ptr0 + (tl.broadcast_to(tmp3, [XBLOCK])), tmp4, xmask)
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] ''')
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] 
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] 
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] async_compile.wait(globals())
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] del async_compile
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] 
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] def call(args):
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG]     args_1, args_2, args_3 = args
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG]     args.clear()
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG]     assert_size_stride(args_1, (262144, ), (1, ))
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG]     assert_size_stride(args_2, (512, ), (1, ))
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG]     assert_size_stride(args_3, (512, ), (1, ))
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG]     with torch.cuda._DeviceGuard(0):
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG]         torch.cuda.set_device(0) # no-op to ensure context
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG]         buf0 = empty((262144, ), device='cuda', dtype=torch.float32)
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG]         # Source Nodes: [], Original ATen: []
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG]         stream0 = get_cuda_stream(0)
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG]         triton_poi_fused_0.run(args_1, buf0, 262144, grid=grid(262144), stream=stream0)
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG]         del args_1
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG]         # Source Nodes: [], Original ATen: []
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG]         triton_poi_fused_1.run(args_2, args_3, buf0, 512, grid=grid(512), stream=stream0)
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG]         del args_2
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG]         del args_3
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG]         return (buf0, )
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] 
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] 
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] def benchmark_compiled_module(times=10, repeat=10):
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG]     from torch._dynamo.testing import rand_strided
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG]     from torch._inductor.utils import print_performance
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG]     args_1 = rand_strided((262144, ), (1, ), device='cuda:0', dtype=torch.float32)
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG]     args_2 = rand_strided((512, ), (1, ), device='cuda:0', dtype=torch.int64)
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG]     args_3 = rand_strided((512, ), (1, ), device='cuda:0', dtype=torch.float32)
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG]     return print_performance(lambda: call([args_1, args_2, args_3]), times=times, repeat=repeat)
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] 
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] 
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] if __name__ == "__main__":
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG]     from torch._inductor.wrapper_benchmark import compiled_module_main
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG]     compiled_module_main('None', benchmark_compiled_module)
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [DEBUG] 
[2023-12-06 18:08:10,177] torch._inductor.graph.__output_code: [INFO] Output code written to: /tmp/torchinductor_aaronmeurer/ze/cze4wgypbgwqi6vtoecbjgjw6nmwsahnmbz5adspksgfq4hh7pyn.py
[2023-12-06 18:08:12,809] torch._inductor.graph.__output_code: [DEBUG] Output code: 
[2023-12-06 18:08:12,809] torch._inductor.graph.__output_code: [DEBUG] 
[2023-12-06 18:08:12,809] torch._inductor.graph.__output_code: [DEBUG] from ctypes import c_void_p, c_long
[2023-12-06 18:08:12,809] torch._inductor.graph.__output_code: [DEBUG] import torch
[2023-12-06 18:08:12,809] torch._inductor.graph.__output_code: [DEBUG] import math
[2023-12-06 18:08:12,809] torch._inductor.graph.__output_code: [DEBUG] import random
[2023-12-06 18:08:12,809] torch._inductor.graph.__output_code: [DEBUG] import os
[2023-12-06 18:08:12,809] torch._inductor.graph.__output_code: [DEBUG] import tempfile
[2023-12-06 18:08:12,809] torch._inductor.graph.__output_code: [DEBUG] from math import inf, nan
[2023-12-06 18:08:12,809] torch._inductor.graph.__output_code: [DEBUG] from torch._inductor.hooks import run_intermediate_hooks
[2023-12-06 18:08:12,809] torch._inductor.graph.__output_code: [DEBUG] from torch._inductor.utils import maybe_profile
[2023-12-06 18:08:12,809] torch._inductor.graph.__output_code: [DEBUG] from torch._inductor.codegen.memory_planning import _align as align
[2023-12-06 18:08:12,809] torch._inductor.graph.__output_code: [DEBUG]
[2023-12-06 18:08:12,809] torch._inductor.graph.__output_code: [DEBUG] from torch import device, empty, empty_strided
[2023-12-06 18:08:12,809] torch._inductor.graph.__output_code: [DEBUG] from torch._inductor.codecache import AsyncCompile
[2023-12-06 18:08:12,809] torch._inductor.graph.__output_code: [DEBUG] from torch._inductor.select_algorithm import extern_kernels
[2023-12-06 18:08:12,809] torch._inductor.graph.__output_code: [DEBUG]
[2023-12-06 18:08:12,809] torch._inductor.graph.__output_code: [DEBUG] aten = torch.ops.aten
[2023-12-06 18:08:12,809] torch._inductor.graph.__output_code: [DEBUG] inductor_ops = torch.ops.inductor
[2023-12-06 18:08:12,809] torch._inductor.graph.__output_code: [DEBUG] assert_size_stride = torch._C._dynamo.guards.assert_size_stride
[2023-12-06 18:08:12,809] torch._inductor.graph.__output_code: [DEBUG] alloc_from_pool = torch.ops.inductor._alloc_from_pool
[2023-12-06 18:08:12,809] torch._inductor.graph.__output_code: [DEBUG] reinterpret_tensor = torch.ops.inductor._reinterpret_tensor
[2023-12-06 18:08:12,809] torch._inductor.graph.__output_code: [DEBUG] async_compile = AsyncCompile()
[2023-12-06 18:08:12,809] torch._inductor.graph.__output_code: [DEBUG]
[2023-12-06 18:08:12,809] torch._inductor.graph.__output_code: [DEBUG]
[2023-12-06 18:08:12,809] torch._inductor.graph.__output_code: [DEBUG] async_compile.wait(globals())
[2023-12-06 18:08:12,809] torch._inductor.graph.__output_code: [DEBUG] del async_compile
[2023-12-06 18:08:12,809] torch._inductor.graph.__output_code: [DEBUG]
[2023-12-06 18:08:12,809] torch._inductor.graph.__output_code: [DEBUG] def call(args):
[2023-12-06 18:08:12,809] torch._inductor.graph.__output_code: [DEBUG]     args_1, args_2, args_3 = args
[2023-12-06 18:08:12,809] torch._inductor.graph.__output_code: [DEBUG]     args.clear()
[2023-12-06 18:08:12,809] torch._inductor.graph.__output_code: [DEBUG]     assert_size_stride(args_1, (262144, ), (1, ))
[2023-12-06 18:08:12,809] torch._inductor.graph.__output_code: [DEBUG]     assert_size_stride(args_2, (512, ), (1, ))
[2023-12-06 18:08:12,809] torch._inductor.graph.__output_code: [DEBUG]     assert_size_stride(args_3, (512, ), (1, ))
[2023-12-06 18:08:12,809] torch._inductor.graph.__output_code: [DEBUG]     with torch.cuda._DeviceGuard(0):
[2023-12-06 18:08:12,809] torch._inductor.graph.__output_code: [DEBUG]         torch.cuda.set_device(0) # no-op to ensure context
[2023-12-06 18:08:12,809] torch._inductor.graph.__output_code: [DEBUG]         # Source Nodes: [], Original ATen: []
[2023-12-06 18:08:12,809] torch._inductor.graph.__output_code: [DEBUG]         buf0 = aten.put(args_1, args_2, args_3)
[2023-12-06 18:08:12,809] torch._inductor.graph.__output_code: [DEBUG]         del args_1
[2023-12-06 18:08:12,809] torch._inductor.graph.__output_code: [DEBUG]         del args_2
[2023-12-06 18:08:12,809] torch._inductor.graph.__output_code: [DEBUG]         del args_3
[2023-12-06 18:08:12,809] torch._inductor.graph.__output_code: [DEBUG]         buf1 = buf0
[2023-12-06 18:08:12,809] torch._inductor.graph.__output_code: [DEBUG]         return (buf1, )
[2023-12-06 18:08:12,809] torch._inductor.graph.__output_code: [DEBUG]
[2023-12-06 18:08:12,809] torch._inductor.graph.__output_code: [DEBUG]
[2023-12-06 18:08:12,809] torch._inductor.graph.__output_code: [DEBUG] def benchmark_compiled_module(times=10, repeat=10):
[2023-12-06 18:08:12,809] torch._inductor.graph.__output_code: [DEBUG]     from torch._dynamo.testing import rand_strided
[2023-12-06 18:08:12,809] torch._inductor.graph.__output_code: [DEBUG]     from torch._inductor.utils import print_performance
[2023-12-06 18:08:12,809] torch._inductor.graph.__output_code: [DEBUG]     args_1 = rand_strided((262144, ), (1, ), device='cuda:0', dtype=torch.float32)
[2023-12-06 18:08:12,809] torch._inductor.graph.__output_code: [DEBUG]     args_2 = rand_strided((512, ), (1, ), device='cuda:0', dtype=torch.int64)
[2023-12-06 18:08:12,809] torch._inductor.graph.__output_code: [DEBUG]     args_3 = rand_strided((512, ), (1, ), device='cuda:0', dtype=torch.float32)
[2023-12-06 18:08:12,809] torch._inductor.graph.__output_code: [DEBUG]     return print_performance(lambda: call([args_1, args_2, args_3]), times=times, repeat=repeat)
[2023-12-06 18:08:12,809] torch._inductor.graph.__output_code: [DEBUG]
[2023-12-06 18:08:12,809] torch._inductor.graph.__output_code: [DEBUG]
[2023-12-06 18:08:12,809] torch._inductor.graph.__output_code: [DEBUG] if __name__ == "__main__":
[2023-12-06 18:08:12,809] torch._inductor.graph.__output_code: [DEBUG]     from torch._inductor.wrapper_benchmark import compiled_module_main
[2023-12-06 18:08:12,809] torch._inductor.graph.__output_code: [DEBUG]     compiled_module_main('None', benchmark_compiled_module)
[2023-12-06 18:08:12,809] torch._inductor.graph.__output_code: [DEBUG]
[2023-12-06 18:08:12,809] torch._inductor.graph.__output_code: [INFO] Output code written to: /tmp/torchinductor_aaronmeurer/bd/cbdiox4iocae7tvdx24g5jl6sgr55m3dkd65o2samwoqk3bday6m.py
[----------------------- put -----------------------]
                 |  Decomposed  |  Lowering  |  Eager
1 threads: ------------------------------------------
      (262144,)  |     41.8     |     23     |    12

Times are in microseconds (us).

@colesbury colesbury added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Dec 8, 2023
asmeurer and others added 2 commits December 8, 2023 14:27
make_fallback(aten.polygamma)
make_fallback(aten.put)
make_fallback(aten.reflection_pad1d)
make_fallback(aten.replication_pad1d)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think your rebase has gone awry.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to remove the aten.put fallback since we're adding a decomposition that replaces it.

norm = x.norm(2, keep_dim, keepdim=True)
return x * (y / norm), norm

@register_decomposition(aten.put)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the test_has_decomposition failure, you can run the test as

EXPECTTEST_ACCEPT=1 pytest test/test_decomp,py -k HasDecomp

and it will update the expected test output files

test/expect/HasDecompTest.test_has_decomposition.expect
test/expect/HasDecompTest.test_aten_core_opeartors.expect

pearu added a commit that referenced this pull request Feb 18, 2024
As in the title. It is an updated copy of #115306 .




cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames

[ghstack-poisoned]
pearu added a commit that referenced this pull request Feb 18, 2024
As in the title. It is an updated copy of #115306 .




cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames

[ghstack-poisoned]
pearu added a commit that referenced this pull request Feb 18, 2024
As in the title. It is an updated copy of #115306 .




cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames

[ghstack-poisoned]
pearu added a commit that referenced this pull request Feb 18, 2024
As in the title. It is an updated copy of #115306 .




cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames

[ghstack-poisoned]
pearu added a commit that referenced this pull request Feb 18, 2024
As in the title. It is an updated copy of #115306 .




cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames

[ghstack-poisoned]
pearu added a commit that referenced this pull request Feb 18, 2024
As in the title. It is an updated copy of #115306 .




cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames

[ghstack-poisoned]
pearu added a commit that referenced this pull request Feb 19, 2024
As in the title. It is an updated copy of #115306 .




cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames

[ghstack-poisoned]
pearu added a commit that referenced this pull request Feb 19, 2024
As in the title. It is an updated copy of #115306 .




cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames

[ghstack-poisoned]
pearu added a commit that referenced this pull request Feb 19, 2024
As in the title. It is an updated copy of #115306 .




cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames

[ghstack-poisoned]
pearu added a commit that referenced this pull request Feb 19, 2024
As in the title. It is an updated copy of #115306 .




cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames

[ghstack-poisoned]
pearu added a commit that referenced this pull request Feb 19, 2024
As in the title. It is an updated copy of #115306 .




cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames

[ghstack-poisoned]
pearu added a commit that referenced this pull request Feb 19, 2024
As in the title. It is an updated copy of #115306 .




cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames

[ghstack-poisoned]
pearu added a commit that referenced this pull request Feb 19, 2024
As in the title. It is an updated copy of #115306 .




cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames

[ghstack-poisoned]
pearu added a commit that referenced this pull request Feb 19, 2024
As in the title. It is an updated copy of #115306 .




cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames

[ghstack-poisoned]
pearu added a commit that referenced this pull request Mar 4, 2024
As in the title. It is an updated copy of #115306 .




cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames

[ghstack-poisoned]
pearu added a commit that referenced this pull request Mar 4, 2024
As in the title. It is an updated copy of #115306 .




cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames

[ghstack-poisoned]
pytorchmergebot pushed a commit that referenced this pull request Mar 4, 2024
Copy link
Contributor

github-actions bot commented Mar 6, 2024

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

@github-actions github-actions bot added the Stale label Mar 6, 2024
@lezcano
Copy link
Collaborator

lezcano commented Mar 6, 2024

landed in #120179

@lezcano lezcano closed this Mar 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/inductor module: inductor open source release notes: inductor Stale triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants