-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Fix unbacked replacement where LHS is purely backed expr and RHS is unbacked expr #164013
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…nbacked symbol [ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/164013
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 1d1f148 with merge base e21b037 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
torch/_inductor/sizevars.py
Outdated
return all(self.shape_env.is_unbacked_symint(s) for s in expr.free_symbols) | ||
|
||
def all_symbols_are_backed(expr: Expr) -> bool: | ||
return all(symbol_is_type(s, SymT.SIZE) for s in expr.free_symbols) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is expr.free_symbols iterstable or does it materialize all symbols?
make sure we dont do this mistake
https://github.com/pytorch/pytorch/pull/140027/files
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm using the free_symbols
attribute stored on the sympy.Expr
which stores the free symbols in a Set
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we use an efficient API. like the one above?
torch/_inductor/sizevars.py
Outdated
return self.unbacked_replacements | ||
|
||
def all_symbols_are_unbacked(expr: Expr) -> bool: | ||
return all(self.shape_env.is_unbacked_symint(s) for s in expr.free_symbols) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto like other place
torch/_inductor/sizevars.py
Outdated
# assuming lhs is the expr to be replaced (src), rhs is the replacement (dst) | ||
# checking if we should keep them for the replacement rule or swap | ||
|
||
if all_symbols_are_unbacked(lhs) and all_symbols_are_backed(rhs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
n00b q, is all_symbols_are_backed(rhs) good enough? why do we need to check all_symbols_are_unbacked(lhs) here and all_symbols_are_unbacked(rhs) in line 744?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's a good point, likely all_symbols_are_backed(rhs)
is sufficient given deferred runtime assertions should always include an unbacked symbol today.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe 'all_symbols_are_unbacked' -> 'some_symbols_are_backed'
else: | ||
return lhs.compare(rhs) == 1 # see sympy.Basic.compare | ||
# fallback to sympy.Basic.compare for a deterministic ordering | ||
return lhs.compare(rhs) == 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess exsting issue come from: sympy.Basic.compare does not prefer backed symint over unbacked?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it does prefer backed symbol over unbacked symbol, but not backed expression over unbacked symbol.
A symbol (backed or unbacked) is always preferred over an expression (e.g. sym1 + sym2)
…nd RHS is unbacked symbol" ## Scenario - If there's a `torch._check(backed_expr == unbacked_symbol)` - then we should replace unbacked_symbol for backed_expr - currently, we don't do that when generating inputs for autotune_at_compile_time ## Error traceback ``` $ python test/inductor/test_aot_inductor.py -k test_size_with_unbacked_add_expr_transitive ... File "/data/users/colinpeppler/pytorch/torch/_inductor/compile_fx.py", line 1696, in fx_codegen_and_compile return scheme.codegen_and_compile(gm, example_inputs, inputs_to_check, graph_kwargs) File "/data/users/colinpeppler/pytorch/torch/_inductor/compile_fx.py", line 1187, in codegen_and_compile dynamo_utils.preserve_rng_state(), File "/home/colinpeppler/.conda/envs/pytorch/lib/python3.12/contextlib.py", line 158, in __exit__ self.gen.throw(value) File "/data/users/colinpeppler/pytorch/torch/_dynamo/utils.py", line 2236, in preserve_rng_state torch.cuda.set_rng_state(cuda_rng_state) # type: ignore[possibly-undefined] File "/data/users/colinpeppler/pytorch/torch/cuda/random.py", line 79, in set_rng_state _lazy_call(cb) File "/data/users/colinpeppler/pytorch/torch/cuda/__init__.py", line 341, in _lazy_call callable() File "/data/users/colinpeppler/pytorch/torch/cuda/random.py", line 77, in cb default_generator.set_state(new_state) torch.AcceleratorError: CUDA error: an illegal memory access was encountered ``` ## Bad autotuning input generation ``` # assume unbacked_symint_fallback = 16 # we generate too small of an input (16) buf11 = generate_example_value((16, 256), (256, 1), 'cuda:0', torch.float32, 0, (16, 256)) triton_poi_fused_ones_1.run(buf11, 4096, stream=stream0) stream0 = get_raw_stream(0) buf12 = generate_example_value((16, 256), (256, 1), 'cuda:0', torch.float32, 0, (16, 256)) buf13 = generate_example_value((16, 256), (256, 1), 'cuda:0', torch.float32, 0, (16, 256)) add_kernel_1.run(buf11, buf12, buf13, 4096, 16, 1, 1, stream=stream0) del buf11, buf12 stream0 = get_raw_stream(0) buf15 = generate_example_value((10500, 256), (256, 1), 'cuda:0', torch.float32, 0, (10500, 256)) triton_poi_fused_add_mul_2.run(buf2, buf13, buf15, 2688000, stream=stream0) ``` ## Good autotuning input generation ``` # notice we generate with the proper size now (10500) buf11 = generate_example_value((10500, 256), (256, 1), 'cuda:0', torch.float32, 0, (10500, 256)) triton_poi_fused_ones_1.run(buf11, 2688000, stream=stream0) stream0 = get_raw_stream(0) buf12 = generate_example_value((10500, 256), (256, 1), 'cuda:0', torch.float32, 0, (10500, 256)) buf13 = generate_example_value((10500, 256), (256, 1), 'cuda:0', torch.float32, 0, (10500, 256)) add_kernel_1.run(buf11, buf12, buf13, 2688000, 10500, 1, 1, stream=stream0) del buf11, buf12 stream0 = get_raw_stream(0) buf15 = generate_example_value((10500, 256), (256, 1), 'cuda:0', torch.float32, 0, (10500, 256)) triton_poi_fused_add_mul_2.run(buf2, buf13, buf15, 2688000, stream=stream0) ``` cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov coconutruben [ghstack-poisoned]
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 1 jobs have failed, first few of them are: trunk / macos-py3-arm64 / test (mps, 1, 1, macos-m2-15) Details for Dev Infra teamRaised by workflow job |
…nd RHS is unbacked symbol" ## Scenario - If there's a `torch._check(backed_expr == unbacked_symbol)` - then we should replace unbacked_symbol for backed_expr - currently, we don't do that when generating inputs for autotune_at_compile_time ## Error traceback ``` $ python test/inductor/test_aot_inductor.py -k test_size_with_unbacked_add_expr_transitive ... File "/data/users/colinpeppler/pytorch/torch/_inductor/compile_fx.py", line 1696, in fx_codegen_and_compile return scheme.codegen_and_compile(gm, example_inputs, inputs_to_check, graph_kwargs) File "/data/users/colinpeppler/pytorch/torch/_inductor/compile_fx.py", line 1187, in codegen_and_compile dynamo_utils.preserve_rng_state(), File "/home/colinpeppler/.conda/envs/pytorch/lib/python3.12/contextlib.py", line 158, in __exit__ self.gen.throw(value) File "/data/users/colinpeppler/pytorch/torch/_dynamo/utils.py", line 2236, in preserve_rng_state torch.cuda.set_rng_state(cuda_rng_state) # type: ignore[possibly-undefined] File "/data/users/colinpeppler/pytorch/torch/cuda/random.py", line 79, in set_rng_state _lazy_call(cb) File "/data/users/colinpeppler/pytorch/torch/cuda/__init__.py", line 341, in _lazy_call callable() File "/data/users/colinpeppler/pytorch/torch/cuda/random.py", line 77, in cb default_generator.set_state(new_state) torch.AcceleratorError: CUDA error: an illegal memory access was encountered ``` ## Bad autotuning input generation ``` # assume unbacked_symint_fallback = 16 # we generate too small of an input (16) buf11 = generate_example_value((16, 256), (256, 1), 'cuda:0', torch.float32, 0, (16, 256)) triton_poi_fused_ones_1.run(buf11, 4096, stream=stream0) stream0 = get_raw_stream(0) buf12 = generate_example_value((16, 256), (256, 1), 'cuda:0', torch.float32, 0, (16, 256)) buf13 = generate_example_value((16, 256), (256, 1), 'cuda:0', torch.float32, 0, (16, 256)) add_kernel_1.run(buf11, buf12, buf13, 4096, 16, 1, 1, stream=stream0) del buf11, buf12 stream0 = get_raw_stream(0) buf15 = generate_example_value((10500, 256), (256, 1), 'cuda:0', torch.float32, 0, (10500, 256)) triton_poi_fused_add_mul_2.run(buf2, buf13, buf15, 2688000, stream=stream0) ``` ## Good autotuning input generation ``` # notice we generate with the proper size now (10500) buf11 = generate_example_value((10500, 256), (256, 1), 'cuda:0', torch.float32, 0, (10500, 256)) triton_poi_fused_ones_1.run(buf11, 2688000, stream=stream0) stream0 = get_raw_stream(0) buf12 = generate_example_value((10500, 256), (256, 1), 'cuda:0', torch.float32, 0, (10500, 256)) buf13 = generate_example_value((10500, 256), (256, 1), 'cuda:0', torch.float32, 0, (10500, 256)) add_kernel_1.run(buf11, buf12, buf13, 2688000, 10500, 1, 1, stream=stream0) del buf11, buf12 stream0 = get_raw_stream(0) buf15 = generate_example_value((10500, 256), (256, 1), 'cuda:0', torch.float32, 0, (10500, 256)) triton_poi_fused_add_mul_2.run(buf2, buf13, buf15, 2688000, stream=stream0) ``` cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov coconutruben [ghstack-poisoned]
…nd RHS is unbacked symbol" ## Scenario - If there's a `torch._check(backed_expr == unbacked_symbol)` - then we should replace unbacked_symbol for backed_expr - currently, we don't do that when generating inputs for autotune_at_compile_time ## Error traceback ``` $ python test/inductor/test_aot_inductor.py -k test_size_with_unbacked_add_expr_transitive ... File "/data/users/colinpeppler/pytorch/torch/_inductor/compile_fx.py", line 1696, in fx_codegen_and_compile return scheme.codegen_and_compile(gm, example_inputs, inputs_to_check, graph_kwargs) File "/data/users/colinpeppler/pytorch/torch/_inductor/compile_fx.py", line 1187, in codegen_and_compile dynamo_utils.preserve_rng_state(), File "/home/colinpeppler/.conda/envs/pytorch/lib/python3.12/contextlib.py", line 158, in __exit__ self.gen.throw(value) File "/data/users/colinpeppler/pytorch/torch/_dynamo/utils.py", line 2236, in preserve_rng_state torch.cuda.set_rng_state(cuda_rng_state) # type: ignore[possibly-undefined] File "/data/users/colinpeppler/pytorch/torch/cuda/random.py", line 79, in set_rng_state _lazy_call(cb) File "/data/users/colinpeppler/pytorch/torch/cuda/__init__.py", line 341, in _lazy_call callable() File "/data/users/colinpeppler/pytorch/torch/cuda/random.py", line 77, in cb default_generator.set_state(new_state) torch.AcceleratorError: CUDA error: an illegal memory access was encountered ``` ## Bad autotuning input generation ``` # assume unbacked_symint_fallback = 16 # we generate too small of an input (16) buf11 = generate_example_value((16, 256), (256, 1), 'cuda:0', torch.float32, 0, (16, 256)) triton_poi_fused_ones_1.run(buf11, 4096, stream=stream0) stream0 = get_raw_stream(0) buf12 = generate_example_value((16, 256), (256, 1), 'cuda:0', torch.float32, 0, (16, 256)) buf13 = generate_example_value((16, 256), (256, 1), 'cuda:0', torch.float32, 0, (16, 256)) add_kernel_1.run(buf11, buf12, buf13, 4096, 16, 1, 1, stream=stream0) del buf11, buf12 stream0 = get_raw_stream(0) buf15 = generate_example_value((10500, 256), (256, 1), 'cuda:0', torch.float32, 0, (10500, 256)) triton_poi_fused_add_mul_2.run(buf2, buf13, buf15, 2688000, stream=stream0) ``` ## Good autotuning input generation ``` # notice we generate with the proper size now (10500) buf11 = generate_example_value((10500, 256), (256, 1), 'cuda:0', torch.float32, 0, (10500, 256)) triton_poi_fused_ones_1.run(buf11, 2688000, stream=stream0) stream0 = get_raw_stream(0) buf12 = generate_example_value((10500, 256), (256, 1), 'cuda:0', torch.float32, 0, (10500, 256)) buf13 = generate_example_value((10500, 256), (256, 1), 'cuda:0', torch.float32, 0, (10500, 256)) add_kernel_1.run(buf11, buf12, buf13, 2688000, 10500, 1, 1, stream=stream0) del buf11, buf12 stream0 = get_raw_stream(0) buf15 = generate_example_value((10500, 256), (256, 1), 'cuda:0', torch.float32, 0, (10500, 256)) triton_poi_fused_add_mul_2.run(buf2, buf13, buf15, 2688000, stream=stream0) ``` cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov coconutruben [ghstack-poisoned]
…nd RHS is unbacked symbol" ## Scenario - If there's a `torch._check(backed_expr == unbacked_symbol)` - then we should replace unbacked_symbol for backed_expr - currently, we don't do that when generating inputs for autotune_at_compile_time ## Error traceback ``` $ python test/inductor/test_aot_inductor.py -k test_size_with_unbacked_add_expr_transitive ... File "/data/users/colinpeppler/pytorch/torch/_inductor/compile_fx.py", line 1696, in fx_codegen_and_compile return scheme.codegen_and_compile(gm, example_inputs, inputs_to_check, graph_kwargs) File "/data/users/colinpeppler/pytorch/torch/_inductor/compile_fx.py", line 1187, in codegen_and_compile dynamo_utils.preserve_rng_state(), File "/home/colinpeppler/.conda/envs/pytorch/lib/python3.12/contextlib.py", line 158, in __exit__ self.gen.throw(value) File "/data/users/colinpeppler/pytorch/torch/_dynamo/utils.py", line 2236, in preserve_rng_state torch.cuda.set_rng_state(cuda_rng_state) # type: ignore[possibly-undefined] File "/data/users/colinpeppler/pytorch/torch/cuda/random.py", line 79, in set_rng_state _lazy_call(cb) File "/data/users/colinpeppler/pytorch/torch/cuda/__init__.py", line 341, in _lazy_call callable() File "/data/users/colinpeppler/pytorch/torch/cuda/random.py", line 77, in cb default_generator.set_state(new_state) torch.AcceleratorError: CUDA error: an illegal memory access was encountered ``` ## Bad autotuning input generation ``` # assume unbacked_symint_fallback = 16 # we generate too small of an input (16) buf11 = generate_example_value((16, 256), (256, 1), 'cuda:0', torch.float32, 0, (16, 256)) triton_poi_fused_ones_1.run(buf11, 4096, stream=stream0) stream0 = get_raw_stream(0) buf12 = generate_example_value((16, 256), (256, 1), 'cuda:0', torch.float32, 0, (16, 256)) buf13 = generate_example_value((16, 256), (256, 1), 'cuda:0', torch.float32, 0, (16, 256)) add_kernel_1.run(buf11, buf12, buf13, 4096, 16, 1, 1, stream=stream0) del buf11, buf12 stream0 = get_raw_stream(0) buf15 = generate_example_value((10500, 256), (256, 1), 'cuda:0', torch.float32, 0, (10500, 256)) triton_poi_fused_add_mul_2.run(buf2, buf13, buf15, 2688000, stream=stream0) ``` ## Good autotuning input generation ``` # notice we generate with the proper size now (10500) buf11 = generate_example_value((10500, 256), (256, 1), 'cuda:0', torch.float32, 0, (10500, 256)) triton_poi_fused_ones_1.run(buf11, 2688000, stream=stream0) stream0 = get_raw_stream(0) buf12 = generate_example_value((10500, 256), (256, 1), 'cuda:0', torch.float32, 0, (10500, 256)) buf13 = generate_example_value((10500, 256), (256, 1), 'cuda:0', torch.float32, 0, (10500, 256)) add_kernel_1.run(buf11, buf12, buf13, 2688000, 10500, 1, 1, stream=stream0) del buf11, buf12 stream0 = get_raw_stream(0) buf15 = generate_example_value((10500, 256), (256, 1), 'cuda:0', torch.float32, 0, (10500, 256)) triton_poi_fused_add_mul_2.run(buf2, buf13, buf15, 2688000, stream=stream0) ``` cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov coconutruben [ghstack-poisoned]
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Scenario
torch._check(backed_expr == unbacked_symbol)
Error traceback
Bad autotuning input generation
Good autotuning input generation
Stack from ghstack (oldest at bottom):
cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben