Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC][inductor][index_put] fallback to aten in torch deterministic mode #96898

Closed
wants to merge 1 commit into from

Conversation

yuguo68
Copy link
Contributor

@yuguo68 yuguo68 commented Mar 15, 2023

@pytorch-bot
Copy link

pytorch-bot bot commented Mar 15, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/96898

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

✅ No Failures

As of commit 580c3a5:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@yuguo68 yuguo68 requested review from jansel and ngimel and removed request for jansel March 15, 2023 22:41
@yuguo68 yuguo68 marked this pull request as ready for review March 15, 2023 22:49
@yuguo68
Copy link
Contributor Author

yuguo68 commented Mar 16, 2023

working on failed tests

@ngimel
Copy link
Collaborator

ngimel commented Mar 17, 2023

wires crossed, #96695 is doing same thing pretty much, cc @colesbury

@yuguo68
Copy link
Contributor Author

yuguo68 commented Mar 17, 2023

wires crossed, #96695 is doing same thing pretty much, cc @colesbury

oh, thanks for sharing! #96695 has dynamo support including guards. This PR only has inductor part for scatter_ops and index_put. I still see test failures and will look into them.

@colesbury
Copy link
Member

@yuguo68 Let me know what your plans are. Do you want to combine this with #96695? Or land this first and then I'll land the guards?

@yuguo68
Copy link
Contributor Author

yuguo68 commented Mar 17, 2023

@yuguo68 Let me know what your plans are. Do you want to combine this with #96695? Or land this first and then I'll land the guards?

Thanks for asking! May I land this and then you land the dynamo part? I think I am close to fix the test errors.

@yuguo68
Copy link
Contributor Author

yuguo68 commented Mar 17, 2023

@ngimel , it seems the index_put fallback will cause benchmarks/dynamo/huggingface.py failure due to cudagraph and sorting in the deterministic implementation of index_put with accumulate=True. Any suggestions for this? Shall we disable cudagraph or torch.use_deterministic_algorithms for this model? Thanks!

+ python benchmarks/dynamo/huggingface.py --ci --accuracy --timing --explain --inductor --device cuda --training --amp --output /var/lib/jenkins/workspace/test/test-reports/training_huggingface.csv
cuda train AlbertForMaskedLM                   ERROR:common:CUDA error: operation failed due to a previous error during capture
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Traceback (most recent call last):
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_fx.py", line 336, in cudagraphify_impl
    static_outputs = model(list(static_inputs))
  File "/tmp/tmpce9dhnjz/6x/c6xuuwpof6ncjaongbprvu332nxskskhnd7pdmcgpfw4smch44ft.py", line 3373, in call
    aten.index_put_(buf608, [slice_2], buf609, True)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_ops.py", line 528, in __call__
    return self._op(*args, **kwargs or {})
RuntimeError: CUDA error: operation not permitted when stream is capturing
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

...
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_fx.py", line 265, in cudagraphify
    return cudagraphify_impl(model, inputs, static_input_idxs)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_fx.py", line 335, in cudagraphify_impl
    with torch.cuda.graph(graph, stream=stream):
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/cuda/graphs.py", line 173, in __exit__
    self.cuda_graph.capture_end()
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/cuda/graphs.py", line 79, in capture_end
    super().capture_end()
RuntimeError: CUDA error: operation failed due to a previous error during capture

@ngimel
Copy link
Collaborator

ngimel commented Mar 18, 2023

Yeah, deterministic index_put is a synchronizing operation, so we should not run it with cuda graphs, it's fine to run accuracy jobs that have index_put without cudagraphs.

@yuguo68 yuguo68 force-pushed the inductor_deterministic branch 2 times, most recently from e58fb9a to 246ce92 Compare March 28, 2023 01:41
@yuguo68 yuguo68 changed the title [RFC][inductor][scatter_ops] fallback to aten in torch deterministic mode [RFC][inductor][index_put] fallback to aten in torch deterministic mode Mar 29, 2023
@yuguo68
Copy link
Contributor Author

yuguo68 commented Mar 29, 2023

drop the change to scatter/scatter_reduce fallback since it causes numerical issues, 1. we might need a ir.ScatterReduceFallback similar to ir.IndexPutFallback. 2. currently scatter/scatter_reduce do not have a deterministic impl anyway, whereas scatter_add does and we can utilize it.

@yuguo68
Copy link
Contributor Author

yuguo68 commented Mar 29, 2023

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Mar 29, 2023
@yuguo68 yuguo68 added the topic: not user facing topic category label Mar 29, 2023
@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 mandatory check(s) are pending/not yet run. The first few are:

  • EasyCLA

Dig deeper by viewing the pending checks on hud

Details for Dev Infra team Raised by workflow job

Failing merge rule: Core Maintainers

@yuguo68
Copy link
Contributor Author

yuguo68 commented Mar 29, 2023

/easycla

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Inductor ignores global pytorch deterministic setting
5 participants